Questions tagged [k-means]
k-means is a method to partition data into clusters by finding a specified number of means, k, s.t. when data are assigned to clusters w/ the nearest mean, the w/i cluster sum of squares is minimized
1,051 questions
0
votes
0
answers
29
views
Why bother with k-means number of clusters? Why not generate them all and see which one works? [closed]
I'm a sociologist with a CS background. I'm analyzing longitudinal data and I'm not up to speed with the statistical lingo around the whole thing. I'm trying to figure out the statistical names of the ...
2
votes
0
answers
16
views
How to cluster based on x and y coordinates
I am trying to identify rows in groups of points using clustering algorithms. The bigger picture problem I'm trying to solve is to identify shelves given x and y coordinates of products. I can cluster ...
0
votes
0
answers
11
views
Identify predictors for clustering output?
I have a dataset with variables collected years ago, and many variables collected this year as outcome variables. I want to combine all the variables collected this year to get one outcome, e.g. ...
1
vote
0
answers
19
views
Question about running k means cluster analysis
In a previous analysis I had 3 groups of subjects - group x with 35 subjects, control group y with 25 subjects, and control group z with 25 subjects. For each group I have levels of 6 different ...
1
vote
0
answers
40
views
Question on using the elbow method for calculating ideal number of clusters for k means cluster analysis
Newb to cluster analysis here. I have a group of 35 subjects. For all of the subjects I have data for different measures of IQ (verbal, math, etc) and different biomarkers. There are 6 IQ measures in ...
0
votes
0
answers
19
views
Clustering Mixed Data Types: Algorithm Selection, Distance Measurement, and Feature Weighting
I have a database of 74,000 records with 29 features. Fourteen of these features are categorical and are either 0 or 1, while the other 15 features are continuous and have been normalized and scaled ...
1
vote
1
answer
28
views
Is this the right approach to cluster using many different evaluations on the same dimension?
I'm working on a project where I want to sort political parties into two groups. I want to do so using the answers of many respondents in a survey who indicated for each party where they see them on a ...
1
vote
0
answers
38
views
What if PCA is unable to group my samples, but K-means perfectly clusters them? Is there any problem with my data analysis? Is it possible? [closed]
I am not an expert, but I am currently using unsupervised methods to better explain my mass spectrometry data obtained via DART-MS analyses. I am still learning.
It turned out that when analyzing my ...
0
votes
0
answers
14
views
calculation of the C-index clustering for manual [duplicate]
Can anyone give me an example of working on the C-index clustering validity test, but calculating manually??
0
votes
0
answers
25
views
calculation of the C-index clustering [duplicate]
Can anyone give me an example of working on the C-index clustering validity test, but calculating manually??
1
vote
0
answers
22
views
Spatial Temporal Clustering evenly spaced over time
I have a large dataset of spatio-temporal data. It has longitude and latitude coordinates, and a date for each observation. For example:
Long
Lat
Date
50
20.43
9-19-2010
51
19.5
10-4-2010
51
19.3
...
0
votes
0
answers
11
views
What are the right metrics to validate the performance of a custom clustering model with three possible outcomes?
I have developed a custom clustering model on top of MiniBatchKmeans, that has three possible outcomes for each data point:
Assign the point to the correct cluster.
Assign the point to the wrong ...
0
votes
0
answers
24
views
Curse of dimensionality in Time series with K-means
I have been looking at the following notebook: time series clustering
where the writer says that the dataset is affected by the "Curse of Dimensionality", so applying TimeSeriesKMeans ...
3
votes
1
answer
28
views
What is "clall" in index.Gap in "clusterSim" R package?
I am using the "clusterSim" package in my project (https://cran.r-project.org/web/packages/clusterSim/clusterSim.pdf, page 39) and I do not understand the meaning of the "clall" ...
0
votes
0
answers
62
views
Variable importance in cluster analysis
I'm new to the cluster analysis, read lots of things but I'm not able to understand how to variables are ordered into cluster. I mean, I find that my data are clustered into 3 different cluster, but ...
0
votes
0
answers
16
views
Should the same environmental variable measured with different methods be removed before K-means? What about variables repr. sep. and by their ratio?
So I'm running K-means clustering algorithm on environmental variables measured on different locations. The aim is to see if the environmental variables can be clustered into separate clusters.
Same ...
1
vote
0
answers
125
views
k-means clustering on a probability distribution instead of a dataset
Normally, clustering algorithms such as $k$-means are defined on a dataset in the following sense: if $D$ is a dataset, find a partition of $D$ into sets $\{S_1, \dots, S_n\}$ that minimises the ...
0
votes
1
answer
150
views
Applying clustering algorithms after t-SNE in R
So I'm doing my bachelor`s work and I'm applying different clustering algorithms on certain data. Before all the clustering of course I'm using a dimensionality reduction algorithm such as t-SNE for ...
2
votes
1
answer
204
views
What is the standard threshold value that is best for accuracy when employing Euclidean distance as a metric for gauging textual similarity?
I'm using Euclidean distance as a metric to compare two sentences for similarity while clustering them using my custom incremental KMeans algorithm. The current threshold value I'm using is 0.7 which ...
0
votes
0
answers
10
views
What is normalized winning frequency in kernel self organizing map(SOM)?
In the k-means based kernel SOM, proposed by MacDonald and Fyfe (2000), the update of the mean is based on a soft learning algorithm
mi(t + 1) = mi(t) + Λ[φ(x) − mi(t)]
where Λ is the normalized ...
0
votes
0
answers
43
views
Why does this K-Means cluster example show 'overlap' between clusters?
I was reading the hypertools docs and came across this pictorial that shows 10 clusters (some seem to share very similar coloring) generated from some (mushroom) ...
1
vote
0
answers
68
views
K-means clustering - weird PCA visualization
I performed PCA on 4 variables and are shown in this visualization:
At first look it doesn't look convincing and the some clusters seem weird.
The data was cleaned and standardized beforehand. Only ...
0
votes
1
answer
20
views
K means clustering of image with k=1 vs mean of all pixels
I have relatively uniformly colored images and I extracted colors using k-means. k means 1 showed the best results for my modeling purposes, k means 2 not so much, and with k-means 3 there ceased to ...
0
votes
0
answers
21
views
Method for pairwise ordering two datasets
Given two rather small but unordered multidimensional vectors/datasets (e.g sets of a handful of 3D coordinates), what is a simple method for pairwise alignment/ordering?
I've though about using ...
1
vote
1
answer
59
views
Elbow method not giving a proper curve in python code
I am trying to determine how many clusters to use for my k-means clustering using different methods.
first i used the following code to calculate different metrics per cluster number and different ...
3
votes
2
answers
455
views
Termination conditions for K-means and their interconnection
As far as I know, there are two termination criteria for K-means clustering algorithm:
assignments of data points do not change
centroids do not change
I wonder if there is any kind of relation ...
1
vote
1
answer
46
views
Mathematics behind standardizing the data points in machine learning algorithms (e.g., K-means clustering)
For K-means algorithm, among other methods using distance-based measurements to determine similarity between data points, why we have to standardize the data points with mean as 0 and standard ...
0
votes
1
answer
53
views
Continuous monitoring of KMeans model post production
In the process of deploying a KMeans model for a customer segmentation use case into production. KMeans doesn’t produce the same results every time and after production cluster sizes and arrangements ...
0
votes
1
answer
169
views
Proving that K-means corresponds to an EM algorithm?
Just wanted to make sure that my proof is correct and that I am not missing anything in the process. Any thoughts?
"
To demonstrate mathematically that the K-means algorithm corresponds to an ...
2
votes
1
answer
36
views
Can I use kmeans on paired data?
I want to see if a treatment brings patients closer to controls using multiple dependent variables. Can I do kmeans and see if the controls are separate from the patients before treatment, but cluster ...
4
votes
2
answers
409
views
Question about Silhouette index calculation using scikit
I am currently working with continuous data measured from different sensors (thermometers and voltmeters). I have a matrix whose columns represent the sensors and the rows are normalized measurements (...
0
votes
0
answers
41
views
Turning heatmap into clusters - Classification
Assume that you having a heatmap that looks like this. The goal is to classify all the "dot" inside the image. How can that be done?
The assumptions of the image:
The image has always black ...
0
votes
1
answer
49
views
In unsupervised learning, is a result of 2 clusters meaningful?
I used both agglomerative clustering and k-means on a dataset and see the results below. Result from agglomerative clustering was demonstrated with silhouette score while kmeans with inertia score. ...
1
vote
0
answers
18
views
Method to find group associated with a target variable [closed]
The business question that I am trying to answer is: what group(s) of people have the highest chance of default? The features that I have are income, debt to income ratio, fico, etc. How do I find the ...
1
vote
1
answer
110
views
How to tell whether segments from K Means clustering result are "successful" and will impact business metrics?
Background
I'm a data analyst. The Business unit I'm assigned for needs to segment users based on power vs non-power users so they can target each segment with proper treatments.
Goal
Segment users (...
0
votes
1
answer
337
views
Dummy Variable Trap in KMeans Clustering
My data set is having a column Gender, so I have to apply One Hot Encodingto perform KMeans Clustering.
Q1. Should I take care about ...
2
votes
1
answer
160
views
Clustering algorithms puts data points that are visually far apart in same cluster
I am trying to cluster a very large set of data points, of roughly (20000, 100) shape. I could not run density based DBSCAN or SpectralClustering due to the ...
1
vote
2
answers
384
views
Interpreting results of K-means after PCA
I have this dataset about an airline company customers with 22 explanatory variables. My goal is to perform some sort of customer segmentation with the k-means algorithm. One problem that I've found ...
0
votes
0
answers
66
views
General technique for loss function minimization
I was trying to rationalize the K-Means algorithm and came up with the following thoughts.
Suppose we need to compute:
$T=min_x L(x)$
but we struggle because $L$ is complex. Suppose we find $L'$ s.t.:
...
1
vote
1
answer
1k
views
Elbow method Vs Gap statistics, which one? challenging for data scientist
I am working on hourly-weather data. It contains four features: rain, wind speed, humidity, and temperature. Obviously, all of them are continuous values. The number of records is around 17000. Other ...
1
vote
2
answers
92
views
Can I use K-Means to group customers based on a single variable?
I have a test dataset of 11m records. The dataset contains a global customer id and spend figure.
I need to group customers into the following categories:
0 Low
1 Low/Med
2 Med
3 Med/High
4 High
I ...
0
votes
0
answers
58
views
How to identify the clusters in SSE plot?
How to determine the number of clusters from the following plot?
1
vote
1
answer
336
views
Unsupervised learning: How to identify differences between clusters?
I'm learning about unsupervised learning and I tried to use KMeans, AgglomerativeClustering and DBSCAN on the same datase. The result was ok, they seems to work fine according silhouette_score() ...
2
votes
0
answers
37
views
Does it make sense to transform a feature containing hours (24h) into two features with xy-coordinates of each hour in the space? [duplicate]
I have a clustering problem that I might solve with an algorithm based on Euclidean distance (e.g. K-Means).
One potential feature is the "hour" at which each user began an interaction.
As ...
0
votes
0
answers
19
views
How do I choose k for k means clustering [duplicate]
Given a set of points, I'm trying to find the right cluster. However, I am lost on what the process is. Here is the graph of all possible points.
I am unsure what I should look at
2
votes
1
answer
370
views
Choosing the best clustering algorithm and evaluating the results
I'm trying to separate my data into clusters using the k-means algorithm and the hierarchical algorithm, choose which algorithm fits my data the best, and evaluate the results. However, all of my ...
0
votes
0
answers
31
views
How to interpret the Scatter Plot result from PCA? [duplicate]
I have a project in school about clustering analysis. I have applied standardization and principal component analysis (PCA) to my dataset (I used K-means), which is about heart disease patients. I ...
1
vote
1
answer
140
views
In $k$-means, how is it NP-hard if the dimensionality of the data is at least $2$ ($d\geq 2$)?
In $k$-means, how is it NP-hard if the dimensionality of the data is at least $2$ ($d\geq 2$)? Can someone justify or give reasons to this statement?
Any guidance would be appreciated.
0
votes
0
answers
22
views
K-means on linearly projected features
I am looking for references on K-Means applied to linearly projected features instead of to the original features, in the sense that both K-Means and the projection matrix are learned at the same time....
1
vote
0
answers
48
views
Can K-means put most of the noise in the same cluster?
I am working on clustering text data (very short sentences) vectorized with tf-idf. The data are characterized by high sparseness and the presence of abundant noise (considered here as documents that ...