Skip to main content

Questions tagged [k-means]

k-means is a method to partition data into clusters by finding a specified number of means, k, s.t. when data are assigned to clusters w/ the nearest mean, the w/i cluster sum of squares is minimized

Filter by
Sorted by
Tagged with
0 votes
0 answers
29 views

Why bother with k-means number of clusters? Why not generate them all and see which one works? [closed]

I'm a sociologist with a CS background. I'm analyzing longitudinal data and I'm not up to speed with the statistical lingo around the whole thing. I'm trying to figure out the statistical names of the ...
Guillaume's user avatar
2 votes
0 answers
16 views

How to cluster based on x and y coordinates

I am trying to identify rows in groups of points using clustering algorithms. The bigger picture problem I'm trying to solve is to identify shelves given x and y coordinates of products. I can cluster ...
Tommy Wolfheart's user avatar
0 votes
0 answers
11 views

Identify predictors for clustering output?

I have a dataset with variables collected years ago, and many variables collected this year as outcome variables. I want to combine all the variables collected this year to get one outcome, e.g. ...
NPpsy's user avatar
  • 43
1 vote
0 answers
19 views

Question about running k means cluster analysis

In a previous analysis I had 3 groups of subjects - group x with 35 subjects, control group y with 25 subjects, and control group z with 25 subjects. For each group I have levels of 6 different ...
FastBallooningHead's user avatar
1 vote
0 answers
40 views

Question on using the elbow method for calculating ideal number of clusters for k means cluster analysis

Newb to cluster analysis here. I have a group of 35 subjects. For all of the subjects I have data for different measures of IQ (verbal, math, etc) and different biomarkers. There are 6 IQ measures in ...
FastBallooningHead's user avatar
0 votes
0 answers
19 views

Clustering Mixed Data Types: Algorithm Selection, Distance Measurement, and Feature Weighting

I have a database of 74,000 records with 29 features. Fourteen of these features are categorical and are either 0 or 1, while the other 15 features are continuous and have been normalized and scaled ...
peiman razavi's user avatar
1 vote
1 answer
28 views

Is this the right approach to cluster using many different evaluations on the same dimension?

I'm working on a project where I want to sort political parties into two groups. I want to do so using the answers of many respondents in a survey who indicated for each party where they see them on a ...
user avatar
1 vote
0 answers
38 views

What if PCA is unable to group my samples, but K-means perfectly clusters them? Is there any problem with my data analysis? Is it possible? [closed]

I am not an expert, but I am currently using unsupervised methods to better explain my mass spectrometry data obtained via DART-MS analyses. I am still learning. It turned out that when analyzing my ...
Isabela's user avatar
  • 11
0 votes
0 answers
14 views

calculation of the C-index clustering for manual [duplicate]

Can anyone give me an example of working on the C-index clustering validity test, but calculating manually??
Raaa's user avatar
  • 1
0 votes
0 answers
25 views

calculation of the C-index clustering [duplicate]

Can anyone give me an example of working on the C-index clustering validity test, but calculating manually??
Raaa's user avatar
  • 1
1 vote
0 answers
22 views

Spatial Temporal Clustering evenly spaced over time

I have a large dataset of spatio-temporal data. It has longitude and latitude coordinates, and a date for each observation. For example: Long Lat Date 50 20.43 9-19-2010 51 19.5 10-4-2010 51 19.3 ...
Robertmg's user avatar
  • 121
0 votes
0 answers
11 views

What are the right metrics to validate the performance of a custom clustering model with three possible outcomes?

I have developed a custom clustering model on top of MiniBatchKmeans, that has three possible outcomes for each data point: Assign the point to the correct cluster. Assign the point to the wrong ...
Sanjay Mythili's user avatar
0 votes
0 answers
24 views

Curse of dimensionality in Time series with K-means

I have been looking at the following notebook: time series clustering where the writer says that the dataset is affected by the "Curse of Dimensionality", so applying TimeSeriesKMeans ...
Zackbord's user avatar
3 votes
1 answer
28 views

What is "clall" in index.Gap in "clusterSim" R package?

I am using the "clusterSim" package in my project (https://cran.r-project.org/web/packages/clusterSim/clusterSim.pdf, page 39) and I do not understand the meaning of the "clall" ...
user2702's user avatar
0 votes
0 answers
62 views

Variable importance in cluster analysis

I'm new to the cluster analysis, read lots of things but I'm not able to understand how to variables are ordered into cluster. I mean, I find that my data are clustered into 3 different cluster, but ...
Riccardo's user avatar
  • 101
0 votes
0 answers
16 views

Should the same environmental variable measured with different methods be removed before K-means? What about variables repr. sep. and by their ratio?

So I'm running K-means clustering algorithm on environmental variables measured on different locations. The aim is to see if the environmental variables can be clustered into separate clusters. Same ...
Cordex's user avatar
  • 77
1 vote
0 answers
125 views

k-means clustering on a probability distribution instead of a dataset

Normally, clustering algorithms such as $k$-means are defined on a dataset in the following sense: if $D$ is a dataset, find a partition of $D$ into sets $\{S_1, \dots, S_n\}$ that minimises the ...
Harry Partridge's user avatar
0 votes
1 answer
150 views

Applying clustering algorithms after t-SNE in R

So I'm doing my bachelor`s work and I'm applying different clustering algorithms on certain data. Before all the clustering of course I'm using a dimensionality reduction algorithm such as t-SNE for ...
user avatar
2 votes
1 answer
204 views

What is the standard threshold value that is best for accuracy when employing Euclidean distance as a metric for gauging textual similarity?

I'm using Euclidean distance as a metric to compare two sentences for similarity while clustering them using my custom incremental KMeans algorithm. The current threshold value I'm using is 0.7 which ...
sanjay M's user avatar
0 votes
0 answers
10 views

What is normalized winning frequency in kernel self organizing map(SOM)?

In the k-means based kernel SOM, proposed by MacDonald and Fyfe (2000), the update of the mean is based on a soft learning algorithm mi(t + 1) = mi(t) + Λ[φ(x) − mi(t)] where Λ is the normalized ...
Anshuman Jayaprakash's user avatar
0 votes
0 answers
43 views

Why does this K-Means cluster example show 'overlap' between clusters?

I was reading the hypertools docs and came across this pictorial that shows 10 clusters (some seem to share very similar coloring) generated from some (mushroom) ...
Vincent Karuri's user avatar
1 vote
0 answers
68 views

K-means clustering - weird PCA visualization

I performed PCA on 4 variables and are shown in this visualization: At first look it doesn't look convincing and the some clusters seem weird. The data was cleaned and standardized beforehand. Only ...
Simon's user avatar
  • 11
0 votes
1 answer
20 views

K means clustering of image with k=1 vs mean of all pixels

I have relatively uniformly colored images and I extracted colors using k-means. k means 1 showed the best results for my modeling purposes, k means 2 not so much, and with k-means 3 there ceased to ...
phil27's user avatar
  • 1
0 votes
0 answers
21 views

Method for pairwise ordering two datasets

Given two rather small but unordered multidimensional vectors/datasets (e.g sets of a handful of 3D coordinates), what is a simple method for pairwise alignment/ordering? I've though about using ...
joaocandre's user avatar
1 vote
1 answer
59 views

Elbow method not giving a proper curve in python code

I am trying to determine how many clusters to use for my k-means clustering using different methods. first i used the following code to calculate different metrics per cluster number and different ...
rebwar's user avatar
  • 11
3 votes
2 answers
455 views

Termination conditions for K-means and their interconnection

As far as I know, there are two termination criteria for K-means clustering algorithm: assignments of data points do not change centroids do not change I wonder if there is any kind of relation ...
Artem Tartakovskiy's user avatar
1 vote
1 answer
46 views

Mathematics behind standardizing the data points in machine learning algorithms (e.g., K-means clustering)

For K-means algorithm, among other methods using distance-based measurements to determine similarity between data points, why we have to standardize the data points with mean as 0 and standard ...
Sophia's user avatar
  • 121
0 votes
1 answer
53 views

Continuous monitoring of KMeans model post production

In the process of deploying a KMeans model for a customer segmentation use case into production. KMeans doesn’t produce the same results every time and after production cluster sizes and arrangements ...
ibarbo's user avatar
  • 65
0 votes
1 answer
169 views

Proving that K-means corresponds to an EM algorithm?

Just wanted to make sure that my proof is correct and that I am not missing anything in the process. Any thoughts? " To demonstrate mathematically that the K-means algorithm corresponds to an ...
Naomi Pomella's user avatar
2 votes
1 answer
36 views

Can I use kmeans on paired data?

I want to see if a treatment brings patients closer to controls using multiple dependent variables. Can I do kmeans and see if the controls are separate from the patients before treatment, but cluster ...
maglorismyspiritanimal's user avatar
4 votes
2 answers
409 views

Question about Silhouette index calculation using scikit

I am currently working with continuous data measured from different sensors (thermometers and voltmeters). I have a matrix whose columns represent the sensors and the rows are normalized measurements (...
slow_learner's user avatar
0 votes
0 answers
41 views

Turning heatmap into clusters - Classification

Assume that you having a heatmap that looks like this. The goal is to classify all the "dot" inside the image. How can that be done? The assumptions of the image: The image has always black ...
euraad's user avatar
  • 425
0 votes
1 answer
49 views

In unsupervised learning, is a result of 2 clusters meaningful?

I used both agglomerative clustering and k-means on a dataset and see the results below. Result from agglomerative clustering was demonstrated with silhouette score while kmeans with inertia score. ...
LCheng's user avatar
  • 219
1 vote
0 answers
18 views

Method to find group associated with a target variable [closed]

The business question that I am trying to answer is: what group(s) of people have the highest chance of default? The features that I have are income, debt to income ratio, fico, etc. How do I find the ...
Victoria B's user avatar
1 vote
1 answer
110 views

How to tell whether segments from K Means clustering result are "successful" and will impact business metrics?

Background I'm a data analyst. The Business unit I'm assigned for needs to segment users based on power vs non-power users so they can target each segment with proper treatments. Goal Segment users (...
Blaze Tama's user avatar
0 votes
1 answer
337 views

Dummy Variable Trap in KMeans Clustering

My data set is having a column Gender, so I have to apply One Hot Encodingto perform KMeans Clustering. Q1. Should I take care about ...
mainak mukherjee's user avatar
2 votes
1 answer
160 views

Clustering algorithms puts data points that are visually far apart in same cluster

I am trying to cluster a very large set of data points, of roughly (20000, 100) shape. I could not run density based DBSCAN or SpectralClustering due to the ...
pingo's user avatar
  • 29
1 vote
2 answers
384 views

Interpreting results of K-means after PCA

I have this dataset about an airline company customers with 22 explanatory variables. My goal is to perform some sort of customer segmentation with the k-means algorithm. One problem that I've found ...
ScarceChicken's user avatar
0 votes
0 answers
66 views

General technique for loss function minimization

I was trying to rationalize the K-Means algorithm and came up with the following thoughts. Suppose we need to compute: $T=min_x L(x)$ but we struggle because $L$ is complex. Suppose we find $L'$ s.t.: ...
Thomas's user avatar
  • 952
1 vote
1 answer
1k views

Elbow method Vs Gap statistics, which one? challenging for data scientist

I am working on hourly-weather data. It contains four features: rain, wind speed, humidity, and temperature. Obviously, all of them are continuous values. The number of records is around 17000. Other ...
Asa Ya's user avatar
  • 73
1 vote
2 answers
92 views

Can I use K-Means to group customers based on a single variable?

I have a test dataset of 11m records. The dataset contains a global customer id and spend figure. I need to group customers into the following categories: 0 Low 1 Low/Med 2 Med 3 Med/High 4 High I ...
John Edwards's user avatar
0 votes
0 answers
58 views

How to identify the clusters in SSE plot?

How to determine the number of clusters from the following plot?
Niro's user avatar
  • 1
1 vote
1 answer
336 views

Unsupervised learning: How to identify differences between clusters?

I'm learning about unsupervised learning and I tried to use KMeans, AgglomerativeClustering and DBSCAN on the same datase. The result was ok, they seems to work fine according silhouette_score() ...
Antonio Caipora's user avatar
2 votes
0 answers
37 views

Does it make sense to transform a feature containing hours (24h) into two features with xy-coordinates of each hour in the space? [duplicate]

I have a clustering problem that I might solve with an algorithm based on Euclidean distance (e.g. K-Means). One potential feature is the "hour" at which each user began an interaction. As ...
rusiano's user avatar
  • 566
0 votes
0 answers
19 views

How do I choose k for k means clustering [duplicate]

Given a set of points, I'm trying to find the right cluster. However, I am lost on what the process is. Here is the graph of all possible points. I am unsure what I should look at
user avatar
2 votes
1 answer
370 views

Choosing the best clustering algorithm and evaluating the results

I'm trying to separate my data into clusters using the k-means algorithm and the hierarchical algorithm, choose which algorithm fits my data the best, and evaluate the results. However, all of my ...
Jim's user avatar
  • 61
0 votes
0 answers
31 views

How to interpret the Scatter Plot result from PCA? [duplicate]

I have a project in school about clustering analysis. I have applied standardization and principal component analysis (PCA) to my dataset (I used K-means), which is about heart disease patients. I ...
AK6000W's user avatar
1 vote
1 answer
140 views

In $k$-means, how is it NP-hard if the dimensionality of the data is at least $2$ ($d\geq 2$)?

In $k$-means, how is it NP-hard if the dimensionality of the data is at least $2$ ($d\geq 2$)? Can someone justify or give reasons to this statement? Any guidance would be appreciated.
Maryam Faheem's user avatar
0 votes
0 answers
22 views

K-means on linearly projected features

I am looking for references on K-Means applied to linearly projected features instead of to the original features, in the sense that both K-Means and the projection matrix are learned at the same time....
f10w's user avatar
  • 213
1 vote
0 answers
48 views

Can K-means put most of the noise in the same cluster?

I am working on clustering text data (very short sentences) vectorized with tf-idf. The data are characterized by high sparseness and the presence of abundant noise (considered here as documents that ...
zurgo's user avatar
  • 11

1
2 3 4 5
22