I have a dataset of 310,225 tweets. I want to find out how many tweets were same or similar. I calculated the similarity between the tweets using Quanteda's textstat frequency. I found that the frequency of distance 1 and 0.9999 in the similarity matrix as below:
0.9999 1
2288 162743
Here's my code:
dfmat_users <- dfm_data %>%
dfm_select(min_nchar = 2) %>%
dfm_trim(min_termfreq = 10)
dfmat_users <- dfmat_users[ntoken(dfmat_users) > 10,]
tstat_sim <- textstat_simil(dfmat_users, method = "cosine", margin = "documents", min_simil = 0.9998)
table(tstat_sim@x) #result of this code is given above.
I need to find out the number of similar or same tweets in the dataset. How should I interpret the results above?