0

I have a dataset of 310,225 tweets. I want to find out how many tweets were same or similar. I calculated the similarity between the tweets using Quanteda's textstat frequency. I found that the frequency of distance 1 and 0.9999 in the similarity matrix as below:

0.9999            1 
 2288           162743 

Here's my code:

dfmat_users <- dfm_data %>% 
dfm_select(min_nchar = 2) %>% 
dfm_trim(min_termfreq = 10) 

dfmat_users <- dfmat_users[ntoken(dfmat_users) > 10,]

tstat_sim <- textstat_simil(dfmat_users, method = "cosine", margin = "documents", min_simil = 0.9998)

table(tstat_sim@x) #result of this code is given above.

I need to find out the number of similar or same tweets in the dataset. How should I interpret the results above?

1 Answer 1

1

The easiest way is to convert the textstat_simil() output to a data.frame of unique pairs, and then filter the ones whose cosine value is above your threshold (here, .9999).

To illustrate, we can reshape the built-in inaugural address corpus into sentences, and then compute the similarity matrix on these, and then do the coercion to data.frame and use dplyr to filter the results you want.

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

sim_df <- data_corpus_inaugural %>%
  corpus_reshape(to = "sentences") %>%
  dfm() %>%
  textstat_simil(method = "cosine") %>%
  as.data.frame()

nrow(sim_df)
## [1] 12508670

You can adjust the condition below for your data to 0.9999 - here I'm using 0.99 as an illustration.

library("dplyr", warn.conflicts = FALSE)
filter(sim_df, cosine > .99)
##            document1       document2 cosine
## 1    1861-Lincoln.69 1861-Lincoln.71      1
## 2    1861-Lincoln.69 1861-Lincoln.73      1
## 3    1861-Lincoln.71 1861-Lincoln.73      1
## 4  1953-Eisenhower.6   1985-Reagan.6      1
## 5  1953-Eisenhower.6    1989-Bush.15      1
## 6      1985-Reagan.6    1989-Bush.15      1
## 7      1989-Bush.140  2009-Obama.108      1
## 8      1989-Bush.140   2013-Obama.87      1
## 9     2009-Obama.108   2013-Obama.87      1
## 10     1989-Bush.140    2017-Trump.9      1
## 11    2009-Obama.108    2017-Trump.9      1
## 12     2013-Obama.87    2017-Trump.9      1

(And: yeah, that's a very fast computation of cosine similarity between 12.5 million sentence pairs!)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.