Interpretation question: Textstat_similarity Quanteda

Question

I have a dataset of 310,225 tweets. I want to find out how many tweets were same or similar. I calculated the similarity between the tweets using Quanteda's textstat frequency. I found that the frequency of distance 1 and 0.9999 in the similarity matrix as below:

0.9999            1 
 2288           162743

Here's my code:

dfmat_users <- dfm_data %>% 
dfm_select(min_nchar = 2) %>% 
dfm_trim(min_termfreq = 10) 

dfmat_users <- dfmat_users[ntoken(dfmat_users) > 10,]

tstat_sim <- textstat_simil(dfmat_users, method = "cosine", margin = "documents", min_simil = 0.9998)

table(tstat_sim@x) #result of this code is given above.

I need to find out the number of similar or same tweets in the dataset. How should I interpret the results above?

Ken Benoit · Accepted Answer · 2020-07-19 20:40:57Z

The easiest way is to convert the textstat_simil() output to a data.frame of unique pairs, and then filter the ones whose cosine value is above your threshold (here, .9999).

To illustrate, we can reshape the built-in inaugural address corpus into sentences, and then compute the similarity matrix on these, and then do the coercion to data.frame and use dplyr to filter the results you want.

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

sim_df <- data_corpus_inaugural %>%
  corpus_reshape(to = "sentences") %>%
  dfm() %>%
  textstat_simil(method = "cosine") %>%
  as.data.frame()

nrow(sim_df)
## [1] 12508670

You can adjust the condition below for your data to 0.9999 - here I'm using 0.99 as an illustration.

library("dplyr", warn.conflicts = FALSE)
filter(sim_df, cosine > .99)
##            document1       document2 cosine
## 1    1861-Lincoln.69 1861-Lincoln.71      1
## 2    1861-Lincoln.69 1861-Lincoln.73      1
## 3    1861-Lincoln.71 1861-Lincoln.73      1
## 4  1953-Eisenhower.6   1985-Reagan.6      1
## 5  1953-Eisenhower.6    1989-Bush.15      1
## 6      1985-Reagan.6    1989-Bush.15      1
## 7      1989-Bush.140  2009-Obama.108      1
## 8      1989-Bush.140   2013-Obama.87      1
## 9     2009-Obama.108   2013-Obama.87      1
## 10     1989-Bush.140    2017-Trump.9      1
## 11    2009-Obama.108    2017-Trump.9      1
## 12     2013-Obama.87    2017-Trump.9      1

(And: yeah, that's a very fast computation of cosine similarity between 12.5 million sentence pairs!)

Collectives™ on Stack Overflow

Interpretation question: Textstat_similarity Quanteda

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
r
text
cosine-similarity
quanteda
sentence-similarity
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged rtextcosine-similarityquantedasentence-similarity or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
r
text
cosine-similarity
quanteda
sentence-similarity
or ask your own question.