Cosine Similarity with two Term Frequency vectors in R

Question

I made usingtm in R a DocumentTermMatrix (dtm). if I understand correctly, this matrix displays for each document how often each possible term occurs. Now I can inspect this matrix and I get

    Terms
Docs     can design door easy finish include light provide use water
  176004   1      2   11    8      0       3     3       4   4     4
  181288   1      2   11    8      0       2     3       4   4     4
  182465   4      4    0    2      0       0    42      13   6     0
etc.

How can I now retrieve the vector of (for example) document 181288? So I will get something like

1      2   11    8      0       2     3       4   4     4 ………

Also, it says my dtm's sparsity is 100%, is it (by approximation) 100% empty?

phiver · Accepted Answer · 2018-06-21 14:02:04Z

1

To retrieve your vector you can do it in multiple ways.

simple, but not recommended unless for quick test:

my_doc <- inspect(dtm[dtm$dimnames$Docs == "181288",])

Doing it like this limits you to what inspect does and this only shows a maximum of 10 documents.

Better way, create a selection list if you want to and filter the dtm. This keeps the sparse matrix format, then transform what you need into a data.frame for further manipulation if needed.

my_selection <- c("181288", "182465")

# selection in case of dtm
my_dtm_selection <- dtm[dtm$dimnames$Docs %in% my_selection, ]

# selection in case of tdm
my_tdm_selection <- tdm[, tdm$dimnames$Docs %in% my_selection]

# create data.frame with document names as first column, followed by the terms
my_df_selection <- data.frame(docs = Docs(my_dtm_selection), as.matrix(my_dtm_selection))

The answer to your second question: yes, almost empty. Or better framed, a lot of empty cells. But you might have more data than you think if you have a lot of documents and terms.

edited Jun 21, 2018 at 14:02

answered Jun 21, 2018 at 10:22

phiver

23.6k14 gold badges47 silver badges58 bronze badges

I wrote my_df_selection[[5]] and I get two values: 0 0. I assume this means that both document 181288 and 182465 have 0 times the word at [[5]] occuring?
– Rich_Rich
Commented Jun 21, 2018 at 11:36
1

That is correct. It is better to do my_df_selection[5]. Then you will get a little table with the docnames and as the column header the term.
– phiver
Commented Jun 21, 2018 at 11:48
You have been a big help!
– Rich_Rich
Commented Jun 21, 2018 at 12:08
1

I updated the answer to include a tdm version. It is almost the same, just placement of the document selection.
– phiver
Commented Jun 21, 2018 at 14:02
Thanks alot! If all goes correctly I will now be able to calculate the cosine similarities between given in my_selection <- c(a, b)
– Rich_Rich
Commented Jun 21, 2018 at 14:15

Add a comment |

Collectives™ on Stack Overflow

Cosine Similarity with two Term Frequency vectors in R

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
r
text-mining
data-analysis
tm
cosine-similarity
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged rtext-miningdata-analysistmcosine-similarity or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
r
text-mining
data-analysis
tm
cosine-similarity
or ask your own question.