0

I made usingtm in R a DocumentTermMatrix (dtm). if I understand correctly, this matrix displays for each document how often each possible term occurs. Now I can inspect this matrix and I get

    Terms
Docs     can design door easy finish include light provide use water
  176004   1      2   11    8      0       3     3       4   4     4
  181288   1      2   11    8      0       2     3       4   4     4
  182465   4      4    0    2      0       0    42      13   6     0
etc.

How can I now retrieve the vector of (for example) document 181288? So I will get something like

1      2   11    8      0       2     3       4   4     4 ………

Also, it says my dtm's sparsity is 100%, is it (by approximation) 100% empty?

1 Answer 1

1

To retrieve your vector you can do it in multiple ways.

simple, but not recommended unless for quick test:

my_doc <- inspect(dtm[dtm$dimnames$Docs == "181288",])

Doing it like this limits you to what inspect does and this only shows a maximum of 10 documents.

Better way, create a selection list if you want to and filter the dtm. This keeps the sparse matrix format, then transform what you need into a data.frame for further manipulation if needed.

my_selection <- c("181288", "182465")

# selection in case of dtm
my_dtm_selection <- dtm[dtm$dimnames$Docs %in% my_selection, ]

# selection in case of tdm
my_tdm_selection <- tdm[, tdm$dimnames$Docs %in% my_selection]

# create data.frame with document names as first column, followed by the terms
my_df_selection <- data.frame(docs = Docs(my_dtm_selection), as.matrix(my_dtm_selection))

The answer to your second question: yes, almost empty. Or better framed, a lot of empty cells. But you might have more data than you think if you have a lot of documents and terms.

5
  • I wrote my_df_selection[[5]] and I get two values: 0 0. I assume this means that both document 181288 and 182465 have 0 times the word at [[5]] occuring?
    – Rich_Rich
    Commented Jun 21, 2018 at 11:36
  • 1
    That is correct. It is better to do my_df_selection[5]. Then you will get a little table with the docnames and as the column header the term.
    – phiver
    Commented Jun 21, 2018 at 11:48
  • You have been a big help!
    – Rich_Rich
    Commented Jun 21, 2018 at 12:08
  • 1
    I updated the answer to include a tdm version. It is almost the same, just placement of the document selection.
    – phiver
    Commented Jun 21, 2018 at 14:02
  • Thanks alot! If all goes correctly I will now be able to calculate the cosine similarities between given in my_selection <- c(a, b)
    – Rich_Rich
    Commented Jun 21, 2018 at 14:15

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.