DocumentTermMatrix misses some words

Question

I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a minimal working example, but the problem is: there is one of the words ("insolvency", so not a short word as in the problem here) in a document of 32 pages which is missed. Now, this word is actually in page 7 of the text. But if I reduce my text with text <- text[7], then DocumentTermMatrix actually finds it! So I am not able to reproduce this with a minimal working example...

Do you have any ideas?

Below a sketch of my script:

library(fastpipe)
library(openxlsx)
library(tm)

`%>>%` <- fastpipe::`%>>%`

source("cleanText.R") # Custom function to clean up the text from reports

keywords_xlsx <- read.xlsx(paste0(getwd(),"/Keywords.xlsx"),
                           sheet = "all",
                           startRow = 1,
                           colNames = FALSE,
                           skipEmptyRows = TRUE,
                           skipEmptyCols = TRUE)

keywords <- keywords_xlsx[1] %>>%
  tolower(as.character(.[,1]))

# Custom function to read pdfs
read <- readPDF(control = list(text = "-layout"))

# Extract text from pdf
report <- "my_report.pdf"
document <- Corpus(URISource(paste0("./Annual reports/", report)), readerControl = list(reader = read))
text <- content(document[[1]]) 

text <- cleanText(report, text) # This is a custom function to clean up the texts

# text <- text[7] # If I do this, my word is found! Otherwise it is missed

# Create a corpus  
text_corpus <- Corpus(VectorSource(text))


matrix <- t(as.matrix(inspect(DocumentTermMatrix(text_corpus,
                                                 list(dictionary = keywords,
                                                      list(wordLengths=c(1, Inf))
                                                 )
))))
  
  
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

What package are all the non-base functions (e.g. Corpus, cleanText) from? — SamR, Commented Sep 28, 2022 at 9:54
Corpus is from the tm package (I believe?). cleanText is my own function, defined on another script. Going to edit my question — gitcanzo, Commented Sep 28, 2022 at 9:55

phiver · Accepted Answer · 2022-09-28 12:12:37Z

The problem lies in your use of inspect. Only use inspect to check if your code is working and to see if a dtm has any values. Never use inspect inside functions / transformations, because inspect by default only shows the firs 10 rows and 10 columns of a document term matrix.

Also if you want to transpose the outcome of a dtm, use TermDocumentMatrix.

Your last line should be:

mat <- as.matrix(TermDocumentMatrix(text_corpus,
                                    list(dictionary = keywords,
                                         list(wordLengths=c(1, Inf)))))

Note that turning a dtm / tdm into a matrix will use a lot more memory than having the data inside a sparse matrix.

Collectives™ on Stack Overflow

DocumentTermMatrix misses some words

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
r
nlp
tm
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged rnlptm or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
nlp
tm
or ask your own question.