1

I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a minimal working example, but the problem is: there is one of the words ("insolvency", so not a short word as in the problem here) in a document of 32 pages which is missed. Now, this word is actually in page 7 of the text. But if I reduce my text with text <- text[7], then DocumentTermMatrix actually finds it! So I am not able to reproduce this with a minimal working example...

Do you have any ideas?

Below a sketch of my script:

library(fastpipe)
library(openxlsx)
library(tm)

`%>>%` <- fastpipe::`%>>%`

source("cleanText.R") # Custom function to clean up the text from reports

keywords_xlsx <- read.xlsx(paste0(getwd(),"/Keywords.xlsx"),
                           sheet = "all",
                           startRow = 1,
                           colNames = FALSE,
                           skipEmptyRows = TRUE,
                           skipEmptyCols = TRUE)

keywords <- keywords_xlsx[1] %>>%
  tolower(as.character(.[,1]))

# Custom function to read pdfs
read <- readPDF(control = list(text = "-layout"))

# Extract text from pdf
report <- "my_report.pdf"
document <- Corpus(URISource(paste0("./Annual reports/", report)), readerControl = list(reader = read))
text <- content(document[[1]]) 

text <- cleanText(report, text) # This is a custom function to clean up the texts

# text <- text[7] # If I do this, my word is found! Otherwise it is missed

# Create a corpus  
text_corpus <- Corpus(VectorSource(text))


matrix <- t(as.matrix(inspect(DocumentTermMatrix(text_corpus,
                                                 list(dictionary = keywords,
                                                      list(wordLengths=c(1, Inf))
                                                 )
))))
  
  
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

2
  • What package are all the non-base functions (e.g. Corpus, cleanText) from?
    – SamR
    Commented Sep 28, 2022 at 9:54
  • Corpus is from the tm package (I believe?). cleanText is my own function, defined on another script. Going to edit my question
    – gitcanzo
    Commented Sep 28, 2022 at 9:55

1 Answer 1

1

The problem lies in your use of inspect. Only use inspect to check if your code is working and to see if a dtm has any values. Never use inspect inside functions / transformations, because inspect by default only shows the firs 10 rows and 10 columns of a document term matrix.

Also if you want to transpose the outcome of a dtm, use TermDocumentMatrix.

Your last line should be:

mat <- as.matrix(TermDocumentMatrix(text_corpus,
                                    list(dictionary = keywords,
                                         list(wordLengths=c(1, Inf)))))

Note that turning a dtm / tdm into a matrix will use a lot more memory than having the data inside a sparse matrix.

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.