I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a minimal working example, but the problem is: there is one of the words ("insolvency", so not a short word as in the problem here) in a document of 32 pages which is missed. Now, this word is actually in page 7 of the text. But if I reduce my text with text <- text[7]
, then DocumentTermMatrix actually finds it! So I am not able to reproduce this with a minimal working example...
Do you have any ideas?
Below a sketch of my script:
library(fastpipe)
library(openxlsx)
library(tm)
`%>>%` <- fastpipe::`%>>%`
source("cleanText.R") # Custom function to clean up the text from reports
keywords_xlsx <- read.xlsx(paste0(getwd(),"/Keywords.xlsx"),
sheet = "all",
startRow = 1,
colNames = FALSE,
skipEmptyRows = TRUE,
skipEmptyCols = TRUE)
keywords <- keywords_xlsx[1] %>>%
tolower(as.character(.[,1]))
# Custom function to read pdfs
read <- readPDF(control = list(text = "-layout"))
# Extract text from pdf
report <- "my_report.pdf"
document <- Corpus(URISource(paste0("./Annual reports/", report)), readerControl = list(reader = read))
text <- content(document[[1]])
text <- cleanText(report, text) # This is a custom function to clean up the texts
# text <- text[7] # If I do this, my word is found! Otherwise it is missed
# Create a corpus
text_corpus <- Corpus(VectorSource(text))
matrix <- t(as.matrix(inspect(DocumentTermMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf))
)
))))
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
Corpus
,cleanText
) from?Corpus
is from thetm
package (I believe?).cleanText
is my own function, defined on another script. Going to edit my question