Lost one document during tokenization

Question

I lost one row of data in the tokenization process.

There are three documents in this data set

structure(list(ID = c("N12277Y", "N12284X", "N12291W"), corrected = c("I am living in  I like living in  I would not like to emigrate because you never hardly see your parents at all and brothers and sisters I would be nursing in a hospital I will drive a car and I would like to wear fashionable clothes I am married I like having parties and going out on nights If I had a girl and a boy I would call the girl  and I would call the boy  The little girl is two and the little boy is one month. My hobbies are making dresses knitting and Swimming I like going on holiday I like going to other countries.  ", 
"I do not know.  ", "I emigrated* to Australia* last year. I have have a small farm* just outside Sydney. I have 250 acres* of land and on that I *****ly plow and keepanimals on. I go into Town (Sydney) about twice a week mostly to get ca*** and hay, my wife does all the Shopping. So I don't have to worry about that. We have two girls one is twelve and the other is ten.  the oldest has just got to the stage of pop and Horse riding,  the younger one has just finished her first play with the school and she came in yesterday saying that* the c***** teacher* said that she was the best of all we have just got over the worst summer* for years. The sun was so hot - that it dried* up all the ***nds and all the crop*. 500 sheep and 100 cows died* with lack of water and we almost dried up as well. But we seem to have* got over that and we are all back to normal again. The two Children went back to school after the summer* holidays three weeks ago. The road* is* very dust and one of s* friends was injured with a * up thought* from the dust. I miss the football a lot but U have plenty of cricket*. The school is about three miles away its only a little place but it only cost two pounds every three weeks. There isnt so much field* in England there is only a pinch* compared to here well there isnt much more to tell so goodbye.  "
), father = structure(c(2L, 2L, 1L), .Label = c("1", "2"), class = "factor"), 
    financial = structure(c(1L, 1L, 1L), .Label = "1", class = "factor")), row.names = 598:600, class = "data.frame")

Then, I executed the following code:

library(dplyr)
library(tidytext)
library(SnowballC)

tokens<- data%>%
  unnest_tokens(output = "word", token = "words", input = corrected)%>%
  anti_join(stop_words)%>% # remove stop words 
  mutate(word = wordStem(word)) # stem words 

essay_matrix <- tokens%>%
  count(ID, word)%>%
  cast_dtm(document = ID, term = word, value = n, weighting = tm::weightTfIdf)

But it shows the matrix only contains 2 documents.

<<DocumentTermMatrix (documents: 2, terms: 87)>>
Non-/sparse entries: 84/90
Sparsity           : 52%
Maximal term length: 9
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

I have located the problem: the second row leads to this error

Error in (function (cl, name, valueClass) : assignment of an object of class “numeric” is not valid for @‘Dim’ in an object of class “dgTMatrix”; is(value, "integer") is not TRUE

I am not sure why this row is problematic, as I have over 4000 data entries but only this row leads to the error. Could someone help?

Thank you in advance.

It's easier to help you if you include a simple reproducible example with sample input and desired output that can be used to test and verify possible solutions. It's really hard to guess what might be going on with just this information. Is one of the documents empty? — MrFlick, Commented Jun 25, 2021 at 0:28
@MrFlick Thanks for your reply. I have updated the question. Now, with the data and code, the problem should be reproducible — karyn-h, Commented Jun 25, 2021 at 1:10
You have a document that says "I do not know". Those are all stop words. When you run anti_join(stop_words) you are removing all values for that document. Thus is disappears from the collection. — MrFlick, Commented Jun 25, 2021 at 2:00

karyn-h · Accepted Answer · 2021-06-25 18:19:37Z

0

Like @MrFlick mentioned, all words in "I do not know" are stop words, so after removing stop words, this document is empty.

To solve it, I removed them by calling the following code, and used data_ready for later analysis.

data_ready<- data[data$ID %in% essay_matrix[["dimnames"]][["Docs"]],]
data_empty<- data[!data$ID %in% essay_matrix[["dimnames"]][["Docs"]],]

answered Jun 25, 2021 at 18:19

karyn-h

1331 silver badge8 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Lost one document during tokenization

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
r
tokenize
tm
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged rtokenizetm or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
tokenize
tm
or ask your own question.