0

Machine: Windows 7 - 64 bit R Version : R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"

I am working on stemming some text for an analysis that I am doing, I am able to do everything all the way up until 'stemComplete' For more context please see the below;

Packages:

  1. TM
  2. SnowballC
  3. rJava
  4. RWeka
  5. Rwekajars
  6. NLP

Sample list of words

test <- as.vector(c('win', 'winner', 'wins', 'wins', 'winning'))

Convert to Corpus

Test_Corpus <- Corpus(VectorSource(test))

Text manipulations`

Test_Corpus <- tm_map(Survey_Corpus, content_transformer(tolower))
Test_Corpus <- tm_map(Survey_Corpus, removePunctuation)
Test_Corpus <- tm_map(Survey_Corpus, removeNumbers)

Stemming using tm_map under the tm package

>Test_stem <- tm_map(Test_Corpus, stemDocument, language = 'english' )

Below is the result from stemming above, which is all correct so far:

  1. win
  2. winner
  3. win
  4. win
  5. win

Now comes the issue! When I try to use test_corpus as a dictionary to transform the words back to an appropriate format using the following code;

>Test_complete <- tm_map(Test_stem, stemCompletion, Test_Corpus)

Below is the error message that I am getting:

Warning messages:

1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be  used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
3: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
4: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
5: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used

I have tried several things listed on previous posts and seen that other people with the same problem have tried with no luck. Below is a list of those things:

  1. Update Java
  2. used content_transformation
  3. used PlainTextDocument
1
  • I'm not sure your formatting is doing what you think it is. Indent for code blocks (including comments) and try to avoid overuse of headers. Commented Feb 20, 2015 at 1:19

1 Answer 1

0

I think you need to save your test_corpus as a dictionary before the stemming process. You could try something like Test_Corpus <- corpus then you could start the steming and using corpus later on in Test_complete <- tm_map(corpus, stemCompletion).

1
  • By changing the name of the corpus at the point of stemming it does the same things right? Commented Feb 20, 2015 at 20:12

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.