Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
0 votes
1 answer
188 views

How to extract entities names with SpacyR with personalized data?

Good afternoon, I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given ...
Sergio A. Gottret Rios's user avatar
1 vote
2 answers
388 views

Convert Corpus from quanteda to tm

My data mycorpus is in a quanteda-corpus (corpus-function from quanteda) which I need to convert to a corpus under the tm package. I know about quanteda's convert-function. This, though, only converts ...
arndtupb's user avatar
0 votes
1 answer
495 views

How to break a corpus into paragraphs using custom delimiters

I am scraping the New york Times webpages to do some natural language processing on it, I want to split the webpage into paragraphs when using corpus in order to do frequency counts on words that ...
SLE's user avatar
  • 85
0 votes
0 answers
1k views

Unknown error when using readtext with PDFs

I am a complete novice working on some textual analysis in R. I have a folder of ~12000 PDF documents I am trying to convert into a corpus for analysis. I have attempted to do so several different ...
Lewis Jackson's user avatar
0 votes
1 answer
174 views

Error when importing tm Vcorpus into Quanteda corpus

This code snippet worked just fine until I decided to update R(3.6.3) and RStudio(1.2.5042) yesterday, though it is not obvious to me that is the source of the problem. In a nutshell, I convert 91 ...
NZU's user avatar
  • 17
2 votes
1 answer
555 views

Approximate string matching in R between two datasets

I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not: dt1 title ...
Carbo's user avatar
  • 916
0 votes
1 answer
585 views

"subscript out of bounds" error in str_extract_all

I am trying extract date information from multiple text files using str_extract_all. If I do a single file, it works fine. But, when I put it in for loop, it gives me this error. I have already tried ...
James G Wilson's user avatar
2 votes
1 answer
2k views

Remove words from a dtm

I have created a dtm. library(tm) corpus = Corpus(VectorSource(dat$Reviews)) dtm = DocumentTermMatrix(corpus) I used it to remove rare terms. dtm = removeSparseTerms(dtm, 0.98) After ...
Banjo's user avatar
  • 1,241
0 votes
1 answer
1k views

How do I remove texts from a corpus in R?

I'm dividing a long document into chapters using the corpus_segment function in the tm package. After running the pattern, I'm still left with a couple of unwanted chapters. I'd like to somehow ...
Erlend Tangeraas Lygre's user avatar
3 votes
1 answer
260 views

How can I bootstrap text readability statistics using quanteda?

I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily ...
beddotcom's user avatar
  • 507
0 votes
1 answer
445 views

Apply a custom (weighted) dictionary to text based on sentiment analysis

I am looking to adjust this code so that I can assign each one of these modal verbs with a different weight. The idea is to use something similar to the NRC library, where we have the "numbers" 1-5 ...
Emily Casey-Wagemaker's user avatar
1 vote
1 answer
5k views

Trying to remove special characters and non-english words from my data R

I am trying to clean up my data to remove; i.) special characters (e.g +_), ii.) specific words (e.g retweet, followers, couldn, better, person) iii.) words that do not appear in the english ...
Emm's user avatar
  • 2,497
0 votes
1 answer
519 views

Stemming each word

I want to stem each word. For example, 'hardworking employees' should be converted to 'hardwork employee' not 'hardworking employee'. In simple words, it should stem both words separately. I know it ...
john's user avatar
  • 1,036
6 votes
1 answer
573 views

Stem completion in R replaces names, not data

My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling ...
J. Trimarco's user avatar
6 votes
1 answer
2k views

tidytext, quanteda, and tm returning different tf-idf scores

I am trying to work on tf-idf weighted corpus (where I expect tf to be a proportion by document rather than simple count). I would expect the same values to be returned by all the classic text mining ...
Radim's user avatar
  • 455
0 votes
2 answers
40 views

ntokens applied to VCorpus

I execute the followwing commands: library(tm) library(dplyr) library(stringi) library(quanteda) df <- structure(list(text = c("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean ...
HelenVcl's user avatar
0 votes
0 answers
249 views

R Regular expression to search citations of law using tidytext and tm

I use tidytext, tm and quantedafor text mining. I try to: filter a tibble with plain, processed text according to presence of a citation of law count the number of the same citation per text ...
captcoma's user avatar
  • 1,898
3 votes
2 answers
2k views

Remove ngrams with leading and trailing stopwords

I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords. I have about 100 pdf files. I converted ...
syre's user avatar
  • 982
2 votes
4 answers
1k views

A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer ...
Jacek Kotowski's user avatar
1 vote
1 answer
1k views

Lemmatization using txt file with lemmes in R

I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan ...
Jacek Kotowski's user avatar
0 votes
1 answer
316 views

Display matching sentences by text typed in a Shiny app text box

I am trying to build an Shiny App that can dynamically display sentences from a database column by matching a Corpus from a text box , ie. as users starts typing the text in the text box, all the ...
Vikram Karthic's user avatar
0 votes
1 answer
194 views

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages. These will usually be preceded with a word "Address", "telephone number"...
Jacek Kotowski's user avatar
2 votes
2 answers
1k views

How to calculate proximity of words to a specific term in a document

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that ...
DHranger's user avatar
0 votes
1 answer
2k views

I can't remove • and some other special characters such as '- using tm_map

I search through the questions and able to replace • in my first set of command. But when I apply to my corpus, it doesn't work, the • still appear. The corpus has 6570 elements,2.3mb, so it seems to ...
Etalo's user avatar
  • 11
1 vote
1 answer
525 views

R: removeCommonTerms with Quanteda package?

The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), ...
hhh's user avatar
  • 52.7k
5 votes
1 answer
4k views

what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R? [duplicate]

I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+...
Ricardo's user avatar
  • 81
1 vote
2 answers
1k views

Adding metadata to STM in R

I am having trouble with the STM package in R. I have built a corpus in Quanteda and I want to convert it into the STM format. I have saved the metadata as an independent CSV file and I want code that ...
Ricardo's user avatar
  • 81
3 votes
2 answers
797 views

Assigning weights to different features in R

Is it possible to assign weights to different features before formulating a DFM in R? Consider this example in R str="apple is better than banana" mydfm=dfm(str, ignoredFeatures = stopwords("english"...
Rahul Chawla's user avatar
2 votes
2 answers
850 views

How to keep the beginning and end of sentence markers with quanteda

I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below. ...
Giuseppe Romagnuolo's user avatar
2 votes
1 answer
6k views

Form bigrams without stopwords in R

I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining". Let's say if I have a string ...
John Chou's user avatar
  • 107
2 votes
1 answer
644 views

Import lexisnexis output into R quanteda

I would to use Benoit's R-package quanteda to analyze articles exported from lexisnexis. The export is in the standard html-format. I use the tm package + plugin to read the lexisnexis output. ...
bstn's user avatar
  • 23
2 votes
2 answers
921 views

R tm Package: How to compare text to positive reference word list and return count of positive word occurrences

What is the best approach to use the tm library to compare text to positive reference word list and return count of positive word occurrences I want to be able to return the sum of positive words in ...
Technophobe01's user avatar
4 votes
2 answers
6k views

Generating all word unigrams through trigrams in R

I am trying to generate a list of all unigrams through trigrams in R to, eventually, make a document-phrase matrix with columns including all single words, bigrams, and trigrams. I expected to find ...
miratrix's user avatar
  • 191