All Questions
33 questions
0
votes
1
answer
188
views
How to extract entities names with SpacyR with personalized data?
Good afternoon,
I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given ...
1
vote
2
answers
388
views
Convert Corpus from quanteda to tm
My data mycorpus is in a quanteda-corpus (corpus-function from quanteda) which I need to convert to a corpus under the tm package. I know about quanteda's convert-function. This, though, only converts ...
0
votes
1
answer
495
views
How to break a corpus into paragraphs using custom delimiters
I am scraping the New york Times webpages to do some natural language processing on it, I want to split the webpage into paragraphs when using corpus in order to do frequency counts on words that ...
0
votes
0
answers
1k
views
Unknown error when using readtext with PDFs
I am a complete novice working on some textual analysis in R.
I have a folder of ~12000 PDF documents I am trying to convert into a corpus for analysis.
I have attempted to do so several different ...
0
votes
1
answer
174
views
Error when importing tm Vcorpus into Quanteda corpus
This code snippet worked just fine until I decided to update R(3.6.3) and RStudio(1.2.5042) yesterday, though it is not obvious to me that is the source of the problem.
In a nutshell, I convert 91 ...
2
votes
1
answer
555
views
Approximate string matching in R between two datasets
I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not:
dt1
title ...
0
votes
1
answer
585
views
"subscript out of bounds" error in str_extract_all
I am trying extract date information from multiple text files using str_extract_all. If I do a single file, it works fine. But, when I put it in for loop, it gives me this error.
I have already tried ...
2
votes
1
answer
2k
views
Remove words from a dtm
I have created a dtm.
library(tm)
corpus = Corpus(VectorSource(dat$Reviews))
dtm = DocumentTermMatrix(corpus)
I used it to remove rare terms.
dtm = removeSparseTerms(dtm, 0.98)
After ...
0
votes
1
answer
1k
views
How do I remove texts from a corpus in R?
I'm dividing a long document into chapters using the corpus_segment function in the tm package.
After running the pattern, I'm still left with a couple of unwanted chapters. I'd like to somehow ...
3
votes
1
answer
260
views
How can I bootstrap text readability statistics using quanteda?
I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily ...
0
votes
1
answer
445
views
Apply a custom (weighted) dictionary to text based on sentiment analysis
I am looking to adjust this code so that I can assign each one of these modal verbs with a different weight. The idea is to use something similar to the NRC library, where we have the "numbers" 1-5 ...
1
vote
1
answer
5k
views
Trying to remove special characters and non-english words from my data R
I am trying to clean up my data to remove; i.) special characters (e.g
+_), ii.) specific words (e.g retweet, followers, couldn, better, person) iii.) words that do not appear in the english ...
0
votes
1
answer
519
views
Stemming each word
I want to stem each word. For example, 'hardworking employees' should be converted to 'hardwork employee' not 'hardworking employee'. In simple words, it should stem both words separately. I know it ...
6
votes
1
answer
573
views
Stem completion in R replaces names, not data
My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling ...
6
votes
1
answer
2k
views
tidytext, quanteda, and tm returning different tf-idf scores
I am trying to work on tf-idf weighted corpus (where I expect tf to be a proportion by document rather than simple count). I would expect the same values to be returned by all the classic text mining ...
0
votes
2
answers
40
views
ntokens applied to VCorpus
I execute the followwing commands:
library(tm)
library(dplyr)
library(stringi)
library(quanteda)
df <- structure(list(text = c("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean ...
0
votes
0
answers
249
views
R Regular expression to search citations of law using tidytext and tm
I use tidytext, tm and quantedafor text mining.
I try to:
filter a tibble with plain, processed text according to presence of a citation of law
count the number of the same citation per text ...
3
votes
2
answers
2k
views
Remove ngrams with leading and trailing stopwords
I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords.
I have about 100 pdf files. I converted ...
2
votes
4
answers
1k
views
A lemmatizing function using a hash dictionary does not work with tm package in R
I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer ...
1
vote
1
answer
1k
views
Lemmatization using txt file with lemmes in R
I would like to use external txt file with Polish lemmas structured as follows:
(source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/)
Abadan Abadanem
Abadan ...
0
votes
1
answer
316
views
Display matching sentences by text typed in a Shiny app text box
I am trying to build an Shiny App that can dynamically display sentences from a database column by matching a Corpus from a text box , ie. as users starts typing the text in the text box, all the ...
0
votes
1
answer
194
views
TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern
I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages.
These will usually be preceded with a word "Address", "telephone number"...
2
votes
2
answers
1k
views
How to calculate proximity of words to a specific term in a document
I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that ...
0
votes
1
answer
2k
views
I can't remove • and some other special characters such as '- using tm_map
I search through the questions and able to replace • in my first set of command.
But when I apply to my corpus, it doesn't work, the • still appear.
The corpus has 6570 elements,2.3mb, so it seems to ...
1
vote
1
answer
525
views
R: removeCommonTerms with Quanteda package?
The removeCommonTerms function is found here for the TM package such that
removeCommonTerms <- function (x, pct)
{
stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")),
...
5
votes
1
answer
4k
views
what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R? [duplicate]
I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+...
1
vote
2
answers
1k
views
Adding metadata to STM in R
I am having trouble with the STM package in R. I have built a corpus in Quanteda and I want to convert it into the STM format. I have saved the metadata as an independent CSV file and I want code that ...
3
votes
2
answers
797
views
Assigning weights to different features in R
Is it possible to assign weights to different features before formulating a DFM in R?
Consider this example in R
str="apple is better than banana"
mydfm=dfm(str, ignoredFeatures = stopwords("english"...
2
votes
2
answers
850
views
How to keep the beginning and end of sentence markers with quanteda
I'm trying to create 3-grams using R's quanteda package.
I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below.
...
2
votes
1
answer
6k
views
Form bigrams without stopwords in R
I have some trouble with bigram in text mining using R recently.
The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining".
Let's say if I have a string ...
2
votes
1
answer
644
views
Import lexisnexis output into R quanteda
I would to use Benoit's R-package quanteda to analyze articles exported from lexisnexis. The export is in the standard html-format. I use the tm package + plugin to read the lexisnexis output. ...
2
votes
2
answers
921
views
R tm Package: How to compare text to positive reference word list and return count of positive word occurrences
What is the best approach to use the tm library to compare text to positive reference word list and return count of positive word occurrences I want to be able to return the sum of positive words in ...
4
votes
2
answers
6k
views
Generating all word unigrams through trigrams in R
I am trying to generate a list of all unigrams through trigrams in R to, eventually, make a document-phrase matrix with columns including all single words, bigrams, and trigrams.
I expected to find ...