Newest 'tm+quanteda' Questions

0 votes

1 answer

188 views

How to extract entities names with SpacyR with personalized data?

Good afternoon, I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given ...

Sergio A. Gottret Rios

11

asked Jan 30, 2023 at 0:39

1 vote

2 answers

388 views

Convert Corpus from quanteda to tm

My data mycorpus is in a quanteda-corpus (corpus-function from quanteda) which I need to convert to a corpus under the tm package. I know about quanteda's convert-function. This, though, only converts ...

arndtupb

62

asked Jul 21, 2021 at 14:08

0 votes

1 answer

495 views

How to break a corpus into paragraphs using custom delimiters

I am scraping the New york Times webpages to do some natural language processing on it, I want to split the webpage into paragraphs when using corpus in order to do frequency counts on words that ...

SLE

85

asked Apr 9, 2021 at 14:20

0 votes

0 answers

1k views

Unknown error when using readtext with PDFs

I am a complete novice working on some textual analysis in R. I have a folder of ~12000 PDF documents I am trying to convert into a corpus for analysis. I have attempted to do so several different ...

Lewis Jackson

1

asked May 28, 2020 at 17:20

0 votes

1 answer

174 views

Error when importing tm Vcorpus into Quanteda corpus

This code snippet worked just fine until I decided to update R(3.6.3) and RStudio(1.2.5042) yesterday, though it is not obvious to me that is the source of the problem. In a nutshell, I convert 91 ...

NZU

17

asked Apr 17, 2020 at 16:55

2 votes

1 answer

555 views

Approximate string matching in R between two datasets

I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not: dt1 title ...

Carbo

916

asked Apr 17, 2020 at 10:28

0 votes

1 answer

585 views

"subscript out of bounds" error in str_extract_all

I am trying extract date information from multiple text files using str_extract_all. If I do a single file, it works fine. But, when I put it in for loop, it gives me this error. I have already tried ...

James G Wilson

1

asked Jul 9, 2019 at 3:44

2 votes

1 answer

2k views

Remove words from a dtm

I have created a dtm. library(tm) corpus = Corpus(VectorSource(dat$Reviews)) dtm = DocumentTermMatrix(corpus) I used it to remove rare terms. dtm = removeSparseTerms(dtm, 0.98) After ...

Banjo

1,241

asked Apr 24, 2019 at 14:45

0 votes

1 answer

1k views

How do I remove texts from a corpus in R?

I'm dividing a long document into chapters using the corpus_segment function in the tm package. After running the pattern, I'm still left with a couple of unwanted chapters. I'd like to somehow ...

Erlend Tangeraas Lygre

1

asked Mar 27, 2019 at 23:23

3 votes

1 answer

260 views

How can I bootstrap text readability statistics using quanteda?

I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily ...

beddotcom

507

asked Mar 14, 2019 at 19:08

0 votes

1 answer

445 views

Apply a custom (weighted) dictionary to text based on sentiment analysis

I am looking to adjust this code so that I can assign each one of these modal verbs with a different weight. The idea is to use something similar to the NRC library, where we have the "numbers" 1-5 ...

Emily Casey-Wagemaker

1

asked Feb 1, 2019 at 2:26

1 vote

1 answer

5k views

Trying to remove special characters and non-english words from my data R

I am trying to clean up my data to remove; i.) special characters (e.g +_), ii.) specific words (e.g retweet, followers, couldn, better, person) iii.) words that do not appear in the english ...

Emm

2,497

asked Nov 9, 2018 at 15:46

0 votes

1 answer

519 views

Stemming each word

I want to stem each word. For example, 'hardworking employees' should be converted to 'hardwork employee' not 'hardworking employee'. In simple words, it should stem both words separately. I know it ...

john

1,036

asked Oct 7, 2018 at 13:00

6 votes

1 answer

573 views

Stem completion in R replaces names, not data

My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling ...

J. Trimarco

149

asked Apr 4, 2018 at 22:26

6 votes

1 answer

2k views

tidytext, quanteda, and tm returning different tf-idf scores

I am trying to work on tf-idf weighted corpus (where I expect tf to be a proportion by document rather than simple count). I would expect the same values to be returned by all the classic text mining ...

Radim

455

asked Feb 15, 2018 at 11:56

0 votes

2 answers

40 views

ntokens applied to VCorpus

I execute the followwing commands: library(tm) library(dplyr) library(stringi) library(quanteda) df <- structure(list(text = c("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean ...

HelenVcl

79

asked Jan 27, 2018 at 11:03

0 votes

0 answers

249 views

R Regular expression to search citations of law using tidytext and tm

I use tidytext, tm and quantedafor text mining. I try to: filter a tibble with plain, processed text according to presence of a citation of law count the number of the same citation per text ...

captcoma

1,898

asked Jan 13, 2018 at 20:10

3 votes

2 answers

2k views

Remove ngrams with leading and trailing stopwords

I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords. I have about 100 pdf files. I converted ...

syre

982

asked Oct 11, 2017 at 10:10

2 votes

4 answers

1k views

A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer ...

Jacek Kotowski

704

asked Sep 8, 2017 at 18:30

1 vote

1 answer

1k views

Lemmatization using txt file with lemmes in R

I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan ...

Jacek Kotowski

704

asked Aug 18, 2017 at 18:02

0 votes

1 answer

316 views

Display matching sentences by text typed in a Shiny app text box

I am trying to build an Shiny App that can dynamically display sentences from a database column by matching a Corpus from a text box , ie. as users starts typing the text in the text box, all the ...

Vikram Karthic

478

asked Aug 14, 2017 at 11:04

0 votes

1 answer

194 views

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages. These will usually be preceded with a word "Address", "telephone number"...

Jacek Kotowski

704

asked Jul 31, 2017 at 8:19

2 votes

2 answers

1k views

How to calculate proximity of words to a specific term in a document

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that ...

DHranger

33

asked May 18, 2017 at 20:57

0 votes

1 answer

2k views

I can't remove • and some other special characters such as '- using tm_map

I search through the questions and able to replace • in my first set of command. But when I apply to my corpus, it doesn't work, the • still appear. The corpus has 6570 elements,2.3mb, so it seems to ...

Etalo

11

asked Mar 22, 2017 at 3:29

1 vote

1 answer

525 views

R: removeCommonTerms with Quanteda package?

The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), ...

hhh

52.7k

asked Jan 11, 2017 at 11:07

5 votes

1 answer

4k views

what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R? [duplicate]

I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+...

Ricardo

81

asked Jul 4, 2016 at 10:48

1 vote

2 answers

1k views

Adding metadata to STM in R

I am having trouble with the STM package in R. I have built a corpus in Quanteda and I want to convert it into the STM format. I have saved the metadata as an independent CSV file and I want code that ...

Ricardo

81

asked Jun 30, 2016 at 12:04

3 votes

2 answers

797 views

Assigning weights to different features in R

Is it possible to assign weights to different features before formulating a DFM in R? Consider this example in R str="apple is better than banana" mydfm=dfm(str, ignoredFeatures = stopwords("english"...

Rahul Chawla

300

asked Apr 23, 2016 at 20:17

2 votes

2 answers

850 views

How to keep the beginning and end of sentence markers with quanteda

I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below. ...

Giuseppe Romagnuolo

3,392

asked Mar 30, 2016 at 23:33

2 votes

1 answer

6k views

Form bigrams without stopwords in R

I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining". Let's say if I have a string ...

John Chou

107

asked Dec 15, 2015 at 6:22

2 votes

1 answer

644 views

Import lexisnexis output into R quanteda

I would to use Benoit's R-package quanteda to analyze articles exported from lexisnexis. The export is in the standard html-format. I use the tm package + plugin to read the lexisnexis output. ...

bstn

23

asked Dec 8, 2015 at 21:04

2 votes

2 answers

921 views

R tm Package: How to compare text to positive reference word list and return count of positive word occurrences

What is the best approach to use the tm library to compare text to positive reference word list and return count of positive word occurrences I want to be able to return the sum of positive words in ...

Technophobe01

8,646

asked Nov 21, 2015 at 5:38

4 votes

2 answers

6k views

Generating all word unigrams through trigrams in R

I am trying to generate a list of all unigrams through trigrams in R to, eventually, make a document-phrase matrix with columns including all single words, bigrams, and trigrams. I expected to find ...

miratrix

191

asked Jul 8, 2015 at 0:17

Collectives™ on Stack Overflow

All Questions

How to extract entities names with SpacyR with personalized data?

Convert Corpus from quanteda to tm

How to break a corpus into paragraphs using custom delimiters

Unknown error when using readtext with PDFs

Error when importing tm Vcorpus into Quanteda corpus

Approximate string matching in R between two datasets

"subscript out of bounds" error in str_extract_all

Remove words from a dtm

How do I remove texts from a corpus in R?

How can I bootstrap text readability statistics using quanteda?

Apply a custom (weighted) dictionary to text based on sentiment analysis

Trying to remove special characters and non-english words from my data R

Stemming each word

Stem completion in R replaces names, not data

tidytext, quanteda, and tm returning different tf-idf scores

ntokens applied to VCorpus

R Regular expression to search citations of law using tidytext and tm

Remove ngrams with leading and trailing stopwords

A lemmatizing function using a hash dictionary does not work with tm package in R

Lemmatization using txt file with lemmes in R

Display matching sentences by text typed in a Shiny app text box

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

How to calculate proximity of words to a specific term in a document

I can't remove • and some other special characters such as '- using tm_map

R: removeCommonTerms with Quanteda package?

what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R? [duplicate]

Adding metadata to STM in R

Assigning weights to different features in R

How to keep the beginning and end of sentence markers with quanteda

Form bigrams without stopwords in R

Import lexisnexis output into R quanteda

R tm Package: How to compare text to positive reference word list and return count of positive word occurrences

Generating all word unigrams through trigrams in R

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags