Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
32 views

How to extract terms and probabilities from tmResult$terms in topic modeling?

I like to create separate word clouds for each of my 8 topics in an LDA model. I extracted top 40 words across 8 topics - an object of length 320 containing top words and occurrence probabilities. I ...
NoaMi's user avatar
  • 41
0 votes
0 answers
22 views

How to obtain and save trigrams from text mining program TM - in text or csv format

I'm hoping to identify tigrams and phrases in a corpus using TM and save the output as a text or csv file. I haven't found a way to do this in Quanteda: How to save n-gram output This reproducible ...
bgreen's user avatar
  • 87
0 votes
0 answers
33 views

Errors attaching metadata to corpus

I am trying to generate a corpus with two documents: one is responses of participants characterized as "supporters" and one is responses of "non-supporters". I've entered this as ...
Nicolette's user avatar
1 vote
0 answers
56 views

How do I remove list() from a Corpus?

I have three text files. After preprocessing them using Corpus and the tm package, the resulting text includes the phrases "list(language = "portuguese")" and "list()". ...
Carla's user avatar
  • 23
0 votes
0 answers
19 views

Get document ID from LDA output R

I'm trying to do LDA over two very large corpus of documents. I need to compare the LDA output (planning to use the Kullback-Leibler similarity measure) across time for each pair of documents. ...
JF96's user avatar
  • 169
0 votes
1 answer
45 views

Undo stemming after tm::stemDocument()?

I have a list of stemmed words in R. Now, I want to undo my stemming in order to receive a list of all the "complete" words in R. This is the code I used for stemming my wordlist: library(tm)...
lili4491li's user avatar
-1 votes
1 answer
24 views

Error while creating the TDM - "No applicable method for 'meta' applied to an object of class "character""

While creating the tm package TermDocumentMatrix, i am getting error. following code i have used. int_vc <- VCorpus(int_vc) int_vc <- tm_map(int_vc, tolower) int_vc <- tm_map(int_vc, ...
yem's user avatar
  • 29
2 votes
0 answers
91 views

Unable to edit metadata in corpus

I have the following corpus: library(jsonlite) library(tm) query = "https://www.ebi.ac.uk/ebisearch/ws/rest/pride?query=submitter_country:Norway&size=1000&fields=submitter_keywords&...
Illimar Rekand's user avatar
0 votes
1 answer
120 views

Searching for specific words in Corpus with R (tm package)

I have a Corpus (tm package), containing a collection of 1.300 different text documents [Content: documents: 1.300]. My goal is now to search the frequency of a specific wordlist in each of those ...
Li4991's user avatar
  • 81
1 vote
0 answers
26 views

Issue with stemCompletion in R

Dear stack overflow community, I have an issue when trying to complete a stemmed Corpus in R using the function stemCompletion within the tm package (https://cran.r-project.org/web/packages/tm/tm.pdf)....
d4rkneo's user avatar
  • 11
0 votes
1 answer
188 views

How to extract entities names with SpacyR with personalized data?

Good afternoon, I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given ...
Sergio A. Gottret Rios's user avatar
1 vote
1 answer
77 views

Error in tm package while topic modelling

I am running into an error while trying to make a corpus object from the tm package in R. The data have been scraped from a website and I have included the full code below so you can run and see how ...
I_like_insights's user avatar
3 votes
1 answer
151 views

Find overlap in terms between a pair of documents

I have a sparse term-document matrix produced by tm's TermDocumentMatrix. I am trying to write a function that takes two document names and k as its arguments, finds all terms that occur in both ...
dimitriy's user avatar
  • 9,440
2 votes
1 answer
425 views

Using R to analyse pubmed articles. Trying to create wordcloud but also association with year of publication

MOST RECENT EDIT: I have successfully created my required data frames containing pmid,year and abstract as columns from a literature search on pubmed. I then split this data frame into many separate ...
Aidi's user avatar
  • 23
1 vote
1 answer
90 views

Text analysis with dictionary of words: NGramTokenizer not working

I am trying to look for a list of keywords in a text. Some of these keywords are n-grams. However, the TermDocumentMatrix will only find single words. I already had a look at several similar questions ...
gitcanzo's user avatar
  • 129
1 vote
1 answer
33 views

DocumentTermMatrix misses some words

I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a ...
gitcanzo's user avatar
  • 129
0 votes
1 answer
310 views

localtime() returns a pointer to a structure with uninitialized members

char datetime[DATETIME_LEN]; time_t timer; struct tm* tm_info; timer = time(NULL); tm_info = localtime(&timer); // debug: tm_info: 0xcccccccccccccccc {tm_sec=??? tm_min=??? tm_hour=??? ...} if (...
cxↄ's user avatar
  • 1,330
0 votes
1 answer
160 views

is package tm suitable for extracting scores from text data?

I have many cognitive assessment data stored as txt files. Each file looks like this: patient number xxxxxx score A (98) (95)ile% score B (100) (97)ile% test C score D (76) ...
Ian Wang's user avatar
  • 157
0 votes
0 answers
317 views

Unused argument error using stopwords in R

I am trying to clean and process post data from twitter. The initial corpus produces the following after cleaning: text_corpus[[1]]$content [1] "I actually would love if my Mad Scientist was ...
Macy's user avatar
  • 11
1 vote
1 answer
181 views

Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization

I have the following data frame report <- data.frame(Text = c("unit 1 crosses the street", "driver 2 was speeding and saw driver# 1", "year 2019 was the ...
S Das's user avatar
  • 3,391
2 votes
1 answer
235 views

Remove Numbers, Punctuations, White Spaces before Tokenization

I have the following data frame report <- data.frame(Text = c("unit 1 crosses the street", "driver 2 was speeding and saw driver# 1", "year 2019 was the ...
S Das's user avatar
  • 3,391
0 votes
1 answer
85 views

row_sums vs findFreqTerms for subsetting TermDocMatrix to include words with a given min frequency

my question is straightforward. I have a (binary) TDM and I want to reduce the number of rows to include only those rows that appear in at least two documents: I thought that these two methods would ...
KArrow'sBest's user avatar
2 votes
1 answer
211 views

Turkish characters problem while plotting graphs in R igraph

I have a dataset which includes Tweets in Turkish language. I'm trying to do text mining with tm package and plot the networks with igraph R packages. library(tm) #build corpus corpus <- iconv(...
Naim Cinar's user avatar
2 votes
1 answer
103 views

How to add target variable whether to see sentence belongs to data 1 or data 2?

I am working on a project. I would like to summarize it with similar case. I need to collect n number of tweets with different hastags. Here is similar code: library(tm) #tweets from first hastag ...
Narimanoglu's user avatar
0 votes
0 answers
60 views

Restore original data from document term matrix in R

I want to know if there is a way to go back to my original database (df) after I have made it a document term matrix. Here is an example of what I want to do. df <- data.frame(group=c("A",...
Sergio Parra's user avatar
0 votes
0 answers
78 views

filtered content of corpus by custom function with R

I want to analysis filtered texts by custom function (function with parameters) using R. I used readlines function to extract my text and I get large list with 258 lists. Then, using VCorpus(...
Sari's user avatar
  • 5
0 votes
2 answers
296 views

How to create a document term incidence matrix from long format text data?

I've got data that look like this: ID word 1 blue 1 red 1 green 1 yellow 2 blue 2 purple 2 orange 2 green But I want to transform them into a binary incidence matrix denoting whether or not ...
nlplearner's user avatar
1 vote
1 answer
264 views

How can I extract bigrams from text without removing the hash symbol?

I am using the following function (based on https://rpubs.com/sprishi/twitterIBM) to extract bigrams from text. However, I want to keep the hash symbol for analysis purposes. The function to clean ...
Chamil Rathnayake's user avatar
0 votes
1 answer
69 views

TermDocumentMatrix Error after Cleaning Corpus

My problem is that I want to pass my corpus to the tm function termdocumentmatrix() and it fails with the error: Error in UseMethod("meta", x): no applicable method for meta' applied to an ...
Mauras's user avatar
  • 1
1 vote
2 answers
2k views

How to remove these special characters in r in a set of string : ’s, …

I have this string which contain special characters, I am not able to remove these characters from the main data frame however, when I prepared a separate object by dft and then I use the following ...
Sachin's user avatar
  • 145
0 votes
1 answer
196 views

Unable to remove these characters from the data in a string in r

I am trying to remove the special character from the following string with the help of following code , but not getting the result : library(tm) v <- "rt shibxwarrior hodl trust processsome ...
Sachin's user avatar
  • 145
0 votes
1 answer
79 views

Text Mining: Cluster Analysis phrases. ERROR: cannot take a sample larger than the population

I'm working on a dataset of thousands of sentences. The dataset is structured by a column and k rows. I've to find some similarities between them e i'm doing a cluster Analysis. I created a corpus and ...
GIORIGO's user avatar
  • 59
-1 votes
1 answer
493 views

Extract table from unstructured text file in r

I have a text file namely data.txt containing multiple tables in the following format. // // TABLE ET_ARCMAT // ARCID MATID VALTO VALFR ...
abdul samad's user avatar
0 votes
0 answers
128 views

Custom dictionary for word removing in R

I'd like to create a custom dictionary of word to be removed into a Corpus. I'm using the tm_map command. I'd like to start from a .txt file (like word1,word2,word3; file.txt), import it in R and ...
user avatar
1 vote
0 answers
23 views

Why does the clean.text() function change word frequencies?

I am doing text analysis and reading articles into R. When I use the clean.text() function from TextReg to clean the text of a corpus and then look up word frequencies using term_stats() from tm, the ...
user6542495's user avatar
2 votes
1 answer
75 views

Some words won't be stemmed using tm ("easier" or "easiest")

I have large questionaire dataset where some of the features need to be stemmed, with the goal being to assign a topic to each response. However, I'm having trouble stemming some words using the ...
Chris Oosthuizen's user avatar
0 votes
2 answers
359 views

subscript out of bounds error in document-term matrix

I am doing text mining in following data, but I get following error at the end Error in `[.simple_triplet_matrix`(dtm, 1:10, 1:10) : subscript out of bounds can you help me address this error. ...
Cina's user avatar
  • 10.2k
2 votes
4 answers
2k views

Have mktime() ignore DST and local time zone in C++

Our system receives data from a vendor in ASCII format "20210715083015". This time is US Eastern time, and is already adjusted for Daylight Savings. Our backend needs this time in ...
Flyboy Wilson's user avatar
1 vote
2 answers
388 views

Convert Corpus from quanteda to tm

My data mycorpus is in a quanteda-corpus (corpus-function from quanteda) which I need to convert to a corpus under the tm package. I know about quanteda's convert-function. This, though, only converts ...
arndtupb's user avatar
1 vote
1 answer
36 views

tm package removeWords function concatenate words in R

Am cleaning the sample data using removewords from tm package but removeWords function concatenate the words post removal. It should be "environmental dead frog" "environmental dead ...
Dhinesh G's user avatar
1 vote
2 answers
420 views

search for word/phrase from column in R

I have data that looks like this: > head(df) ID Comment 1 1 I ate dinner. 2 2 ...
user11015000's user avatar
1 vote
1 answer
79 views

Lost one document during tokenization

I lost one row of data in the tokenization process. There are three documents in this data set structure(list(ID = c("N12277Y", "N12284X", "N12291W"), corrected = c("...
karyn-h's user avatar
  • 133
0 votes
1 answer
27 views

R TM package produces strange results with Inspect command

I'm having a little trouble with the inspect function from the tm package in R. I have a sample 2-row data.table as defined below: dt <- data.table(doc_id = c(1, 2), text = c('the driver of the 1st ...
AlexP's user avatar
  • 637
2 votes
1 answer
44 views

combining words in tm R is not achieving desired result

I am trying to combine a few words so that they count as one. In this example I want val and valuatin to be counted as valuation. The code I have been using to try and do this is below: #load in ...
user11015000's user avatar
0 votes
1 answer
1k views

Cosine Similarity Matrix in R

I have a document term matrix, "mydtm" that I have created in R, using the 'tm' package. I am attempting to depict the similarities between each of the 557 documents contained within the dtm/...
Luke Hansen's user avatar
0 votes
1 answer
186 views

Dealing with several text columns in a labeled data set while running NLP in R

Hope all of you guys are healthy and well. I am new to the world of NLP and my question may sound stupid, so I apologize in advance.I would like to perform NLP on some text data which is labeled and ...
Alex's user avatar
  • 245
0 votes
1 answer
65 views

R Tm package dictionary matching leads to higher frequency than actual words of text

I have been using the code below to load text as a corpus and using the tm package to clean the text. As a next step I am loading a dictionary and cleaning it as well. Then I am matching the words ...
user15721704's user avatar
0 votes
1 answer
495 views

How to break a corpus into paragraphs using custom delimiters

I am scraping the New york Times webpages to do some natural language processing on it, I want to split the webpage into paragraphs when using corpus in order to do frequency counts on words that ...
SLE's user avatar
  • 85
0 votes
1 answer
120 views

Issue with adding breaks to a tm_object

I am having troubles with adding fixed breaks to a tm_map. I tried the same code as another topic at this forum (Customize how R tmap legend values are printed) on a different dataset, but the ...
Jelmer Visser's user avatar
0 votes
2 answers
308 views

Calculating term frequencies in a big corpus efficiently regardless of document boundaries

I have a corpus of almost 2m documents. I want to calculate the term frequencies of the terms in the whole corpus, regardless of document boundaries. A naive approach would be combining all the ...
Rafs's user avatar
  • 796

1
2 3 4 5
22