1,082 questions
0
votes
1
answer
32
views
How to extract terms and probabilities from tmResult$terms in topic modeling?
I like to create separate word clouds for each of my 8 topics in an LDA model. I extracted top 40 words across 8 topics - an object of length 320 containing top words and occurrence probabilities.
I ...
0
votes
0
answers
22
views
How to obtain and save trigrams from text mining program TM - in text or csv format
I'm hoping to identify tigrams and phrases in a corpus using TM and save the output as a text or csv file. I haven't found a way to do this in Quanteda: How to save n-gram output
This reproducible ...
0
votes
0
answers
33
views
Errors attaching metadata to corpus
I am trying to generate a corpus with two documents: one is responses of participants characterized as "supporters" and one is responses of "non-supporters". I've entered this as ...
1
vote
0
answers
56
views
How do I remove list() from a Corpus?
I have three text files. After preprocessing them using Corpus and the tm package, the resulting text includes the phrases "list(language = "portuguese")" and "list()".
...
0
votes
0
answers
19
views
Get document ID from LDA output R
I'm trying to do LDA over two very large corpus of documents.
I need to compare the LDA output (planning to use the Kullback-Leibler similarity measure) across time for each pair of documents. ...
0
votes
1
answer
45
views
Undo stemming after tm::stemDocument()?
I have a list of stemmed words in R. Now, I want to undo my stemming in order to receive a list of all the "complete" words in R.
This is the code I used for stemming my wordlist:
library(tm)...
-1
votes
1
answer
24
views
Error while creating the TDM - "No applicable method for 'meta' applied to an object of class "character""
While creating the tm package TermDocumentMatrix, i am getting error. following code i have used.
int_vc <- VCorpus(int_vc)
int_vc <- tm_map(int_vc, tolower)
int_vc <- tm_map(int_vc, ...
2
votes
0
answers
91
views
Unable to edit metadata in corpus
I have the following corpus:
library(jsonlite)
library(tm)
query = "https://www.ebi.ac.uk/ebisearch/ws/rest/pride?query=submitter_country:Norway&size=1000&fields=submitter_keywords&...
0
votes
1
answer
120
views
Searching for specific words in Corpus with R (tm package)
I have a Corpus (tm package), containing a collection of 1.300 different text documents [Content: documents: 1.300].
My goal is now to search the frequency of a specific wordlist in each of those ...
1
vote
0
answers
26
views
Issue with stemCompletion in R
Dear stack overflow community,
I have an issue when trying to complete a stemmed Corpus in R using the function stemCompletion within the tm package (https://cran.r-project.org/web/packages/tm/tm.pdf)....
0
votes
1
answer
188
views
How to extract entities names with SpacyR with personalized data?
Good afternoon,
I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given ...
1
vote
1
answer
77
views
Error in tm package while topic modelling
I am running into an error while trying to make a corpus object from the tm package in R.
The data have been scraped from a website and I have included the full code below so you can run and see how ...
3
votes
1
answer
151
views
Find overlap in terms between a pair of documents
I have a sparse term-document matrix produced by tm's TermDocumentMatrix.
I am trying to write a function that takes two document names and k as its arguments, finds all terms that occur in both ...
2
votes
1
answer
425
views
Using R to analyse pubmed articles. Trying to create wordcloud but also association with year of publication
MOST RECENT EDIT:
I have successfully created my required data frames containing pmid,year and abstract as columns from a literature search on pubmed. I then split this data frame into many separate ...
1
vote
1
answer
90
views
Text analysis with dictionary of words: NGramTokenizer not working
I am trying to look for a list of keywords in a text. Some of these keywords are n-grams. However, the TermDocumentMatrix will only find single words. I already had a look at several similar questions ...
1
vote
1
answer
33
views
DocumentTermMatrix misses some words
I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a ...
0
votes
1
answer
310
views
localtime() returns a pointer to a structure with uninitialized members
char datetime[DATETIME_LEN];
time_t timer;
struct tm* tm_info;
timer = time(NULL);
tm_info = localtime(&timer); // debug: tm_info: 0xcccccccccccccccc {tm_sec=??? tm_min=??? tm_hour=??? ...}
if (...
0
votes
1
answer
160
views
is package tm suitable for extracting scores from text data?
I have many cognitive assessment data stored as txt files. Each file looks like this:
patient number xxxxxx
score A (98) (95)ile%
score B (100) (97)ile%
test C
score D (76)
...
0
votes
0
answers
317
views
Unused argument error using stopwords in R
I am trying to clean and process post data from twitter. The initial corpus produces the following after cleaning:
text_corpus[[1]]$content
[1] "I actually would love if my Mad Scientist was ...
1
vote
1
answer
181
views
Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization
I have the following data frame
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the ...
2
votes
1
answer
235
views
Remove Numbers, Punctuations, White Spaces before Tokenization
I have the following data frame
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the ...
0
votes
1
answer
85
views
row_sums vs findFreqTerms for subsetting TermDocMatrix to include words with a given min frequency
my question is straightforward. I have a (binary) TDM and I want to reduce the number of rows to include only those rows that appear in at least two documents:
I thought that these two methods would ...
2
votes
1
answer
211
views
Turkish characters problem while plotting graphs in R igraph
I have a dataset which includes Tweets in Turkish language. I'm trying to do text mining with tm package and plot the networks with igraph R packages.
library(tm)
#build corpus
corpus <- iconv(...
2
votes
1
answer
103
views
How to add target variable whether to see sentence belongs to data 1 or data 2?
I am working on a project. I would like to summarize it with similar case. I need to collect n number of tweets with different hastags.
Here is similar code:
library(tm)
#tweets from first hastag
...
0
votes
0
answers
60
views
Restore original data from document term matrix in R
I want to know if there is a way to go back to my original database (df) after I have made it a document term matrix.
Here is an example of what I want to do.
df <- data.frame(group=c("A",...
0
votes
0
answers
78
views
filtered content of corpus by custom function with R
I want to analysis filtered texts by custom function (function with parameters) using R.
I used readlines function to extract my text and I get large list with 258 lists. Then, using VCorpus(...
0
votes
2
answers
296
views
How to create a document term incidence matrix from long format text data?
I've got data that look like this:
ID
word
1
blue
1
red
1
green
1
yellow
2
blue
2
purple
2
orange
2
green
But I want to transform them into a binary incidence matrix denoting whether or not ...
1
vote
1
answer
264
views
How can I extract bigrams from text without removing the hash symbol?
I am using the following function (based on https://rpubs.com/sprishi/twitterIBM) to extract bigrams from text. However, I want to keep the hash symbol for analysis purposes. The function to clean ...
0
votes
1
answer
69
views
TermDocumentMatrix Error after Cleaning Corpus
My problem is that I want to pass my corpus to the tm function termdocumentmatrix() and it fails with the error: Error in UseMethod("meta", x): no applicable method for meta' applied to an ...
1
vote
2
answers
2k
views
How to remove these special characters in r in a set of string : ’s, …
I have this string which contain special characters, I am not able to remove these characters from the main data frame however, when I prepared a separate object by dft and then I use the following ...
0
votes
1
answer
196
views
Unable to remove these characters from the data in a string in r
I am trying to remove the special character from the following string with the help of following code , but not getting the result :
library(tm)
v <- "rt shibxwarrior hodl trust processsome ...
0
votes
1
answer
79
views
Text Mining: Cluster Analysis phrases. ERROR: cannot take a sample larger than the population
I'm working on a dataset of thousands of sentences. The dataset is structured by a column and k rows.
I've to find some similarities between them e i'm doing a cluster Analysis. I created a corpus and ...
-1
votes
1
answer
493
views
Extract table from unstructured text file in r
I have a text file namely data.txt containing multiple tables in the following format.
//
// TABLE ET_ARCMAT
// ARCID MATID VALTO VALFR ...
0
votes
0
answers
128
views
Custom dictionary for word removing in R
I'd like to create a custom dictionary of word to be removed into a Corpus. I'm using the tm_map command.
I'd like to start from a .txt file (like word1,word2,word3; file.txt), import it in R and ...
1
vote
0
answers
23
views
Why does the clean.text() function change word frequencies?
I am doing text analysis and reading articles into R. When I use the clean.text() function from TextReg to clean the text of a corpus and then look up word frequencies using term_stats() from tm, the ...
2
votes
1
answer
75
views
Some words won't be stemmed using tm ("easier" or "easiest")
I have large questionaire dataset where some of the features need to be stemmed, with the goal being to assign a topic to each response. However, I'm having trouble stemming some words using the ...
0
votes
2
answers
359
views
subscript out of bounds error in document-term matrix
I am doing text mining in following data, but I get following error at the end
Error in `[.simple_triplet_matrix`(dtm, 1:10, 1:10) :
subscript out of bounds
can you help me address this error.
...
2
votes
4
answers
2k
views
Have mktime() ignore DST and local time zone in C++
Our system receives data from a vendor in ASCII format "20210715083015". This time is US Eastern time, and is already adjusted for Daylight Savings.
Our backend needs this time in ...
1
vote
2
answers
388
views
Convert Corpus from quanteda to tm
My data mycorpus is in a quanteda-corpus (corpus-function from quanteda) which I need to convert to a corpus under the tm package. I know about quanteda's convert-function. This, though, only converts ...
1
vote
1
answer
36
views
tm package removeWords function concatenate words in R
Am cleaning the sample data using removewords from tm package but removeWords function concatenate the words post removal. It should be "environmental dead frog" "environmental dead ...
1
vote
2
answers
420
views
search for word/phrase from column in R
I have data that looks like this:
> head(df)
ID Comment
1 1 I ate dinner.
2 2 ...
1
vote
1
answer
79
views
Lost one document during tokenization
I lost one row of data in the tokenization process.
There are three documents in this data set
structure(list(ID = c("N12277Y", "N12284X", "N12291W"), corrected = c("...
0
votes
1
answer
27
views
R TM package produces strange results with Inspect command
I'm having a little trouble with the inspect function from the tm package in R.
I have a sample 2-row data.table as defined below:
dt <- data.table(doc_id = c(1, 2), text = c('the driver of the 1st ...
2
votes
1
answer
44
views
combining words in tm R is not achieving desired result
I am trying to combine a few words so that they count as one.
In this example I want val and valuatin to be counted as valuation.
The code I have been using to try and do this is below:
#load in ...
0
votes
1
answer
1k
views
Cosine Similarity Matrix in R
I have a document term matrix, "mydtm" that I have created in R, using the 'tm' package. I am attempting to depict the similarities between each of the 557 documents contained within the dtm/...
0
votes
1
answer
186
views
Dealing with several text columns in a labeled data set while running NLP in R
Hope all of you guys are healthy and well.
I am new to the world of NLP and my question may sound stupid, so I apologize in advance.I would like to perform NLP on some text data which is labeled and ...
0
votes
1
answer
65
views
R Tm package dictionary matching leads to higher frequency than actual words of text
I have been using the code below to load text as a corpus and using the tm package to clean the text. As a next step I am loading a dictionary and cleaning it as well. Then I am matching the words ...
0
votes
1
answer
495
views
How to break a corpus into paragraphs using custom delimiters
I am scraping the New york Times webpages to do some natural language processing on it, I want to split the webpage into paragraphs when using corpus in order to do frequency counts on words that ...
0
votes
1
answer
120
views
Issue with adding breaks to a tm_object
I am having troubles with adding fixed breaks to a tm_map. I tried the same code as another topic at this forum (Customize how R tmap legend values are printed) on a different dataset, but the ...
0
votes
2
answers
308
views
Calculating term frequencies in a big corpus efficiently regardless of document boundaries
I have a corpus of almost 2m documents. I want to calculate the term frequencies of the terms in the whole corpus, regardless of document boundaries.
A naive approach would be combining all the ...