All Questions
26 questions
-1
votes
1
answer
493
views
Extract table from unstructured text file in r
I have a text file namely data.txt containing multiple tables in the following format.
//
// TABLE ET_ARCMAT
// ARCID MATID VALTO VALFR ...
1
vote
1
answer
251
views
How can I use the tm_map, removeWords, function with regex values?
I am working with a list of previously clustered re-tweet usernames, which I would like to upload in a Document-Term-Matrix for further comparison per cluster. Each cluster is hereby stored as a ...
1
vote
3
answers
2k
views
Extracting text from *.txt files in R
I've used Expressions for Mac to confirm my Regex works but I can't find a command to extract information from my text file. I have 2,500 text files and I need to pull out the date of each document in ...
0
votes
0
answers
249
views
R Regular expression to search citations of law using tidytext and tm
I use tidytext, tm and quantedafor text mining.
I try to:
filter a tibble with plain, processed text according to presence of a citation of law
count the number of the same citation per text ...
3
votes
3
answers
5k
views
R: removing part of the word in a character string
I have a character vector
words <- c("somethingspan.", "..span?", "spanthank", "great to hear", "yourspan")
And I'm trying to remove span AND punctuation from every word in the vector
> ...
6
votes
1
answer
3k
views
R: regexpr() how to use a vector in pattern parameter
I would like to learn the positions of terms from a dictionary found in a set of short texts. The problem is in the last lines of the following code roughly based on From of list of strings, identify ...
0
votes
2
answers
232
views
Finding repeated sentences/words/phrases by group over time
I have a data-set in which each column is a variable and each row is an observation (like time series data. It looks like this (I apologize for the format, but I can't show the data):
I'd like to ...
1
vote
1
answer
793
views
R errors because of PCRE configuration, unicode properties
I am using the removeWords and tm_map() functions in the tm package in order to parse some text data. My understanding is that it simply uses Perl regular expressions through gsub() to complete the ...
1
vote
1
answer
111
views
Extracting unknown dates from txt/HTML files using R
I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my PC in the folders data_X_txt and data_X (in ...
5
votes
1
answer
7k
views
R tm substitute words in Corpus using gsub
I have a large document corpus with more than 200 documents. As you can expect from such a large corpus, some of the words are misspelled, used in different formats, and so on and so forth. I have ...
0
votes
1
answer
656
views
R- Subset a corpus by meta data (id) matching partial strings
I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents ...
0
votes
1
answer
1k
views
Removing @mentions using the 'tm' package R
I have a corpus of tweets and some of them have @mentions which I want to remove, I am using the tm_map function of the tm package but not getting the desired result. Here is an example:
...
2
votes
1
answer
512
views
Why does extractHTMLStrip() from tm.plugin.webmining truncate strings under 61 characters?
I have a set of messages, some of which are plain text, others that are marked-up with HTML tags. The messages with HTML tags do not appear to contain the tags <html> or <body>; I've only ...
1
vote
2
answers
1k
views
removing phrases (stopphrases) from corpus in R?
I can easily remove stop words using the tm package but is there an easy way to remove specific phrases? I'd like to be able to remove the phrase, "good morning" but not remove cases where good is not ...
7
votes
2
answers
2k
views
How to extract sentences containing specific person names using R
I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:
Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg ...
2
votes
2
answers
3k
views
Splitting a document from a tm Corpus into multiple documents
A bit of a bizarre question, is there a way to split corpus documents that have been imported using the Corpus function in tm into multiple documents that can then be reread in my Corpus as separate ...
3
votes
1
answer
108
views
Search a string by a mix of syntactical and regex patterns
I would like to use R to search a text for patterns expressed through a mix of POS and actual strings. (I have seen this functionality in a python library here: http://www.clips.ua.ac.be/pages/pattern-...
0
votes
1
answer
1k
views
R : Text Analysis - tm Package - stemComplete error
Machine: Windows 7 - 64 bit
R Version : R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
I am working on stemming some text for an analysis that I am doing, I am able to do everything all the way up ...
1
vote
0
answers
104
views
How to remove whitespace between two specific letters in TM pkg using gsub
I'm using the 'tm' package to clean and mine a large set of social media posts related to e-cigarettes as a precursor to running principle components analysis on the output to identify key themes. I ...
3
votes
2
answers
2k
views
Searching R corpus for all words ending in "esque"
I am using R's tm package to get word frequencies using the dictionary method. I want to find all words that end with "esque" whether they are spelled "abcd-esque", "abcdesque" or "abcd esque" (since ...
2
votes
1
answer
291
views
R - does failed RegEx pattern matching originate in file conversion or use of tm package?
As a relative novice in R and programming, my first ever question in this forum is about regex pattern matching, specifically line breaks. First some background. I am trying to perform some ...
3
votes
2
answers
3k
views
How to give space between 2 words after removing Punctuation and Numbers text mining in R
We can see that in the below example after removing number 3054 and punctuation marks - in given string "BG3054-suhas B-DC chr 23.7-22.8.13" the output will combine as bgsuhas but i need a space ...
0
votes
1
answer
2k
views
Remove chararcters in text corpus
I'm analyzing a corpus of emails. Some emails contain URLs. When I apply the removePunctuation function from the tm library, I get httpwww, and then I lose the info of a web address. What I would like ...
1
vote
2
answers
2k
views
How to remove rows from a data frame that contain only few words in R?
I'm trying to remove rows from my data frame that contain less than 5 words.
e.g.
mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE)
head(mydf)
NO ARTICLE
1 34 The ...
1
vote
2
answers
1k
views
How to use a regular expression inside TermDocumentMatrix for text mining?
I know that I can use the tm package to count the occurrences of specific words in a corpus using the Dictionary function:
require(tm)
data(crude)
dic <- Dictionary("crude")
tdm <- ...
1
vote
3
answers
2k
views
Removing everything but html tags from a corpus
I'm using the package tm. I have a corpus full of html document and I would like to remove everything but the html tags. I've been trying to do that for a few days but I don't seem to be able to find ...