Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
-1 votes
1 answer
493 views

Extract table from unstructured text file in r

I have a text file namely data.txt containing multiple tables in the following format. // // TABLE ET_ARCMAT // ARCID MATID VALTO VALFR ...
abdul samad's user avatar
1 vote
1 answer
251 views

How can I use the tm_map, removeWords, function with regex values?

I am working with a list of previously clustered re-tweet usernames, which I would like to upload in a Document-Term-Matrix for further comparison per cluster. Each cluster is hereby stored as a ...
hyde's user avatar
  • 53
1 vote
3 answers
2k views

Extracting text from *.txt files in R

I've used Expressions for Mac to confirm my Regex works but I can't find a command to extract information from my text file. I have 2,500 text files and I need to pull out the date of each document in ...
IanLux's user avatar
  • 13
0 votes
0 answers
249 views

R Regular expression to search citations of law using tidytext and tm

I use tidytext, tm and quantedafor text mining. I try to: filter a tibble with plain, processed text according to presence of a citation of law count the number of the same citation per text ...
captcoma's user avatar
  • 1,898
3 votes
3 answers
5k views

R: removing part of the word in a character string

I have a character vector words <- c("somethingspan.", "..span?", "spanthank", "great to hear", "yourspan") And I'm trying to remove span AND punctuation from every word in the vector > ...
Kasia Kulma's user avatar
  • 1,722
6 votes
1 answer
3k views

R: regexpr() how to use a vector in pattern parameter

I would like to learn the positions of terms from a dictionary found in a set of short texts. The problem is in the last lines of the following code roughly based on From of list of strings, identify ...
Jacek Kotowski's user avatar
0 votes
2 answers
232 views

Finding repeated sentences/words/phrases by group over time

I have a data-set in which each column is a variable and each row is an observation (like time series data. It looks like this (I apologize for the format, but I can't show the data): I'd like to ...
Alex's user avatar
  • 77
1 vote
1 answer
793 views

R errors because of PCRE configuration, unicode properties

I am using the removeWords and tm_map() functions in the tm package in order to parse some text data. My understanding is that it simply uses Perl regular expressions through gsub() to complete the ...
RickyB's user avatar
  • 627
1 vote
1 answer
111 views

Extracting unknown dates from txt/HTML files using R

I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my PC in the folders data_X_txt and data_X (in ...
Marvin Schopf's user avatar
5 votes
1 answer
7k views

R tm substitute words in Corpus using gsub

I have a large document corpus with more than 200 documents. As you can expect from such a large corpus, some of the words are misspelled, used in different formats, and so on and so forth. I have ...
DotPi's user avatar
  • 4,107
0 votes
1 answer
656 views

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id". For example, I would like to filter all documents ...
tarti's user avatar
  • 35
0 votes
1 answer
1k views

Removing @mentions using the 'tm' package R

I have a corpus of tweets and some of them have @mentions which I want to remove, I am using the tm_map function of the tm package but not getting the desired result. Here is an example: ...
Anurag H's user avatar
  • 999
2 votes
1 answer
512 views

Why does extractHTMLStrip() from tm.plugin.webmining truncate strings under 61 characters?

I have a set of messages, some of which are plain text, others that are marked-up with HTML tags. The messages with HTML tags do not appear to contain the tags <html> or <body>; I've only ...
matmat's user avatar
  • 905
1 vote
2 answers
1k views

removing phrases (stopphrases) from corpus in R?

I can easily remove stop words using the tm package but is there an easy way to remove specific phrases? I'd like to be able to remove the phrase, "good morning" but not remove cases where good is not ...
Roshman's user avatar
  • 33
7 votes
2 answers
2k views

How to extract sentences containing specific person names using R

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph: Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg ...
Frown's user avatar
  • 259
2 votes
2 answers
3k views

Splitting a document from a tm Corpus into multiple documents

A bit of a bizarre question, is there a way to split corpus documents that have been imported using the Corpus function in tm into multiple documents that can then be reread in my Corpus as separate ...
src471's user avatar
  • 91
3 votes
1 answer
108 views

Search a string by a mix of syntactical and regex patterns

I would like to use R to search a text for patterns expressed through a mix of POS and actual strings. (I have seen this functionality in a python library here: http://www.clips.ua.ac.be/pages/pattern-...
nassimhddd's user avatar
  • 8,510
0 votes
1 answer
1k views

R : Text Analysis - tm Package - stemComplete error

Machine: Windows 7 - 64 bit R Version : R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" I am working on stemming some text for an analysis that I am doing, I am able to do everything all the way up ...
Jacob Johnston's user avatar
1 vote
0 answers
104 views

How to remove whitespace between two specific letters in TM pkg using gsub

I'm using the 'tm' package to clean and mine a large set of social media posts related to e-cigarettes as a precursor to running principle components analysis on the output to identify key themes. I ...
soytri's user avatar
  • 11
3 votes
2 answers
2k views

Searching R corpus for all words ending in "esque"

I am using R's tm package to get word frequencies using the dictionary method. I want to find all words that end with "esque" whether they are spelled "abcd-esque", "abcdesque" or "abcd esque" (since ...
monarque13's user avatar
2 votes
1 answer
291 views

R - does failed RegEx pattern matching originate in file conversion or use of tm package?

As a relative novice in R and programming, my first ever question in this forum is about regex pattern matching, specifically line breaks. First some background. I am trying to perform some ...
Brigitte's user avatar
3 votes
2 answers
3k views

How to give space between 2 words after removing Punctuation and Numbers text mining in R

We can see that in the below example after removing number 3054 and punctuation marks - in given string "BG3054-suhas B-DC chr 23.7-22.8.13" the output will combine as bgsuhas but i need a space ...
Suhas's user avatar
  • 41
0 votes
1 answer
2k views

Remove chararcters in text corpus

I'm analyzing a corpus of emails. Some emails contain URLs. When I apply the removePunctuation function from the tm library, I get httpwww, and then I lose the info of a web address. What I would like ...
Yoav's user avatar
  • 1,029
1 vote
2 answers
2k views

How to remove rows from a data frame that contain only few words in R?

I'm trying to remove rows from my data frame that contain less than 5 words. e.g. mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE) head(mydf) NO ARTICLE 1 34 The ...
cptn's user avatar
  • 693
1 vote
2 answers
1k views

How to use a regular expression inside TermDocumentMatrix for text mining?

I know that I can use the tm package to count the occurrences of specific words in a corpus using the Dictionary function: require(tm) data(crude) dic <- Dictionary("crude") tdm <- ...
Christine Forrester's user avatar
1 vote
3 answers
2k views

Removing everything but html tags from a corpus

I'm using the package tm. I have a corpus full of html document and I would like to remove everything but the html tags. I've been trying to do that for a few days but I don't seem to be able to find ...
Simon-Okp's user avatar
  • 697