Newest 'tm+regex' Questions

-1 votes

1 answer

493 views

Extract table from unstructured text file in r

I have a text file namely data.txt containing multiple tables in the following format. // // TABLE ET_ARCMAT // ARCID MATID VALTO VALFR ...

abdul samad

61

asked Oct 10, 2021 at 3:16

1 vote

1 answer

251 views

How can I use the tm_map, removeWords, function with regex values?

I am working with a list of previously clustered re-tweet usernames, which I would like to upload in a Document-Term-Matrix for further comparison per cluster. Each cluster is hereby stored as a ...

hyde

53

asked May 27, 2020 at 22:58

1 vote

3 answers

2k views

Extracting text from *.txt files in R

I've used Expressions for Mac to confirm my Regex works but I can't find a command to extract information from my text file. I have 2,500 text files and I need to pull out the date of each document in ...

IanLux

13

asked Dec 4, 2018 at 16:51

0 votes

0 answers

249 views

R Regular expression to search citations of law using tidytext and tm

I use tidytext, tm and quantedafor text mining. I try to: filter a tibble with plain, processed text according to presence of a citation of law count the number of the same citation per text ...

captcoma

1,898

asked Jan 13, 2018 at 20:10

3 votes

3 answers

5k views

R: removing part of the word in a character string

I have a character vector words <- c("somethingspan.", "..span?", "spanthank", "great to hear", "yourspan") And I'm trying to remove span AND punctuation from every word in the vector > ...

Kasia Kulma

1,722

asked Dec 14, 2017 at 10:10

6 votes

1 answer

3k views

R: regexpr() how to use a vector in pattern parameter

I would like to learn the positions of terms from a dictionary found in a set of short texts. The problem is in the last lines of the following code roughly based on From of list of strings, identify ...

Jacek Kotowski

704

asked Jul 14, 2017 at 7:00

0 votes

2 answers

232 views

Finding repeated sentences/words/phrases by group over time

I have a data-set in which each column is a variable and each row is an observation (like time series data. It looks like this (I apologize for the format, but I can't show the data): I'd like to ...

Alex

77

asked Jun 15, 2017 at 13:57

1 vote

1 answer

793 views

R errors because of PCRE configuration, unicode properties

I am using the removeWords and tm_map() functions in the tm package in order to parse some text data. My understanding is that it simply uses Perl regular expressions through gsub() to complete the ...

RickyB

627

asked Feb 16, 2017 at 15:01

1 vote

1 answer

111 views

Extracting unknown dates from txt/HTML files using R

I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my PC in the folders data_X_txt and data_X (in ...

Marvin Schopf

161

asked Nov 1, 2016 at 13:35

5 votes

1 answer

7k views

R tm substitute words in Corpus using gsub

I have a large document corpus with more than 200 documents. As you can expect from such a large corpus, some of the words are misspelled, used in different formats, and so on and so forth. I have ...

DotPi

4,107

asked Jul 27, 2016 at 7:00

0 votes

1 answer

656 views

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id". For example, I would like to filter all documents ...

tarti

35

asked Mar 22, 2016 at 9:26

0 votes

1 answer

1k views

Removing @mentions using the 'tm' package R

I have a corpus of tweets and some of them have @mentions which I want to remove, I am using the tm_map function of the tm package but not getting the desired result. Here is an example: ...

Anurag H

999

asked Mar 3, 2016 at 12:20

2 votes

1 answer

512 views

Why does extractHTMLStrip() from tm.plugin.webmining truncate strings under 61 characters?

I have a set of messages, some of which are plain text, others that are marked-up with HTML tags. The messages with HTML tags do not appear to contain the tags <html> or <body>; I've only ...

matmat

905

asked Dec 3, 2015 at 1:39

1 vote

2 answers

1k views

removing phrases (stopphrases) from corpus in R?

I can easily remove stop words using the tm package but is there an easy way to remove specific phrases? I'd like to be able to remove the phrase, "good morning" but not remove cases where good is not ...

Roshman

33

asked Jul 24, 2015 at 13:40

7 votes

2 answers

2k views

How to extract sentences containing specific person names using R

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph: Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg ...

Frown

259

asked Jul 21, 2015 at 9:24

2 votes

2 answers

3k views

Splitting a document from a tm Corpus into multiple documents

A bit of a bizarre question, is there a way to split corpus documents that have been imported using the Corpus function in tm into multiple documents that can then be reread in my Corpus as separate ...

src471

91

asked Jun 17, 2015 at 20:31

3 votes

1 answer

108 views

Search a string by a mix of syntactical and regex patterns

I would like to use R to search a text for patterns expressed through a mix of POS and actual strings. (I have seen this functionality in a python library here: http://www.clips.ua.ac.be/pages/pattern-...

nassimhddd

8,510

asked Mar 30, 2015 at 8:17

0 votes

1 answer

1k views

R : Text Analysis - tm Package - stemComplete error

Machine: Windows 7 - 64 bit R Version : R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" I am working on stemming some text for an analysis that I am doing, I am able to do everything all the way up ...

Jacob Johnston

141

asked Feb 20, 2015 at 1:11

1 vote

0 answers

104 views

How to remove whitespace between two specific letters in TM pkg using gsub

I'm using the 'tm' package to clean and mine a large set of social media posts related to e-cigarettes as a precursor to running principle components analysis on the output to identify key themes. I ...

soytri

11

asked Jan 28, 2015 at 20:33

3 votes

2 answers

2k views

Searching R corpus for all words ending in "esque"

I am using R's tm package to get word frequencies using the dictionary method. I want to find all words that end with "esque" whether they are spelled "abcd-esque", "abcdesque" or "abcd esque" (since ...

monarque13

578

asked Dec 19, 2014 at 3:24

2 votes

1 answer

291 views

R - does failed RegEx pattern matching originate in file conversion or use of tm package?

As a relative novice in R and programming, my first ever question in this forum is about regex pattern matching, specifically line breaks. First some background. I am trying to perform some ...

Brigitte

77

asked Nov 3, 2014 at 2:17

3 votes

2 answers

3k views

How to give space between 2 words after removing Punctuation and Numbers text mining in R

We can see that in the below example after removing number 3054 and punctuation marks - in given string "BG3054-suhas B-DC chr 23.7-22.8.13" the output will combine as bgsuhas but i need a space ...

Suhas

41

asked Aug 3, 2014 at 14:33

0 votes

1 answer

2k views

Remove chararcters in text corpus

I'm analyzing a corpus of emails. Some emails contain URLs. When I apply the removePunctuation function from the tm library, I get httpwww, and then I lose the info of a web address. What I would like ...

Yoav

1,029

asked May 28, 2014 at 8:21

1 vote

2 answers

2k views

How to remove rows from a data frame that contain only few words in R?

I'm trying to remove rows from my data frame that contain less than 5 words. e.g. mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE) head(mydf) NO ARTICLE 1 34 The ...

cptn

693

asked Mar 3, 2014 at 6:32

1 vote

2 answers

1k views

How to use a regular expression inside TermDocumentMatrix for text mining?

I know that I can use the tm package to count the occurrences of specific words in a corpus using the Dictionary function: require(tm) data(crude) dic <- Dictionary("crude") tdm <- ...

Christine Forrester

281

asked Aug 22, 2013 at 14:18

1 vote

3 answers

2k views

Removing everything but html tags from a corpus

I'm using the package tm. I have a corpus full of html document and I would like to remove everything but the html tags. I've been trying to do that for a few days but I don't seem to be able to find ...

Simon-Okp

697

asked Mar 26, 2012 at 15:51

Collectives™ on Stack Overflow

All Questions

Extract table from unstructured text file in r

How can I use the tm_map, removeWords, function with regex values?

Extracting text from *.txt files in R

R Regular expression to search citations of law using tidytext and tm

R: removing part of the word in a character string

R: regexpr() how to use a vector in pattern parameter

Finding repeated sentences/words/phrases by group over time

R errors because of PCRE configuration, unicode properties

Extracting unknown dates from txt/HTML files using R

R tm substitute words in Corpus using gsub

R- Subset a corpus by meta data (id) matching partial strings

Removing @mentions using the 'tm' package R

Why does extractHTMLStrip() from tm.plugin.webmining truncate strings under 61 characters?

removing phrases (stopphrases) from corpus in R?

How to extract sentences containing specific person names using R

Splitting a document from a tm Corpus into multiple documents

Search a string by a mix of syntactical and regex patterns

R : Text Analysis - tm Package - stemComplete error

How to remove whitespace between two specific letters in TM pkg using gsub

Searching R corpus for all words ending in "esque"

R - does failed RegEx pattern matching originate in file conversion or use of tm package?

How to give space between 2 words after removing Punctuation and Numbers text mining in R

Remove chararcters in text corpus

How to remove rows from a data frame that contain only few words in R?

How to use a regular expression inside TermDocumentMatrix for text mining?

Removing everything but html tags from a corpus

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags