Using regular expression in text processing
A regular expression is simply a sequence of character strings that defines the search pattern. In natural language, processing and text mining are the two areas where regular expressions are used a lot. There are other application areas as well. In this recipe, you will perform text data pre-processing without using the tm
library but by using a regular expression.
Getting ready
Suppose you have a corpus of documents and your objective is to find the frequent words in the corpus. So, the first thing is to do the pre-processing and then create term a document matrix. In this recipe, you will use a regular expression on the text data retrieved from a web page using the readLines()
function. Specifically, you will read the following web page using the readLines()
function:
https://en.wikipedia.org/wiki/Programming_with_Big_Data_in_R
How to do it…
Let's take a look at the following steps to learn how to use a regular expression in text processing:
- To read...