Handling corpus-raw text
In this section, we will see how to get the raw text and, in the following section, we will preprocess text and identify the sentences.
The process for this section is given in Figure 4.1:

Figure 4.1: Process of handling corpus-raw text
Getting raw text
In this section, we will use three sources where we can get the raw text data.
The following are the data sources:
- Raw text file
- Define raw data text inside a script in the form of a local variable
- Use any of the available corpus from
nltk
Let's begin:
- Raw text file access: I have a
.txt
file saved on my local computer which contains text data in the form of a paragraph. I want to read the content of that file and then load the content as the next step. I will run a sentence tokenizer to get the sentences out of it. - Define raw data text inside a script in the form of a local variable: If we have a small amount of data, then we can assign the data to a local string variable. For example: Text = This is the sentence, this is...