Packt+ | Advance your knowledge in tech

You're reading from Hands-On Natural Language Processing with Python A practical guide to applying deep learning architectures to your NLP applications

Product type Paperback

Published in Jul 2018

Publisher Packt

ISBN-13 9781789139495

Length 312 pages

Edition 1st Edition

Languages

Processing

Tools

NLTK

Concepts

Deep Learning

Authors (2):

Rajalingappaa Shanmugamani

Rajesh Arumugam

View More author details

Table of Contents (20) Chapters

Title Page

Packt Upsell

Foreword

Contributors

Preface

1. Getting Started

2. Text Classification and POS Tagging Using NLTK FREE CHAPTER

3. Deep Learning and TensorFlow

4. Semantic Embedding Using Shallow Models

5. Text Classification Using LSTM

6. Searching and DeDuplicating Using CNNs

7. Named Entity Recognition Using Character LSTM

8. Text Generation and Summarization Using GRUs

9. Question-Answering and Chatbots Using Memory Networks

10. Machine Translation Using the Attention-Based Model

11. Speech Recognition Using DeepSpeech

12. Text-to-Speech Using Tacotron

13. Deploying Trained Models

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Basic concepts and terminologies in NLP

The following are some of the important terminologies and concepts in NLP mostly related to the language data. Getting familiar with these terms and concepts will help the reader in getting up to speed in understanding the contents in later chapters of the book:

Text corpus or corpora
Paragraph
Sentences
Phrases and words
N-grams
Bag-of-words

We will explain these in the following sections.

Text corpus or corpora

The language data that all NLP tasks depend upon is called the text corpus or simply corpus. A corpus is a large set of text data that can be in one of the languages like English, French, and so on. The corpus can consist of a single document or a bunch of documents. The source of the text corpus can be social network sites like Twitter, blog sites, open discussion forums like Stack Overflow, books, and several others. In some of the tasks like machine translation, we would require a multilingual corpus. For example we might need both the English and French translations of the same document content for developing a machine translation model. For speech tasks, we would also need human voice recordings and the corresponding transcribed corpus.

In most of the later chapters, we will be using text corpus and speech recordings available from the internet or open source data repositories. For many of the NLP task, the corpus is split into chunks for further analysis. These chunks could be at the paragraph, sentence, or word level. We will touch upon these in the following sections.

Paragraph

A paragraph is the largest unit of text handled by an NLP task. Paragraph level boundaries by itself may not be much use unless broken down into sentences. Though sometimes the paragraph may be considered as context boundaries. Tokenizers that can split a document into paragraphs are available in some of the Python libraries. We will look at such tokenizers in later chapters.

Sentences

Sentences are the next level of lexical unit of language data. A sentence encapsulates a complete meaning or thought and context. It is usually extracted from a paragraph based on boundaries determined by punctuations like period. The sentence may also convey opinion or sentiment expressed in it. In general, sentences consists of parts of speech (POS) entities like nouns, verbs, adjectives, and so on. There are tokenizers available to split paragraphs to sentences based on punctuations.

Phrases and words

Phrases are a group of consecutive words within a sentence that can convey a specific meaning. For example, in the sentence Tomorrow is going to be a rainy day the part going to be a rainy day expresses a specific thought. Some of the NLP tasks extract key phrases from sentences for search and retrieval applications. The next smallest unit of text is the word. The common tokenizers split sentences into text based on punctuations like spaces and comma. One of the problems with NLP is ambiguity in the meaning of same words used in different context. We will later see how this is handled well when we discuss word embeddings.

N-grams

A sequence of characters or words forms an N-gram. For example, character unigram consists of a single character, a bigram consists of a sequence of two characters and so on. Similarly word N-grams consists of a sequence of n words. In NLP, N-grams are used as features for tasks like text classification.

Bag-of-words

Bag-of-words in contrast to N-grams does not consider word order or sequence. It captures the word occurrence frequencies in the text corpus. Bag-of-words is also used as features in tasks like sentiment analysis and topic identification.

In the following sections, we will look at an overview of the following applications of NLP: