Packt+ | Advance your knowledge in tech

You're reading from Natural Language Processing and Computational Linguistics A practical guide to text analysis with Python, Gensim, spaCy, and Keras

Product type Paperback

Published in Jun 2018

Publisher Packt

ISBN-13 9781788838535

Length 306 pages

Edition 1st Edition

Languages

Processing

Tools

Keras

Concepts

Mobile Application Development

Author (1):

Bhargav Srinivasa-Desikan

View More author details

Table of Contents (22) Chapters

Title Page

Packt Upsell

Contributors

Preface

1. What is Text Analysis?

2. Python Tips for Text Analysis FREE CHAPTER

3. spaCy's Language Models

4. Gensim – Vectorizing Text and Transformations and n-grams

5. POS-Tagging and Its Applications

6. NER-Tagging and Its Applications

7. Dependency Parsing

8. Topic Models

9. Advanced Topic Modeling

10. Clustering and Classifying Text

11. Similarity Queries and Summarization

12. Word2Vec, Doc2Vec, and Gensim

13. Deep Learning for Text

14. Keras and spaCy for Deep Learning

15. Sentiment Analysis and ChatBots

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Similarity queries

Now that we have the capability to compare between two documents, it is possible for us to set up our algorithms to extract out the most similar documents for an input query – simply index each of the documents, then search for the lowest distance value returned between the corpus and the query, and return the documents with the lowest distance values – these would be most similar. Luckily for us, however, Gensim has in-built structures to do this document similarity task!

We will be using the similarities module to construct this structure.

from gensim import similarities

We previously mentioned creating an index – we can do this far faster with the similarities module. As mentioned in the Gensim documentation for the Similarity class – the Similarity class splits the index into several smaller sub-indexes (shards), which are disk-based. If your entire index fits in memory (hundreds of thousands of documents for 1 GB of RAM), you can also use the MatrixSimilarity or SparseMatrixSimilarity...