Packt+ | Advance your knowledge in tech

You're reading from Artificial Vision and Language Processing for Robotics Create end-to-end systems that can power robots with artificial vision and deep learning techniques

Product type Paperback

Published in Apr 2019

Publisher Packt

ISBN-13 9781838552268

Length 356 pages

Edition 1st Edition

Languages

Processing

Tools

Processing

Concepts

Artificial Intelligence

Authors (3):

Morena Alberola

Molina Gallego

Garay Maestre

View More author details

Table of Contents (11) Chapters

About the Book

1. Fundamentals of Robotics FREE CHAPTER

2. Introduction to Computer Vision

3. Fundamentals of Natural Language Processing

4. Neural Networks with NLP

5. Convolutional Neural Networks for Computer Vision

6. Robot Operating System (ROS)

7. Build a Text-Based Dialogue System (Chatbot)

8. Object Recognition to Guide a Robot Using CNNs

9. Computer Vision for Robotics

1. Appendix

Chapter 3: Fundamentals of Natural Language Processing

Activity 3: Process a Corpus

Solution

Import the sklearn TfidfVectorizer and TruncatedSVD methods:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
Load the corpus:
docs = []
ndocs = ["doc1", "doc2", "doc3"]
for n in ndocs:
aux = open("dataset/"+ n +".txt", "r", encoding="utf8")
docs.append(aux.read())
With spaCy, let's add some new stop words, tokenize the corpus, and remove the stop words. The new corpus without these words will be stored in a new variable:
import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
nlp = en_core_web_sm.load()
nlp.vocab["\n\n"].is_stop = True
nlp.vocab["\n"].is_stop = True
nlp.vocab["the"].is_stop = True
nlp.vocab["The"].is_stop = True
newD = []
for d, i in zip(docs, range(len(docs))):
doc = nlp(d)
tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
newD.append(' '.join(tokens))
Create the TF-IDF matrix. I'm going to add some parameters to improve the results:
vectorizer = TfidfVectorizer(use_idf=True,
ngram_range=(1,2),
smooth_idf=True,
max_df=0.5)
X = vectorizer.fit_transform(newD)
Perform the LSA algorithm:
lsa = TruncatedSVD(n_components=100,algorithm='randomized',n_iter=10,random_state=0)
lsa.fit_transform(X)
With pandas, we are shown a sorted DataFrame with the weights of the terms of each concept and the name of each feature:
import pandas as pd
import numpy as np
dic1 = {"Terms": terms, "Components": lsa.components_[0]}
dic2 = {"Terms": terms, "Components": lsa.components_[1]}
dic3 = {"Terms": terms, "Components": lsa.components_[2]}
f1 = pd.DataFrame(dic1)
f2 = pd.DataFrame(dic2)
f3 = pd.DataFrame(dic3)
f1.sort_values(by=['Components'], ascending=False)
f2.sort_values(by=['Components'], ascending=False)
f3.sort_values(by=['Components'], ascending=False)
The output is as follows: