Packt+ | Advance your knowledge in tech

You're reading from Building Machine Learning Systems with Python Explore machine learning and deep learning techniques for building intelligent systems using scikit-learn and TensorFlow

Product type Paperback

Published in Jul 2018

Publisher

ISBN-13 9781788623223

Length 406 pages

Edition 3rd Edition

Languages

Python

Tools

Scikit-learn

Concepts

Deep Learning

Authors (3):

Pedro Coelho

Willi Richert

Brucher

View More author details

Table of Contents (21) Chapters

Title Page

Packt Upsell

Contributors

Preface

1. Getting Started with Python Machine Learning FREE CHAPTER

2. Classifying with Real-World Examples

3. Regression

4. Classification I – Detecting Poor Answers

5. Dimensionality Reduction

6. Clustering – Finding Related Posts

7. Recommendations

8. Artificial Neural Networks and Deep Learning

9. Classification II – Sentiment Analysis

10. Topic Modeling

11. Classification III – Music Genre Classification

12. Computer Vision

13. Reinforcement Learning

14. Bigger Data

1. Where to Learn More About Machine Learning

2. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Preprocessing – similarity measured as a similar number of common words

As we have seen earlier, the bag of words approach is both fast and robust. It is, though, not without challenges. Let's dive directly into them.

Converting raw text into a bag of words

We do not have to write custom code for counting words and representing those counts as a vector. Scikit's CountVectorizer method, does the job efficiently but also has a very convenient interface:

>>> from sklearn.feature_extraction.text import CountVectorizer>>> vectorizer = CountVectorizer(min_df=1)

The min_df parameter determines how CountVectorizer treats seldom words (minimum document frequency). If it is set to an integer, all words occurring in fewer documents will be dropped. If it is a fraction, all words that occur in less than that fraction of the overall dataset will be dropped. The max_df parameter works in a similar manner. If we print the instance, we can see what other parameters scikit provides together...