Packt+ | Advance your knowledge in tech

You're reading from Machine Learning Algorithms Popular algorithms for data science and machine learning

Product type Paperback

Published in Aug 2018

Publisher Packt

ISBN-13 9781789347999

Length 522 pages

Edition 2nd Edition

Languages

Python

Tools

Scikit-learn

Concepts

Data Science

Author (1):

Giuseppe Bonaccorso

View More author details

Table of Contents (24) Chapters

Title Page

Dedication

Packt Upsell

Contributors

Preface

1. A Gentle Introduction to Machine Learning FREE CHAPTER

2. Important Elements in Machine Learning

3. Feature Selection and Feature Engineering

4. Regression Algorithms

5. Linear Classification Algorithms

6. Naive Bayes and Discriminant Analysis

7. Support Vector Machines

8. Decision Trees and Ensemble Learning

9. Clustering Fundamentals

10. Advanced Clustering

11. Hierarchical Clustering

12. Introducing Recommendation Systems

13. Introducing Natural Language Processing

14. Topic Modeling and Sentiment Analysis in NLP

15. Introducing Neural Networks

16. Advanced Deep Learning Models

17. Creating a Machine Learning Architecture

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Creating training and test sets

When a dataset is large enough, it's a good practice to split it into training and test sets, the former to be used for training the model and the latter to test its performances. In the following diagram, there's a schematic representation of this process:

Training/test set split process schema

There are two main rules in performing such an operation:

Both datasets must reflect the original distribution
The original dataset must be randomly shuffled before the split phase in order to avoid a correlation between consequent elements

With scikit-learn, this can be achieved by using the train_test_split() function:

fromsklearn.model_selectionimporttrain_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1000)

The test_size parameter (as well as training_size) allows you to specify the percentage of elements to put into the test/training set. In this case, the ratio is 75 percent for training and 25 percent for the...