Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Learning Data Mining with Python

You're reading from   Learning Data Mining with Python Use Python to manipulate data and build predictive models

Arrow left icon
Product type Paperback
Published in Apr 2017
Publisher Packt
ISBN-13 9781787126787
Length 358 pages
Edition 2nd Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Robert Layton Robert Layton
Author Profile Icon Robert Layton
Robert Layton
Arrow right icon
View More author details
Toc

Table of Contents (20) Chapters Close

Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
1. Getting Started with Data Mining FREE CHAPTER 2. Classifying with scikit-learn Estimators 3. Predicting Sports Winners with Decision Trees 4. Recommending Movies Using Affinity Analysis 5. Features and scikit-learn Transformers 6. Social Media Insight using Naive Bayes 7. Follow Recommendations Using Graph Mining 8. Beating CAPTCHAs with Neural Networks 9. Authorship Attribution 10. Clustering News Articles 11. Object Detection in Images using Deep Neural Networks 12. Working with Big Data 13. Next Steps...

Introducing data mining


Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. There are many different applications of data mining, with new applications being discovered all the time.

Note

Data mining is part algorithm design, statistics, engineering, optimization, and computer science. However, combined with these base skills in the area, we also need to apply domain knowledge (expert knowledge)of the area we are applying the data mining. Domain knowledge is critical for going from good results to great results. Applying data mining effectively usually requires this domain-specific knowledge to be integrated with the algorithms.

Most data mining applications work with the same high-level view, where a model learns from some data and is applied to other data, although the details often change quite considerably.

Data mining applications involve creating data sets and tuning the algorithm as explained in the following steps

  1. We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of the following two aspects:
  • Samples: These are objects in the real world, such as a book, photograph, animal, person, or any other object. Samples are also referred to as observations, records or rows, among other naming conventions.
  • Features: These are descriptions or measurements of the samples in our dataset. Features could be the length, frequency of a specific word, the number of legs on an animal, date it was created, and so on. Features are also referred to as variables, columns, attributes or covariant, again among other naming conventions.
    1. The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.

    As a simple example, we may wish the computer to be able to categorize people as short or tall. We start by collecting our dataset, which includes the heights of different people and whether they are considered short or tall:

    Person

    Height

    Short or tall?

    1

    155cm

    Short

    2

    165cm

    Short

    3

    175cm

    Tall

    4

    185cm

    Tall

    As explained above, the next step involves tuning the parameters of our algorithm. As a simple algorithm; if the height is more than x, the person is tall. Otherwise, they are short. Our training algorithms will then look at the data and decide on a good value for x. For the preceding data, a reasonable value for this threshold would be 170 cm. A person taller than 170 cm is considered tall by the algorithm. Anyone else is considered short by this measure. This then lets our algorithm classify new data, such as a person with height 167 cm, even though we may have never seen a person with those measurements before.

    In the preceding data, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This feature engineering is a critical problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset. Ultimately, this step often requires some expert domain knowledge or at least some trial and error.

    In this book, we will introduce data mining through Python. In some cases, we choose clarity of code and workflows, rather than the most optimized way to perform every task. This clarity sometimes involves skipping some details that can improve the algorithm's speed or effectiveness.

    You have been reading a chapter from
    Learning Data Mining with Python - Second Edition
    Published in: Apr 2017
    Publisher: Packt
    ISBN-13: 9781787126787
    Register for a free Packt account to unlock a world of extra content!
    A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
    Unlock this book and the full library FREE for 7 days
    Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
    Renews at $15.99/month. Cancel anytime
    Visually different images