Packt+ | Advance your knowledge in tech

You're reading from Learning Data Mining with Python Harness the power of Python to analyze data and create insightful predictive models

Product type Paperback

Published in Jul 2015

Publisher Packt

ISBN-13 9781784396053

Length 344 pages

Edition 1st Edition

Languages

Python

Tools

IPython

Concepts

Data Mining

Author (1):

Robert Layton

View More author details

Table of Contents (20) Chapters

Learning Data Mining with Python

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Getting Started with Data Mining FREE CHAPTER

2. Classifying with scikit-learn Estimators

3. Predicting Sports Winners with Decision Trees

4. Recommending Movies Using Affinity Analysis

5. Extracting Features with Transformers

6. Social Media Insight Using Naive Bayes

7. Discovering Accounts to Follow Using Graph Mining

8. Beating CAPTCHAs with Neural Networks

9. Authorship Attribution

10. Clustering News Articles

11. Classifying Objects in Images Using Deep Learning

12. Working with Big Data

Next Steps…

Index

Chapter 10 – Clustering News Articles

Evaluation

The evaluation of clustering algorithms is a difficult problem—on the one hand, we can sort of tell what good clusters look like; on the other hand, if we really know that, we should label some instances and use a supervised classifier! Much has been written on this topic. One slideshow on the topic that is a good introduction to the challenges follows:

http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf

In addition, a very comprehensive (although now a little dated) paper on this topic is here: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf.

The scikit-learn package does implement a number of the metrics described in those links, with an overview here: http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation.

Using some of these, you can start evaluating which parameters need to be used for better clusterings. Using a Grid Search, we can find parameters that maximize a metric—just like in classification.

Temporal analysis

The code we developed in this chapter can be rerun over many months. By adding some tags to each cluster, you can track which topics stay active over time, getting a longitudinal viewpoint of what is being discussed in the world news.

To compare the clusters, consider a metric such as the adjusted mutual information score, which was linked to the scikit-learn documentation earlier. See how the clusters change after one month, two months, six months, and a year.

Real-time clusterings

The k-means algorithm can be iteratively trained and updated over time, rather than discrete analyses at given time frames. Cluster movement can be tracked in a number of ways—for instance, you can track which words are popular in each cluster and how much the centroids move per day. Keep the API limits in mind—you probably only need to do one check every few hours to keep your algorithm up-to-date.