Chapter 4. Building a Spam Classification Pipeline
Two pillars of Google's Gmail service stand out. These are anInbox
folder, receiving benign or wanted email messages, and aSpam
folder, receiving unsolicited, junk emails, or simply spam.
The emphasis of this chapter is on identifying spam and classifying it as such. It explores the following topics concerning spam detection:
- What are the techniques of separating spam from ham?
- If spam filtering is one suitable technique, how can it be formalized as a supervised learning classification task?
- Why is a certain algorithm better than another for spam filtering, and in what respect?
- Where are the tangible benefits of effective spam filtering most felt?
This chapter implements a spam filtering data analysis pipeline.
Implementing a spam classifier with Scala and machine learning (ML) is the overall learning objective of this chapter. Starting from the datasets we created for you, we will rely on the Spark ML library's machine learning APIs and its supporting...