Reference
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, provides a much more complete introduction to Spark that this chapter can provide. I thoroughly recommend it.
If you are interested in learning more about information theory, I recommend David MacKay's book Information Theory, Inference, and Learning Algorithms.
Information Retrieval, by Manning, Raghavan, and Schütze, describes how to analyze textual data (including lemmatization and stemming). An online
On the Ling-Spam dataset, and how to analyze it: http://www.aueb.gr/users/ion/docs/ir_memory_based_antispam_filtering.pdf.
This blog post delves into the Spark Web UI in more detail. https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html.
This blog post, by Sandy Ryza, is the first in a two-part series discussing Spark internals, and how to leverage them to improve performance: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache...