Introduction to Spark MLlib
MLlib is a part of the Spark project that provides machine learning capabilities. One of the reasons to choose MLlib is that it's built on Apache Spark, which is a fast and general engine for large-scale data processing. One can find extensive documentation on MLlib at http://spark.apache.org/docs/latest/ml-guide.html. MLlib out of the box provides machine learning algorithms, such as the following:
- Classification: This is used by Gmail to categorize whether an email is spam or not.
- Clustering: This is categorization. Google uses this to categorize news articles into various categories such as sports, politics, weather, and so on, based on the title and content.
- Collaborative Filtering: This is used by the recommendation engines. YouTube and Amazon are classic examples for this as they recommend items based on likes and ratings from the user.
Since we are building a Recommendation engine, we will use the Collaborative Filtering algorithm for our use case.
PredictionIO...