Summary
In this chapter, we saw how to interoperate with a few big data tools such as Spark, H2O, and ADAM for handling a large-scale genomics dataset. We applied the Spark-based K-means algorithm to genetic variants data from the 1000 Genomes project analysis, aiming to cluster genotypic variants at the population scale.
Then we applied an H2O-based DL algorithm and Spark-based Random Forest models to predict geographic ethnicity. Additionally, we learned how to install and configure H2O for DL. This knowledge will be used in later chapters. Finally and importantly, we learned how to use H2O to compute variable importance in order to select the most important features in a training set.
In the next chapter, we will see how effectively we can use the Latent Dirichlet Allocation (LDA) algorithm for finding useful patterns in data. We will compare other topic modeling algorithms and the scalability power of LDA. In addition, we will utilize Natural Language Processing (NLP) libraries such as...