Spark word count
Now that we have seen some of the functionality, let's explore further. We can use a script similar to the following to count the word occurrences in a file:
import pyspark
if not 'sc' in globals():
sc = pyspark.SparkContext()
#load in the file
text_file = sc.textFile("Spark File Words.ipynb")
#split file into distinct words
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
# print out words found
for x in counts.collect():
print(x)We have the same preamble to the coding. Then, we load the text file into memory.
Once the file is loaded, we split each line into words and use a lambda function to tick off each occurrence of a word. The code is truly creating a new record for each word occurrence, such as at appears one. The idea is that this process could be split over multiple processors, where each processor generates these low-level information bits. We are not concerned with optimizing...