Cleaning raw data with generator functions
One of the tasks that arise in exploratory data analysis is cleaning up raw source data. This is often done as a composite operation applying several scalar functions to each piece of input data to create a usable dataset.
Let's look at a simplified set of data. This data is commonly used to show techniques in exploratory data analysis. It's called Anscombe's quartet
, and it comes from the article, Graphs in Statistical Analysis, by F. J. Anscombe that appeared in American Statistician in 1973. The following are the first few rows of a downloaded file with this dataset:
Anscombe's quartet I II III IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
Sadly, we can't trivially process this with the csv
module. We have to do a little bit of parsing to extract the useful information from this file. Since the data is properly tab...