Cleaning web log data
We're going to show the importance of cleaning your data. I have some web log data from a little website that I own. We are just going to try to find the top viewed pages on that website. Sounds pretty simple, but as you'll see, it's actually quite challenging! So, if you want to follow along, the TopPages.ipynb
is the notebook that we're working from here. Let's start!
I actually have an access log that I took from my actual website. It's a real HTTP access log from Apache and is included in your book materials. So, if you do want to play along here, make sure you update the path to move the access log to wherever you saved the book materials:
logPath = "E:\\sundog-consult\\Packt\\DataScience\\access_log.txt"
Applying a regular expression on the web log
So, I went and got the following little snippet of code off of the Internet that will parse an Apache access log line into a bunch of fields:
format_pat= re.compile( r"(?P<host>[\d\.]+)\s" r"(?P<identity...