Getting the data
The data we will use for the first part of this chapter is a set of books from Project Gutenberg at www.gutenberg.org, which is a repository of public domain literature works. The books I used for these experiments come from a variety of authors:
- Booth Tarkington (22 titles)
- Charles Dickens (44 titles)
- Edith Nesbit (10 titles)
- Arthur Conan Doyle (51 titles)
- Mark Twain (29 titles)
- Sir Richard Francis Burton (11 titles)
- Emile Gaboriau (10 titles)
Overall, there are 177 documents from 7 authors, giving a significant amount of text to work with. A full list of the titles, along with download links and a script to automatically fetch them, is given in the code bundle called getdata.py. If running the code results in significantly fewer books than above, the mirror may be down. See this website for more mirror URLs to try in the script: https://www.gutenberg.org/MIRRORS.ALL
To download these books, we use the requests library to download the files into our data directory.
First, in a new...