The newsgroups data
The first project in this book is about the 20 newsgroups dataset found in scikit-learn. The data contains approximately 20,000 across 20 online newsgroups. A newsgroup is a place on the Internet where you can ask and answer questions about a certain topic. The data is already split into training and test sets. The cutoff point is at a certain date. The original data comes from http://qwone.com/~jason/20Newsgroups/. 20 different newsgroups are listed as follows:
comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.xrec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockeysci.cryptsci.electronicssci.medsci.spacemisc.forsaletalk.politics.misctalk.politics.gunstalk.politics.mideasttalk.religion.miscalt.atheismsoc.religion.christian
All the documents in the dataset are in English. And from the newsgroup names, you can deduce the topics.
Some of the newsgroups are closely related or even overlapping, for instance, those five computer...