The newsgroups data
The first project in this book is about the 20 newsgroups dataset found in scikit-learn. The data contains approximately 20,000 across 20 online newsgroups. A newsgroup is a place on the Internet where you can ask and answer questions about a certain topic. The data is already split into training and test sets. The cutoff point is at a certain date. The original data comes from http://qwone.com/~jason/20Newsgroups/. 20 different newsgroups are listed as follows:
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale
talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
soc.religion.christian
All the documents in the dataset are in English. And from the newsgroup names, you can deduce the topics.
Some of the newsgroups are closely related or even overlapping, for instance, those five computer...