Similarity queries
Now that we have the capability to compare between two documents, it is possible for us to set up our algorithms to extract out the most similar documents for an input query – simply index each of the documents, then search for the lowest distance value returned between the corpus and the query, and return the documents with the lowest distance values – these would be most similar. Luckily for us, however, Gensim has in-built structures to do this document similarity task!
We will be using the similarities module to construct this structure.
from gensim import similarities
We previously mentioned creating an index – we can do this far faster with the similarities module. As mentioned in the Gensim documentation for the Similarity
class – the Similarity
class splits the index into several smaller sub-indexes (shards), which are disk-based. If your entire index fits in memory (hundreds of thousands of documents for 1 GB of RAM), you can also use the MatrixSimilarity
or SparseMatrixSimilarity...