Skip to content Skip to sidebar Skip to footer

Doc2vec Get Most Similar Documents

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec mode

Solution 1:

You need to use infer_vector to get a document vector of the new text - which does not alter the underlying model.

Here is how you do it:

tokens = "a new sentence to match".split()

new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity

Edit:

Here is an example of how the underlying model does not change after infer_vec is called.

import numpy as np

words = "king queen man".split()

len_before =  len(model.docvecs) #number of docs#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]

new_vec = model.infer_vector(words)

len_after =  len(model.docvecs)

print np.array_equal(model[words[0]], w_vec0) # Trueprint np.array_equal(model[words[1]], w_vec1) # Trueprint np.array_equal(model[words[2]], w_vec2) # Trueprint len_before == len_after #True

Post a Comment for "Doc2vec Get Most Similar Documents"