2020年文档相似性算法：初学者教程( 三 ) _相似性算法

Doc2vecWord2vec于2014年面世，这让当时的开发者们刮目相看。你可能听说过非常有名的一个例子：

国王 - 男性 = 女王

Word2vec非常擅长理解单个单词，将整个句子向量化需要很长时间。更不用说整个文件了。
相反，我们将使用Doc2vec，这是一种类似的嵌入算法，将段落而不是每个单词向量化。你可以看看这个博客的介绍：https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e
不幸的是，对于Doc2vec来说，没有官方预训练模型。我们将使用其他人的预训练模型。它是在英文维基百科上训练的（数字不详，但模型大小相当于1.5Gb）：https://github.com/jhlau/doc2vec
Doc2vec的官方文档指出，输入可以是任意长度。一旦标识化，我们输入整个文档到gensim库。

from gensim.models.doc2vec import Doc2Vecfrom sklearn.metrics.pairwise import cosine_similarityimport stringimport nltknltk.download('stopwords')nltk.download('wordnet')nltk.download('punkt')from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()base_document = "This is an example sentence for the document to be compared"documents = ["This is the collection of documents to be compared against the base_document"]def preprocess(text):# 步骤:# 1. 小写字母# 2. 词根化# 3. 删除停用词# 4. 删除标点符号# 5. 删除长度为1的字符lowered = str.lower(text)stop_words = set(stopwords.words('english'))word_tokens = word_tokenize(lowered)words = []for w in word_tokens:if w not in stop_words:if w not in string.punctuation:if len(w) > 1:lemmatized = lemmatizer.lemmatize(w)words.append(lemmatized)return wordsdef process_doc2vec_similarity():# 这两种预先训练的模型都可以在jhlau的公开仓库中获得 。# URL: https://github.com/jhlau/doc2vec# filename = './models/apnews_dbow/doc2vec.bin'filename = './models/enwiki_dbow/doc2vec.bin'model= Doc2Vec.load(filename)tokens = preprocess(base_document)# 只处理出现在doc2vec预训练过的向量中的单词 。enwiki_ebow模型包含669549个词汇 。tokens = list(filter(lambda x: x in model.wv.vocab.keys(), tokens))base_vector = model.infer_vector(tokens)vectors = []for i, document in enumerate(documents):tokens = preprocess(document)tokens = list(filter(lambda x: x in model.wv.vocab.keys(), tokens))vector = model.infer_vector(tokens)vectors.append(vector)print("making vector at index:", i)scores = cosine_similarity([base_vector], vectors).flatten()highest_score = 0highest_score_index = 0for i, score in enumerate(scores):if highest_score < score:highest_score = scorehighest_score_index = imost_similar_document = documents[highest_score_index]print("Most similar document by Doc2vec with the score:", most_similar_document, highest_score)process_doc2vec_similarity()

Universal Sentence Encoder (USE)这是google最近在2018年5月发布的一个流行算法。实现细节：https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder 。
我们将使用谷歌最新的官方预训练模型：Universal Sentence Encoder 4(https://tfhub.dev/google/universal-sentence-encoder/4).
顾名思义，它是用句子来构建的。但官方文件并没有限制投入规模。没有什么能阻止我们将它用于文档比较任务。
整个文档按原样插入到Tensorflow中。没有进行标识化。

from sklearn.metrics.pairwise import cosine_similarityimport tensorflow as tfimport tensorflow_hub as hubbase_document = "This is an example sentence for the document to be compared"documents = ["This is the collection of documents to be compared against the base_document"]def process_use_similarity():filename = "./models/universal-sentence-encoder_4"model = hub.load(filename)base_embeddings = model([base_document])embeddings = model(documents)scores = cosine_similarity(base_embeddings, embeddings).flatten()highest_score = 0highest_score_index = 0for i, score in enumerate(scores):if highest_score < score:highest_score = scorehighest_score_index = imost_similar_document = documents[highest_score_index]print("Most similar document by USE with the score:", most_similar_document, highest_score)process_use_similarity()

BERT这可是个重量级选手。2018年11月谷歌开源BERT算法。第二年，谷歌搜索副总裁发表了一篇博文，称BERT是他们过去5年来最大的飞跃。
它是专门为理解你的搜索查询而构建的。当谈到理解一个句子的上下文时，BERT似乎比这里提到的所有其他技术都要出色。
最初的BERT任务并不打算处理大量的文本输入。对于嵌入多个句子，我们将使用UKPLab（来自德国大学）出版的句子转换器开源项目（https://github.com/UKPLab/sentence-transformers），其计算速度更快。它们还为我们提供了一个与原始模型相当的预训练模型（https://github.com/UKPLab/sentence-transformers#performance）