Annoy小结以及在自然语言处理方面的应用

科技2022-07-15 101

距离公式

Euclidean distance(欧式距离)

Taxicab geometry （曼哈顿距离）

Cosine similarity(余弦距离)

Hamming distance(汉明距离) 在信息论中，两个等长字符串之间的汉明距离是相应符号不同的位置数。换句话说，它测量将一个字符串转换为另一个字符串所需的最小替换数，或可能将一个字符串转换为另一个字符串的最小错误数。例如： “karolin” and “kathrin” is 3. “karolin” and “kerstin” is 3. “kathrin” and “kerstin” is 4. 1011101 and 1001001 is 2. 2173896 and 2233796 is 3.

Dot product（点积）

annoy API说明

AnnoyIndex（f，metric）创建一个索引对象，用以读写和保存vector,f为vertor的dimension,metric为距离的度量标准：“angular”, “euclidean”, “manhattan”, “hamming”, “dot”a.add_item（i，v）用于给索引添加向量v，i（任何非负整数）是给向量v的表示。a.build（n_trees，n_jobs = -1）建立n_trees树的森林。查询时，树越多，精度越高。调用build之后，不能再添加任何项目。n_jobs指定用于构建树的线程数。n_jobs = -1使用所有可用的CPU内核。a.save（fn，prefault = False）将索引保存到磁盘并加载（请参阅下一个函数）。保存后，不能再添加任何项目。a.load（fn，prefault = False）从磁盘加载（映射）索引。如果prefault设置为True，它将把整个文件预读到内存中（使用带有MAP_POPULATE的mmap ）。默认值为False。a.unload（）卸载。a.get_nns_by_item（i，n，search_k = -1， include_distances = False）返回第i 个item的n个最近邻的item。在查询期间，它将最多检查search_k个节点（如果未提供，则默认为n_trees * n个）。search_k为您提供了更好的准确性和速度之间的运行时权衡。如果将include_distances设置为True，它将返回一个包含两个列表的2元素元组：第二个包含所有对应的距离。a.get_nns_by_vector（v，n，search_k = -1， include_distances = False）与上一个相同，通过向量v查询。a.get_item_vector（i）返回先前添加的项i的向量。a.get_distance（i，j）返回项目i和j之间的距离。。a.get_n_items（）返回索引中的项目数。a.get_n_trees（）返回索引中的树数。a.on_disk_build（fn）指定文件而不是RAM中建立索引（在添加项目之前执行，在建立之后无需保存）a.set_seed（seed）将使用给定的种子初始化随机数生成器。仅用于构建树，即仅在添加项目之前需要传递它。调用a.build（n_trees）或a.load（fn）后将无效。

官网示例

from annoy import AnnoyIndex import random f = 3 t = AnnoyIndex(f, 'angular') # 创建一个索引对象，用以读写和保存vector for i in range(10): v = [random.gauss(0, 1) for z in range(f)] # v表示索引i的vector，维度为3 print("索引：{}".format(i)) print("vector:{}".format(v)) t.add_item(i, v) # 用于给索引i添加向量v t.build(10) # 10 trees t.save('test.ann') # ... u = AnnoyIndex(f, 'angular') u.load('test.ann') print(u.get_nns_by_item(0, 2)) # 返回第0个item的2个最邻近的item 索引：0 vector:[0.2681347993218612, -0.24441751037756473, -0.5967289646953606] 索引：1 vector:[-0.0607644149591005, -1.449121861964382, 2.4451388443493056] 索引：2 vector:[2.003848929132559, -0.2541425662139462, -0.497085614620346] 索引：3 vector:[0.5764345458809472, -1.714689709068186, 1.4701779426872346] 索引：4 vector:[-1.1456338128767158, 0.6860479014952721, -1.8517137042760479] 索引：5 vector:[-0.6706489069574595, -0.26963909331623775, 0.9244132638827116] 索引：6 vector:[-1.8022712228819096, 0.4153152891427675, 0.39534408712543206] 索引：7 vector:[-0.47381612602560474, -0.1329901736873041, 0.5639810171598271] 索引：8 vector:[-0.3766129897610793, 0.4341122247930549, 0.6912182958278317] 索引：9 vector:[0.23223634021211337, 0.5001789600926715, 1.237195532406417] [0, 2]

在自然语言处理方面的应用----实现单词之间的类比

import torch import torch.nn as nn from tqdm import tqdm from annoy import AnnoyIndex import numpy as np class PreTrainedEmbeddings(object): """ A wrapper around pre-trained word vectors and their use """ def __init__(self, word_to_index, word_vectors): """ Args: word_to_index (dict): mapping from word to integers word_vectors (list of numpy arrays) """ self.word_to_index = word_to_index self.word_vectors = word_vectors self.index_to_word = {v: k for k, v in self.word_to_index.items()} self.index = AnnoyIndex(len(word_vectors[0]), metric='euclidean') # 创建一个索引对象，用以读写和保存vector的 print("Building Index!") for _, i in self.word_to_index.items(): self.index.add_item(i, self.word_vectors[i]) # 用于给索引i添加向量v self.index.build(50) # 用于构建 n_trees 的森林。 print("Finished!") @classmethod def from_embeddings_file(cls, embedding_file): """Instantiate from pre-trained vector file. Vector file should be of the format: word0 x0_0 x0_1 x0_2 x0_3 ... x0_N word1 x1_0 x1_1 x1_2 x1_3 ... x1_N Args: embedding_file (str): location of the file Returns: instance of PretrainedEmbeddigns """ word_to_index = {} word_vectors = [] with open(embedding_file,'rb') as fp: for line in fp.readlines(): line = line.decode('utf8').split(" ") word = line[0] vec = np.array([float(x) for x in line[1:]]) word_to_index[word] = len(word_to_index) word_vectors.append(vec) return cls(word_to_index, word_vectors) def get_embedding(self, word): """ Args: word (str) Returns an embedding (numpy.ndarray) """ return self.word_vectors[self.word_to_index[word]] def get_closest_to_vector(self, vector, n=1): """Given a vector, return its n nearest neighbors Args: vector (np.ndarray): should match the size of the vectors in the Annoy index n (int): the number of neighbors to return Returns: [str, str, ...]: words that are nearest to the given vector. The words are not ordered by distance """ nn_indices = self.index.get_nns_by_vector(vector, n) return [self.index_to_word[neighbor] for neighbor in nn_indices] def compute_and_print_analogy(self, word1, word2, word3): """Prints the solutions to analogies using word embeddings Analogies are word1 is to word2 as word3 is to __ This method will print: word1 : word2 :: word3 : word4 Args: word1 (str) word2 (str) word3 (str) """ vec1 = self.get_embedding(word1) vec2 = self.get_embedding(word2) vec3 = self.get_embedding(word3) # now compute the fourth word's embedding! spatial_relationship = vec2 - vec1 vec4 = vec3 + spatial_relationship closest_words = self.get_closest_to_vector(vec4, n=4) existing_words = set([word1, word2, word3]) closest_words = [word for word in closest_words if word not in existing_words] if len(closest_words) == 0: print("Could not find nearest neighbors for the computed vector!") return for word4 in closest_words: print("{} : {} :: {} : {}".format(word1, word2, word3, word4))

预训练词向量：

embeddings = PreTrainedEmbeddings.from_embeddings_file(/data/.vector_cache/glove.6B.100d.txt')

输出：

Building Index! Finished!

单词之间的类比：

embeddings.compute_and_print_analogy('man', 'he', 'woman') man : he :: woman : she man : he :: woman : her embeddings.compute_and_print_analogy('fly', 'plane', 'sail') fly : plane :: sail : ship fly : plane :: sail : vessel embeddings.compute_and_print_analogy('cat', 'kitten', 'dog') cat : kitten :: dog : puppy cat : kitten :: dog : puppies cat : kitten :: dog : hound cat : kitten :: dog : mannequin

Processed: 0.009, SQL: 8