Annoy小结以及在自然语言处理方面的应用

    科技2022-07-15  101

    距离公式

    Euclidean distance(欧式距离)

    Taxicab geometry (曼哈顿距离)

    Cosine similarity(余弦距离)

    Hamming distance(汉明距离) 在信息论中,两个等长字符串之间的汉明距离是相应符号不同的位置数。换句话说,它测量将一个字符串转换为另一个字符串所需的最小替换数,或可能将一个字符串转换为另一个字符串的最小错误数。 例如: “karolin” and “kathrin” is 3. “karolin” and “kerstin” is 3. “kathrin” and “kerstin” is 4. 1011101 and 1001001 is 2. 2173896 and 2233796 is 3.

    Dot product(点积)

    annoy API说明

    AnnoyIndex(f,metric)创建一个索引对象,用以读写和保存vector,f为vertor的dimension,metric为距离的度量标准:“angular”, “euclidean”, “manhattan”, “hamming”, “dot”a.add_item(i,v)用于给索引添加向量v,i(任何非负整数)是给向量v的表示。a.build(n_trees,n_jobs = -1)建立n_trees树的森林。查询时,树越多,精度越高。调用build之后,不能再添加任何项目。n_jobs指定用于构建树的线程数。n_jobs = -1使用所有可用的CPU内核。a.save(fn,prefault = False)将索引保存到磁盘并加载(请参阅下一个函数)。保存后,不能再添加任何项目。a.load(fn,prefault = False)从磁盘加载(映射)索引。如果prefault设置为True,它将把整个文件预读到内存中(使用带有MAP_POPULATE的mmap )。默认值为False。a.unload()卸载。a.get_nns_by_item(i,n,search_k = -1, include_distances = False)返回第i 个item的n个最近邻的item。在查询期间,它将最多检查search_k个节点(如果未提供,则默认为n_trees * n个)。search_k为您提供了更好的准确性和速度之间的运行时权衡。如果将include_distances设置为True,它将返回一个包含两个列表的2元素元组:第二个包含所有对应的距离。a.get_nns_by_vector(v,n,search_k = -1, include_distances = False)与上一个相同,通过向量v查询。a.get_item_vector(i)返回先前添加的项i的向量。a.get_distance(i,j)返回项目i和j之间的距离。。a.get_n_items()返回索引中的项目数。a.get_n_trees()返回索引中的树数。a.on_disk_build(fn)指定文件而不是RAM中建立索引(在添加项目之前执行,在建立之后无需保存)a.set_seed(seed)将使用给定的种子初始化随机数生成器。仅用于构建树,即仅在添加项目之前需要传递它。调用a.build(n_trees)或a.load(fn)后将无效。

    官网示例

    from annoy import AnnoyIndex import random f = 3 t = AnnoyIndex(f, 'angular') # 创建一个索引对象,用以读写和保存vector for i in range(10): v = [random.gauss(0, 1) for z in range(f)] # v表示索引i的vector,维度为3 print("索引:{}".format(i)) print("vector:{}".format(v)) t.add_item(i, v) # 用于给索引i添加向量v t.build(10) # 10 trees t.save('test.ann') # ... u = AnnoyIndex(f, 'angular') u.load('test.ann') print(u.get_nns_by_item(0, 2)) # 返回第0个item的2个最邻近的item 索引:0 vector:[0.2681347993218612, -0.24441751037756473, -0.5967289646953606] 索引:1 vector:[-0.0607644149591005, -1.449121861964382, 2.4451388443493056] 索引:2 vector:[2.003848929132559, -0.2541425662139462, -0.497085614620346] 索引:3 vector:[0.5764345458809472, -1.714689709068186, 1.4701779426872346] 索引:4 vector:[-1.1456338128767158, 0.6860479014952721, -1.8517137042760479] 索引:5 vector:[-0.6706489069574595, -0.26963909331623775, 0.9244132638827116] 索引:6 vector:[-1.8022712228819096, 0.4153152891427675, 0.39534408712543206] 索引:7 vector:[-0.47381612602560474, -0.1329901736873041, 0.5639810171598271] 索引:8 vector:[-0.3766129897610793, 0.4341122247930549, 0.6912182958278317] 索引:9 vector:[0.23223634021211337, 0.5001789600926715, 1.237195532406417] [0, 2]

    在自然语言处理方面的应用----实现单词之间的类比

    import torch import torch.nn as nn from tqdm import tqdm from annoy import AnnoyIndex import numpy as np class PreTrainedEmbeddings(object): """ A wrapper around pre-trained word vectors and their use """ def __init__(self, word_to_index, word_vectors): """ Args: word_to_index (dict): mapping from word to integers word_vectors (list of numpy arrays) """ self.word_to_index = word_to_index self.word_vectors = word_vectors self.index_to_word = {v: k for k, v in self.word_to_index.items()} self.index = AnnoyIndex(len(word_vectors[0]), metric='euclidean') # 创建一个索引对象,用以读写和保存vector的 print("Building Index!") for _, i in self.word_to_index.items(): self.index.add_item(i, self.word_vectors[i]) # 用于给索引i添加向量v self.index.build(50) # 用于构建 n_trees 的森林。 print("Finished!") @classmethod def from_embeddings_file(cls, embedding_file): """Instantiate from pre-trained vector file. Vector file should be of the format: word0 x0_0 x0_1 x0_2 x0_3 ... x0_N word1 x1_0 x1_1 x1_2 x1_3 ... x1_N Args: embedding_file (str): location of the file Returns: instance of PretrainedEmbeddigns """ word_to_index = {} word_vectors = [] with open(embedding_file,'rb') as fp: for line in fp.readlines(): line = line.decode('utf8').split(" ") word = line[0] vec = np.array([float(x) for x in line[1:]]) word_to_index[word] = len(word_to_index) word_vectors.append(vec) return cls(word_to_index, word_vectors) def get_embedding(self, word): """ Args: word (str) Returns an embedding (numpy.ndarray) """ return self.word_vectors[self.word_to_index[word]] def get_closest_to_vector(self, vector, n=1): """Given a vector, return its n nearest neighbors Args: vector (np.ndarray): should match the size of the vectors in the Annoy index n (int): the number of neighbors to return Returns: [str, str, ...]: words that are nearest to the given vector. The words are not ordered by distance """ nn_indices = self.index.get_nns_by_vector(vector, n) return [self.index_to_word[neighbor] for neighbor in nn_indices] def compute_and_print_analogy(self, word1, word2, word3): """Prints the solutions to analogies using word embeddings Analogies are word1 is to word2 as word3 is to __ This method will print: word1 : word2 :: word3 : word4 Args: word1 (str) word2 (str) word3 (str) """ vec1 = self.get_embedding(word1) vec2 = self.get_embedding(word2) vec3 = self.get_embedding(word3) # now compute the fourth word's embedding! spatial_relationship = vec2 - vec1 vec4 = vec3 + spatial_relationship closest_words = self.get_closest_to_vector(vec4, n=4) existing_words = set([word1, word2, word3]) closest_words = [word for word in closest_words if word not in existing_words] if len(closest_words) == 0: print("Could not find nearest neighbors for the computed vector!") return for word4 in closest_words: print("{} : {} :: {} : {}".format(word1, word2, word3, word4))

    预训练词向量:

    embeddings = PreTrainedEmbeddings.from_embeddings_file(/data/.vector_cache/glove.6B.100d.txt')

    输出:

    Building Index! Finished!

    单词之间的类比:

    embeddings.compute_and_print_analogy('man', 'he', 'woman') man : he :: woman : she man : he :: woman : her embeddings.compute_and_print_analogy('fly', 'plane', 'sail') fly : plane :: sail : ship fly : plane :: sail : vessel embeddings.compute_and_print_analogy('cat', 'kitten', 'dog') cat : kitten :: dog : puppy cat : kitten :: dog : puppies cat : kitten :: dog : hound cat : kitten :: dog : mannequin
    Processed: 0.009, SQL: 8