距离公式
Euclidean distance(欧式距离)
Taxicab geometry (曼哈顿距离)
Cosine similarity(余弦距离)
Hamming distance(汉明距离) 在信息论中,两个等长字符串之间的汉明距离是相应符号不同的位置数。换句话说,它测量将一个字符串转换为另一个字符串所需的最小替换数,或可能将一个字符串转换为另一个字符串的最小错误数。 例如: “karolin” and “kathrin” is 3. “karolin” and “kerstin” is 3. “kathrin” and “kerstin” is 4. 1011101 and 1001001 is 2. 2173896 and 2233796 is 3.
Dot product(点积)
annoy API说明
AnnoyIndex(f,metric)创建一个索引对象,用以读写和保存vector,f为vertor的dimension,metric为距离的度量标准:“angular”, “euclidean”, “manhattan”, “hamming”, “dot”a.add_item(i,v)用于给索引添加向量v,i(任何非负整数)是给向量v的表示。a.build(n_trees,n_jobs = -1)建立n_trees树的森林。查询时,树越多,精度越高。调用build之后,不能再添加任何项目。n_jobs指定用于构建树的线程数。n_jobs = -1使用所有可用的CPU内核。a.save(fn,prefault = False)将索引保存到磁盘并加载(请参阅下一个函数)。保存后,不能再添加任何项目。a.load(fn,prefault = False)从磁盘加载(映射)索引。如果prefault设置为True,它将把整个文件预读到内存中(使用带有MAP_POPULATE的mmap )。默认值为False。a.unload()卸载。a.get_nns_by_item(i,n,search_k = -1, include_distances = False)返回第i 个item的n个最近邻的item。在查询期间,它将最多检查search_k个节点(如果未提供,则默认为n_trees * n个)。search_k为您提供了更好的准确性和速度之间的运行时权衡。如果将include_distances设置为True,它将返回一个包含两个列表的2元素元组:第二个包含所有对应的距离。a.get_nns_by_vector(v,n,search_k = -1, include_distances = False)与上一个相同,通过向量v查询。a.get_item_vector(i)返回先前添加的项i的向量。a.get_distance(i,j)返回项目i和j之间的距离。。a.get_n_items()返回索引中的项目数。a.get_n_trees()返回索引中的树数。a.on_disk_build(fn)指定文件而不是RAM中建立索引(在添加项目之前执行,在建立之后无需保存)a.set_seed(seed)将使用给定的种子初始化随机数生成器。仅用于构建树,即仅在添加项目之前需要传递它。调用a.build(n_trees)或a.load(fn)后将无效。
官网示例
from annoy
import AnnoyIndex
import random
f
= 3
t
= AnnoyIndex
(f
, 'angular')
for i
in range(10):
v
= [random
.gauss
(0, 1) for z
in range(f
)]
print("索引:{}".format(i
))
print("vector:{}".format(v
))
t
.add_item
(i
, v
)
t
.build
(10)
t
.save
('test.ann')
u
= AnnoyIndex
(f
, 'angular')
u
.load
('test.ann')
print(u
.get_nns_by_item
(0, 2))
索引:0
vector:[0.2681347993218612, -0.24441751037756473, -0.5967289646953606]
索引:1
vector:[-0.0607644149591005, -1.449121861964382, 2.4451388443493056]
索引:2
vector:[2.003848929132559, -0.2541425662139462, -0.497085614620346]
索引:3
vector:[0.5764345458809472, -1.714689709068186, 1.4701779426872346]
索引:4
vector:[-1.1456338128767158, 0.6860479014952721, -1.8517137042760479]
索引:5
vector:[-0.6706489069574595, -0.26963909331623775, 0.9244132638827116]
索引:6
vector:[-1.8022712228819096, 0.4153152891427675, 0.39534408712543206]
索引:7
vector:[-0.47381612602560474, -0.1329901736873041, 0.5639810171598271]
索引:8
vector:[-0.3766129897610793, 0.4341122247930549, 0.6912182958278317]
索引:9
vector:[0.23223634021211337, 0.5001789600926715, 1.237195532406417]
[0, 2]
在自然语言处理方面的应用----实现单词之间的类比
import torch
import torch
.nn
as nn
from tqdm
import tqdm
from annoy
import AnnoyIndex
import numpy
as np
class PreTrainedEmbeddings(object):
""" A wrapper around pre-trained word vectors and their use """
def __init__(self
, word_to_index
, word_vectors
):
"""
Args:
word_to_index (dict): mapping from word to integers
word_vectors (list of numpy arrays)
"""
self
.word_to_index
= word_to_index
self
.word_vectors
= word_vectors
self
.index_to_word
= {v
: k
for k
, v
in self
.word_to_index
.items
()}
self
.index
= AnnoyIndex
(len(word_vectors
[0]), metric
='euclidean')
print("Building Index!")
for _
, i
in self
.word_to_index
.items
():
self
.index
.add_item
(i
, self
.word_vectors
[i
])
self
.index
.build
(50)
print("Finished!")
@
classmethod
def from_embeddings_file(cls
, embedding_file
):
"""Instantiate from pre-trained vector file.
Vector file should be of the format:
word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
word1 x1_0 x1_1 x1_2 x1_3 ... x1_N
Args:
embedding_file (str): location of the file
Returns:
instance of PretrainedEmbeddigns
"""
word_to_index
= {}
word_vectors
= []
with open(embedding_file
,'rb') as fp
:
for line
in fp
.readlines
():
line
= line
.decode
('utf8').split
(" ")
word
= line
[0]
vec
= np
.array
([float(x
) for x
in line
[1:]])
word_to_index
[word
] = len(word_to_index
)
word_vectors
.append
(vec
)
return cls
(word_to_index
, word_vectors
)
def get_embedding(self
, word
):
"""
Args:
word (str)
Returns
an embedding (numpy.ndarray)
"""
return self
.word_vectors
[self
.word_to_index
[word
]]
def get_closest_to_vector(self
, vector
, n
=1):
"""Given a vector, return its n nearest neighbors
Args:
vector (np.ndarray): should match the size of the vectors
in the Annoy index
n (int): the number of neighbors to return
Returns:
[str, str, ...]: words that are nearest to the given vector.
The words are not ordered by distance
"""
nn_indices
= self
.index
.get_nns_by_vector
(vector
, n
)
return [self
.index_to_word
[neighbor
] for neighbor
in nn_indices
]
def compute_and_print_analogy(self
, word1
, word2
, word3
):
"""Prints the solutions to analogies using word embeddings
Analogies are word1 is to word2 as word3 is to __
This method will print: word1 : word2 :: word3 : word4
Args:
word1 (str)
word2 (str)
word3 (str)
"""
vec1
= self
.get_embedding
(word1
)
vec2
= self
.get_embedding
(word2
)
vec3
= self
.get_embedding
(word3
)
spatial_relationship
= vec2
- vec1
vec4
= vec3
+ spatial_relationship
closest_words
= self
.get_closest_to_vector
(vec4
, n
=4)
existing_words
= set([word1
, word2
, word3
])
closest_words
= [word
for word
in closest_words
if word
not in existing_words
]
if len(closest_words
) == 0:
print("Could not find nearest neighbors for the computed vector!")
return
for word4
in closest_words
:
print("{} : {} :: {} : {}".format(word1
, word2
, word3
, word4
))
预训练词向量:
embeddings
= PreTrainedEmbeddings
.from_embeddings_file
(/data
/.vector_cache
/glove
.6B.100d.txt'
)
输出:
Building Index!
Finished!
单词之间的类比:
embeddings
.compute_and_print_analogy
('man', 'he', 'woman')
man : he :: woman : she
man : he :: woman : her
embeddings
.compute_and_print_analogy
('fly', 'plane', 'sail')
fly : plane :: sail : ship
fly : plane :: sail : vessel
embeddings
.compute_and_print_analogy
('cat', 'kitten', 'dog')
cat : kitten :: dog : puppy
cat : kitten :: dog : puppies
cat : kitten :: dog : hound
cat : kitten :: dog : mannequin