嵌入式生成自定义id
If you want to model the unique meaning of commonly occurring n-grams, often called collocations, the best solution is to train a new embedding space to model those specific semantics.
如果要对通常出现的n-gram(通常称为并置)的唯一含义进行建模,最好的解决方案是训练新的嵌入空间以对这些特定语义进行建模。
An alternate approach is “Aggregating” i.e. embedding each word separately and then taking their average as the embedding for their combination. But this only captures part of the collocation’s meaning. In simpler terms, when 2 word-vectors are added together, the sum is often not the correct representation of the phrase formed by joining the 2 words.
另一种方法是“汇总”,即分别嵌入每个单词,然后将其平均值作为其组合的嵌入。 但这仅捕获了并置含义的一部分。 用更简单的术语来说,将2个单词向量加在一起时,总和通常不是正确的表示形式,即通过连接2个单词形成的短语。
Another alternate approach is fine-tuning/transfer learning. It is used when you do not have a lot of data. Fine-tuning/transfer learning takes an existing model architecture and weights, then does additional training with more data. An example of fine-tuning for word2vec in Keras can be found here. In this case, Google’s Wikipedia model is taken and trained with custom collocations.
另一种替代方法是微调/转移学习。 当您没有大量数据时使用。 微调/转移学习采用现有的模型架构和权重,然后使用更多数据进行额外的训练。 在此处可以找到Keras中word2vec的微调示例。 在这种情况下,将采用Google的Wikipedia模型并通过自定义搭配对其进行训练。
Okay, let's get into it then. First things first, import your libraries.
好的,让我们开始吧。 首先,导入您的库。
import gensimfrom nltk import ngramsfrom nltk.corpus import stopwordsstoplist = stopwords.words('english')from collections import CounterNow let’s get a sample dataset. I have used the ‘brown’ data from nltk corpus.
现在让我们获得一个样本数据集。 我使用了nltk语料库中的“棕色”数据。
from nltk.corpus import brownwords = brown.words()sents = brown.sents()print("Number of sentences: ",len(sents)))print("Number of words: , ",len(words))brown.sents()[2:3][out]: Number of sentences: 57340
[出]:句子数:57340
Number of words: 1161192
字数:1161192
[[‘The’, ‘September-October’, ‘term’, ‘court’, ‘had’, ‘been’, ‘presided’, ‘by’, ‘Fulton’, ‘Superior’, ‘Court’, ‘Judge’, ‘Durwood’, ‘Pye’]]
[[['','9月至10月','任期','法院','已经','被','主持','由','富尔顿','高级','法院','法官','Durwood','Pye']]
So the corpus is fairly large with 57340 sentences and 1161192 words. The sentences are already tokenized, which is great. (Remember, the gensim word2vec model takes in a list of tokenized sentences (i.e. a list of list) as an input). Now let’s extract all the n-grams from the corpus:
因此,语料库相当大,有57340个句子和1161192个单词。 句子已经被标记了,这很棒。 (请记住,gensim word2vec模型将带标记的句子列表(即列表列表)作为输入)。 现在让我们从语料库中提取所有n元语法:
def get_ngrams(words): words = [word.lower() for word in words if word not in stoplist and len(word)>2] bigram=["_".join(phrases) for phrases in list(ngrams(words,2))] trigram=["_".join(phrases) for phrases in list(ngrams(words,3))] fourgram=["_".join(phrases) for phrases in list(ngrams(words,4))] return bigram, trigram, fourgrambigram, trigram, fourgram = get_ngrams(words)print("Top 3 bigrams: ",Counter(bigram).most_common()[:3])print("Top 3 trigrams: ",Counter(trigram).most_common()[:3])print("Top 3 fourgrams: ",Counter(fourgram).most_common()[:3])[out]: Top 3 bigrams: [(‘united_states’, 392), (‘new_york’, 296), (‘per_cent’, 146)]
[out]:前3个二元组:[('united_states',392),('new_york',296),('per_cent',146)]
Top 3 trigrams: [(‘united_states_america’, 29), (‘new_york_city’, 27), (‘government_united_states’, 25)]
前3个三字组:[('united_states_america',29),('new_york_city',27),('government_united_states',25)]
Top 3 fourgrams: [(‘government_united_states_america’, 17), (‘john_notte_jr._governor’, 15), (‘average_per_capita_income’, 10)]
前3个四元组:[['government_united_states_america',17),(''john_notte_jr._governor',15),('average_per_capita_income',10)]
They look beautiful, don't they? The stopwords have been removed though. Now, we need to understand that embeddings are created for single words, so we have joined the words in our n-grams with “_” character.
它们看起来很漂亮,不是吗? 停用词已被删除。 现在,我们需要了解为单个单词创建了嵌入,因此我们在n-gram中将这些单词连接为“ _”字符。
Let's generate the training data for our custom embeddings now:
现在,为我们的自定义嵌入生成训练数据:
training_data = []for sentence in sents: l1 = [word.lower() for word in sentence if word not in stoplist and len(word) > 2] l2 , l3 , l4 = get_ngrams(sentence) training_data.append(l1 + l2 + l3 + l4)training_data[2:3][out]: [[‘the’, ‘september-october’, ‘term’,..september-october’, ‘september-october_term’, ‘term_jury’, ‘jury_charged’, ‘charged_fulton’, ‘fulton_superior’, ‘superior_court’,.. ‘the_september-october_term’, ‘september-october_term_jury’, ‘term_jury_charged’, ‘jury_charged_fulton’,]]
[out]:[['the','september-october','term',。september-october','september-october_term','term_jury','jury_charged','charged_fulton','fulton_superior',' superior_court”,..“ the_september-october_term”,“ september-october_term_jury”,“ term_jury_charged”,“ jury_charged_fulton”,]]
Alright, so the data looks ready to be used to train our custom embedding.
好了,因此数据看起来可以用来训练我们的自定义嵌入了。
model = gensim.models.Word2Vec(training_data)model.save('custom.embedding')model = gensim.models.Word2Vec.load('custom.embedding')And your custom embeddings with n-grams are ready.
现在可以使用n-gram进行自定义嵌入了。
Each word is represented in the space of 100 dimensions:
每个字在100维的空间中表示:
len(model['india'])[out]: 100
[输出]:100
Note: Always save the embedding so you don't have to train it every time you rerun your notebook.
注意:请务必保存嵌入内容,这样就不必在每次重新运行笔记本时都对其进行培训。
Let's look at some supporting functions already implemented in Gensim to manipulate word embeddings. For example, to compute the cosine similarity between 2 words:
让我们看一下已经在Gensim中实现的一些支持功能,用于操纵单词嵌入。 例如,要计算两个单词之间的余弦相似度:
new_model.similarity('university','school') > 0.3[out]: True
[输出]:正确
Finding the top n words that are similar to a target word:
查找与目标单词相似的前n个单词:
model.most_similar(positive=['india'], topn = 5)[out]: [(‘britain’, 0.9997713565826416), (‘the_government’, 0.9996576309204102), (‘court’, 0.9996564388275146), (‘government_india’, 0.9996494650840759), (‘secretary_state’, 0.9995989799499512)]
[out]:[('britain',0.9997713565826416,('the_government',0.9996576309204102),('court',0.9996564388275146),('government_india',0.9996494494650840759),('secretary_state',0.9995989799499512)]
You can input multiple words as well:
您也可以输入多个单词:
model.most_similar(positive=['india','britain'], negative=['paris'], topn = 5)[out]: [(‘america’, 0.999089241027832), (‘government_united_states’, 0.9990255832672119), (‘united_states_america’, 0.9989479780197144), (‘pro-western’, 0.9975387454032898), (‘foreign_countries’, 0.9969671964645386)]
[输出]:[('america',0.999089241027832),('government_united_states',0.9990255832672119),('united_states_america',0.9989479780197144),('pro-western',0.9975387454032898),('foreign_countries',0.9969671964645386)]]
So that’s all for this article, folks. Thank you for reading. Cheers!
伙计们,这就是本文的全部内容。 感谢您的阅读。 干杯!
翻译自: https://medium.com/@suyashkhare619/how-to-deal-with-multi-word-phrases-or-n-grams-while-building-a-custom-embedding-eec547d1ab45
嵌入式生成自定义id
相关资源:四史答题软件安装包exe