nlp自然语言处理
A quick discussion on the recent progression of NLP, the fall of LSTM, and an introduction to (and tutorial for) BERT from Google
快速讨论NLP的最新进展,LSTM的衰落,以及Google的BERT简介(及其教程)
You’ll get the most out of this article if you already have
如果您已经掌握了本文,那么您将获得最大的收益。
Some familiarity with neural networks/neural network architecture (how data moves through the system, activations, etc.) 对神经网络/神经网络架构有所了解(数据如何在系统中移动,激活等) A basic understanding of NLP tasks 对NLP任务的基本了解 Experience programming in Python (+Tensorflow, for BERT tutorial)经验丰富的Python编程经验(+ Tensorflow,用于BERT教程)I’m attempting to cover a lot of ground here, so I’ve linked lots of relevant resources and documentation after each section and throughout the article that I recommend exploring. The goal is to provide you with a basic understanding of the topics, and resources for you to explore and experiment on your own.
我试图在这里介绍很多内容,因此在每个部分之后以及建议浏览的整篇文章中,我都链接了许多相关资源和文档。 目的是使您对主题有基本的了解,并为您自己探索和实验提供资源。
We’re going to take a deep look at the transformer model, a relatively new and incredibly effective NLP solution. We’re going to talk about the recent progression of NLP techniques, and why some of these solutions aren't optimal. We’ll learn how transformers differ from these earlier approaches (such as LSTM), and how information is learned within the network. Then we’re going to discuss how modern algorithms understand “what is language? what is context?”, and move on to BERT, a modern transformer architecture by Google that can be used in a huge range of tasks from document classification and document entanglement, to sentiment analysis and question answering. Finally, we’ll solidify our understanding and wrap up by answering the question “what makes transformers so great?”.
我们将深入研究变压器模型,这是一个相对较新且非常有效的NLP解决方案。 我们将讨论NLP技术的最新进展,以及为什么其中一些解决方案不是最佳解决方案。 我们将学习变压器与这些早期方法(例如LSTM)的不同之处,以及如何在网络中学习信息。 然后,我们将讨论现代算法如何理解“什么是语言? 什么是上下文?”,然后转到BERT,这是Google推出的现代转换器架构,可用于从文档分类和文档纠缠到情感分析和问题解答的众多任务。 最后,我们将回答“什么使变压器如此出色?”这一问题来巩固我们的理解并总结。
For a little background and comparison, we’re going to start by talking a little bit about the supervised-learning subset of NLP where we take in a document of words as input and attempt to predict some output (such as whether or not this document is spam). For us to do this, we have to be able to take in our document and represent it as a fixed-size vector. We say fixed-size because due to the laws of linear algebra at play inside our system, we are only able to compare vectors of the same shape.
为了进行一些背景和比较,我们将首先讨论NLP的监督学习子集,在该子集中,我们将单词文档作为输入,并尝试预测一些输出(例如该文档是否是垃圾邮件)。 为了做到这一点,我们必须能够接收我们的文档并将其表示为固定大小的向量。 我们说固定大小是因为由于系统内部线性代数的规律,我们只能比较相同形状的向量。
Our first challenge here is that we need a method of meaningfully encoding these differently shaped documents into a vector with a fixed length.
我们的第一个挑战是,我们需要一种将这些形状不同的文档有意义地编码为固定长度矢量的方法。
Along comes the most common method for dealing with this vector-shape mismatch called the bag of words approach. In a bag of words we create 1 vector dimension per word in our vocabulary… the whole vocabulary we wish to consider. So for example, if we’re working within the English language which has roughly 100,000 words, we’d have a roughly 100,000-dimensional vector. We then populate this vector with the frequency count of each word as it appears in the document. You can imagine that this naturally leads to very sparse data. This sparse data is still stored efficiently, ignoring 0’s and simply recording their position. Data stored might look like this
随之而来的是处理这种矢量形状不匹配的最常见方法,称为“词袋法”。 在一个词袋中,我们在词汇表中创建了每个单词1个向量维,即我们要考虑的整个词汇表。 因此,例如,如果我们使用的英语语言大约有100,000个单词,那么我们将有大约100,000个维的向量。 然后,我们使用出现在文档中的每个单词的频率计数来填充此向量。 您可以想象,这自然会导致非常稀疏的数据。 稀疏数据仍然可以有效存储,忽略0并仅记录其位置。 存储的数据可能看起来像这样
If values exist…
如果值存在……
list(tuple(position: int, value: float))
list(tuple(position: int, value: float))
If values are 0…
如果值为0…
list(position: int)
list(position: int)
Although this method is fairly efficient, and may actually be adequate for our spam/no spam problem, there is still one massive limitation. The order of words within documents typically matters, it’s how we create context and understand lengthy documents or complex ideas. However, in a bag of words algorithm, no attention is paid to order. For example, take the two sentences “work to live” and “live to work”. These two documents could be completely different content-wise (we know that working to live and living to work are two completely different concepts), but have the same word counts and therefore would be encoded as identical vectors. Now, we could try to help our algorithm better understand context through the application of n-grams, or comparing n-lengthed groupings of words, but this would drive our vector dimensionality into the trillions or higher, and this can cause major problems.
尽管此方法相当有效,并且实际上可能足以解决我们的垃圾邮件/无垃圾邮件问题,但仍然存在一个巨大的局限性。 文档中的单词顺序通常很重要,这是我们创建上下文并理解冗长的文档或复杂想法的方式。 但是,在单词袋算法中,没有注意顺序。 例如,用两个句子“为生存而工作”和“为工作而生活”。 这两个文档在内容方面可能是完全不同的(我们知道生活和工作是两个完全不同的概念),但是具有相同的字数,因此将被编码为相同的向量。 现在,我们可以尝试通过应用n-gram或比较n-length个单词分组来帮助我们的算法更好地理解上下文,但是这会使我们的向量维数达到数万亿甚至更高,这可能会引起严重的问题。
BoW resources:
BoW资源:
- A Gentle Introduction to Bag of Words Models
-语言包模型的温和介绍
- BoW and TF-IDF
-BoW和TF-IDF
The solution researchers arrived at for solving this vector-length encoding problem was to implement an RNN (a recurrent neural network). RNNs help us answer the question “how do we solve a function on a variable-length set of inputs?”. Remember that in a bag of words, our vector length is still essentially the size of the English vocabulary. We want to be able to effectively compare two documents of different lengths without having to create these 100k-dimensional vectors. RNN’s recursive nature allows us to analyze documents bit by bit (single words) instead of all at once (100k-d vector), meaning we can now consume these differently shaped documents. It does this iteratively using the mathematical equivalent of a for-loop that recursively defines the output at each stage as a function of the inputs at the previous stages and the previous output. Our final output is our final hidden state after the loop has finished. That sounds pretty confusing so let’s visualize this process
研究人员得出的解决此矢量长度编码问题的方法是实现RNN(递归神经网络)。 RNN帮助我们回答“如何在可变长度的输入集上求解函数?”的问题。 请记住,在一袋单词中,我们的向量长度本质上仍是英语词汇量的大小。 我们希望能够有效地比较两个不同长度的文档,而不必创建这些100k维向量。 RNN的递归性质使我们可以一点一点地(单个单词)分析文档,而不是一次分析所有文档(100k-d矢量),这意味着我们现在可以使用这些形状不同的文档。 它使用for循环的数学等效项来迭代地执行此操作,该for循环根据先前阶段的输入和先前输出递归地定义每个阶段的输出。 最终的输出是循环完成后的最终隐藏状态。 这听起来很令人困惑,所以让我们看一下这个过程
Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com
构建更好的语音应用。 在voicetechpodcast.com上获得语音技术专家的更多文章和访谈
On the right, we have roughly the architecture we would see if we were to code an RNN. We have our input (x_t), our recursive activation cell (A), and our final output (h_t). On the left, we have the “unrolled” version showing the different stages of recursion. We see the first vector feature, x_0, pass through our activation, and generate an output. Our next layer then introduces x_1, we perform the activation again on x_1 as well as our previous output and then pass that value along to be used in the next loop. We can imagine the result is a very deep network, where each value in the vector creates a new layer of activation.
在右侧,如果要对RNN进行编码,我们将大致了解架构。 我们有输入(x_t),递归激活单元(A)和最终输出(h_t)。 左侧是“展开”版本,显示了递归的不同阶段。 我们看到第一个矢量特征x_0,通过激活激活并生成输出。 然后,我们的下一层引入x_1,我们在x_1以及先前的输出上再次执行激活,然后将该值传递给下一个循环。 我们可以想象结果是一个非常深的网络,其中向量中的每个值都会创建一个新的激活层。
You can imagine already that this recursive nature could lead to problems with values getting too large or too small (exploding or vanishing gradients), and this is what makes RNNs so difficult to use, except on small sequences.
您已经可以想象到,这种递归性质可能会导致值变得太大或太小(梯度爆炸或消失)的问题,这就是RNN如此难以使用的原因,除了小序列。
If the RNN process is still a little confusing, that’s okay. For now just know that we have a network that is able to iteratively analyze differently shaped documents, and it does so using for-loop-like activation. We’re going to try to deepen our understanding by exploring the LSTM model next (which is simply an RNN containing a slightly more sophisticated cell).
如果RNN流程仍然有点令人困惑,那就可以了。 现在,只知道我们有一个网络可以迭代分析形状不同的文档,并且可以使用类似于for循环的激活来进行分析。 接下来,我们将尝试通过探索LSTM模型(这只是一个包含稍微复杂一些的单元的RNN)来加深我们的理解。
RNN resources:
RNN资源:
- RNNs and LSTM
-RNN和LSTM
- A great visual explanation
-出色的视觉解释
LSTM stands for Long Short Term Memory, and it is an RNN architecture initially developed back in the 90s by Sepp Hochreiter and Jürgen Schmidhuber for natural language processing. The way it works is actually very similar to a ResNet CNN architecture, where new values are added onto the activation as you move through the layers, solving our exploding/vanishing gradients problem.
LSTM代表长期短期记忆,它是一种RNN架构,最初由Sepp Hochreiter和JürgenSchmidhuber于90年代开发用于自然语言处理。 实际上,它的工作方式与ResNet CNN架构非常相似,后者在您移动各层时将新值添加到激活中,从而解决了爆炸/消失梯度问题。
Of course, there are still limitations. LSTMs are pretty difficult to train because you still have very long gradient paths — you are still propagating gradients from the end all the way through the transformation cell to the beginning. So for long documents (e.g. 100 words), you would have a very deep network (e.g. comparable to a 100-layer network).
当然,仍然存在限制。 LSTM很难训练,因为您仍然具有非常长的渐变路径-您仍在从渐变到转换单元的整个起点传播渐变。 因此,对于较长的文档(例如100个单词),您将拥有非常深的网络(例如与100层网络相当)。
Another major limitation when it comes to LSTM is that transfer learning never really worked on these models. One of the great things about CNNs is that you can train your network on a massive dataset like ImageNet, and then take that neural network and fine-tune it for a specific task on a much smaller amount of new data. This type of transfer learning is possible with an LSTM, but just not very reliable, creating the need for new, labeled datasets specific to each new task (this is expensive).
LSTM的另一个主要限制是,转移学习从未真正在这些模型上起作用。 CNN的一大优点是,您可以在海量数据集(如ImageNet)上训练网络,然后采用该神经网络并针对少量新数据对特定任务进行微调。 LSTM可以进行这种类型的迁移学习,但是这种学习不是很可靠,因此需要特定于每个新任务的新的,标记的数据集(这很昂贵)。
LSTM resources:
LSTM资源:
- A visual explaination
-视觉解释
- RNNs and LSTM for NLP
-用于NLP的RNN和LSTM
Transformers are the missing link that allows our computers to actually understand “what is language? what is context?”.
变压器是缺少的链接,它使我们的计算机能够真正理解“什么是语言? 什么是背景?”。
When you hear the term Muppets in reference to transformer models, the speaker is referring to the collection papers published a little over two years ago on the topic. The first paper was titled “Attention Is All You Need” ¹, and it introduces the concepts we will discuss in just a second. The next major papers published (by Google) proposed an architecture called BERT, followed by a paper proposing an architecture ELMo, and so on. Researches have since adopted the term “Muppet” model to describe any of these types of systems.
当您听到涉及变压器模型的术语Muppets时,演讲者指的是两年前出版的有关该主题的收集论文。 第一篇论文的标题为“注意就是您所需要的一切”¹,它介绍了我们将在稍后讨论的概念。 接下来的主要论文(由Google发表)提出了一种名为BERT的体系结构,其后提出了一种体系结构ELMo的论文,依此类推。 此后,研究人员采用术语“ Muppet”模型来描述这些类型的系统中的任何一种。
The first paper ¹ published on the topic of transformers addresses the problem of machine translation, think translating a document from English to French. The classic way to do this using a neural network is to use an encoder and decoder. That system looks something like this.
关于变压器的主题发表的第一篇论文¹解决了机器翻译的问题,请考虑将文档从英语翻译为法语。 使用神经网络执行此操作的经典方法是使用编码器和解码器。 该系统看起来像这样。
https://www.wolfram.com/language/12/neural-network-framework/use-transformer-neural-nets.html?product=language https://www.wolfram.com/language/12/neural-network-framework/use-transformer-neural-nets.html?product=languageThe encoder (left) processes the input iteratively and the decoder (right) does the same thing to the output of the encoder. We’re just going to focus on the encoder portion for now, as this is all you need for supervised learning and the decoder is relatively similar.
编码器(左)迭代处理输入,解码器(右)对编码器的输出执行相同的操作。 我们现在仅关注编码器部分,因为这是监督学习所需的全部,并且解码器相对相似。
https://www.wolfram.com/language/12/neural-network-framework/use-transformer-neural-nets.html?product=language https://www.wolfram.com/language/12/neural-network-framework/use-transformer-neural-nets.html?product=languageAt a high level, our encoder only contains three parts: a positional encoding, an attention mechanism, and a final dense layer normalizing and converting our output.
从高层次上讲,我们的编码器仅包含三个部分:位置编码,注意机制以及归一化并转换输出的最终密集层。
General transformer resources:
通用变压器资源:
- Illustrated Guide to Transformers
-图解变形金刚指南
- Huggingface encoder-decoder documentation
-Huggingface编码器-解码器文档
First, we’re going to talk about the attention mechanism, which is located in the middle of the encoder, and is how the algorithm is able to work with documents of variable length. This mechanism performs an all-to-all comparison where, for every layer of the net, for every output of the next layer, it considers every possible input from the previous layer. Every output is a weighted sum of every input, where the weighting is a learned function, and then we apply a fully connected (dense) layer after it.
首先,我们将讨论位于编码器中间的注意力机制,该机制是算法如何处理可变长度的文档的方法。 此机制执行全面比较,其中对于网络的每一层,对于下一层的每个输出,它都会考虑来自上一层的每个可能的输入。 每个输出是每个输入的加权总和,其中加权是一个学习的函数,然后在它之后应用完全连接的(密集)层。
The way this works in our transformer relies on something called a relevance score, which simply an interpretable dot-product of what we call our query and key vectors. These vectors are generated by our model: for every output position we generate a query, and for every input we’re considering we generate a key. The process for how all of these pieces come together to form a prediction can be very confusing, so at a high level here’s what’s going on.
转换程序在我们的转换器中的工作方式依赖于一个称为“相关性分数”的东西,它只是我们所谓的查询和键向量的可解释的点积。 这些向量是由我们的模型生成的:对于每个输出位置,我们都会生成一个查询,对于我们考虑的每个输入,我们都会生成一个键。 所有这些部分如何组合在一起以形成预测的过程可能会非常令人困惑,因此从高层次来看,这就是正在发生的事情。
Combine query and key valuesQ: query token (output token)K: key token (input token)____________________________Gives us the relevance scoresRelevance = Q*K____________________________Use softmax to normalize relevance scores and perform weighted average of values (3rd version of each token) to get our outputV: value (input token)OUT = Softmax(relevance)*VLet’s also write some pseudo-code to get a better understanding of what our algorithm is doing at each step.
我们还编写一些伪代码,以更好地了解我们的算法在每个步骤中所做的事情。
X_input is a list of tensors, one per input tokenQ, K, and V are learned matricesdef attention(self, X_input):# for every token, transform the previous layer’s outputfor i in range(self.sequence_len):query[i] = self.Q * X_input[i]key[i] = self.K * X_input[i]value[i] = self.V * X_input[i]# compute output values, one at a timefor i in range(self.sequence_len):this_query = query[i]# how relevant is this input to this output?for j in range(self.sequence_len):relevance[j] = this_quer * key[j]# normalize relevance scores to sum to 1relevance = scaled_softmax(relevance)# compute a weighted sum of the valuesout[i] = 0for j in range(self.sequence_len):out[i] += relevance[j] * value[j]=return outCool! Now we have a basic understanding of this attention mechanism, and how it creates relevance scores.
凉! 现在,我们对这种注意力机制及其如何创建相关性分数有了基本的了解。
Multi-headed attention is a novel innovation on the attention mechanism present in our transformer algorithm, and its goal is to provide our model with a more comprehensive understanding of the language it’s presented. The theory behind this multi-headed approach is that we simply do the same thing (attention mechanism) 8 times (whatever product of 8 you want to use) with different Q, K, and V matrices (learned matrices by the model). This allows our network to learn 8 different things to pay attention to, such as in the case of translation we learn an attention mechanism for grammar, vocab, gender, tense, etc. allowing our model to look at different parts of the input document for different purposes (and can do this at each layer).
多头注意力是对我们的变形器算法中存在的注意力机制的一种创新,其目的是为我们的模型提供对所呈现语言的更全面的理解。 这种多头方法背后的理论是,我们对不同的Q,K和V矩阵(由模型学习的矩阵)仅对相同的事物(注意力机制)进行了8次(无论您要使用8的乘积)。 这使我们的网络能够学习8种不同的注意事项,例如在翻译的情况下,我们学习一种针对语法,词汇,性别,时态等的注意机制。允许我们的模型针对输入文档的不同部分进行关注不同的目的(并且可以在每一层上做到这一点)。
Attention resources:
注意资源:
- Deconstructing BERT
-解构BERT
Okay, that’s a lot on the attention mechanism, let’s talk about the positional encoding. It’s important to note that without this positional encoding, attention mechanisms are just a bag of words. This is, there is nothing saying “work to live” is different than “live to work.” We still don't have any context. The way positional encoding fixes this is by taking our input vector and using word2vec to encode some new vector for each input token. Onto that token embedding, we add the sine and cosine of some frequencies (concept is loosely based on Fourier theory) starting with pi, and stretching our wave out longer and longer. What this does, is it allows our model to reason about the relative positions of any tokens.
好的,注意力机制很多,让我们谈谈位置编码。 重要的是要注意,如果没有这种位置编码,则注意机制只是一袋字。 就是说,没有什么可以说“为工作而活”不同于“为工作而活”。 我们仍然没有任何背景。 位置编码解决此问题的方法是采用我们的输入向量,并使用word2vec为每个输入令牌编码一些新向量。 在该令牌嵌入中,我们从pi开始添加一些频率的正弦和余弦(概念是基于傅立叶理论的宽松定义),并将我们的波越来越长地延伸。 这样做是为了使我们的模型能够推理出任何代币的相对位置。
You can imagine that if our assigned frequencies for each token are increasing as we iteratively encode them, our model would be able to infer that one word came before or after another based on its wavelength. Because our model has a record of all these different wavelengths, it can look across the whole document at arbitrary scales to see if one idea is before or after another. The key here is that this is how the system understands position within documents when performing attention, differentiating it from a bag of words.
您可以想象,如果我们为每个令牌分配的频率随着对它们进行迭代编码而增加,那么我们的模型将能够根据其波长推断出一个单词出现在另一个单词之前或之后。 因为我们的模型记录了所有这些不同的波长,所以它可以以任意比例查看整个文档,以查看一个想法是在另一个想法之前还是之后。 这里的关键是,这是系统在执行注意力时如何理解文档中位置的方法,从而将其与一堆单词区分开。
Positional encoding resources:
位置编码资源:
- Positional Encoding Explained
-解释位置编码
BERT stands for Bidirectional Encoder Representations from Transformers, and it is the transformer solution published by Google, developed by Jacob Devlin and his colleagues in 2018. There is a pre-trained implementation in Tensorflow and decent documentation on how to use it on the google-research GitHub (here).
BERT代表Transformers的双向编码器表示,它是Google发布的变压器解决方案,由Jacob Devlin和他的同事在2018年开发。研究GitHub(此处)。
Remember that neural machine translation, question answering, sentiment analysis, text summarization, pretty much any complex NLP task all require an understanding of language(vocabulary, grammar, and context). Our solution? Train BERT in two steps. First, pre-train BERT to understand language, then fine-tune it for specific tasks.
请记住,神经机器翻译,问题回答,情感分析,文本摘要,几乎任何复杂的NLP任务都需要理解语言(词汇,语法和上下文)。 我们的解决方案? 分两步训练BERT。 首先,对BERT进行培训以使其理解语言,然后针对特定任务对其进行微调。
This process is how our model learns to answer questions “What is language? What is context?”. It answers these questions by training on two unsupervised tasks simultaneously — Mass Language Modeling (MLM), and Next Sentence Prediction (NSP).
这个过程就是我们的模型学习如何回答“什么是语言? 什么是背景?”。 它通过同时训练两个无监督任务-大众语言建模(MLM)和下一句预测(NSP)来回答这些问题。
MLM helps BERT understand the bidirectional context within a sentence. It does this by taking in a sentence, replacing random words in the sentence, and attempting to output the mask tokens (kind of like fill in the blanks).
MLM帮助BERT理解句子中的双向上下文。 它通过接收一个句子,替换句子中的随机单词,并尝试输出掩码标记(有点像填入空白)来做到这一点。
in: The [MASK1] brown fox [MASK2] over the lazy dogout: MASK1 = quick, MASK2 = jumpedIn NSP, the model takes in two sentences as input and determines if the second follows the first. Think of this as a binary classification problem, where we are output 1 when sentence B follows sentence A, and 0 otherwise. This helps BERT understand context across different sentences, and is how the model contextualizes separate ideas in different locations across the document.
在NSP中,模型接受两个句子作为输入,并确定第二个句子是否跟在第一个句子之后。 将其视为二进制分类问题,当句子B跟随句子A时,我们输出1 ,否则输出0 。 这有助于BERT理解不同句子之间的上下文,以及该模型如何将整个文档中不同位置的独立思想上下文化。
in: A = Kevin is a cool guy. B = He lives in Ohio.out: 1 (Yes, sentence B follows sentence A)These two processes together (MLM and NSP) are what make so BERT good at understanding language and context.
这两个过程(MLM和NSP)一起使BERT能够很好地理解语言和上下文。
To fine-tune our model to a specific task, all we need to do is replace the final (output) layer of the BERT decoder with a layer that architecturally fits our task and then train our model.
为了将模型微调至特定任务,我们需要做的就是用架构上适合我们任务的层替换BERT解码器的最终(输出)层,然后训练模型。
BERT has been made incredibly easy to use, here’s a very simple BERT implementation in Tensorflow from the lovely folks on Github at huggingface. There are tons of resources for helping you experiment with fine-tuning your own BERT, just ask Google for some help.
BERT的使用变得异常简单,这是Gisub上可爱的人们在Tensorflow中使用非常简单的BERT实现的拥抱。 有大量资源可帮助您尝试微调自己的BERT,只需向Google寻求帮助即可。
# Load dataset, tokenizer, model from pretrained model/vocabularytokenizer = BertTokenizer.from_pretrained('bert-base-cased')model = TFBertForSequenceClassification.from_pretrained('bert-base-based')data = tensorflow_datasets.load('glue/mrpc')# Prepare dataset for GLUE as a tf.data.dataset instancetrain_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)valid_dataset = valid_dataset.batch(64)# Prepare training: Compile tf.keras model with optimizer, loss, and learning rate scheduleoptimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')model.compile(optimizer=optimizer, loss=loss, metrics=[metrics])If you’re familiar with building neural networks in Tensorflow or Keras, this code should look pretty friendly to you. If PyTorch is your native interface, huggingface has implementations for you on their page as well.
如果您熟悉在Tensorflow或Keras中构建神经网络,那么此代码对您应该看起来非常友好。 如果PyTorch是您的本机界面,则拥抱面在其页面上也为您提供了实现。
BERT resources:
BERT资源:
- Understanding searches better than ever before using BERT
-使用BERT之前比以往任何时候都更了解搜索
- BERT Explained
-BERT解释
- Huggingface transformers repo
-Huggingface变压器回购
Our all-to-all comparisons can be run fully parallel, drastically reducing our compute costs. So even though our process is N² complexity, due to the matrix orientation of our values, we can run these operations simultaneously using a GPU. This is a huge advantage over RNNs such as LSTM that have to process their tokens in sequence (e.g. you can’t do anything with token 11 until you are completely done processing token 10). With GPU allowing us to perform these calculations in parallel, our compute costs are almost “free.”
我们的所有比较都可以完全并行运行,从而大大降低了我们的计算成本。 因此,即使我们的过程复杂度为N²,由于值的矩阵方向,我们仍可以使用GPU同时运行这些操作。 与必须按顺序处理其令牌的RNN(例如LSTM)相比,这是一个巨大的优势(例如,在完全完成处理令牌10之前,您无法对令牌11进行任何操作)。 使用GPU允许我们并行执行这些计算,我们的计算成本几乎是“免费”的。
Transformers make use of ReLU as opposed to sigmoid and hyperbolic tangent (tanh) activations. These activations are built into the LSTM model and are problematic. Why? These functions scale our activation to be either between 0–1 or -1–1. If we have a neuron with a very high (or very low) activation value, our values start to cluster (saturate) around 0 and 1, or -1 and 1. Our gradient descent has a hard time telling the difference between activations in these oversaturated areas, and our optimizer can get confused.
变压器使用ReLU而不是S型和双曲线正切(tanh)激活。 这些激活内置在LSTM模型中,存在问题。 为什么? 这些功能将激活范围定为0–1或-1–1。 如果我们的神经元具有很高(或非常低)的激活值,则我们的值将在0和1或-1和1附近聚集(饱和)。我们的梯度下降很难区分激活之间的差异。饱和区域,我们的优化器可能会感到困惑。
ReLU allows each neuron to express a stronger opinion. In a sigmoid activation, there is no significant difference between an activation being 3, 8, or 30 — they are all going to cluster around 1. With ReLU, we can say we have an activation of 3, or 8, or 30, and they are all meaningfully different values. We go from being able to say “yes”, “no”, or “maybe”, to being able to express an opinion with a specific amount of strength.
ReLU允许每个神经元表达更强的见解。 在S型激活中,激活为3、8或30之间没有显着差异-它们都将聚集在1附近。使用ReLU,我们可以说我们的激活为3或8或30,并且它们都是有意义的不同价值。 我们从能够说“是”,“不是”或“也许”,到能够以特定的力度表达意见。
ReLU is also less sensitive to random initialization, runs great on low-precision hardware, and it’s stupidly easy to compute the gradient (either 1 or 0).
ReLU对随机初始化也不太敏感,可以在低精度硬件上很好地运行,并且计算梯度(1或0)非常容易。
The downsides of ReLU are minimal, but it is worth noting the “dead neurons”, meaning some outputs will always be 0 (can be fixed with a leaky ReLU).
ReLU的缺点很小,但值得注意的是“死亡神经元”,这意味着某些输出将始终为0(可以用泄漏的ReLU固定)。
This advantage is pretty obvious, but you can now effectively perform transfer learning for NLP tasks! And you can train these networks on unsupervised tasks (unlabeled data).
这种优势非常明显,但是您现在可以有效地执行NLP任务的转移学习! 您可以在无人监督的任务(无标签的数据)上训练这些网络。
Transformers have leveled up NLP, the days of LSTMs and bag of word models are over. What these transformers so great? So many things (see the list we just made above)! Experiment with Google’s implementation, try out some custom builds and fine-tune a BERT for your own task. You’ll be surprised at the results if you’re used to working with an LSTM or other RNN type network.
变压器已经提升了NLP,LSTM和单词模型袋的时代已经过去。 这些变压器有什么用呢? 这么多东西(请参阅我们上面列出的清单)! 试用Google的实现,尝试一些自定义版本,并针对您自己的任务微调BERT。 如果您习惯使用LSTM或其他RNN类型的网络,将会对结果感到惊讶。
翻译自: https://medium.com/voice-tech-podcast/nlps-rise-of-the-transformers-1e203d4f836a
nlp自然语言处理
相关资源:四史答题软件安装包exe