飞马系统api使用说明
In the last week of December 2019, Google Brain team launched this state of the art summarization model PEGASUS, which expands to Pre-training with Extracted Gap-sentences for Abstractive Summarization. In this article, we will just be looking at how we can generate summaries using the pre-trained model, for the information on how the pre-training took place, refer here.
在2019年12月的最后一周,Google Brain团队启动了这种最先进的汇总模型PEGASUS,该模型已扩展为使用提取的缺口句进行预训练以进行抽象汇总 。 在本文中,我们将探讨如何使用预训练模型生成摘要,有关预训练如何进行的信息,请参见此处 。
As one could see in the original paper itself, it has been giving great abstractive summaries, for example, one of it’s fine-tuned model on XSum data, following happened for an input:
就像在原始论文本身中可以看到的那样,它一直在提供出色的抽象摘要,例如,其中之一是针对XSum数据的经过微调的模型,输入时发生了以下情况:
Fig 1: Example 图1:示例Not bad for a machine generated summary, eh?
对于机器生成的摘要来说还不错,是吗?
Coming to the point of this article, let’s see how we can use the given pre-trained model to generate summaries for our text. Since this is ongoing research, we do not have a method to get summaries for our text quickly. So until we do get this from the authors, the way in this article could be used.
谈到本文的要点,让我们看看如何使用给定的预训练模型为文本生成摘要。 由于这是一项持续的研究,因此我们没有一种方法可以快速获取文本摘要。 因此,直到我们从作者那里获得此信息之前,都可以使用本文中的方式。
As the first step, one needs to visit the GitHub repository and follow the steps mentioned in the documentation to install the library and download the model checkpoints. Be cautious about the way you install gsutil, as in linux distributions, some other package gets installed. The documentation is now updated so just make sure that you read through the steps cautiously.
第一步,需要访问GitHub存储库,并按照文档中提到的步骤安装库并下载模型检查点。 请谨慎安装gsutil的方式,因为在Linux发行版中,会安装一些其他软件包。 现在,文档已更新,因此请确保您仔细阅读所有步骤。
Next step would be to install the dependencies mentioned in the requirements.txt. Cautiousness required here as well, keep track of the versions of the dependencies you are using. In my case, everything worked flawlessly with tensorflow version 1.15.
下一步将是安装requirements.txt中提到的依赖项。 在这里也需要谨慎,请跟踪所使用的依赖项的版本。 就我而言,使用tensorflow版本1.15一切都可以完美工作。
Great! So now that we are done with the setup, let’s get to the action. The pegasus directory appears in the following way:
大! 现在,我们完成了设置,现在开始操作。 飞马目录以下列方式出现:
Fig 2: pegasus cloned repository. 图2:飞马座克隆的存储库。In the top-most directory named ckpt, we have our model checkpoint trained on C4 data. Along with that, you will find fine-tuned models on 12 tensorflow datasets. Refer to Fig 3.
在名为ckpt的最顶层目录中,我们对模型检查点进行了C4数据训练。 除此之外,您还将在12张tensorflow数据集上找到经过微调的模型。 参见图3。
Fig 3: Model checkpoints. 图3:模型检查点。Everything seems to be fine till now. So, one can use any of these model checkpoints to generate summaries for their custom text. But wait before getting excited about these models, if one thinks of it, there must be some form in which the model expects the input right? So let’s work on creating the input data first.
到现在为止一切似乎还不错。 因此,可以使用这些模型检查点中的任何一个为其自定义文本生成摘要。 但是等到对这些模型感到兴奋之前,如果有人想到它,那么必须以某种形式期望模型正确输入吗? 因此,让我们首先创建输入数据。
The input needs to be a .tfrecord. So let’s just see how we are going to create our input data. The following piece of code ought to do it for you. Just one thing to take care of here, make sure the .tfrecord is saved inside the testdata directory, which is inside pegasus/data/.
输入必须是.tfrecord 。 因此,让我们看看如何创建输入数据。 以下代码段应该为您完成。 这里只需要注意一件事,请确保.tfrecord保存在testdata目录中,该目录位于pegasus / data /中。
import pandas as pd import tensorflow as tf save_path = "<Your path>/pegasus/data/testdata/test_pattern_1.tfrecords" input_dict = dict( inputs=[ # Your text inputs to be summarized. ], targets=[ # Corresponding targets for the inputs. ] ) data = pd.DataFrame(input_dict) with tf.io.TFRecordWriter(save_path) as writer: for row in data.values: inputs, targets = row[:-1], row[-1] example = tf.train.Example( features=tf.train.Features( feature={ "inputs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[inputs[0].encode('utf-8')])), "targets": tf.train.Feature(bytes_list=tf.train.BytesList(value=[targets.encode('utf-8')])), } ) ) writer.write(example.SerializeToString())In the gist above you will see that the targets are also passed. The list target is supposed to be the actual summary or the ground truth. Since we are only trying to generate summaries from the model and not train it, you can pass empty strings, but we can’t omit it because the model expects input in that format.
在上面的要点中,您将看到目标也已通过。 列表目标应该是实际摘要或基本事实。 由于我们仅尝试从模型生成摘要而不进行训练,因此您可以传递空字符串,但是由于模型希望以该格式输入,因此我们不能忽略它。
Awesome! Now that our data is prepared, there is just one more step and we start to get the summaries. So this step is to register our tfrecord in the registry of the pegasus(locally). Great! Let’s move forward. Just remember to keep track of the save_path from the code we used to generate the input data.
太棒了! 现在我们的数据已经准备好了,只需要再迈出一步,我们就可以获取摘要。 因此,此步骤是将我们的tfrecord注册到飞马(本地)的注册表中。 大! 让我们前进。 只要记住要记住我们用于生成输入数据的代码中的save_path即可。
save_path = "<Your path>/pegasus/data/testdata/test_pattern_1.tfrecords" @registry.register("test_transformer") def test_transformer(param_overrides): return transformer_params( { "train_pattern": save_path, "dev_pattern": save_path, "test_pattern": save_path, "max_input_len": 1024, "max_output_len": 256, "train_steps": 180000, "learning_rate": 0.0001, "batch_size": 8, }, param_overrides)In the pegasus directory in your system, go to the path pegasus/params/public_params.py and paste the above code at the end of the script. In the above gist you will see that all the three; train_pattern, dev_pattern and test_pattern are assigned the same tfrecord, you may create different tfrecords for all three but since we are only looking to infer, it doesn’t matter. And we are done!
在系统的pegasus目录中,转到路径pegasus / params / public_params.py并将以上代码粘贴到脚本的末尾。 在上面的要点中,您将看到所有这三个; 给train_pattern,dev_pattern和test_pattern分配了相同的tfrecord,您可以为所有这三个创建不同的tfrecord,但是由于我们只是在推断,所以没关系。 我们完成了!
Toggle to the pegasus directory using your terminal and just run the command :
使用终端切换到飞马座目录,然后运行命令:
python3 pegasus/bin/evaluate.py --params=test_transformer \--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \--model_dir=ckpt/pegasus_ckpt/This will start to create your summaries for your input data. Once done you will see 3 text files created in the directory of the model that you pick. These three files correspond to the input text, target text and the predicted summaries.
这将开始为您的输入数据创建摘要。 完成后,您将在选择的模型目录中看到3个文本文件。 这三个文件对应于输入文本,目标文本和预测的摘要。
You can open these text files and analyze the summaries. While you do, you might see that the summaries appear to be extractive rather than abstractive. That can be cured by fine-tuning the model with your data with a very small sample. See this note from the contributors.
您可以打开这些文本文件并分析摘要。 在执行此操作时,您可能会发现摘要似乎是提取性的而不是抽象性的。 可以通过使用很小的样本对数据进行微调来解决此问题。 请参阅贡献者的注释 。
This article consists of one of the workarounds to generate summaries from the pre-trained model provided by the Google Brain team for abstractive summarization, while it may not be a clean or efficient method but ought do the job until we get such functionality from the authors. If readers have some other way they could make use of these models for creating summaries, please comment or reach out.
本文包含一种变通办法,可以从Google Brain团队提供的经过预训练的模型中生成摘要,以进行抽象总结,尽管这可能不是一种干净或有效的方法,但应该在我们从作者那里获得这种功能之前,先行工作。 。 如果读者有其他方式可以利用这些模型来创建摘要,请发表评论或联系。
Thank you so much for taking out time to read this article, find me at https://chauhanakash23.github.io/
非常感谢您抽出宝贵的时间阅读本文,并在https://chauhanakash23.github.io/找到我
https://www.youtube.com/watch?v=GQs2AiohjpM
https://www.youtube.com/watch?v=GQs2AiohjpM
https://github.com/google-research/pegasus
https://github.com/google-research/pegasus
https://towardsdatascience.com/pegasus-google-state-of-the-art-abstractive-summarization-model-627b1bbbc5ce
https://towardsdatascience.com/pegasus-google-state-of-the-art-abstractive-summarization-model-627b1bbbc5ce
翻译自: https://medium.com/thecyphy/generating-abstractive-summaries-using-googles-pegasus-model-18eef8ae985b
飞马系统api使用说明
相关资源:JFEMAS 中金飞马 Java API Linux/64bit V1.21