使用转换器在自定义数据集上构建更快，更准确的搜索引擎

科技2025-03-01 39

介绍(Introduction)

In this article, we will build a search engine on a huge corpus of custom dataset, which will not only retrieve the search results based on the query/questions but also give us a 1000 words context around the response.

在本文中，我们将在庞大的自定义数据集上构建搜索引擎，该引擎不仅会根据查询来检索搜索结果，还会在响应周围提供1000个单词的上下文。

And all of that a lot faster and more accurate using transformers🤗

使用transformers🤗 ，所有这些都可以更快，更准确地transformers🤗

Example —

示例-

Question: What is the impact of coronavirus on pregnant women?Answer: pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this populationResearch Paper: COVID 19 in babies: Knowledge for neonatal careContext: The disease manifests with a spectrum of symptoms ranging from mild upper respiratory tract infection to severe pneumonitis, acute respiratory distress syndrome (ARDS) and death.Relatively few cases have occurred in children and neonates who seem to have a more favourable clinical course than other age groups (De Rose et al. 2020) . While not initially identified as a population at risk, pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this population (Alfaraj et al. 2019) .Moreover, the associated policies developed as a result of the pandemic relating to social distancing and prevention of cross infection have led to important considerations specific to the field of maternal and neonatal health, and a necessity to consider unintended consequences for both the mother and baby (Buekens et al. 2020)

I have published a Kaggle notebook here.

我在这里发布了Kaggle笔记本。

To achieve this, we will need:

为此，我们将需要：

A corpus of data

数据语料库

Transformers library to build QA model

Transformers库以建立质量检查模型

and Finally, Haystack library to scale QA model to thousands of documents and build a search engine.

最后， Haystack库将质量检查模型扩展到成千上万个文档，并构建了搜索引擎。

Let’s start —

开始吧 -

数据 (Data)

For this article, we will use Kaggle’s COVID-19 Open Research Dataset Challenge (CORD-19).

对于本文，我们将使用Kaggle的COVID-19开放研究数据集挑战(CORD-19) 。

CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

CORD-19的资源超过200,000篇学术文章，其中包括超过100,000篇有关COVID-19，SARS-CoV-2和相关冠状病毒的全文。该免费可用的数据集提供给全球研究界，以应用自然语言处理和其他AI技术的最新进展来产生新见解，以支持正在进行的与这种传染病的斗争。

This dataset is ideal for building document retrieval system as it has full research paper content in text format. Columns like

该数据集具有文本格式的完整研究论文内容，是构建文档检索系统的理想选择。像列

paper_id: Unique identifier of research paper

paper_id ：研究论文的唯一标识符

title: title of research paper

title ：研究论文标题

abstract: Bried summary of the research paper

abstract ：研究论文摘要

full_text: Full text/content of the research paper

full_text ：研究论文的全文/内容

are of our interest.

是我们的兴趣所在。

In Kaggle Folder Structure — There are 2 directories — pmc_json and pdf_json - which contains the data in the json format. We will take 25,000 articles from pmc_json directory and 25000 articles from pdf_json - So, a total of 50,000 research articles to build our search engine.

在Kaggle文件夹结构中-有2个目录pmc_json和pdf_json包含json格式的数据。我们将采取25,000文章pmc_json目录和25000篇文章从pdf_json -所以，一共有50000篇研究论文来构建我们的搜索引擎。

We will extract paper_id, title, abstract, full_text and put it in an easy to use pandas.DataFrame.

我们将提取paper_id ， title ， abstract ， full_text并将其放在易于使用的pandas.DataFrame 。

import numpy as np import pandas as pd import os import json import re from tqdm import tqdm dirs=["pmc_json","pdf_json"] docs=[] counts=0 for d in dirs: print(d) counts = 0 for file in tqdm(os.listdir(f"../input/CORD-19-research-challenge/document_parses/{d}")):#What is an f string? file_path = f"../input/CORD-19-research-challenge/document_parses/{d}/{file}" j = json.load(open(file_path,"rb")) #Taking last 7 characters. it removes the 'PMC' appended to the beginning #also paperid in pdf_json are guids and hard to plot in the graphs hence the substring paper_id = j['paper_id'] paper_id = paper_id[-7:] title = j['metadata']['title'] try:#sometimes there are no abstracts abstract = j['abstract'][0]['text'] except: abstract = "" full_text = "" bib_entries = [] for text in j['body_text']: full_text += text['text'] docs.append([paper_id, title, abstract, full_text]) #comment this below block if you want to consider all files #comment block start counts = counts + 1 if(counts >= 25000): break #comment block end df=pd.DataFrame(docs,columns=['paper_id','title','abstract','full_text']) print(df.shape) df.head() pmc_jsonpdf_json(50000, 4)

We have 50,000 articles and columns like paper_id, title, abstract and full_text

我们有50,000个文章和列，例如paper_id ， title ， abstract和full_text

We will be interested in title and full_text columns as these columns will be used to build the engine. Let’s setup a Search Engine on top full_text - which contains the full content of the research papers.

我们将对title和full_text列感兴趣，因为这些列将用于构建引擎。让我们在full_text顶部设置一个搜索引擎-包含研究论文的全部内容。

草垛 (Haystack)

Now, Welcome Haystack! The secret sauce behind setting up a search engine and ability to scale any QA model to thousands of documents.

现在，欢迎Haystack ！设置搜索引擎以及将任何质量检查模型扩展到数千个文档的能力背后的秘密之处。

Haystack 草垛

Haystack helps you scale QA models to large collections of documents! You can read more about this amazing library here https://github.com/deepset-ai/haystack

Haystack可帮助您将质量检查模型扩展到大量文档！您可以在这里阅读有关此惊人库的更多信息https://github.com/deepset-ai/haystack

For installation: ! pip install git+https://github.com/deepset-ai/haystack.git

安装： ! pip install git+https://github.com/deepset-ai/haystack.git ! pip install git+https://github.com/deepset-ai/haystack.git

But just to give a background, there are 3 major components to Haystack.

但是，仅出于背景考虑，Haystack包含3个主要组成部分。

Document Store: Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).

Document Store ：存储用于搜索的文档的数据库。我们建议使用Elasticsearch，但还有更多轻量级选项可用于快速原型制作(SQL或内存中)。

Retriever: Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.

检索器：快速，简单的算法，可从大量文档中识别候选段落。算法包括TF-IDF或BM25，自定义Elasticsearch查询和基于嵌入的方法。检索器有助于将Reader的范围缩小到可以回答给定问题的较小文本单位。

Reader: Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face’s model hub or fine-tune it to your own domain data.

阅读器：强大的神经模型，可以详细阅读文本以找到答案。在SQuAD之类的任务上使用通过FARM或Transformers训练的BERT，RoBERTa或XLNet等多种模型。阅读器将文本的多个段落作为输入，并返回具有相应置信度得分的前n个答案。您可以从Hugging Face的模型中心加载预训练的模型，或者将其微调到您自己的域数据。

And then there is Finder which glues together a Reader and a Retriever as a pipeline to provide an easy-to-use question answering interface.

然后是Finder ，它将Reader和Retriever作为管道粘合在一起，以提供易于使用的问题回答界面。

Now, we can setup Haystack in 3 steps:

现在，我们可以通过3个步骤设置Haystack ：

Install haystack and import its required modules

安装haystack并导入所需的模块

Setup DocumentStore

设置DocumentStore

Setup Retriever, Reader and Finder

设置Retriever Reader ， Reader和Finder

1.安装haystack (1. Install haystack)

Let’s install haystack and import all the required modules

让我们安装haystack并导入所有必需的模块

# installing haystack ! pip install git+https://github.com/deepset-ai/haystack.git # importing necessary dependencies from haystack import Finder from haystack.indexing.cleaning import clean_wiki_text from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http from haystack.reader.farm import FARMReader from haystack.reader.transformers import TransformersReader from haystack.utils import print_answers

2.设置DocumentStore(2. Setting up DocumentStore)

Haystack finds answers to queries within the documents stored in a DocumentStore. The current implementations of DocumentStore include ElasticsearchDocumentStore, SQLDocumentStore, and InMemoryDocumentStore.

Haystack在DocumentStore存储的DocumentStore查找查询的答案。 DocumentStore的当前实现包括ElasticsearchDocumentStore ， SQLDocumentStore和InMemoryDocumentStore 。

But they recommend ElasticsearchDocumentStore because as it comes preloaded with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.

但是他们建议使用ElasticsearchDocumentStore因为它预先加载了全文查询，BM25检索和用于文本嵌入的矢量存储等功能。

So — Let’s set up a ElasticsearchDocumentStore.

因此-让我们建立一个ElasticsearchDocumentStore 。

! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q ! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz ! chown -R daemon:daemon elasticsearch-7.6.2 import os from subprocess import Popen, PIPE, STDOUT es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) # as daemon ) # wait until ES has started ! sleep 30 # initiating ElasticSearch from haystack.database.elasticsearch import ElasticsearchDocumentStore document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

Once ElasticsearchDocumentStore is setup, we will write our documents/texts to the DocumentStore.

设置好ElasticsearchDocumentStore ，我们会将文档/文本写入DocumentStore 。

Writing documents to ElasticsearchDocumentStore requires a format - List of dictionaries as shown below:

将文档写入ElasticsearchDocumentStore需要一种格式-字典列表，如下所示：

[ {"name": "<some-document-name>, "text": "<the-actual-text>"}, {"name": "<some-document-name>, "text": "<the-actual-text>"} {"name": "<some-document-name>, "text": "<the-actual-text>"}]

(Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and can be accessed later for filtering or shown in the responses of the Finder)

(可选：您还可以在此处添加更多键值对，这些键值对将在Elasticsearch中索引为字段，以后可以访问以进行过滤或显示在Finder的响应中)

We will use title column to pass as name and full_text column to pass as the text

我们将使用title列作为name传递，并使用full_text列作为text传递

# Now, let's write the dicts containing documents to our DB. document_store.write_documents(data[['title', 'abstract']].rename(columns={'title':'name','full_text':'text'}).to_dict(orient='records'))

3.设置Retriever Reader ， Reader和Finder(3. Setup Retriever, Reader and Finder)

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.

检索器有助于将Reader的范围缩小到可以回答给定问题的较小文本单位。他们使用一些简单但快速的算法。

Here: We use Elasticsearch’s default BM25 algorithm

此处：我们使用Elasticsearch的默认BM25算法

from haystack.retriever.sparse import ElasticsearchRetriever retriever = ElasticsearchRetriever(document_store=document_store)

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.

阅读器将详细扫描检索者返回的文本，并提取k个最佳答案。它们基于功能强大但速度较慢的深度学习模型。

Haystack currently supports Readers based on the frameworks FARM and Transformers. With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

Haystack当前基于FARM和Transformers框架支持Readers。两者都可以加载本地模型，也可以从Hugging Face's模型中心( https://huggingface.co/models)加载。

Here: a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

此处：使用基于FARM的阅读器的中型RoBERTa QA模型( https://huggingface.co/deepset/roberta-base-squad2 )

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True, context_window_size=500) Downloading: 100%Downloading: 100%Downloading: 100%Downloading: 100%Downloading: 100%Downloading: 100%

And finally: The Finder sticks together reader and retriever in a pipeline to fetch answers based on our query.

最后：Finder将阅读器和检索器放在管道中，以根据我们的查询获取答案。

finder = Finder(reader, retriever)

🥳瞧！我们做完了。 (🥳 Voila! We’re Done.)

Let’s see, how well our search engine works! — For simplicity, we will keep the number of documents to be retrieved to 2 using top_k_reader parameter. But we can extend to any number in production.

让我们看看我们的搜索引擎的运行情况！ —为简单起见，我们将使用top_k_reader参数将要检索的文档数保持为2。但是我们可以扩展到生产中的任何数量。

Now, whenever we search or query our DocumentStore, we get 3 responses-

现在，每当我们搜索或查询DocumentStore ，我们都会得到3个响应-

we get the answer

我们得到答案

a 1000 words context around the answer

围绕答案的1000个单词的上下文

and the name/title of the research paper

以及研究论文的名称/标题

Example 1: What is the impact of coronavirus on babies?

例1：冠状病毒对婴儿有什么影响？

question = "What is the impact of coronavirus on babies?" number_of_answers_to_fetch = 2 prediction = finder.get_answers(question=question, top_k_retriever=10, top_k_reader=number_of_answers_to_fetch) print(f"Question: {prediction['question']}") print("\n") for i in range(number_of_answers_to_fetch): print(f"#{i+1}") print(f"Answer: {prediction['answers'][i]['answer']}") print(f"Research Paper: {prediction['answers'][i]['meta']['name']}") print(f"Context: {prediction['answers'][i]['context']}") print('\n\n') Question: What is the impact of coronavirus on babies?#1 Answer: While babies have been infected, the naivete of the neonatal immune system in relation to the inflammatory response would appear to be protective, with further inflammatory responses achieved with the consumption of human milk.Research Paper: COVID 19 in babies: Knowledge for neonatal careContext: ance to minimize the health systems impact of this pandemic across the lifespan.The Covid-19 pandemic has presented neonatal nurses and midwives with challenges when caring for mother and babies. This review has presented what is currently known aboutCovid-19 and neonatal health, and information and research as they are generated will add to a complete picture of the health outcomes. While babies have been infected, the naivete of the neonatal immune system in relation to the inflammatory response would appear to be protective, with further inflammatory responses achieved with the consumption of human milk. The WHO has made clear recommendations about the benefits of breastfeeding, even if the mother and baby dyad is Covid-19 positive, if they remain well. The mother and baby should not be separated, and the mother needs to be able to participate in her baby's care and develop her mothering role. The complexities of not being able to access her usual support people mean that the mother#2 Answer: neonate are mild, with low-grade fever and gastrointestinal signs such as poor feeding and vomiting. The respiratory symptoms are also limited to mild tachypnoea and/or tachycardia.Research Paper: COVID 19 in babies: Knowledge for neonatal careContext: Likewise, if the mother and baby are well, skin-to-skin and breast feeding should be encouraged, as the benefits outweigh any potential harms. If a neonate becomes unwell and requires intensive care, they should be nursed with droplet precautions in a closed incubator in a negative pressure room. The management is dictated by the presenting signs and symptoms. It would appear the presenting symptoms in the neonate are mild, with low-grade fever and gastrointestinal signs such as poor feeding and vomiting. The respiratory symptoms are also limited to mild tachypnoea and/or tachycardia. However, as there has been a presentation of seizure activity with fever a neurological examination should be part of the investigations.As of writing this paper, there has been a preliminary study published in the general media The WHO has also welcomed these preliminary results and state it is "looking forward to a full data analysis" (https://www.who.int/news-room/detail/16-06-2020-who-welcomesprelimin

Example 2: What is the impact of coronavirus on pregnant women?

例2：冠状病毒对孕妇有什么影响？

question = "What is the impact of coronavirus on pregnant women?" number_of_answers_to_fetch = 2 prediction = finder.get_answers(question=question, top_k_retriever=10, top_k_reader=number_of_answers_to_fetch) print(f"Question: {prediction['question']}") print("\n") for i in range(number_of_answers_to_fetch): print(f"#{i+1}") print(f"Answer: {prediction['answers'][i]['answer']}") print(f"Research Paper: {prediction['answers'][i]['meta']['name']}") print(f"Context: {prediction['answers'][i]['context']}") print('\n\n') Question: What is the impact of coronavirus on pregnant women?#1 Answer: pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this populationResearch Paper: COVID 19 in babies: Knowledge for neonatal careContext: na. The disease manifests with a spectrum of symptoms ranging from mild upper respiratory tract infection to severe pneumonitis, acute respiratory distress syndrome (ARDS) and death.Relatively few cases have occurred in children and neonates who seem to have a more favourable clinical course than other age groups (De Rose et al. 2020) . While not initially identified as a population at risk, pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this population (Alfaraj et al. 2019) .Moreover, the associated policies developed as a result of the pandemic relating to social distancing and prevention of cross infection have led to important considerations specific to the field of maternal and neonatal health, and a necessity to consider unintended consequences for both the mother and baby (Buekens et al. 2020) .Countries are faced with a rapidly deve#2 Answer: While not initially identified as a population at risk, pregnant woman may be more vulnerable to severe infection (Favre et al., 2020) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this populationResearch Paper: COVID 19 in babies: Knowledge for neonatal careContext: tified in Wuhan, Hubei, China. The disease manifests with a spectrum of symptoms ranging from mild upper respiratory tract infection to severe pneumonitis, acute respiratory distress syndrome (ARDS) and death. Relatively few cases have occurred in children and neonates who seem to have a more favourable clinical course than other age groups (De Rose et al., 2020). While not initially identified as a population at risk, pregnant woman may be more vulnerable to severe infection (Favre et al., 2020) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this population (Alfaraj et al., 2019). Moreover, the associated policies developed as a result of the pandemic relating to social distancing and prevention of cross infection have led to important considerations specific to the field of maternal and neonatal health, and a necessity to consider unintended consequences for both the mother and baby (Buekens et al., 2020).Countries

Thanks for reading —

谢谢阅读 -

I have published a Kaggle notebook here.

我在这里发布了Kaggle笔记本。

If you’re interested in reading more Machine Learning/Deep Learning articles

如果您有兴趣阅读更多机器学习/深度学习文章

翻译自: https://medium.com/analytics-vidhya/building-a-faster-and-accurate-search-engine-on-custom-dataset-with-transformers-d1277bedff3d

Processed: 0.018, SQL: 8