Many people that have just started learning Elasticsearch often confuse the Text and Keyword field data type. The difference between them is simple, but very crucial. In this article, I will talk about the difference, how to use them, how they behave, and which one to use between the two.
许多刚开始学习Elasticsearch的人经常将Text和Keyword字段数据类型混淆。 它们之间的区别很简单,但非常关键。 在本文中,我将讨论它们的区别,如何使用它们,如何表现以及在两者之间使用哪个。
The crucial difference between them is that Elasticsearch will analyze the Text before it’s stored into the Inverted Index while it won’t analyze Keyword type. Analyzed or not analyzed will affect how it will behave when getting queried.
它们之间的关键区别在于,Elasticsearch将在将Text存储到反向索引之前对其进行分析,而不会分析Keyword类型。 已分析或未分析将影响查询时的行为。
If you’re just starting to learn Elasticsearch and still don’t know what is Inverted Index and Analyzer, I recommend reading a basic guide to Elasticsearch first.
如果您刚刚开始学习Elasticsearch,但仍然不知道什么是反向索引和分析器,建议您先阅读有关Elasticsearch的基本指南。
If you index a document to Elasticsearch containing string without defining mapping to the fields before, Elasticsearch will create a dynamic mapping with both Text and Keyword data type. But even if it works with dynamic mapping, I suggest that you define a mapping settings before you index any document depending on the use case to save space and increase writing speed.
如果您将包含字符串的文档索引到Elasticsearch,但之前未定义到字段的映射,则Elasticsearch将创建具有Text和Keyword数据类型的动态映射。 但是,即使它适用于动态映射,我还是建议您根据用例为任何文档建立索引之前定义映射设置,以节省空间并提高写入速度。
These are examples of the mapping settings for Text and Keyword type, note that I will use an Index named “text-vs-keyword” which I have created before for this example.
这些是“ Text和“ Keyword类型的映射设置的示例,请注意,我将使用在此示例之前创建的名为“ text-vs-keyword”的索引。
Keyword Mapping
关键字对应
curl --request PUT \ --url http://localhost:9200/text-vs-keyword/_mapping \ --header 'content-type: application/json' \ --data '{ "properties": { "keyword_field": { "type": "keyword" } }}'Text Mapping
文字对应
curl --request PUT \ --url http://localhost:9200/text-vs-keyword/_mapping \ --header 'content-type: application/json' \ --data '{ "properties": { "text_field": { "type": "text" } }}'Multi Fields
多领域
curl --request PUT \ --url http://localhost:9200/text-vs-keyword/_mapping \ --header 'content-type: application/json' \ --data '{ "properties": { "text_and_keyword_mapping": { "type": "text", "fields": { "keyword_type": { "type":"keyword" } } } }}'Both of the field types are indexed differently in the Inverted Index. The difference in the indexing process will affect when you’re doing a query to the Elasticsearch.
两种字段类型在反向索引中的索引均不同。 索引过程的差异将影响您对Elasticsearch进行查询的时间。
Let’s index a document for example:
例如,让我们为文档建立索引:
curl --request POST \ --url http://localhost:9200/text-vs-keyword/_doc/example \ --header 'content-type: application/json' \ --data '{ "keyword_field":"The quick brown fox jumps over the lazy dog", "text_field":"The quick brown fox jumps over the lazy dog"}'After executing the curl command above, then if you get all of the documents in the index then you should have:
执行完上面的curl命令之后,如果您在索引中获得了所有文档,则应该具有:
[ { "_index": "text-vs-keyword", "_type": "_doc", "_id": "example", "_score": 1.0, "_source": { "keyword_field": "The quick brown fox jumps over the lazy dog", "text_field": "The quick brown fox jumps over the lazy dog" } } ]Let’s start with the simpler one, Keyword. Elasticsearch won’t analyze Keyword data types, which means the String that you index will stay as it is.
让我们从简单的Keyword入手。 Elasticsearch不会分析Keyword数据类型,这意味着您索引的字符串将保持原样。
So, with the example above, what would the string looks like in the Inverted Index?
因此,在上面的示例中,该字符串在反向索引中将是什么样?
Example keyword in Inverted Index 倒排索引中的示例关键字Yes, you’re right, it’s exactly as you write.
是的,您是对的,正是您所写的。
Unlike the Keyword field data type, the string indexed to Elasticsearch will go through the analyzer process before it is stored into the Inverted Index. By default, the Elasticsearch’s standard analyzer will split and lower the string that we indexed. You can learn more about the standard analyzer on Elasticsearch’s documentation.
与Keyword字段数据类型不同,索引为Elasticsearch的字符串在存储到反向索引之前将经过分析器过程。 默认情况下,Elasticsearch的标准分析器将拆分并降低我们索引的字符串。 您可以在Elasticsearch的文档中了解有关标准分析器的更多信息。
Elasticsearch has an API to check what the text would look like after the analyzing process, we can try it with:
Elasticsearch拥有一个API,可在分析过程后检查文本的外观,我们可以尝试以下操作:
curl --request POST \ --url http://localhost:9200/text-vs-keyword/_analyze?pretty \ --header 'content-type: application/json' \ --data '{ "analyzer": "standard", "text": "The quick brown fox jumps over the lazy dog"}'So according to the response above, this is how the Inverted Index should look like for text_field field
因此,根据上面的响应,这就是text_field字段的倒排索引的样子
Example text in Inverted Index 倒排索引中的示例文本Only a little different from the keyword one, right? But you need to pay attention to what it stores in the Inverted Index because it will majorly affect the query process.
与keyword one仅有一点不同,对吧? 但是您需要注意它存储在反向索引中的内容,因为它会严重影响查询过程。
Now that we understand how text and keyword behave when indexed, let’s learn about how they behave when they’re queried.
现在,我们了解了text和keyword在被索引时的行为,让我们了解它们在被查询时的行为。
First, we must know there are two types of query for the string:
首先,我们必须知道该字符串有两种查询类型:
Match Query 匹配查询 Term Query字词查询Same as Text and Keyword, the difference between Match Query and Term Query is that the query in Match Query will get analyzed into terms first, while the query in Term Query will not.
与Text和Keyword相同, Match Query和Term Query的区别在于Match Query中的Match Query将首先被分析成术语,而Term Query则不会。
Querying Elasticsearch works by matching the queried terms with the terms in the Inverted Index, the terms queried and the one in the Inverted Index must be exactly the same, else it won’t get matched. This means that the analyzed string and non-analyzed string in indexing and querying results will produce a very different result.
查询Elasticsearch是通过将查询的术语与反向索引中的术语进行匹配来实现的,查询的术语与反向索引中的术语必须完全相同,否则将无法匹配。 这意味着索引和查询结果中的分析字符串和非分析字符串将产生非常不同的结果。
Because both the field data type and query aren’t analyzed they both will need to be exactly the same so they can produce a result.
因为没有对字段数据类型和查询都进行分析,所以它们都必须完全相同,以便产生结果。
If we try with the exact same query:
如果我们尝试使用完全相同的查询:
curl --request POST \ --url 'http://localhost:9200/text-vs-keyword/_doc/_search?size=0' \ --header 'content-type: application/json' \ --data '{ "query": { "term": { "keyword_field": "The quick brown fox jumps over the lazy dog" } }}'Elasticsearch will return a result:
Elasticsearch将返回结果:
{ "_index": "text-vs-keyword", "_type": "_doc", "_id": "example", "_score": 0.2876821, "_source": { "keyword_field": "The quick brown fox jumps over the lazy dog", "text_field": "The quick brown fox jumps over the lazy dog" } }}If we try with something that is not exact, even if there is the word in the Inverted Index:
如果我们尝试使用不完全正确的方法,即使倒排索引中包含单词:
curl --request POST \ --url 'http://localhost:9200/text-vs-keyword/_doc/_search?size=0' \ --header 'content-type: application/json' \ --data '{ "query": { "term": { "keyword_field": "The" } }}'It returned no result because the term in the query doesn’t match any of the terms in the Inverted Index.
它没有返回任何结果,因为查询中的术语与反向索引中的任何术语都不匹配。
Let’s first try querying the same string “The quick brown fox jumps over the lazy dog” with Match Query to keyword_mapping and see what happens:
让我们先试试查询相同的字符串“敏捷的棕色狐狸跳过懒狗”与Match Query到keyword_mapping ,看看会发生什么:
curl --request POST \ --url http://localhost:9200/text-vs-keyword/_doc/_search \ --header 'content-type: application/json' \ --data '{ "query": { "match": { "keyword_field": "The quick brown fox jumps over the lazy dog" } }}'The result should be:
结果应为:
{ "_index": "text-vs-keyword", "_type": "_doc", "_id": "example", "_score": 0.2876821, "_source": { "keyword_field": "The quick brown fox jumps over the lazy dog", "text_field": "The quick brown fox jumps over the lazy dog" }}Wait, it shouldn’t produce any result because the terms produced analyzed query aren’t an exact match with the “The quick brown fox jumps over the lazy dog” in the Inverted Index, but why is it producing a result?
等等,它不会产生任何结果,因为所产生的术语经过分析的查询与倒排索引中的“褐狐快速越过懒狗”并不完全匹配,但是为什么会产生结果?
That’s right, the query was analyzed because we’re using Match Query, but instead of a standard analyzer, the Elasticsearch used index-time analyzer, which was mapped to the Keyword field data type. Since the analyzer mapped with Keyword field data type is Term Analyzer, the Elasticsearch changed nothing in the query.
没错,对查询进行了分析,因为我们使用的是Match Query ,但是Elasticsearch代替了标准分析器,而是使用了index-time分析器,该index-time分析器已映射到Keyword字段数据类型。 由于映射到Keyword字段数据类型的分析器是术语分析器,因此Elasticsearch在查询中未进行任何更改。
Now, let’s try with a standard analyzer:
现在,让我们尝试使用标准分析仪:
curl --request POST \ --url http://localhost:9200/text-vs-keyword/_doc/_search \ --header 'content-type: application/json' \ --data '{ "query": { "match": { "keyword_field": { "query": "The quick brown fox jumps over the lazy dog", "analyzer":"standard" } } }}'No result is produced because it analyzes the query in terms and nothing is an exact match with the term in the Inverted Index.
不会产生任何结果,因为它会根据术语分析查询,并且与倒排索引中的术语完全不匹配。
The indexed document of text type will have many terms as we can see in the previous section. To show how the query gets matched with the terms in Inverted Index, let’s try two queries, The first query sends the entire sentence to Elasticsearch;
如上一节所述,文本类型的索引文档将有许多术语。 为了展示查询如何与倒排索引中的术语匹配,让我们尝试两个查询,第一个查询将整个句子发送给Elasticsearch;第二个查询将整个句子发送给Elasticsearch。
curl --request POST \ --url 'http://localhost:9200/text-vs-keyword/_doc/_search?pretty=' \ --header 'content-type: application/json' \ --data '{ "query": { "term": { "text_field": "The quick brown fox jumps over the lazy dog" } }}'the second one only “The.”
第二个只有“ The”。
curl --request POST \ --url 'http://localhost:9200/text-vs-keyword/_doc/_search?pretty=' \ --header 'content-type: application/json' \ --data '{ "query": { "term": { "text_field": "The" } }}'Both of the queries produce no results.
这两个查询均未产生结果。
The first query produced no result because, in the Inverted Index, we never stored the entire sentence, the indexing process only stores the terms that have already chunked from the text.
第一个查询没有结果,因为在倒排索引中,我们从不存储整个句子,索引过程仅存储从文本中分块出来的词。
The second query also produced no result. There is a “The” in the indexed document, but remember that the analyzer lower-cased the word, so in Inverted Index, it is stored as “the”
第二个查询也没有产生结果。 在索引文档中有一个“ The”,但请记住分析器将单词小写,因此在“反向索引”中,它存储为“ the”
Let’s try the Term Query again with “the”:
让我们使用“ the”再次尝试术语查询:
curl --request POST \ --url 'http://localhost:9200/text-vs-keyword/_doc/_search?pretty=' \ --header 'content-type: application/json' \ --data '{ "query": { "term": { "text_field": "the" } }}'Yep! it produced a result because queried “the” is an exact match with the “the” in the Inverted Index.
是的之所以产生结果,是因为查询的“ the”与倒排索引中的“ the”完全匹配。
Now it’s time for text type with Match Query, since it analyzes both types it is easy to get them to produce results. Let’s try with two queries first
现在该使用Match Query进行文本类型了,因为它可以分析两种类型,因此很容易使它们产生结果。 让我们先尝试两个查询
The first query will send “The” to the Elasticsearch, we know that with term query it produces no result, but what about match query?
第一个查询将向“ Elasticsearch”发送“ The”,我们知道对于term query它不会产生结果,但是match query呢?
The second query will send “the LAZ dog tripped over th QUICK brown dog,” some words are in the Inverted Index, some are not, will the Elasticsearch produce any result from it?
第二个查询将发送“ LAZ狗绊倒QUICK棕色狗”,某些单词在倒排索引中,有些不在,Elasticsearch会从中产生任何结果吗?
curl --request POST \ --url 'http://localhost:9200/text-vs-keyword/_doc/_search?pretty=' \ --header 'content-type: application/json' \ --data '{ "query": { "match": { "text_field": "The" } }}'curl --request POST \ --url 'http://localhost:9200/text-vs-keyword/_doc/_search?pretty=' \ --header 'content-type: application/json' \ --data '{ "query": { "match": { "text_field": "the LAZ dog tripped over th QUICK brown dog" } }}'Yep! Both of them produced a result
是的他们两个都产生了结果
{ "_index": "text-vs-keyword", "_type": "_doc", "_id": "example", "_score": 0.39556286, "_source": { "keyword_field": "The quick brown fox jumps over the lazy dog", "text_field": "The quick brown fox jumps over the lazy dog" } }The first query produced a result because “The” in the query was analyzed and became “the” which is the exact match with the one in the Inverted Index.
第一个查询产生了一个结果,因为分析了查询中的“ The”并使其成为与倒排索引中的“ the”完全匹配的“ the”。
The second query, while not all the terms are in the Inverted Index, still produces a result. Elasticsearch will return a result, even if only one of the terms queried exactly matches the one in the Inverted Index.
尽管不是所有术语都在倒排索引中,但第二个查询仍会产生结果。 即使仅查询的一项与反向索引中的一项完全匹配,Elasticsearch也会返回结果。
If you pay attention to the result, there is a _score field. How many of the query’s terms that are an exact match with the one in the Inverted Index is one of the things that affects the score, but let’s save calculating score for another day.
如果您注意结果,则有一个_score字段。 与反向索引中的一个词完全匹配的查询词有多少是影响得分的因素之一,但让我们将计算得分的时间省去。
Use keyword field data type if:
在以下情况下使用keyword字段数据类型:
You want an exact match query 您需要完全匹配查询 You want to make Elasticsearch function like other databases您想要使Elasticsearch功能像其他数据库一样You want to use it for wildcard query您想将其用于通配符查询Use text field data type if:
在以下情况下,请使用文本字段数据类型:
You want to create an autocomplete 您要创建自动完成 You want to create a search system您要创建一个搜索系统Understanding how text and keyword field data types work is one of the things that you will want to learn in Elasticsearch, the difference seems simple but will matter a lot.
了解text和keyword字段数据类型的工作方式是您要在Elasticsearch中学习的内容之一,区别似乎很简单,但很重要。
You will want to understand and choose the field data type suitable for your use case, if you want both field data types then you can use Multi Fields feature when creating the mapping.
您将要了解并选择适合您的用例的字段数据类型,如果您想同时使用两种字段数据类型,则可以在创建映射时使用“多字段”功能。
Lastly, I hope this article helps you in learning Elasticsearch and understanding the differences between text and keyword field data type in Elasticsearch. Thanks for reading!
最后,我希望本文能帮助您学习Elasticsearch并了解Elasticsearch中文本字段和关键字字段数据类型之间的区别。 谢谢阅读!
翻译自: https://medium.com/better-programming/elasticsearch-text-vs-keyword-2ccb99ec72ae
相关资源:四史答题软件安装包exe