本篇介绍es提供的几种分词分析器和常用的开源分词分析器。
分词器Tokenizer
standard
es默认的分词器,中规中矩的按照 Unicode Standard Annex #29分词,一般的小写符号会忽略,对于中文等字符会逐字分割,参数max_token_length表示最大的字符长度,再切分后会继续按此切分。譬如:
POST/GET localhost:9200/_analyze { "tokenizer" : { "type":"standard", "max_token_length": 4 }, "text" : "123 Brown-Foxes, jump!" }
会分词为:
["123","Brow","n","Foxe","s","jump"]
nGram
一个无视语义,按照字符尽量收集全索引的分词方式,会前后叠加的按符号位分词,参数:
min_gram |
表示分词后的最小长度,default 1 |
max_gram | 表示分词后的最大长度,default 2 |
token_chars | 一个数组,表示只对指定的字符集进行分词,可选例如digit(数字)、letter、symbol等,如果指定了token_chars,未在其中规定的字符集会被忽略 |
POST/GET localhost:9200/_analyze { "tokenizer" : { "type":"nGram", "min_gram":2, "max_gram":3 }, "text" : "12 Brown Foxes," }
会分词为:
["12","12 ","2 ","2 B"," B"," Br","Br","Bro","ro","row","ow","own","wn","wn ","n ","n F"," F"," Fo","Fo","Fox","ox","oxe","xe","xes","es","es,","s,"]
nGram的分词很全面,但如此夸张的方式用不好会导致索引doc过大,同时使查询效率偏低。
whitespace
分词规则很简单,无其余规则的按空格分词:
PUT localhost:9200/_analyze { "tokenizer" : { "type":"whitespace" }, "text" : "123 Brown Foxes, jump!" }
会分词为:
["123","Brown","Foxes,","jump!"]
uax_url_email
在standard的基础上能够有效拆分出邮箱和url地址的格式,同样有max_token_length这一参数:
POST/GET localhost:9200/_analyze { "tokenizer" : { "type":"uax_url_email" }, "text" : "www.baidu.com.cn .Brown Foxes, jump! bailong<test@163.com>" }
会分词为:
["www.baidu.com.cn","Brown","Foxes,","jump","bailong","test@163.com"]
开源分词器——ik_smart、ik_max_word
以上分词已经能满足大部分要求,但是基于语言的分词支持度并不好,例如对于汉语的拆分,要么是完整的按句子分割,要么是逐字分割。对此有Lucene迁移过来的ik系列分词(和对应的分析器)。从官方github可以找到配置方法
ik_smart存储了常用的词典,使之能兼容汉语分词,做到最高效的分词,同时也支持自定义词典。
先看一个例子:
POST/GET localhost:9200/_analyze { "tokenizer":"ik_smart", "text" : "中国驻洛杉矶领事馆遭亚裔男子的枪击,嫌犯现已自首" }
会分词为:
["中国","驻","洛杉矶,","领事馆","遭","亚裔","男子","的","枪击","嫌犯","现已","自首"]
但是此种分词有时会不能覆盖所有搜索结果,例如我们单单搜索“领事“,其实是找不到这个term的。
ik_max_word对于这种更高要求的搜索做了最大化的分词:
POST/GET localhost:9200/_analyze { "tokenizer":"ik_max_word", "text" : "中国驻洛杉矶领事馆遭亚裔男子的枪击,嫌犯现已自首" }
会分词为:
["中国","驻","洛杉矶,","领事馆","领事","馆","遭","亚裔","男子","的","枪击","嫌犯","现已","自首"]
这里将领事馆再次拆词,就能兼容更大的搜索范畴。
Reindex
我们也许会注意到es虽然不能修改已创建的字段和相应的分析器,但是可以添加字段和添加分析器,添加分析器后,新的数据或是变更的数据自然可以被索引到,那旧数据呢?Es并不会对这些数据做任何处理。换言之,对于历史数据,新建的分词器将不会生效,除非我们对其进行Reindex操作,重新建立倒排索引:
POST localhost:9200/_analyze { "source": { "index": "my-index-000001" }, "dest": { "index": "my-new-index-000001" } }