准备今天的操作
删除之前的实验索引
curl -XDELETE http://127.0.0.1:9200/synctest/article
output:
{"acknowledged":true}
创建新mapping
curl -XPUT 'http://127.0.0.1:9200/servcie/_mapping/massage' -d '
{
"massage":{
"properties":{
"location":{
"type":"geo_point"
},
"name":{
"type":"string"
},
"age":{
"type":"integer"
},
"address":{
"type":"string"
},
"price":{
"type":"double",
"index":"not_analyzed"
},
"is_open":{
"type":"boolean"
}
}
}
}'
查看新创建的mapping
curl -XGET http://127.0.0.1:9200/servcie/massage/_mapping?pretty
{
"servcie" : {
"mappings" : {
"massage" : {
"properties" : {
"address" : {
"type" : "string"
},
"age" : {
"type" : "integer"
},
"is_open" : {
"type" : "boolean"
},
"location" : {
"type" : "geo_point"
},
"name" : {
"type" : "string"
},
"price" : {
"type" : "double"
}
}
}
}
}
}
进入我们的分词测试
curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"text":"波多菠萝蜜"}'
{
"tokens" : [ {
"token" : "波",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
}, {
"token" : "多",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
}, {
"token" : "菠",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
}, {
"token" : "萝",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
}, {
"token" : "蜜",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
} ]
}
分词器 是由一个分解器(tokenizer)、零个或多个词元过滤器(token filters)组成
curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"text":"abc dsf,sdsf"}'
中文检索
如果使用中文检索,还必须使用中文分词,平时使用最多的可能就要属IK分词器了。
安装IK分词
./bin/plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.3/elasticsearch-analysis-ik-1.9.3.zip
重启后查看插件(是否加载成功)
curl -XGET http://localhost:9200/_cat/plugins
Marrow analysis-ik 1.9.3 j
使用ik分词分析
curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"波多菠萝蜜"}'
{
"tokens" : [ {
"token" : "波",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "多",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
}, {
"token" : "菠萝蜜",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "菠萝",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "菠",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 4
}, {
"token" : "萝",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "蜜",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 6
} ]
}
可以看到已经多 菠萝、菠萝蜜进行了分词
随着社会发展和不同的业务术语, 有些新的词汇,并没有收录到我们的IK分词器, 即使使用match_pharse等查询也存在检索不到数据情况,那我们该怎么办呢?
举个例子, 比如我们希望能检索出 “吊炸天” 这个词(1.9.3版本的IK并没有被收录)
curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"吊炸天天不容"}'
{
"tokens" : [ {
"token" : "吊",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "炸",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
}, {
"token" : "天天",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "不容",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
} ]
}
如果必须的话, 这个时候我们就需要 修改IK的词库了
我们 修改analysis-ik/config/ik/custom 下 mydict.dic 文件, 这个文件是专门为我们拓展词汇准备的, 再最后面添加好新词后保存并重启es即可
curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"吊炸天天不容"}'
{
"tokens" : [ {
"token" : "吊炸天",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "吊",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "炸",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 2
}, {
"token" : "天天",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "不容",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 4
} ]
}
我们可以看到已经对“吊炸天”进行了单独的分词.