Elasticsearch分词器
如何将一内容进行分词
get _analyze
{
"text":"你好,我是小明,很高兴认识你",
"analyzer":"standard"
}
normalization-规范化
normalization会将我们查询的内容和ES存储的内容进行统一的格式化,比如大小写、单复数、没有含义的单词等做统一的处理,以保证在检索的时候能够匹配的到。
Character Filter-字符过滤器
在搜索时不参与匹配的字符进行过滤,分为三种
HTML Strip
过滤HTML标签
put my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_filter": {
"type": "html_strip",
"escaped_tags": [
"a"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": "my_filter"
}
}
}
}
}
get my_index/_analyze
{
"analyzer":"my_analyzer",
"text":"<p>你好<h1><a>小明</a></h2>"
}
Mapping
自定义过滤器
put my_index1
{
"settings": {
"analysis": {
"char_filter": {
"my_filter": {
"type": "mapping",
"mappings":[
"你 => *",
"滚 => *"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": "my_filter"
}
}
}
}
}
get my_index1/_analyze
{
"analyzer":"my_analyzer",
"text":"你好,滚"
}
Patten Replace
正则替换,通过正则替换掉指定部分
put my_index2
{
"settings": {
"analysis": {
"char_filter": {
"my_filter": {
"type": "pattern_replace",
"pattern":"(\\d{3})\\d{4}(\\d{4})",
"replacement":"$1****$2"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": "my_filter"
}
}
}
}
}
get my_index2/_analyze
{
"analyzer":"my