Elasticsearch 使用不同分词器导致搜索排名的问题

相信我们很多人做中文搜索的时候,在Github找了ik中分分词插件
然后建立mapping的时候,很自然的使用这样的参数(参照官方分词文档实例)

{
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            }
        }
}

假设我们在已经建立的testindex
那么我们来看一下全部数据(打火车和火车两条数据)

curl 127.0.0.1:9200/test/_search | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 1,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 1,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      }
    ]
  }
}

这时候我们开始搜索(打火车)

curl 127.0.0.1:9200/test/_search?q=打火车 | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.21110919,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.160443,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      }
    ]
  }
}

这时候我们惊奇的发现火车的分值是0.21110919居然比打火车的0.160443还高

中间经过一路排查, 首先感谢https://github.com/mobz/elasticsearch-head插件, 让排查数据的时候减少很多操作.
之后查看文档分词结果就得知了答案

curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打火": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 2
            }
          ]
        },
        "火车": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}

很惊奇的发现打火车被划分成打火和火车两个词, 所以这之中肯定有问题了(当然对于搜索引擎是没有问题的).
打火车文档中的火车得到了分值,但打火会使搜索得分下降, 导致火车文档的排名靠前
所以我决定把两个分词器设置成一样

{
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_smart",
                "search_analyzer": "ik_smart"
            }
        }
}

然后再看一下分词数据(这次分词的数据的确是我们预想的)

curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 1
            }
          ]
        },
        "火车": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}

这时我们再搜索一次数据排名, 看到得分值排名的确是我们想要的了.

curl  127.0.0.1:9200/test/_search?q=打火车 | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.77041256,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.77041256,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      }
    ]
  }
}