Does ElasticSearch support Unicode / Chinese?

kerwin picture kerwin · Nov 11, 2013 · Viewed 8.1k times · Source

I am doing text searching via ElasticSearch, and There is a problem on querying with term type. What I am doing below is basically,

  1. Add a document with Chinese string (你好).
  2. Querying with text method, and it return the document.
  3. Querying with term method, return nothing.

So, Why it's happen? and how to resolve it.

➜  curl -XPOST 'http://localhost:9200/test/test/' -d '{ "name" : "你好" }'

{
  "ok": true,
  "_index": "test",
  "_type": "test",
  "_id": "VdV8K26-QyiSCvDrUN00Nw",
  "_version": 1
}

➜  curl -XGET 'http://localhost:9200/test/test/_mapping?pretty=1'

{
  "test" : {
    "properties" : {
      "name" : {
        "type" : "string"
      }
    }
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1'

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "VdV8K26-QyiSCvDrUN00Nw",
        "_score": 1.0,
        "_source": {
          "name": "你好"
        }
      }
    ]
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{
  "query": {
    "text": {
      "name": "你好"
    }
  }
}'

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.8838835,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "VdV8K26-QyiSCvDrUN00Nw",
        "_score": 0.8838835,
        "_source": {
          "name": "你好"
        }
      }
    ]
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{
  "query": {
    "term": {
      "name": "你好"
    }
  }
}'

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Answer

Torsten Engelbrecht picture Torsten Engelbrecht · Nov 11, 2013

From the ElasticSearch docs about term query:

Matches documents that have fields that contain a term (not analyzed).

The name field is analyzed by default, so it can not be found by a term query (only finds not analyzed fields). You can try it and index another document with a different name (not Chinese) and it can also not be found by the term query. If you are now wondering why following search query return results though:

curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{"query" : {"term" : { "name" : "好" }}}'

Its because each token is a not analyzed term for that matter. If you would index a document with the name "你好吗", you would also not find documents containing "好吗" or "你好", but you could find documents containing "你", "好" or "吗" with a term query.

For Chinese you might need to pay special attention to the analyzer used. For me the standard analyzer seems good enough though (tokenize Chinese phrases on character by character basis, rather than space).