ElasticSearch Analyzer and Tokenizer for Emails

LYu picture LYu · May 8, 2015 · Viewed 14.7k times · Source

I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here.

Suppose there are five email addresses stored under field "email":

1. {"email": "[email protected]"}
2. {"email": "[email protected], [email protected]"}
3. {"email": "[email protected]"}
4. {"email": "[email protected]}
5. {"email": "[email protected]"}

I want to fulfill the following searching scenarios:

[Search -> Receive]

"[email protected]" -> 1,2

"[email protected]" -> 2,4

"[email protected]" -> 5

"john.doe" -> 1,2,3,4

"john" -> 1,2,3,4,5

"gmail.com" -> 1,2

"outlook.com" -> 2,3,4

The first three matchings is a MUST, and for the rest of them the more precise the better. Have already tried different combinations of index/search analyzers, tokenizers, and filters. Also tried to work on the condition for match queries, but did not find an ideal solution, any thought is welcome, and no limit to the mappings, analyzers, or which kind of query to use, thanks.

Answer

Andrei Stefan picture Andrei Stefan · May 8, 2015

Mapping:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "email": {
          "type": "pattern_capture",
          "preserve_original": 1,
          "patterns": [
            "([^@]+)",
            "(\\p{L}+)",
            "(\\d+)",
            "@(.+)",
            "([^-@]+)"
          ]
        }
      },
      "analyzer": {
        "email": {
          "tokenizer": "uax_url_email",
          "filter": [
            "email",
            "lowercase",
            "unique"
          ]
        }
      }
    }
  },
  "mappings": {
    "emails": {
      "properties": {
        "email": {
          "type": "string",
          "analyzer": "email"
        }
      }
    }
  }
}

Test data:

POST /test/emails/_bulk
{"index":{"_id":"1"}}
{"email": "[email protected]"}
{"index":{"_id":"2"}}
{"email": "[email protected], [email protected]"}
{"index":{"_id":"3"}}
{"email": "[email protected]"}
{"index":{"_id":"4"}}
{"email": "[email protected]"}
{"index":{"_id":"5"}}
{"email": "[email protected]"}

Query to be used:

GET /test/emails/_search
{
  "query": {
    "term": {
      "email": "[email protected]"
    }
  }
}