Multilingual elasticsearch indexing best practice/experiences

ilijaluve picture ilijaluve · Mar 3, 2014 · Viewed 12.7k times · Source

Wondering what are the best practice or experiences used for multilingual indexing and search in elasticsearch. I read through a number of resources, and as best as I can distill it the available options for indexing are:

  1. separate index per language;

  2. multi field type for multilingual field;

  3. separate field for all the possible languages.

So, wondering what are the side-effects for choosing one or the other of these options (or some other that I've missed). I guess having more indices does not really slow down the cluster (if it is not some huge number of languages), so not sure what would I get from choosing 2 or 3 except perhaps easier maintenance.

Any help welcomed!

Answer

Shote picture Shote · Jun 19, 2014

A bit old question, but the info might be helpful anyway. The index/mapping structure mainly depends on your usecase.
Do you need to use all the languages simultaneously or only one language is used at time?

  • Option 1: multilanguage website for example - the users only see and search in the current language they have chosen. In this case my experience is that index-per-lang would be good solution, especially if you need to be able to add and remove languages easily. The data amount is separated between the indices (performance benefit). Easy setup of analyzers for each language, especially if their settings differs only by the language name. Personally I'm currently using this option for one of my projects

General notes for options 2 and 3: Using one of those options gives you the ability to score the documents differently, based on the language as you can define scoring for each language field. You can add new fields to a mapping if you need to add more languages, but there is no way to remove or change the existing fields. Hence you will have to reindex all your content and set the property for the removed language to empty. You will need to add new analyzers for every new language. But it is required to close the index first and open it after the changes are made.

  • Option 2: If you need to search in all languages at once the multi-field gives you the easiest access as you can address all its sub-fields at once:

    "book_title": {
        "type": "multi_field",
        "fields": {
            "english": {
                "type": "string"
            },
            "german": {
                "type": "string"
            },
            "italian": {
                "type": "string"
            },
        }
    }

Here you can search in specific language (ex.: "book_title.english") or in all languages (using "book_title"). You should be careful not to update the field using "book_title" name, but using "book_title.[language]". Using "book_title" will lead to updating all the subfields with identical data (which is probably not what you want)

  • Option 3: Completely separate fields - you will need to put them all in the search query if you need to search as in option 2, more secure in terms of indexing as you cannot overwrite all the languages by mistake

  • Idea for option 4 - use type-per-language: can be used if you have only one type of documents. You can have different fields per language. Not useful if you have multiple document types