Wondering what are the best practice or experiences used for multilingual indexing and search in elasticsearch. I read through a number of resources, and as best as I can distill it the available options for indexing are:
separate index per language;
multi field type for multilingual field;
separate field for all the possible languages.
So, wondering what are the side-effects for choosing one or the other of these options (or some other that I've missed). I guess having more indices does not really slow down the cluster (if it is not some huge number of languages), so not sure what would I get from choosing 2 or 3 except perhaps easier maintenance.
Any help welcomed!
A bit old question, but the info might be helpful anyway.
The index/mapping structure mainly depends on your usecase.
Do you need to use all the languages simultaneously or only one language is used at time?
General notes for options 2 and 3: Using one of those options gives you the ability to score the documents differently, based on the language as you can define scoring for each language field. You can add new fields to a mapping if you need to add more languages, but there is no way to remove or change the existing fields. Hence you will have to reindex all your content and set the property for the removed language to empty. You will need to add new analyzers for every new language. But it is required to close the index first and open it after the changes are made.
"book_title": { "type": "multi_field", "fields": { "english": { "type": "string" }, "german": { "type": "string" }, "italian": { "type": "string" }, } }
Here you can search in specific language (ex.: "book_title.english") or in all languages (using "book_title"). You should be careful not to update the field using "book_title" name, but using "book_title.[language]". Using "book_title" will lead to updating all the subfields with identical data (which is probably not what you want)
Option 3: Completely separate fields - you will need to put them all in the search query if you need to search as in option 2, more secure in terms of indexing as you cannot overwrite all the languages by mistake
Idea for option 4 - use type-per-language: can be used if you have only one type of documents. You can have different fields per language. Not useful if you have multiple document types