I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below.
and the output I am getting is as follows:
Someto the left
Someto the left
Some in the middle
Some in the middle
Some with some tab
Some with some tab
Some with some space between them
Some with some space between them
Sometext here
Sometext here
this much
this much
How do I get the desired output as of the same structure in image?
i.e. as follows:
Some text here
Some text here
Some to the left
Some to the left
Some in the middle
Some in the middle
Some with some tab
Some with some tab
Some with some space between them this much
Some with some space between them this much
Newer versions of tesseract (3.04) have an option called preserve_interword_spaces
which should do what you want.
Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces
option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.
Details on this option are here.