Highly inconsistent OCR result for tesseract

codefreaK picture codefreaK · Sep 13, 2017 · Viewed 12.2k times · Source

enter image description here

This is the original screenshot and I cropped the image into 4 parts and cleared the background of the image to the extent that I can possibly do but tesseract only detects the last column here and ignores the rest.

enter image description here

The output from the tesseract is shown as it is there are blank spaces which I remove while processing result

  Femme—Fatale.



  DaRkLoRdEIa
  aChineseN1gg4

  Noob_Diablo_

enter image description here

The output from the tesseract is shown as it is there are blank spaces which I remove while processing result

Kicked.

NosNoel
ChikiZD
Death_Eag|e_42

Chai—.

enter image description here

3579 10 1 7 148

2962 3 O 7 101

2214 2 2 7 99

2205 1 3 6 78

enter image description here

8212

7198

6307

5640

4884

15

40

40

6O

80

80

Am just dumping the output of

result = `pytesseract.image_to_string(Image.open("D:/newapproach/B&W"+str(i)+".jpg"),lang="New_Language")`

But I do not know how to proceed from here to get a consistent result.Is there anyway so that I can force the tesseract to recognize the text area and make it scan that.Because in trainer (SunnyPage), tesseract on default recognition scan it fails to recognize some areas but once I select the manually everything is detected and translated to text correctly

Code

Answer

Manoj picture Manoj · Sep 26, 2017

Tried with the command line which gives us option to decide which psm value to be used.

Can you try with this:

pytesseract.image_to_string(image, config='-psm 6')

Tried with the image provided by you and below is the result:

Extracted Text Out of Image

The only problem I am facing is that my tesseract dictionary is interpreting "1" provided in your image to ""I" .

Below is the list of psm options available:

pagesegmode values are: 0 = Orientation and script detection (OSD) only.

1 = Automatic page segmentation with OSD.

2 = Automatic page segmentation, but no OSD, or OCR

3 = Fully automatic page segmentation, but no OSD. (Default)

4 = Assume a single column of text of variable sizes.

5 = Assume a single uniform block of vertically aligned text.

6 = Assume a single uniform block of text.

7 = Treat the image as a single text line.

8 = Treat the image as a single word.

9 = Treat the image as a single word in a circle.

10 = Treat the image as a single character.