How to extract data from image that contains tabular data?

Question 1

How to extract data from image that contains tabular data?

python opencv ocr tesseract python-tesseract

developer · Jan 14, 2019 · Viewed 10.5k times · Source

Answer

Answer

Using cv2 is good after cv2.imwrite(temp_filename, gray_img)

import PIL.Image  
Use config='-psm 6'
page_str = image_to_string(Image.open(temp_filename), lang="eng", config='-psm 6')

This will return good data from table images

Question 2

I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way. Here is what I have done until now-

import cv2
import os  
import numpy as np 
import pytesseract
#import pillow 

#Since scanned PDF can't be handled by pdf2image, convert the scanned PDF into a JPEG format using the below code- 
filename = path   
from pdf2image import convert_from_path 
pages = convert_from_path(filename, 500) for page in pages:
page.save("dest", 'JPEG')


imgname = "path" 
oriimg = cv2.imread(imgname,cv2.IMREAD_COLOR) 
cv2.imshow("original image", oriimg)
cv2.waitKey(0)


#img = cv2.resize(oriimg,None,fx=0.5,fy=0.5,interpolation=cv2.INTER_CUBIC) 
img = cv2.resize(oriimg,(700,1500),interpolation=cv2.INTER_AREA) 
#here length height  
cv2.imshow("lol", img) 
cv2.waitKey(0) 
cv2.imwrite("changed_dimensionsimgpath", img)


import PIL.Image  
image = cv2.imread(imgname,cv2.IMREAD_COLOR) 
grayedimg = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) grayedimg = 
cv2.threshold(grayedimg, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] 
cv2.imwrite("H://newim.jpg", grayedimg)


pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- 
OCR\tesseract.exe"


text = pytesseract.image_to_string(PIL.Image.open("path"))
print(text)

My input table looks like below. The regions which have black background are not being identified by OCR and not being extracted as text. Any help would be greatly appreciated.

Output of this code for the image sample-

Sun by Select .

F'I‘L‘Mlm":[ [Juir SHIIEF'. ”ﬁllﬁt Fadll'fi



Brand Type Fragranm Unit: Ithange Dollm 'LChanga Men
Eleanit' Sprayl Grange J.?IEBﬂI-Eﬂ' 11% '5H'1Elﬁ9ﬂﬂﬂ 35% I E
Eleanlt! kﬁmnsul' Grange IEEEESWI 39% I521LESM1MH 1113553 ‘ E
Dehuxe F‘mmr [emu 525.940 461% '51:EE?,GED,00 433.6% 5
Datum: Anus»! ﬁring?) 4,3341%} 29% 513573300119 215% E
Dem Spray ‘Drangr: £432,100 09% 515.223.:53000 154%

Min Blaster Aemgul: Dramge ”2114033111 59% :SHSiMMﬂ H94:

DiFlEIESIEf Sprawl Drama “NEW. 50% ‘5E1D1_E-BDM 141% I
Incredlme Spray Lem 1.513.410" 483% a HELENE] $11143 I E

t“ In

1'"

How to extract data from image that contains tabular data?

Answer

Related questions