I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#?
how can i get text from Pdf with saving the shape of original Pdf
this is a page from pdf i don't want only text i want the text to be in the shapes like the original pdf and sorry for poor English
Just for documentation reasons, here is an example of OCR
using tesseract
and pdf2image
to extract text from an image pdf.
import pdf2image
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def pdf_to_img(pdf_file):
return pdf2image.convert_from_path(pdf_file)
def ocr_core(file):
text = pytesseract.image_to_string(file)
return text
def print_pages(pdf_file):
images = pdf_to_img(pdf_file)
for pg, img in enumerate(images):
print(ocr_core(img))
print_pages('sample.pdf')