PDF to Text extractor in nodejs without OS dependencies

bartium picture bartium · Jun 9, 2015 · Viewed 11k times · Source

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command. Thanks

Answer

Eugene picture Eugene · Jun 15, 2015

Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:

'Texts': an array of text blocks with position, actual text and styling informations: 'x' and 'y': relative coordinates for positioning 'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value. 'A': text alignment, including: left center right 'R': an array of text run, each text run object has two main fields: 'T': actual text 'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section