How to extract bold text from pdf using pdfbox?

Lipu picture Lipu · Nov 4, 2013 · Viewed 8.2k times · Source

I am using a Apache pdfbox for extracting text. I can extract the text from pdf but I dont know how to know that whether the word is bold or not??? (code suggestion would be good!!!) Here is the code for extracting plain text from pdf that is working fine.

PDDocument document = PDDocument
    .load("/home/lipu/workspace/MRCPTester/test.pdf");
document.getClass();
if (document.isEncrypted()) {
    try {
        document.decrypt("");
    } catch (InvalidPasswordException e) {
        System.err.println("Error: Document is encrypted with a password.");
        System.exit(1);
    }
}

// PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// stripper.setSortByPosition(true);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
stripper.setSortByPosition(true);
String st = stripper.getText(document);

Answer

mkl picture mkl · Nov 4, 2013

The result of PDFTextStripper is plain text. After extracting it, therefore, it is too late. But you can override certain methods of it and only let through text which is formatted according to your wishes.

In case of the PDFTextStripper you have to override

protected void processTextPosition( TextPosition text )

In your override you check whether the text in question fulfills your requirements (TextPosition contains much information on the text in question, not only the text itself), and if it does, forward the TextPosition text to the super implementation.

The main problem is, though, to recognize which text is bold.

Criteria for boldness may be the word bold in the font name, e.g. Courier-BoldOblique - you access the font of the text using text.getFont() and the postscript name of the font using the font's getBaseFont() method

String postscriptName = text.getFont().getBaseFont();

Criteria may also be from the font descriptor - you get the font descriptor of a font using the getFontDescriptor method, and a font descriptor has an optional font weight value

float fontWeight = text.getFont().getFontDescriptor().getFontWeight();

The value is defined as

(Optional; PDF 1.5; should be used for Type 3 fonts in Tagged PDF documents) The weight (thickness) component of the fully-qualified font name or font specifier. The possible values shall be 100, 200, 300, 400, 500, 600, 700, 800, or 900, where each number indicates a weight that is at least as dark as its predecessor. A value of 400 shall indicate a normal weight; 700 shall indicate bold.

The specific interpretation of these values varies from font to font.

EXAMPLE 300 in one font may appear most similar to 500 in another.

(Table 122, Section 9.8.1, ISO 32000-1)

There may be additional hints towards bold-ism to check, e.g. a big line width

double lineWidth = getGraphicsState().getLineWidth();

when the rendering mode draws an outline, too:

int renderingMode = getGraphicsState().getTextState().getRenderingMode();

You may have to try with your the documents you have at hand which criteria suffice.