Editing PDF with XPDF (or with something else)

Marek Szanyi picture Marek Szanyi · Jan 19, 2010 · Viewed 8.9k times · Source

I would like to ask if it is possible to edit PDF files using the xpdf library and if yes how? I guess this is possible but i could not find any tutorial nor documentation for xpdf so i have realy no idea :( . I'm also open for using another library if any other has support for pdf editing. My only requirement for such library is that it has to be a C++ library or at least a C one and has to be cross-platform (Windows and Linux)

I Only need basic editing of a pdf file for example:

"this is a text in a pdf document" would be changed to "this is a text in pdf" with a different text color as well.

Thanks for all your replies!

Answer

plinth picture plinth · Jan 20, 2010

Just so you understand the scope of what you're getting into, "basic editing" of PDF content is nearly always non-trivial.

Page content in PDF is represented by short RPN programs that paint on the page. It's a small language similar to PostScript in semantics, but without looping structures or function definitions (so there is no halting problem). In a sane world, your text on the page is going to be represented by something like this:

BT /F1 12 Tf 72 720 Td (this is a text in a pdf document) Tj ET

which when translated into something more familiar, is this:

BeginText();
SetFont(F1, 12.0);  // Font 1, 12.0 pt
TextMoveTo(72, 720);
ShowText("this is a text in a pdf document");
EndText();

So in this case, you have to transform this into something like this:

BeginText();
SetFont(F1, 12.0);  // Font 1, 12.0 pt
TextMoveTo(72, 720);
ShowText("this is a ");
SetFont(F2, 12);
ShowText("text");
SetFont(F1, 12);
ShowText(" in a pdf document");
EndText();

which would become:

BT /F1 12 Tf 72 720 Td (this is a ) Tj /F2 12 Tf (text) Tj /F1 12 Tf
( in a pdf document) Tj ET

in the equivalent PDF. The problem is many-fold:

  1. You have to extract out the page and all its resources (non-trivial)
  2. You have to generate a new page, inserting new resources (you're adding a new font), embedding the font if allowable
  3. Alter the content stream of the page to include your changed content.

And 3 is where you're going to get hung up, because there are an infinite number of ways to generate a page that has the content you describe and even with a decent library, you're going to have a hard time getting maybe 70% of them. Let me briefly describe why this is as bad as it sounds. There are PDF generation programs (I'm looking at you, troff) that lay all the plain text on a page first, then lay all the italic text, then all the bold text. I swear, I'm not making this up. Some programs want to lay text down very precisely, so if you're lucky, they'll use the TJ operator which lays out text with specific kerning. If you're not lucky (which is most of the time), they're instead lay out the text with a set of moves before every single glyph on the page. And what if your text is laid our on a curve or an unusual orientation (maps, ads)? What about the cases where someone subtly changes the font size for a greater distinction between upper and lower case or simulates small caps?

This is why, when I wrote the find text tool for Acrobat 1.0, it took me two months of sweat to handle as many of the edge cases. This is not editing text - it's just trying to find a single word or phrase.

I'm not going to recommend a library for you - sorry - I gave xpdf a brief look over and it's not clear whether or not it has PDF generation capabilities or if it is simply a consumer of PDF. PdfLib, which is a commercial product, appears to be to generate PDF, although it's not clear if it can consume it, but you could certainly get both sides by gluing them together.

If it were me, I would use tools that I've developed and I'd still be a little shy of this task. My library is being used by Atalasoft, the company I work for, to generate PDFs from whole cloth and to do editing within a very limited domain (annotations, document metadata). The hardest part is that we do our very best to hide the complexity of PDF from our customers. In general, our customers want us to understand the spec instead of them and make the rest easy - but tasks like this (redaction is another one), are really hard to do without understanding the depth of the PDF specification. If you start entering the library world of PDF manipulation, you should start with reading the spec, especially chapter 8 (Graphics) and chapter 9 (Text), and you'll get a better understanding of what you're going to have to do with the library.