I am able to fetch the field names for most of the pdf files using pdfbox but i am not able to fetch fields income taxform. is it something restricted in that form.
though it contains multiple fields in the form, it is showing only one field.
This is the output:
topmostSubform[0].
my code:
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List fields = acroForm.getFields();
@SuppressWarnings("rawtypes")
java.util.Iterator fieldsIter = fields.iterator();
System.out.println(new Integer(fields.size()).toString());
while( fieldsIter.hasNext())
{
PDField field = (PDField)fieldsIter.next();
System.out.println(field.getFullyQualifiedName());
System.out.println(field.getPartialName());
}
used in
public static void main(String[] args) throws IOException {
PDDocument pdDoc = null;
try {
pdDoc = PDDocument.load("income.pdf");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Ggdfgdgdgf feilds = new Ggdfgdgdgf();
feilds.printFields(pdDoc);
}
The PDF in question is a hybrid AcroForm/XFA form. This means that it contains the form definition both in AcroForm and in XFA format.
PDFBox primarily supports AcroForm (which is the PDF form technology presented in the PDF specification), but as both formats are present, PDFBox can at least inspect the AcroForm form definition.
Your code ignores that AcroForm.getFields()
does not return all field definitions but merely the definitions of the root fields, cf. the JavaDoc comments:
/**
* This will return all of the documents root fields.
*
* A field might have children that are fields (non-terminal field) or does not
* have children which are fields (terminal fields).
*
* The fields within an AcroForm are organized in a tree structure. The documents root fields
* might either be terminal fields, non-terminal fields or a mixture of both. Non-terminal fields
* mark branches which contents can be retrieved using {@link PDNonTerminalField#getChildren()}.
*
* @return A list of the documents root fields.
*
*/
public List<PDField> getFields()
If you want to access all fields, you have to walk the form field tree, e.g. like this:
public void test() throws IOException
{
try ( InputStream resource = getClass().getResourceAsStream("f2290.pdf"))
{
PDDocument pdfDocument = PDDocument.load(resource);
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDField> fields = acroForm.getFields();
for (PDField field : fields)
{
list(field);
}
}
}
void list(PDField field)
{
System.out.println(field.getFullyQualifiedName());
System.out.println(field.getPartialName());
if (field instanceof PDNonTerminalField)
{
PDNonTerminalField nonTerminalField = (PDNonTerminalField) field;
for (PDField child : nonTerminalField.getChildren())
{
list(child);
}
}
}
This returns a huge list of fields for your document.
PS: You have not stated which PDFBox version you use. As currently PDFBox development clearly has begun recommending the use of the current 2.0.0 release candidates, I assumed in my answer that you use that version.