Extracting text from a PDF file

Attilah picture Attilah · Nov 15, 2009 · Viewed 10.1k times · Source

I'm using PDFBox for a C# .NET project. and I'm getting a "TypeInitializationException" (The type initializer for 'java.lang.Throwable' threw an exception.) when executing the following block of code :

  FileStream stream = new FileStream(@"C:\1.pdf",FileMode.Open);

  //retrieve the pdf bytes from the stream.
  byte[] pdfbytes=new byte[65000];

  stream.Read(pdfbytes, 0, 65000);

 //get the pdf file bytes.
 allbytes = pdfbytes;

 //create a stream from the file bytes.
 java.io.InputStream ins = new java.io.ByteArrayInputStream(allbytes);
 string txt;

 //load the doc
 PDDocument doc = PDDocument.load(ins);
 PDFTextStripper stripper = new PDFTextStripper();

 //retrieve the pdf doc's text
 txt = stripper.getText(doc);
 doc.close();

the exception occurs at the 3rd statement :

PDDocument doc = PDDocument.load(ins);

What can I do to solve this ?

This is the stack trace :

           at java.lang.Throwable.__<map>(Exception , Boolean )
   at org.pdfbox.pdfparser.PDFParser.parse()
   at org.pdfbox.pdmodel.PDDocument.load(InputStream input, RandomAccess scratchFile)
   at org.pdfbox.pdmodel.PDDocument.load(InputStream input)
   at At.At.ExtractTextFromPDF(InputStream fileStream) in
 C:\Users\Administrator\Documents\Visual Studio 2008\Projects\AtProject\Att\At.cs:line 61

Inner Exception of the InnerException :

  • InnerException {"Could not load file or assembly 'IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The system cannot find the file specified.":"IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58"} System.Exception {System.IO.FileNotFoundException}

OK, I solved the previous problem by copying some .dll files of the PDFBox to my bin folder. but now I'm getting this error : expected='/' actual='.'--1 org.pdfbox.io.PushBackInputStream@283d742

Are there any alternatives to using PDFBox ? is there any other reliable library out there I can use to extract text from pdf files.

Answer

Sasha picture Sasha · Nov 15, 2009

It looks like you missing some library for PDFBox. You need:

  • IKVM.GNU.Classpath.dll
  • PDFBox-X.X.X.dll
  • FontBox-X.X.X-dev.dll
  • IKVM.Runtime.dll

Read this topic Read from a PDF file using C#. You can find the discussion of similar problem in comments of this topic.