Difference between Apache POI api and Apache Tika Api?

java apache-poi apache-tika

Krishna · Sep 19, 2013 · Viewed 8.2k times · Source

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task.

While going thru tika, I came across POI API and found more friendly to use it.

we may have requirement to parse PDF file in further.

I am new to this technology, i would like know difference between two and which technology is more suitable for my requirement.

Thanks, Krishna

Answer

Apache Tika provides a common way to extract consistent text and metadata from a wide range of formats. It also provides content detection, language detection and a few other bits. If you write your code to work with Apache Tika, then your code will be able to work with a huge range of formats in the same way. You don't need to worry about whether one format has a Title, or another calls the same logical thing a LongTitle or a Subject. You don't need to worry about what library to use for what format. You call Tika, it does the hard work for you, and back comes your consistent Metadata and Textual Content

Apache POI is one of the libraries that Tika uses. POI supports most of the main Microsoft Office formats, including Excel (.xls and .xlsx). It provides access to the whole of the file format, allowing you complete control over what information you read out. (It also supports writing). Tika uses POI to get text and metadata out of the various different Microsoft formats, but doesn't extract everything. Using POI directly would allow you to decide what you care about and get that.

If you want to support lots of file formats, use Tika. If you want full control of how you get the information out, use POI.

Difference between Apache POI api and Apache Tika Api?

Answer

Related questions