It is said that Java is 10x faster than python in terms of performance. That's what I see from benchmarks too. But what really brings down Java is the JVM startup time.
This is a test I made:
$time xlsx2csv.py Types\ of\ ESI\ v2.doc-emb-Package-9
...
<output skipped>
real 0m0.085s
user 0m0.072s
sys 0m0.013s
$time java -jar -client /usr/local/bin/tika-app-0.7.jar -m Types\ of\ ESI\ v2.doc-emb-Package-9
real 0m2.055s
user 0m2.433s
sys 0m0.078s
Same file , a 12 KB ms XLSX embedded file inside Docx and Python is 25x faster !! WTH!!
It takes 2.055 sec for Java.
I know it is all due to startup time, but what i need is i need to call it via a script to parse some documents which i do not want to re-invent the wheel in python.
But as to parse 10k+ files , it is just not practical..
Anyway to speed it up (I already tried -client option and it only speed up by so little(20%) ).
My another idea? Run it as a long-running daemon , communicate using UDP or Linux-ICP sockets locally?