How to read an ORC file stored locally in Python Pandas?

Della picture Della · Oct 19, 2018 · Viewed 13.6k times · Source

Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?

The filename is someFile.snappy.orc

I can see online that spark.read.orc('someFile.snappy.orc') works, but even after import pyspark, it is throwing error.

Answer

Rafal Janik picture Rafal Janik · Dec 4, 2018

I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.

import pandas as pd
import pyarrow.orc as orc

with open(filename) as file:
    data = orc.ORCFile(file)
    df = data.read().to_pandas()