I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array<map<string,string>>'
). An example of the df would be:
import pandas as pd
df = pd.DataFrame.from_records([
(1, [{'job_id': 1, 'started': '2019-07-04'}, {'job_id': 2, 'started': '2019-05-04'}], 100),
(5, [{'job_id': 3, 'started': '2015-06-04'}, {'job_id': 9, 'started': '2019-02-02'}], 540)],
columns=['uid', 'job_history', 'latency']
)
The when using engine='fastparquet
, Dask reads all other columns fine but returns a column of None
s for the column with the complex type. When I set engine='pyarrow'
, I get the following exception:
ArrowNotImplementedError: lists with structs are not supported.
A lot of googling has made it clear that reading a column with a Nested Array just isn't really supported right now, and I'm not totally sure what the best way to handle this is. I figure my options are:
json
library. The schema is simple and that would do the job if possiblelist
and note that the data across these columns are related/map to each other by index (e.g. the elements in idx 0
across these keys/columns all came from the same source). This would work, but frankly, breaks my heart :(I'd love to hear how others have navigated around this limitation. My company uses nested arrays in their parquest frequently, and I'd hate to have to let go of using Dask because of this.
I'm dealing with pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported
when I try to read using Pandas; however, when I read using pyspark and then convert to pandas, the data at least loads:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.read.load(path)
pdf = df.toPandas()
and the offending field is now rendered as a pyspark Row object, which have some structured parsing but you would have to probably write custom pandas functions to extract data from them:
>>> pdf["user"][0]["sessions"][0]["views"]
[Row(is_search=True, price=None, search_string='ABC', segment='listing', time=1571250719.393951), Row(is_search=True, price=None, search_string='ZYX', segment='homepage', time=1571250791.588197), Row(is_search=True, price=None, search_string='XYZ', segment='listing', time=1571250824.106184)]
the individual record can be rendered as a dictionary, simply call .asDict(recursive=True)
on the Row object you would like.
Unfortunately, it takes ~5 seconds to start the SparkSession context and every spark action also takes much longer than pandas operations (for small to medium datasets) so I would greatly prefer a more python-native option