Pandas dataframe to JSONL (JSON Lines) conversion

Arnold Klein picture Arnold Klein · Aug 9, 2018 · Viewed 10.2k times · Source

I need to convert pandas data frame to JSONL format. I couldn't find a good package to do it and tried to implement myself, but it looks a bit ugly and not efficient.

For example, given a pandas df:

        label      pattern
  0      DRUG      aspirin
  1      DRUG    trazodone
  2      DRUG   citalopram

I need to convert to txt file of the form:

{"label":"DRUG","pattern":[{"lower":"aspirin"}]}
{"label":"DRUG","pattern":[{"lower":"trazodone"}]}
{"label":"DRUG","pattern":[{"lower":"citalopram"}]}

I tried with to_dict('records'), but I'm missing [ ] and nested 'lower' key.

df.to_dict('record')

creates:

[{'label': 'DRUG', 'pattern': 'aspirin'},
 {'label': 'DRUG', 'pattern': 'trazodone'},
 {'label': 'DRUG', 'pattern': 'citalopram'}]

I thought about converting the 'pattern' columns and include nested 'lower'?

UPD

So far, I succeeded to convert 'pattern' into list:

df_new = pd.concat((df[['label']], df[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_new.columns = ['label', 'pattern']
df_new.head()

The result:

    label   pattern
0   DRUG    [aspirin]
1   DRUG    [trazodone]
2   DRUG    [citalopram]

and then:

df_new.to_dict(orient='records')

[{'label': 'DRUG', 'pattern': ['aspirin']},
 {'label': 'DRUG', 'pattern': ['trazodone']},
 {'label': 'DRUG', 'pattern': ['citalopram']}]

UPD 2

Eventually, I managed to get what I want, but in the most non-pythonic way.

df_1 = pd.DataFrame(df[['pattern']].apply(lambda x: {'lower': x[0]}, axis=1))
df_1.columns = ['pattern']

df_fin = pd.concat((df[['label']], df_1[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_fin.columns = ['label', 'pattern']
df_fin.to_json(orient='records')

 '{'label': 'DRUG', 'pattern': [{'lower': 'aspirin'}]}
  {'label': 'DRUG', 'pattern': [{'lower': 'trazodone'}]}
  {'label': 'DRUG', 'pattern': [{'lower': 'citalopram'}]}'

Any chance you can show a neat solution?

Answer

kmsquire picture kmsquire · Oct 17, 2018

In versions of Pandas > 0.19.0, DataFrame.to_json has a parameter, lines, that will write out JSONL format.

Given that, a more succinct version of your solution might look like this:

import pandas as pd

data = [{'label': 'DRUG', 'pattern': 'aspirin'},
        {'label': 'DRUG', 'pattern': 'trazodone'},
        {'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)

# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})

# Output in JSONL format
print(df.to_json(orient='records', lines=True))

Output:

{"label":"DRUG","pattern":{"lower":"aspirin"}}
{"label":"DRUG","pattern":{"lower":"trazodone"}}
{"label":"DRUG","pattern":{"lower":"citalopram"}}