Convert Outlook PST to json using libpst

dimid picture dimid · Jun 30, 2016 · Viewed 7k times · Source

I have an Outlook PST file, and I'd like to get a json of the emails, e.g. something like

{"emails": [
{"from": "[email protected]",
 "to": "[email protected]",
 "bcc": "[email protected]",
 "subject": "mitm",
 "content": "be careful!"
}, ...]}

I've thought using readpst to convert to MH format and then scan it in a ruby/python/bash script, is there a better way?

Unfortunately the ruby-msg gem doesn't work on my PST files (and looks like it wasn't updated since 2014).

Answer

dimid picture dimid · Jun 30, 2016

I found a way to do it in 2 stages, first convert to mbox and then to json:

# requires installing libpst
pst2json my.pst
# or you can specify a custom output dir and an outlook mail folder,
# e.g. Inbox, Sent, etc.
pst2json -o email/ -f Inbox my.pst

Where pst2json is my script and mbox2json is slightly modified from Mining the Social Web.

pst2json:

#!/usr/bin/env bash

usage(){
    echo "usage: $(basename $0) [-o <output-dir>] [-f <folder>] <pst-file>"
    echo "default output-dir: email/mbox-all/<pst-file>"
    echo "default folder: Inbox"
    exit 1
}

which readpst || { echo "Error: libpst not installed"; exit 1; }
folder=Inbox

while (( $# > 0 )); do
    [[ -n "$pst_file" ]] && usage
    case "$1" in
        -o)
            if [[ -n "$2" ]]; then
                out_dir="$2"
                shift 2
            else
                usage
            fi
            ;;
        -f)
            if [[ -n "$2" ]]; then
                folder="$2"
                shift 2
            else
                usage
            fi
            ;;
        *)
            pst_file="$1"
            shift
    esac
done

default_out_dir="email/mbox-all/$(basename $pst_file)"
out_dir=${out_dir:-"$default_out_dir"}
mkdir -p "$out_dir"
readpst -o "$out_dir" "$pst_file"
[[ -f "$out_dir/$folder" ]] || { echo "Error: folder $folder is missing or empty."; exit 1; }
res="$out_dir"/"$folder".json
mbox2json "$out_dir/$folder" "$res" && echo "Success: result saved to $res"

mbox2json (python 2.7):

# -*- coding: utf-8 -*-

import sys
import mailbox
import email
import quopri
import json
from BeautifulSoup import BeautifulSoup

MBOX = sys.argv[1]
OUT_FILE = sys.argv[2]
SKIP_HTML=True

def cleanContent(msg):

    # Decode message from "quoted printable" format

    msg = quopri.decodestring(msg)

    # Strip out HTML tags, if any are present

    soup = BeautifulSoup(msg)
    return ''.join(soup.findAll(text=True))


def jsonifyMessage(msg):
    json_msg = {'parts': []}
    for (k, v) in msg.items():
        json_msg[k] = v.decode('utf-8', 'ignore')

    # The To, CC, and Bcc fields, if present, could have multiple items
    # Note that not all of these fields are necessarily defined

    for k in ['To', 'Cc', 'Bcc']:
        if not json_msg.get(k):
            continue
        json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r'
                , '').replace(' ', '').decode('utf-8', 'ignore').split(',')

    try:
        for part in msg.walk():
            json_part = {}
            if part.get_content_maintype() == 'multipart':
                continue
            type = part.get_content_type()
            if SKIP_HTML and type == 'text/html':
                continue
            json_part['contentType'] = type
            content = part.get_payload(decode=False).decode('utf-8', 'ignore')
            json_part['content'] = cleanContent(content)

            json_msg['parts'].append(json_part)
    except Exception, e:
        sys.stderr.write('Skipping message - error encountered (%s)\n' % (str(e), ))
    finally:
        return json_msg

# There's a lot of data to process, so use a generator to do it. See http://wiki.python.org/moin/Generators
# Using a generator requires a trivial custom encoder be passed to json for serialization of objects
class Encoder(json.JSONEncoder):
    def default(self, o):
        return {'emails': list(o)}


# The generator itself...
def gen_json_msgs(mb):
    while 1:
        msg = mb.next()
        if msg is None:
            break
        yield jsonifyMessage(msg)

mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
json.dump(gen_json_msgs(mbox),open(OUT_FILE, 'wb'), indent=4, cls=Encoder)

Now, it's possible to process the file easily. E.g. to get just the contents of the emails:

jq '.emails[] | .parts[] | .content' < out/Inbox.json