How to force PyYAML to load strings as unicode objects?

Petr Viktorin picture Petr Viktorin · May 23, 2010 · Viewed 11.5k times · Source

The PyYAML package loads unmarked strings as either unicode or str objects, depending on their content.

I would like to use unicode objects throughout my program (and, unfortunately, can't switch to Python 3 just yet).

Is there an easy way to force PyYAML to always strings load unicode objects? I do not want to clutter my YAML with !!python/unicode tags.

# Encoding: UTF-8

import yaml

menu= u"""---
- spam
- eggs
- bacon
- crème brûlée
- spam
"""

print yaml.load(menu)

Output: ['spam', 'eggs', 'bacon', u'cr\xe8me br\xfbl\xe9e', 'spam']

I would like: [u'spam', u'eggs', u'bacon', u'cr\xe8me br\xfbl\xe9e', u'spam']

Answer

cryo picture cryo · Jun 3, 2010

Here's a version which overrides the PyYAML handling of strings by always outputting unicode. In reality, this is probably the identical result of the other response I posted except shorter (i.e. you still need to make sure that strings in custom classes are converted to unicode or passed unicode strings yourself if you use custom handlers):

# -*- coding: utf-8 -*-
import yaml
from yaml import Loader, SafeLoader

def construct_yaml_str(self, node):
    # Override the default string handling function 
    # to always return unicode objects
    return self.construct_scalar(node)
Loader.add_constructor(u'tag:yaml.org,2002:str', construct_yaml_str)
SafeLoader.add_constructor(u'tag:yaml.org,2002:str', construct_yaml_str)

print yaml.load(u"""---
- spam
- eggs
- bacon
- crème brûlée
- spam
""")

(The above gives [u'spam', u'eggs', u'bacon', u'cr\xe8me br\xfbl\xe9e', u'spam'])

I haven't tested it on LibYAML (the c-based parser) as I couldn't compile it though, so I'll leave the other answer as it was.