How to parse csv using boost::spirit

user841550 picture user841550 · Aug 21, 2013 · Viewed 10.2k times · Source

I have this csv line

std::string s = R"(1997,Ford,E350,"ac, abs, moon","some "rusty" parts",3000.00)";

I can parse it using boost::tokenizer:

typedef boost::tokenizer< boost::escaped_list_separator<char> , std::string::const_iterator, std::string> Tokenizer;
boost::escaped_list_separator<char> seps('\\', ',', '\"');
Tokenizer tok(s, seps);
for (auto i : tok)
{
    std::cout << i << std::endl;
}

It gets it right except token "rusty" should have double quotes which are getting stripped.

Here is my attempt to use boost::spirit

boost::spirit::classic::rule<> list_csv_item = !(boost::spirit::classic::confix_p('\"', *boost::spirit::classic::c_escape_ch_p, '\"') | boost::spirit::classic::longest_d[boost::spirit::classic::real_p | boost::spirit::classic::int_p]);
std::vector<std::string> vec_item;
std::vector<std::string>  vec_list;
boost::spirit::classic::rule<> list_csv = boost::spirit::classic::list_p(list_csv_item[boost::spirit::classic::push_back_a(vec_item)],',')[boost::spirit::classic::push_back_a(vec_list)];
boost::spirit::classic::parse_info<> result = parse(s.c_str(), list_csv);
if (result.hit)
{
  for (auto i : vec_item)
  {
    cout << i << endl;
   }
}

Problems:

  1. does not work, prints the first token only

  2. why boost::spirit::classic? can't find examples using Spirit V2

  3. the setup is brutal .. but I can live with this

** I really want to use boost::spirit because it tends to be pretty fast

Expected output:

1997
Ford
E350
ac, abs, moon
some "rusty" parts

3000.00

Answer

sehe picture sehe · Aug 21, 2013

For a background on parsing (optionally) quoted delimited fields, including different quoting characters (', "), see here:

For a very, very, very complete example complete with support for partially quoted values and a

splitInto(input, output, ' ');

method that takes 'arbitrary' output containers and delimiter expressions, see here:

Addressing your exact question, assuming either quoted or unquoted fields (no partial quotes inside field values), using Spirit V2:

Let's take the simplest 'abstract datatype' that could possibly work:

using Column  = std::string;
using Columns = std::vector<Column>;
using CsvLine = Columns;
using CsvFile = std::vector<CsvLine>;

And the repeated double-quote escapes a double-quote semantics (as I pointed out in the comment), you should be able to use something like:

static const char colsep = ',';

start  = -line % eol;
line   = column % colsep;
column = quoted | *~char_(colsep);
quoted = '"' >> *("\"\"" | ~char_('"')) >> '"';

The following complete test program prints

[1997][Ford][E350][ac, abs, moon][rusty][3001.00]

(Note the BOOST_SPIRIT_DEBUG define for easy debugging). See it Live on Coliru

Full Demo

//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

using Column  = std::string;
using Columns = std::vector<Column>;
using CsvLine = Columns;
using CsvFile = std::vector<CsvLine>;

template <typename It>
struct CsvGrammar : qi::grammar<It, CsvFile(), qi::blank_type>
{
    CsvGrammar() : CsvGrammar::base_type(start)
    {
        using namespace qi;

        static const char colsep = ',';

        start  = -line % eol;
        line   = column % colsep;
        column = quoted | *~char_(colsep);
        quoted = '"' >> *("\"\"" | ~char_('"')) >> '"';

        BOOST_SPIRIT_DEBUG_NODES((start)(line)(column)(quoted));
    }
  private:
    qi::rule<It, CsvFile(), qi::blank_type> start;
    qi::rule<It, CsvLine(), qi::blank_type> line;
    qi::rule<It, Column(),  qi::blank_type> column;
    qi::rule<It, std::string()> quoted;
};

int main()
{
    const std::string s = R"(1997,Ford,E350,"ac, abs, moon","""rusty""",3001.00)";

    auto f(begin(s)), l(end(s));
    CsvGrammar<std::string::const_iterator> p;

    CsvFile parsed;
    bool ok = qi::phrase_parse(f,l,p,qi::blank,parsed);

    if (ok)
    {
        for(auto& line : parsed) {
            for(auto& col : line)
                std::cout << '[' << col << ']';
            std::cout << std::endl;
        }
    } else
    {
        std::cout << "Parse failed\n";
    }

    if (f!=l)
        std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}