Export large amount of data from Cassandra to CSV

KrzysztofZalasa picture KrzysztofZalasa · Jul 22, 2014 · Viewed 14k times · Source

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:

  • sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism
  • COPY - causes timeouts on quite fast EC2 instances for big number of records
  • CAPTURE - like above, causes timeouts
  • reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second

I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores

Is there any better option for export gigabytes of data from Cassandra to CSV?

Answer

Alex Ott picture Alex Ott · Jun 11, 2020

Update for 2020th: DataStax provides a special tool called DSBulk for loading and unloading of data from Cassandra (starting with Cassandra 2.1), and DSE (starting with DSE 4.7/4.8). In simplest case, the command line looks as following:

dsbulk unload -k keyspace -t table -url path_to_unload

DSBulk is heavily optimized for loading/unloading operations, and has a lot of options, including import/export from/to compressed files, providing the custom queries, etc.

There is a series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6