Parquet vs Cassandra using Spark and DataFrames

M.Rez picture M.Rez · Jun 14, 2016 · Viewed 7.2k times · Source

I have come to this dilemma that I cannot choose what solution is going to be better for me. I have a very large table (couple of 100GBs) and couple of smaller (couple of GBs). In order to create my data pipeline in Spark and use spark ML I need to join these tables and do couple of GroupBy (aggregate) operations. Those operations were really slow for me so I chose to do one of these two:

  • Use Cassandra and use indexing to speed the GoupBy operations.
  • Use Parquet and Partitioning based on the layout of the data.

I can say that Parquet partitioning works faster and more scalable with less memory overhead that Cassandra uses. So the question is this:

If developer infers and understands the data layout and the way it is going to be used, wouldn't it better for just use Parquet since you will have more control over it? Why should I pay the price for the overhead that Cassandra causes?

Answer

Citrullin picture Citrullin · Jun 14, 2016

Cassandra is also a good solution for analytics use cases, but in another way. Before you model your keyspaces, you have to know how you need to read the data. You can also use where and range queries, but in a hard restricted way. Sometimes you will hate this restriction, but there are reasons for these restrictions. Cassandra is not like Mysql. In MySQL the performance is not a key feature. It's more about flexibility and consistency. Cassandra is a high performance write/read database. Better in write than in read. Cassandra has also a linear scalability.

Okay, a bit about your use case: Parquet is the better option for you. This is why:

  • You aggregate raw data on really large and not splitted datasets
  • Your Spark ML Job sounds like a scheduled, not long-running job. (onces a week, day?)

This fits more in the use cases of Parquet. Parquet is a solution for ad-hoc analysis, filter analysis stuff. Parquet is really nice if you need to run a query 1 or 2 times a month. Parquet is also a nice solution if a marketing guy wants to know one thing and the response time is not so important. Simply and short:

  • Use Cassandra if you know the queries.
  • Use Cassandra if a query will be used in a daily business
  • Use Cassandra if Realtime matters (I talk about a maximum of 30 seconds latency, from, customer makes an action and I can see the result in my dashboard)

  • Use Parquet if Realtime doesn't matter

  • Use Parquet if the query will not perform 100x a day.
  • Use Parquet if you want to do batch processing stuff