COPY cassandra table from csv file

siliconchris picture siliconchris · Feb 21, 2015 · Viewed 23.2k times · Source

I'm setting up a demo landscape for Cassandra, Apache Spark and Flume on my Mac (Mac OS X Yosemite with Oracle jdk1.7.0_55). The landscape shall work as a proof of concept for a new analytics platform and therefore I need some test data in my cassandra db. I am using cassandra 2.0.8.

I created some demo data in excel and exported that as a CSV file. The structure is like this:

ProcessUUID;ProcessID;ProcessNumber;ProcessName;ProcessStartTime;ProcessStartTimeUUID;ProcessEndTime;ProcessEndTimeUUID;ProcessStatus;Orderer;VorgangsNummer;VehicleID;FIN;Reference;ReferenceType
0F0D1498-D149-4FCC-87C9-F12783FDF769;AbmeldungKl‰rfall;1;Abmeldung Kl‰rfall;2011-02-03 04:05+0000;;2011-02-17 04:05+0000;;Finished;SIXT;4278;A-XA 1;WAU2345CX67890876;KLA-BR4278;internal

I then created a keyspace and a column family in cqlsh using:

CREATE KEYSPACE dadcargate 
WITH REPLICATAION  = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

use dadcargate;

CREATE COLUMNFAMILY Process (
  ProcessUUID uuid, ProcessID varchar, ProcessNumber bigint, ProcessName varchar, 
  ProcessStartTime timestamp, ProcessStartTimeUUID timeuuid, ProcessEndTime timestamp, 
  ProcessEndTimeUUID timeuuid, ProcessStatus varchar, Orderer varchar,
  VorgangsNummer varchar, VehicleID varchar, FIN varchar, Reference varchar,
  ReferenceType varchar, 
PRIMARY KEY (ProcessUUID))
WITH COMMENT='A process is like a bracket around multiple process steps';

The column family name and all columns in it are created with all lower case - will have to investigate into this as well some day, but that is not so relevant at the moment.

Now I take my CSV file, which has around 1600 entries and want to import that in my table named process like this:

cqlsh:dadcargate> COPY process (processuuid, processid, processnumber, processname, 
processstarttime, processendtime, processstatus, orderer, vorgangsnummer, vehicleid,
fin, reference, referencetype) 
FROM 'Process_BulkData.csv' WITH DELIMITER = ';' AND HEADER = TRUE;

It gives the following error:

Record #0 (line 1) has the wrong number of fields (15 instead of 13).
0 rows imported in 0.050 seconds.

Which is essentially true, As I do NOT have the timeUUID Fields in my cvs-export.

If I try the COPY command without explicit column-names like this (given the fact, that I actually do miss two fields):

cqlsh:dadcargate> COPY process from 'Process_BulkData.csv' 
WITH DELIMITER = ';' AND HEADER = TRUE;

I end up with another error:

Bad Request: Input length = 1
Aborting import at record #0 (line 1). Previously-inserted values still present.
0 rows imported in 0.009 seconds.

Hm. Kinda strange, but okay. Maybe the COPY command does not like the fact that there are two fields missing. I still think this to be strange, as the missing fields are of course there (from a structural point of view) but only empty.

I still have another shot: I deleted the missing columns in excel, exported the file again as cvs and try to import WITHOUT header line in my csv BUT explicit column names, like this:

cqlsh:dadcargate> COPY process (processuuid, processid, processnumber, processname, 
processstarttime, processendtime, processstatus, orderer, vorgangsnummer, vehicleid, 
fin, reference, referencetype) 
FROM 'Process_BulkData-2.csv' WITH DELIMITER = ';' AND HEADER = TRUE;

I get this error:

Bad Request: Input length = 1
Aborting import at record #0 (line 1). Previously-inserted values still present.
0 rows imported in 0.034 seconds.

Can ANYONE tell me what I'm doing wrong here? According to the documentation of copy-command, the way I setup my commands, should work for at least two of them. Or so I would think.

But nah, I'm obviously missing something important here.

Answer

Aaron picture Aaron · Feb 21, 2015

cqlsh's COPY command can be touchy. However, in the COPY documentation is this line:

The number of columns in the CSV input is the same as the number of columns in the Cassandra table metadata.

Keeping that in-mind, I did manage to get your data to import with a COPY FROM, by naming the empty fields (processstarttimeuuid and processendtimeuuid, respectively):

aploetz@cqlsh:stackoverflow> COPY process (processuuid, processid, processnumber, 
processname, processstarttime, processstarttimeuuid, processendtime, 
processendtimeuuid, processstatus, orderer, vorgangsnummer, vehicleid, fin, reference, 
referencetype) FROM 'Process_BulkData.csv' WITH DELIMITER = ';' AND HEADER = TRUE;

1 rows imported in 0.018 seconds.
aploetz@cqlsh:stackoverflow> SELECT * FROM process ;

 processuuid                          | fin               | orderer | processendtime            | processendtimeuuid | processid         | processname        | processnumber | processstarttime          | processstarttimeuuid | processstatus | reference  | referencetype | vehicleid | vorgangsnummer
--------------------------------------+-------------------+---------+---------------------------+--------------------+-------------------+--------------------+---------------+---------------------------+----------------------+---------------+------------+---------------+-----------+----------------
 0f0d1498-d149-4fcc-87c9-f12783fdf769 | WAU2345CX67890876 |    SIXT | 2011-02-16 22:05:00+-0600 |               null | AbmeldungKl‰rfall | Abmeldung Kl‰rfall |             1 | 2011-02-02 22:05:00+-0600 |                 null |      Finished | KLA-BR4278 |      internal |    A-XA 1 |           4278

(1 rows)