I have a 5 node cluster of Cassandra, with ~650 GB of data on each node involving a replication factor of 3. I have recently started seeing the following error in /var/log/cassandra/system.log.
INFO [ReadStage-5] 2017-10-17 17:06:07,887 NoSpamLogger.java:91 - Maximum memory usage reached (1.000GiB), cannot allocate chunk of 1.000MiB
I have attempted to increase the file_cache_size_in_mb, but sooner rather than later this same error catches up. I have tried to go as high as 2GB for this parameter, but to no avail.
When the error happens, the CPU utilisation soars and the read latencies are terribly erratic. I see this surge show up approximated every 1/2 hour. Note the timings in the list below.
INFO [ReadStage-5] 2017-10-17 17:06:07,887 NoSpamLogger.java:91 - Maximum memory usage reached (1.000GiB), cannot allocate chunk of 1.000MiB INFO [ReadStage-36] 2017-10-17 17:36:09,807 NoSpamLogger.java:91 - Maximum memory usage reached (1.000GiB), cannot allocate chunk of 1.000MiB INFO [ReadStage-15] 2017-10-17 18:05:56,003 NoSpamLogger.java:91 - Maximum memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB INFO [ReadStage-28] 2017-10-17 18:36:01,177 NoSpamLogger.java:91 - Maximum memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB
Two of the tables that I have are partitioned by hour, and the partitions are large. Ex. Here are their outputs from nodetool table stats
Read Count: 4693453
Read Latency: 0.36752741680805157 ms.
Write Count: 561026
Write Latency: 0.03742310516803143 ms.
Pending Flushes: 0
Table: raw_data
SSTable count: 55
Space used (live): 594395754275
Space used (total): 594395754275
Space used by snapshots (total): 0
Off heap memory used (total): 360753372
SSTable Compression Ratio: 0.20022598072758296
Number of keys (estimate): 45163
Memtable cell count: 90441
Memtable data size: 685647925
Memtable off heap memory used: 0
Memtable switch count: 1
Local read count: 0
Local read latency: NaN ms
Local write count: 126710
Local write latency: 0.096 ms
Pending flushes: 0
Percent repaired: 52.99
Bloom filter false positives: 167775
Bloom filter false ratio: 0.16152
Bloom filter space used: 264448
Bloom filter off heap memory used: 264008
Index summary off heap memory used: 31060
Compression metadata off heap memory used: 360458304
Compacted partition minimum bytes: 51
**Compacted partition maximum bytes: 3449259151**
Compacted partition mean bytes: 16642499
Average live cells per slice (last five minutes): 1.0005435888450147
Maximum live cells per slice (last five minutes): 42
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0
Read Count: 4712814
Read Latency: 0.3356051004771247 ms.
Write Count: 643718
Write Latency: 0.04168356951335834 ms.
Pending Flushes: 0
Table: customer_profile_history
SSTable count: 20
Space used (live): 9423364484
Space used (total): 9423364484
Space used by snapshots (total): 0
Off heap memory used (total): 6560008
SSTable Compression Ratio: 0.1744084338623116
Number of keys (estimate): 69
Memtable cell count: 35242
Memtable data size: 789595302
Memtable off heap memory used: 0
Memtable switch count: 1
Local read count: 2307
Local read latency: NaN ms
Local write count: 51772
Local write latency: 0.076 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 384
Bloom filter off heap memory used: 224
Index summary off heap memory used: 400
Compression metadata off heap memory used: 6559384
Compacted partition minimum bytes: 20502
**Compacted partition maximum bytes: 4139110981**
Compacted partition mean bytes: 708736810
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Here goes:
cdsdb/raw_data histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 61.21 0.00 1955666 642
75% 1.00 73.46 0.00 17436917 4768
95% 3.00 105.78 0.00 107964792 24601
98% 8.00 219.34 0.00 186563160 42510
99% 12.00 315.85 0.00 268650950 61214
Min 0.00 6.87 0.00 51 0
Max 14.00 1358.10 0.00 3449259151 7007506
cdsdb/customer_profile_history histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 73.46 0.00 223875792 61214
75% 0.00 88.15 0.00 668489532 182785
95% 0.00 152.32 0.00 1996099046 654949
98% 0.00 785.94 0.00 3449259151 1358102
99% 0.00 943.13 0.00 3449259151 1358102
Min 0.00 24.60 0.00 5723 4
Max 0.00 5839.59 0.00 5960319812 1955666
Could you please suggest a way forward to mitigate this issue?
Based on the cfhistograms output posted, the partitions are enormous.
95% percentile of raw_data table has partition size of 107MB and max of 3.44GB. 95% percentile of customer_profile_history has partition size of 1.99GB and max of 5.96GB.
This clearly relates to the problem you notice every half-hour as these huge partitions are written to the sstable. The data-model has to change and based on the partition size above its better to have a partition interval as "minute" instead of "hour". So a 2GB partition would reduce to 33MB partition.
Recommended partition size is to keep it as close to 100MB maximum. Though theoretically we can store more than 100MB, the performance is going to suffer. Remember every read of that partition is over 100MB of data through the wire. In your case, its over 2GB and hence all the performance implications along with it.