My goal is to ensure that records published by a DynamoDB stream are processed in the "correct" order. My table contains events for customers. Hash key is Event ID, range key a timestamp. "Correct" order would mean that events for the same customer ID are processed in order. Different customer IDs can be processed in parallel.
I'm consuming the stream via Lambda functions. Consumers are spawned automatically per shard. So if the runtime decides to shard the stream, consumption happens in parallel (if I get this right) and I run the risk of processing a CustomerAddressChanged event before CustomerCreated (for example).
The docs imply that there is no way to influence the sharding. But they don't say so explicitly. Is there a way, e.g., by using a combination of customer ID and timestamp for the range key?
The assumption that sharding is determined by table keys seems to be correct. My solution will be to use customer ID as hash key and timestamp (or event ID) as range key.
This AWS blog says:
The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.
This slide confirms it. I still wish the DynamoDB docs would explicitly say so...