How to query DynamoDB by date (range key), with no obvious hash key?

James Skidmore picture James Skidmore · Mar 12, 2016 · Viewed 29.8k times · Source

I need to keep local data on an iOS app in sync with data in a DynamoDB table. The DynamoDB table is ~2K rows, with only a hash key (id), and the following attributes:

  • id (uuid)
  • lastModifiedAt (timestamp)
  • name
  • latitude
  • longitude

I am currently scanning and filtering by lastModifiedAt, where lastModifiedAt is greater than the app's last refresh date, but I imagine that will become expensive.

The best answer I can find is to add a Global Secondary Index with lastModifiedAt as the range, but there is no obvious hash key for the GSI.

What is best practice when needing to query by range using a GSI, but there is no obvious hash key? Alternatively, if a full scan is the only option, are there any best practices to keep down the cost?

Answer

bsd picture bsd · Apr 8, 2016

Although a Global Secondary Index seems to fit your requirements, any attempt to include timestamp related information as part of your Hash Key will most likely create what is known as "Hot Partition", which is extremely undesirable.

The uneven access will occur as the most recent items are going to be retrieved with way more frequency than the old ones. This will not only impact your performance but also make your solution less cost effective.

See some details from the documentation:

For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.

Based on what is stated, the id seems indeed to be a good choice for your Hash Key (aka. Partition Key), I wouldn't change that as GSI keys work in the same way as far as partitioning. As a separate note, performance is highly optimized when you retrieve your data by providing the entire Primary Key, so we should try to find a solution that provides that whenever possible.

I would suggest creating separate tables to store the primary keys based on how recent they were updated. You can segment the data into tables based on the granularity that best fits your use cases. For example, say that you want to segment the updates by day:

a. Your daily updates could be stored in tables with the following naming convention: updates_DDMM

b. The updates_DDMM tables would only have the id's (hash keys of the other table)

Now say that the latest app refresh date was from 2 days ago (04/07/16) and you need to get the recent records, you would then need:

i. Scan the tables updates_0504 and updates_0604 to get all the hash keys.

ii. Finally obtain the records from the main table (containing lat/lng, name, etc) by submitting a BatchGetItem with all the obtained hash keys.

BatchGetItem is super fast and will do the job like no other operation.

One can argue that creating additional tables will add cost to your overall solution... well, with GSI you are essentially duplicating your table (in case you are projecting all fields) and adding that additional cost for all ~2k records, being them recently updated or not...

It seems counter intuitive creating tables like this but it is actually a best practice when dealing with time series data (From AWS DynamoDB Documentation):

[...] the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

I hope that helps. Regards.