I'm coming from a relational database background and trying to work with amazon's DynamoDB
I have a table with a hash key "DataID" and a range "CreatedAt" and a bunch of items in it.
I'm trying to get all the items that were created after a specific date and sorted by date. Which is pretty straightforward in a relational database.
In DynamoDB the closest thing i could find is a query and using the range key greater than filter. The only issue is that to perform a query i need a hash key which defeats the purpose.
So what am I doing wrong? Is my table schema wrong, shouldn't the hash key be unique? or is there another way to query?
Given your current table structure this is not currently possible in DynamoDB. The huge challenge is to understand that the Hash key of the table (partition) should be treated as creating separate tables. In some ways this is really powerful (think of partition keys as creating a new table for each user or customer, etc...).
Queries can only be done in a single partition. That's really the end of the story. This means if you want to query by date (you'll want to use msec since epoch), then all the items you want to retrieve in a single query must have the same Hash (partition key).
I should qualify this. You absolutely can scan
by the criterion you are looking for, that's no problem, but that means you will be looking at every single row in your table, and then checking if that row has a date that matches your parameters. This is really expensive, especially if you are in the business of storing events by date in the first place (i.e. you have a lot of rows.)
You may be tempted to put all the data in a single partition to solve the problem, and you absolutely can, however your throughput will be painfully low, given that each partition only receives a fraction of the total set amount.
The best thing to do is determine more useful partitions to create to save the data:
Do you really need to look at all the rows, or is it only the rows by a specific user?
Would it be okay to first narrow down the list by Month, and do multiple queries (one for each month)? Or by Year?
If you are doing time series analysis there are a couple of options, change the partition key to something computated on PUT
to make the query
easier, or use another aws product like kinesis which lends itself to append-only logging.