How can I implement versioning without replacing with previous record in DynamoDB?

iCodeLikeImDrunk picture iCodeLikeImDrunk · Jun 18, 2014 · Viewed 19.2k times · Source

Currently, I see that when I use versioning in DynamoDB, it changes the version number but the new entry will replace the old entry; ie:

old

{ object:one, name:"hey", version:1}

new

{ object:one, name:"ho", version:2}

What I want is to have BOTH entries in the db; ie:

old

{ object:one, name:"hey", version:1 }

new

{ object:one, name:"hey", version:1}
{ object:one, name:"ho", version:2}

Any way to accomplish this?

Answer

ShortFuse picture ShortFuse · Feb 8, 2019

I've been experimenting and calculating what's most efficient in terms of read/write units and cost, considering race conditions where updates occur while a version is being logged, and avoiding of duplication of data. I've narrowed down a couple of possible solutions. You'd have to consider your best variation.

The basic concepts revolve around consider version 0 as the latest version. Also, we will use a revisions key that will list how many revisions exist before this item, but also will be used to determine the current version of the item (version = revisions + 1). Being able to calculate how versions exists is a requirement and, in my opinion, revisions fulfills that need as well as a value that can be presented to the user.

So the first row will be created with version: 0 and revisions: 0. While this is technically the first version (v1), we don't apply a version number until it's archived. When this row changes, version stays at 0, which still denotes latest, and revisions is incremented to 1. A new row is created with all the previous values, with the exception that now that row denotes version: 1.

To summarize:

On item creation:

  • Create item with revisions: 0 and version 0

On item update or overwrite:

  • Increment revisions
  • Insert old row exactly as before, but change version: 0 to the new version which can easily be calculated as version: revisions + 1.

Here's what converting to convertion would look like on a table with just primary key:

Primary Key: id

  id  color
9501  violet
9502  cyan
9503  magenta

Primary Key: id+version

id    version  revisions  color
9501        0          6  violet
9501        1          0  red
9501        2          1  orange
9501        3          2  yellow
9501        4          3  green
9501        5          4  blue
9501        6          5  indigo

Here's converting a table that already uses a sort key:

Primary key: id+date

id    date     color
9501  2018-01  violet
9501  2018-02  cyan
9501  2018-03  black

Primary key: id+date_ver

id    date_ver     revisions  color
9501  2018-01__v0          6  violet
9501  2018-01__v1          0  red
9501  2018-01__v2          1  orange
9501  2018-01__v3          2  yellow
9501  2018-01__v4          3  green
9501  2018-01__v5          4  blue
9501  2018-01__v6          5  indigo

Alternative #2:

id    date_ver     revisions  color
9501  2018-01              6  violet
9501  2018-01__v1          0  red
9501  2018-01__v2          1  orange
9501  2018-01__v3          2  yellow
9501  2018-01__v4          3  green
9501  2018-01__v5          4  blue
9501  2018-01__v6          5  indigo

We actually have the option of either putting previous versions in the same table or separate them into their own table. Both options have their different advantages and disadvantages.

Using the same table:

  • Primary key is composed of Partition key and Sort key
  • Version must be used in sort key either alone as a number or appended to an existing sort key as a string

Advantages:

  • All data exists in one table

Disadvantages:

  • Possibly limits your use of the table sort keys
  • Versioning uses the same writes units as your primary table
  • Sort keys can only be configured during table creation
  • Possibly need to readjust your code to query against v0
  • Previous versions will also be affected by indexes

Using a secondary tables:

  • Add revision key to both tables
  • If not using sort key, build a sort key for secondary table called version. The primary table would always have version: 0. Use of this key on the primary table isn't mandatory.
  • If already using a sort key, see "Alternative #2" above

Advantages:

  • Primary table does not need to change any keys or be recreated. get requests do not change.
  • Primary table keeps its sort key
  • Secondary table can have independent read and write capacity units
  • Secondary table has its own indexes

Disadvantages:

  • Requires managing of a second table

Regardless of how you decide to partition the data, now we have to decide how create the revision rows. Here are a couple of different methods:

On-demand, synchronous item-overwrite/update and revision-insert

Summary: Get the current version of the row. Perform both an update on the current row and insert the previous version with one transaction.

To avoid race conditions, we need to write both the update and insert in the same operation using TransactWriteItems. Also, we need to make sure that the version we are updating is the right version by the time the request reaches database server. We achieve this by either one of two checks, or even both:

  1. In the Update command in TransactItems, the ConditionExpression must check that the revision in the row to be updated matches the revision in the object we performed a Get on before.
  2. In the Put command in TransactItems, the ConditionExpression checks to make sure the row does not already exist.

Cost

  • 1 Read capacity unit per 4K for Get on v0
  • 1 Write capacity unit for preparing TransactWriteItem
  • 1 Write capacity unit per 1K for Put/Update on v0
  • 1 Write capacity unit per 1K for Put on revision
  • 1 Write capacity unit for committing TransactWriteItem

Notes:

  • Items are limited to 400KB

On-demand, asynchronous item-get, item-overwrite/update, and revision-insert

Summary: Get and store current row. When overwriting or updating a row, check against current revision and increment revisions. Insert previously stored row with version number.

Perform an update with

{
  UpdateExpression: 'SET revisions = :newRevisionCount',
  ExpressionAttributeValues: {
    ':newRevisionCount': previousRow.revisions + 1,
    ':expectedRevisionCount': previousRow.revisions,
  },
  ConditionExpression: 'revisions = :expectedRevisionCount',
}

We can use the same ConditionExpression with put when overwriting a previously existing row.

In the response, we are watching for ConditionalCheckFailedException. If this is returned, that means the revision was already changed by another process and we need to repeat the process from the beginning or abort entirely. If there are no exceptions, then we can insert the previous stored row after update the value on your version attribute as appropriate (numeric or string).

Cost

  • 1 read capacity unit per 4K for Get on v0
  • 1 write capacity unit per 1KB for Put/UpdateItem on v0
  • 1 write capacity unit per 1KB for Put on revision

On-demand, asynchronous blind item-update and revision-insert

Summary: Perform a "blind" update on the v0 row while incrementing revisions and requesting the old attributes. Use the return value to create a new row with the version number.

Perform an update-item with

{
  UpdateExpression: 'ADD revisions :revisionIncrement',
  ExpressionAttributeValues: {
    ':revisionIncrement': 1,
  },
  ReturnValues: 'ALL_OLD',
}

The ADD action will automatically create revisions if it doesn't exist and will consider it 0. One nice benefit of ReturnValues is:

There is no additional cost associated with requesting a return value aside from the small network and processing overhead of receiving a larger response. No read capacity units are consumed.

In the update response, the Attributes value will be the data from the old record. The version of this record is the value of Attributes.revisions + 1. Update the value on your version attribute as appropriate (numeric or string).

Now you can insert this record into your target table.

Cost

  • 1 write capacity unit per 1KB for Update on v0
  • 1 write capacity unit per 1KB for Put on revision

Notes:

  • The returned object's Attributes length is limited to 65535.
  • No solution for overwriting rows.

Automated asynchronous revision-insert

Summary: Perform "blind" updates and inserts on the primary while incrementing revisions. Use a Lambda trigger watching for changes to revision to insert revisions asynchronously.

Perform an update with

{
  UpdateExpression: 'ADD revisions :revisionIncrement',
  ExpressionAttributeValues: {
    ':revisionIncrement': 1,
  },
}

The ADD action will automatically create revisions if it doesn't exist and will consider it 0.

For overwriting records with put increment revisions value based on a previous get request.

Configure a DynamoDB Stream view type to return both new and old images. Setup a Lambda trigger against the database table. Here's a piece of sample code for NodeJS that would compare the old and new images and call a function to write the revisions in batch.

/**
 * @param {AWSLambda.DynamoDBStreamEvent} event
 * @return {void}
 */
export function handler(event) {
  const oldRevisions = event.Records
    .filter(record => record.dynamodb.OldImage
      && record.dynamodb.NewImage
      && record.dynamodb.OldImage.revision.N !== record.dynamodb.NewImage.revision.N)
    .map(record => record.dynamodb.OldImage);
  batchWriteRevisions(oldRevisions);
}

This is just sample, but production code would likely include more checks.

Cost

  • 1 read capacity unit per 4K for get on v0 (only when overwriting)
  • 1 write capacity unit per 1KB for Put/Update on v0
  • 1 DynamoDB Stream read request unit per GetRecords command
  • 1 write capacity unit per 1KB for Put of revision

Notes:

  • DynamoDB Stream shard data expires after 24 hours
  • DynamoDB Stream read request units are independent of table read capacity units
  • Use of Lambda functions has its own pricing
  • Changing stream view type requires disabling and re-enabling the stream
  • Works with Write, Put, BatchWriteItems, TransactWriteItems commands

For my use cases, I'm already using a DynamoDB Streams and I don't expect users to request versioned rows all that often. I also can let users wait a bit for the revisions to be ready since it's asynchronous. That makes using a second table and the automated lambda process the more ideal solution for me.

For the asynchronous options there are some points of failure. Still, it's something that you can either retry immediately on the on-demand requests, or schedule for later for the DynamoDB Stream solution.

If anybody has any other solutions or critiques, please comment. Thanks!