Achieving consistency in Azure Table Storage #2

In the previous post I presented some easy ways of achieving consistency in Table Storage by storing data within one partition. This works in the majority of scenarios, there're some cases however, when you have to divide records(because of the design, to preserve scalability or to logically separate different concerns) and still ensure, that you can perform operations within a transaction. You may be a bit concerned - all in all we just talked, that storing data within a single(at least from the transaction point of view) partition is required to actually be able to perform EGTs. Well - as always there's a solution to go beyond some limits and achieve what we're aiming for. Let's go!

Eventually consistent transactions

Just a note before we start - this pattern won't guarantee, that a transaction is isolated. This means that a client will be able to read data while a transaction is being processed. Unfortunately there's no easy way to completely lock tables while an inter-partition operation is being performed.

Let's back to our eventual consistency. What does it mean? The answer is pretty simple - once a transaction is finished, our data can be considered consistent. All right - but this is something new. What's the difference between transaction performed as EGT? 

In EGT your are performing maximally 100 operations without a possibility to see an ongoing process. In other words - you always see the result of a transaction. With eventual consistency you can divide the process into steps:

  • get an entity from Partition 1
  • inserty an entity into Partition 2
  • delete an entity from Partition 1

Of course you can have more than only 3 steps. The crux here is the clear division between each step. If we consider other operations performed during a transaction:

  • get an entity from Partition 1
  • get an entity from Partition 2
  • inserty an entity into Partition 2
  • get an entity from Partition 1
  • delete an entity from Partition 1

The whole view should be clearer. With eventual consistency those bolded steps stand for operations, which clearly are victims of read phenomenas. Always consider possible drawbacks of solutions like this and if needed, use other database which isolates transactions.

Achieving eventual consistency

To achieve eventual consistency we have to introduce a few other components to our architecture. We'll need at least two things:

  • queue which holds actions, which should be performed in a transaction
  • worker roles, which reads messages from a queue and perform the actual transactions

Now let's talk about each new component in details.

Queue

By using a queue we're able to easily orchestrate operations, which should be performed by worker roles. The easiest example is creating a project, which will archive records stored in Table Storage. Thanks to a queue we can post a message saying 'Archive a record', which can be read by other components and processed. Finally workers can post their messages saying, that an action has been finished. 

Worker role

When we're saying about workers we think about simple services, which perform some part of a flow. In eventual consistency pattern they're responsible for handling a transaction logic. If we come back to the example from the previous point, a worker would be responsible for moving an entity from the one table to another and then deleting it. The important note here is idempotence - you have to ensure, that you won't add more than one instance of an entity in the case of restarting the flow. The same goes when deleting things - you should delete only if an entity exists.

Considerations

You can apply this pattern not only to perform operations between different partitions - it also works when you're working with other components like blobs. It has some obvious drawbacks like lack of isolation or external segment, which have to be handled in your code. On the other hand it's a perfectly valid approach, especially in table-other_storage scenario. 

Achieving consistency in Azure Table Storage #1

In the upcoming two posts I'll present you two ways of achieving consistency in Azure Table Storage. I split this topic into two parts mostly because one has to understand how transactions work in Table Storage before we go any further. Enough talking for now - let's dive deeper!

EGTs or Entity Group Transactions

This is something not so obvious initially and to be honest, I wasn't aware of this fact when I started working to Table Storage. This is mostly due to a simple reason - in documentation two terms - EGT and batch transactions - are often used alternately, but in reality they are the same thing. I guess most people are familiar with batching in this Azure component, but for the sake of clarity, I'll quote a bit of information here.

Tables in Azure Storage allow you to perform batches, which will be executed as a one operation. Consider following example(taken from https://docs.microsoft.com/en-us/azure/cosmos-db/table-storage-how-to-use-dotnet):

/
// Retrieve the storage account from the connection string.
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
    CloudConfigurationManager.GetSetting("StorageConnectionString"));

// Create the table client.
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();

// Create the CloudTable object that represents the "people" table.
CloudTable table = tableClient.GetTableReference("people");

// Create the batch operation.
TableBatchOperation batchOperation = new TableBatchOperation();

// Create a customer entity and add it to the table.
CustomerEntity customer1 = new CustomerEntity("Smith", "Jeff");
customer1.Email = "Jeff@contoso.com";
customer1.PhoneNumber = "425-555-0104";

// Create another customer entity and add it to the table.
CustomerEntity customer2 = new CustomerEntity("Smith", "Ben");
customer2.Email = "Ben@contoso.com";
customer2.PhoneNumber = "425-555-0102";

// Add both customer entities to the batch insert operation.
batchOperation.Insert(customer1);
batchOperation.Insert(customer2);

// Execute the batch operation.
table.ExecuteBatch(batchOperation);

There are however some restrictions:

  • all operations have to be executed withing a single partition
  • you're limited to only 100 operations per batch

Depending on how you designed your table and what you're actually building, those numbers can be potentially more or less problematic. Let's try to investigate possible patterns, which can be helpful here.

Denormalization

Since many developers came from the world of relational database, normalizing tables in a database is their second nature. This is a great skill... and unfortunately it becomes a real pain in the arse when working with NoSQL storages. Let's say we have a many-to-one relation. Now if on the left side we have more than 100 items, which we'd like to move to another entity, we can lost consistency since we have limited number of operation we can perform at once. In such scenario it could be viable to store references to items in the main table to able to perform a transaction(of course as long as we are in a single partition scenario).

Denormalization & performance

We can extend the previous example a bit and consider following scenario - we'd like to improve performance by limit the number of requests we have to perform when making a query to more than a one table(e.g. employee + employee's last payslip). To do so we could duplicate data and store the second table as an extension to the first. To achieve consistency we'd have to ensure, that both tables are in the same partition(so we can update both table in the transaction).

Intra-partition secondary index pattern

Similar pattern to the inter-partition secondary pattern, which I described previously(this one however lets you achieve consistency by using EGTs since all data is stored in the same partition).

Considerations

When considering consistency in Table Storage and storing data in the same partition(or at least duplicating it by creating secondary indexes), you have to think about your scalability targets. As you may know, minimizing the number of partitions can affect how your solutions scales in the future because they are the main factor of load balancing requests. As always all depends on the characteristics of your application and what features of this storage your're interested the most.

What's next?

In the next post we'll focus on inter-partition transactions and what can be done in that area. Stay tuned!