Achieving consistency in Azure Table Storage #1

In the upcoming two posts I'll present you two ways of achieving consistency in Azure Table Storage. I split this topic into two parts mostly because one has to understand how transactions work in Table Storage before we go any further. Enough talking for now - let's dive deeper!

EGTs or Entity Group Transactions

This is something not so obvious initially and to be honest, I wasn't aware of this fact when I started working to Table Storage. This is mostly due to a simple reason - in documentation two terms - EGT and batch transactions - are often used alternately, but in reality they are the same thing. I guess most people are familiar with batching in this Azure component, but for the sake of clarity, I'll quote a bit of information here.

Tables in Azure Storage allow you to perform batches, which will be executed as a one operation. Consider following example(taken from https://docs.microsoft.com/en-us/azure/cosmos-db/table-storage-how-to-use-dotnet):

/
// Retrieve the storage account from the connection string.
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
    CloudConfigurationManager.GetSetting("StorageConnectionString"));

// Create the table client.
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();

// Create the CloudTable object that represents the "people" table.
CloudTable table = tableClient.GetTableReference("people");

// Create the batch operation.
TableBatchOperation batchOperation = new TableBatchOperation();

// Create a customer entity and add it to the table.
CustomerEntity customer1 = new CustomerEntity("Smith", "Jeff");
customer1.Email = "Jeff@contoso.com";
customer1.PhoneNumber = "425-555-0104";

// Create another customer entity and add it to the table.
CustomerEntity customer2 = new CustomerEntity("Smith", "Ben");
customer2.Email = "Ben@contoso.com";
customer2.PhoneNumber = "425-555-0102";

// Add both customer entities to the batch insert operation.
batchOperation.Insert(customer1);
batchOperation.Insert(customer2);

// Execute the batch operation.
table.ExecuteBatch(batchOperation);

There are however some restrictions:

  • all operations have to be executed withing a single partition
  • you're limited to only 100 operations per batch

Depending on how you designed your table and what you're actually building, those numbers can be potentially more or less problematic. Let's try to investigate possible patterns, which can be helpful here.

Denormalization

Since many developers came from the world of relational database, normalizing tables in a database is their second nature. This is a great skill... and unfortunately it becomes a real pain in the arse when working with NoSQL storages. Let's say we have a many-to-one relation. Now if on the left side we have more than 100 items, which we'd like to move to another entity, we can lost consistency since we have limited number of operation we can perform at once. In such scenario it could be viable to store references to items in the main table to able to perform a transaction(of course as long as we are in a single partition scenario).

Denormalization & performance

We can extend the previous example a bit and consider following scenario - we'd like to improve performance by limit the number of requests we have to perform when making a query to more than a one table(e.g. employee + employee's last payslip). To do so we could duplicate data and store the second table as an extension to the first. To achieve consistency we'd have to ensure, that both tables are in the same partition(so we can update both table in the transaction).

Intra-partition secondary index pattern

Similar pattern to the inter-partition secondary pattern, which I described previously(this one however lets you achieve consistency by using EGTs since all data is stored in the same partition).

Considerations

When considering consistency in Table Storage and storing data in the same partition(or at least duplicating it by creating secondary indexes), you have to think about your scalability targets. As you may know, minimizing the number of partitions can affect how your solutions scales in the future because they are the main factor of load balancing requests. As always all depends on the characteristics of your application and what features of this storage your're interested the most.

What's next?

In the next post we'll focus on inter-partition transactions and what can be done in that area. Stay tuned!

Azure Table Storage good practices - Intra-partition secondary index pattern

Searching for records in Table Storage is super fast when using Partition and Row Keys. For most scenarios such setup will be sufficient - you either don't need additional properties to filter with or you're interested in large portions of data, which will be processed in your code. However, sometimes you'd like to make it possible to perform quick filtering using custom column. This is where intra-partition secondary index pattern helps.

The problem

The reason why queries using PKs and RKs are so fast lies behind the fact, that Table Storage automatically indexes entities using those columns. This is how it was designed and developed - in reality most scenarios are covered with this setup. On the other hand, we'd like to enable ourselves to create a table, which will keep superb performance and still allow querying other columns(like FirstName, City or Street in Employee table). Of course it's possible to perform partition scans and filter proper values in your code, yet additional overhead could be something, you cannot accept. We have to design a table in such way we'll somehow incorporate additional filters into internal design of Table Storage.

The solution

The solution here(as most solutions I present) is pretty simple. If we know, that indexes are created using PKs and RKs, we have to add additional values to them. This will allow us to take advantage of the indexing feature and let avoid additional overhead during filtering values. Let's consider following example:

/
PK	| RK	| FirstName	| LastName
employee	1	John	Doe
employee	2	Jane Doe

If we'd like to filter retrieved records using LastName = `Doe` then it'd force us to do it on our side, possible fetching more records than we need and lowering performance of our application. Now let's redesign it a little bit:

/
PK	| RK	| FirstName	| LastName
employee	1	John	Doe
employee	2	Jane Doe
employee	lastname_Doe	John	Doe
employee	lastname_Doe	Jane Doe

Now we can perform following filtering on this table:

/
$filter=(PartitionKey eq 'employee') and (RowKey eq 'lastname_Doe')

Retrieving only those records we need. 

Summary

As you can see small changes in Table Storage design can result in significant performance improvements. Some consideration here should be focused on possible duplication of data, which has to be handled on your side. If performance is important for you, this is a small price for the overall "level-up" in areas you care the most.