Azure Table Storage good practices - Intra-partition secondary index pattern

Searching for records in Table Storage is super fast when using Partition and Row Keys. For most scenarios such setup will be sufficient - you either don't need additional properties to filter with or you're interested in large portions of data, which will be processed in your code. However, sometimes you'd like to make it possible to perform quick filtering using custom column. This is where intra-partition secondary index pattern helps.

The problem

The reason why queries using PKs and RKs are so fast lies behind the fact, that Table Storage automatically indexes entities using those columns. This is how it was designed and developed - in reality most scenarios are covered with this setup. On the other hand, we'd like to enable ourselves to create a table, which will keep superb performance and still allow querying other columns(like FirstName, City or Street in Employee table). Of course it's possible to perform partition scans and filter proper values in your code, yet additional overhead could be something, you cannot accept. We have to design a table in such way we'll somehow incorporate additional filters into internal design of Table Storage.

The solution

The solution here(as most solutions I present) is pretty simple. If we know, that indexes are created using PKs and RKs, we have to add additional values to them. This will allow us to take advantage of the indexing feature and let avoid additional overhead during filtering values. Let's consider following example:

/
PK	| RK	| FirstName	| LastName
employee	1	John	Doe
employee	2	Jane Doe

If we'd like to filter retrieved records using LastName = `Doe` then it'd force us to do it on our side, possible fetching more records than we need and lowering performance of our application. Now let's redesign it a little bit:

/
PK	| RK	| FirstName	| LastName
employee	1	John	Doe
employee	2	Jane Doe
employee	lastname_Doe	John	Doe
employee	lastname_Doe	Jane Doe

Now we can perform following filtering on this table:

/
$filter=(PartitionKey eq 'employee') and (RowKey eq 'lastname_Doe')

Retrieving only those records we need. 

Summary

As you can see small changes in Table Storage design can result in significant performance improvements. Some consideration here should be focused on possible duplication of data, which has to be handled on your side. If performance is important for you, this is a small price for the overall "level-up" in areas you care the most.

Azure Table Storage good practices - log tail pattern

Although Azure Table Storage is pretty simple and straightforward solution, designing it so you can take the most from it is not an easy task. I've seen some projects, where invalid decisions regarding tables structure, wrong partition keys values or "who cares" row keys led to constant degradation of performance and raising cost of maintenance. Many of those problems could be avoided by introducing smart yet easy to introduce patterns, from which one I like the most is the log tail pattern.

The problem

Let's consider following table:

/
PK | RK | Timestamp | Event

As you may know, rows inserted into Azure Storage table are always stored in ascending order using a row key. This allow you to go from the start to the end of a segments pretty quickly and with predictable performance. If we assume, that we have following rows in our table:

/
PK | RK | Timestamp  | Event
foo   1   01/01/2000  { "Event": "Some_event"}
foo   2   01/01/2000  { "Event": "Some_event"}
foo   3   01/01/2000  { "Event": "Some_event"}
(...)
foo   9999   01/01/2000  { "Event": "Some_event"}
foo   10000   01/01/2000  { "Event": "Some_event"}

we can really quickly fetch a particular row because we can easily query it using PK and RK. We know that 1 will be before 10 and 100 will be stored before 6578. The problem happens when we cannot quickly determine which RK is bigger - mostly because we used e.g. a custom identifier like combination of a GUID and a timestamp. This forces us often to query large portions of a table just to find the most recent records. It'd possible to use a statement in our query like WHERE RowKey > $some_value, but it still introduces some overhead, which could be critical in some scenarios. How can we store our data in Table Storage and retrieve most recent records quickly and efficiently?

Log tail pattern

Fortunately the solution is really easy here and if decision is made early, it doesn't require much effort to introduce. The idea is simple - find a row key, which will "reorder" our table in a way, that the newest rows are also the first ones in a table. The concept seems to be tricky initially, but soon will be the first thing you think about when you hear "Azure Table Storage" ;)

Let's consider following solution(taken from here):

/
string invertedTicks = string.Format("{0:D19}", DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks);
DateTime dt = new DateTime(DateTime.MaxValue.Ticks - Int64.Parse(invertedTicks));

This will reverse the order how rows are stored in a table and allow you to quickly fetch those you're interested in the most. It's especially useful when creating all kinds of logs and appending to them, where you're usually interested mostly in the most recent records.

Summary

I rely heavily on this pattern in many projects I've created(both for my private use and commercial) and it really helps in creating efficient table structure, which can be easily queried and is optimized for a particular use. Of course it cannot be used in all scenarios, but for some it's a must have.

One of the things you have to consider here is that you must pad the reverse tick value with leading zeroes to ensure the string value sorts as expected(something that is fixed by string.Format() in the example). Without this fix you can end with incorrectly ordered rows. Nonetheless it's a small price you have to pay for a proper design and performance.