Dat tricky Throughput Unit

This is a perfect test whether you understand the service or not. Pardon me if this was/is obvious for you - apparently for me it was not. What's more, some people still seem to be confused. And we're here to avoid confusions. Let's start!

Hit the limit

Event Hub's model(both from technical and pricing point of view) is one of the easiest to understand and really, really straightforward. Let's perform a quick calculation:

I need to perform 1K ops / sec - I need 1 TU
I need to perform 10K ops / sec - I need 10TUs

In the overall price we can exclude cost related to the number of events processed(since it's like 5% of the money spent for this service). Now there's one more thing worth mentioning - TUs come as a pair - 1TU gives you 1MB(or 1K events) ingress and 2MB(or 2K events) egress. The question is:

"How to kill Event Hub's egress?"

Let's focus a little and try to find a scenario. We cannot exceed easily 1MB of egress having max ingress of 1MB. Maybe it'd doable by loading lots of data into EH and then introducing a consumer, which will be able to fetch and process 2MB of events per second. Still, this doesn't allow us to exceed the maximum of 2MBs of egress. We're safe.

But what if you introduce another *consumer group*? Since there's no filtering in Event Hub, each consumer group gets the same amount of events(in other words - when you have N consumer groups, you will read the stream N times). Now in the following scenario:

1MB of ingress
Consumer1(Consumer Group A)
Consumer2(Consumer Group B)

You've just hit the limit of 1TU(since you have 2MB of egress). Now let's try to scale and extend this solution. Let's introduce another consumer group:

1MB of ingress
Consumer1(Consumer Group A)
Consumer2(Consumer Group B)
Consumer3(Consumer Group C)

Now 1TU is not sufficient. By scaling out our Event Hub to 2TUs we can handle up to 4MBs of egress. In the same moment we can handle 2MBs of ingress. So if some reason throttling was your friend and kept the load up to some limit, you can quickly face problems and need to scale out once more.

Be smarter

As you can see, mistakes can be made and relying on consumer groups to filter(or orchestrate) events is not a way to go. In such scenario it'd much better to post events directly to e.g. Event Grid or use topics from Service Bus, so we can easily route messages. You have to understand, that the main purpose of Event Hub is to act as a really big pipe for data, which can be easily digested - misusing it could give you serious headaches.

 

Don't be afraid to store it twice

In most cases we try to follow DRY approach - Don't Repeat Yourself. This simple yet powerful statement, which - if executed correctly - has the power of changing meh projects to good ones, doesn't work well in all scenarios and use cases. As one once said "One size doesn't fit all", you shouldn't always follow each pattern or approach blindly. You can easily get burnt.

Problem

Imagine following scenario(or rather requirement):

We have following concepts in the application: Company, Group and User. User can have access to multiple Companies(from which one is his main). Group is a logical thing, which exists within a Company

The question is - how to store data in Table Storage, so we can easily query both all Groups within a Company and a single Group. 

Let's skip User for a second and consider following implementation:

/
[FunctionName("GroupList")]
public static Task<HttpResponseMessage> GroupList(
	[HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "group")] HttpRequestMessage req,
	[Table(TableName, "AzureWebJobsStorage")] IQueryable<GroupEntity> groups,
	[Table(Company.TableName, "AzureWebJobsStorage")] IQueryable<Company.CompanyEntity> companies,
	TraceWriter log)
{
	var groupsQuery = groups.Take(100).ToList().Select(_ => new
	{
		Id = _.RowKey,
		Name = _.Name,
		Description = _.Description,
		// ReSharper disable once ReplaceWithSingleCallToFirst
		CompanyName = companies.Where(company => company.RowKey == _.CompanyId.ToString()).FirstOrDefault()?.Name
	}).ToList();

	var response = req.CreateResponse(HttpStatusCode.OK,
		groupsQuery);

	return Task.FromResult(response);
}

[FunctionName("GroupGet")]
public static Task<HttpResponseMessage> GroupGet(
	[HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "group/{id}")] HttpRequestMessage req,
	[Table(TableName, "{partitionKey}", "{id}", Connection = "AzureWebJobsStorage")] GroupEntity group,
	TraceWriter log)
{
	var response = req.CreateResponse(HttpStatusCode.OK, group);
	return Task.FromResult(response);
}

public class GroupEntity : TableEntity
{
	public GroupEntity(Guid companyId)
	{
		PartitionKey = companyId.ToString();
		RowKey = Guid.NewGuid().ToString();
	}

	public string Name { get; set; }
	public Guid CompanyId { get; set; }
	public string Description { get; set; }
}

All right, initially it all looks fine. If we want to get a list of User specific Groups, we'll just add some context(e.g. store all user's companies somewhere and just use their identifiers as a parameter in a query). Now let's request a single Group. We have a Group identifier, we can get the whole row from a database...

NO WE CAN'T!

Since GroupEntity's PK is equal to Company's identifier, we'd have to make one query per each company a user has access to. Not very smart, not very clean. What to do? Change PK in GroupEntity to a generic one? We'll lost the possibility to make fast queries for all groups within a company. Make a combined identifier and user it as a PK? We still have to perform multiple queries. Go for SQL and perform proper JOINs? This is definitely a possibility - but we don't need other features of a relative database. Is it a dead end? 

Solution

One thing in Azure Storage is really cheap - it's the storage itself. How can we remodel our tables so we can improve both performance and lower transactions amount? Well, we can store our data twice!

Before you start throwing tomatoes at me, consider following example:

/
[FunctionName("GroupCreate")]
public static async Task<HttpResponseMessage> GroupCreate(
	[HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = "group")] HttpRequestMessage req,
	[Table(TableName, "AzureWebJobsStorage")] IAsyncCollector<GroupEntity> groups,
	TraceWriter log)
{
	var data = JsonConvert.DeserializeObject<GroupEntity>(await req.Content.ReadAsStringAsync());

	var group = new GroupEntity(data.CompanyId)
	{
		Name = data.Name,
		CompanyId = data.CompanyId,
		Description = data.Description
	};

	await groups.AddAsync(group);
	await groups.AddAsync(new GroupEntity
	{
		RowKey = group.RowKey,
		Name = data.Name,
		CompanyId = data.CompanyId,
		Description = data.Description
	});

	var response = req.CreateResponse(HttpStatusCode.OK, group);
	return response;
}

Here we're storing a row twice but with slight changes. The first insert stores data with no changes to the previous version. The second one is the crucial one - it performs one change, which is changing PK to a generic one named "group". Thanks to this solution we can have two almost the same rows, one for being displayed as a part of a list, one for storing all info regarding a row. 

Now you may ask - how it secures a row from being displayed if a User doesn't have access to a given Company? That's why we're storing company identifier along with a row in CompanyId column. This is much quickier and cleaner solution than performing several requests to Table Storage - we can cache the data locally and just check whether identifiers match.

Summary

Modeling Table Storage is both challenging and rewarding - you can easily hit performance problems if tables are not designed carefully, on the other hand wise design allow you to really push the limit. Those redesigns are important also because of one more thing - they save time. And in a cloud time = money. Make sure you pay only as much as needed.