Appending data to Azure Storage Blob concurrently

Append Blob is a fairly common feature of Azure Storage, which makes all kinds of logs or data aggregations a piece of cake. While the whole concept is super-easy(at least from the SDK client point of view), using it in a real scenario could give you headaches. Why you may ask? Well, the nature of appends is not so obvious and at the first glance our perceptions could be deceived.

A quick look at the documentation reveals how limited are our options to append data to a blob concurrently

Four Horsemen of the Parallelypse

There're 5 methods in total available on CloudAppendBlob type:

  • AppendBlob
  • AppendFromStream
  • AppendText
  • AppendFromByteArray
  • AppendFromFile

As you can see, I grouped them a little so we have two categories:

  • methods for multiple writers scenario
  • methods for a single writer scenario

Now the question is - how do we know, that one method is designed for this specific scenario? Well, the easiest option is to read the documentation. This is a description of AppendText taken from the API reference:

Appends a string of text to an append blob.
This API should be used strictly in a single writer scenario because the API internally uses the append-offset conditional header to avoid duplicate blocks which does not work in a multiple writer scenario.

So what happens if you try to use such method in a e.g. Azure Function?

The remote server returned an error: (412) The append position condition specified was not met 

This is not the best description, isn't it?

Gimme a code snippet!

The easiest way to fix this issue is to transform a string to a stream:

/
using (var ms = new MemoryStream())
using(var sw = new StreamWriter(ms))
{
	await sw.WriteLineAsync("Serialized_data");
	await sw.FlushAsync();

	ms.Position = 0;

	await blob.AppendBlockAsync(ms);
}

Summary

Personally I found this issue is not so obvious - I involuntarily used AppendText method, which looked as the best match to my code and after some time I noticed those 412 error codes. The one thing you have to remember when using AppendBlock is the fact, that each block cannot exceed 4 MB size each. This - along with the limit of 50k append operations - allows for building a blob of max size equal to 195 GBs, which should be fine for the most of projects.

What's all about Azure Event Grid?

Azure Event Grid is one of the newest products available in Azure cloud stack. Since it's still in preview, we are not offered full functionality(so e.g. only two regions can be selected, not all event publishers have been added). However with all the goodnes provided by this component, we can start thinking about "reactive programming in the cloud" - at least this is what documentation tells us. Let's dive deeper into Event Grid and find why it's so special.

Competitors

Event Grid is all about events. You may ask how it is different comparing similar products like Event Hub or Service Bus. If you take a look at the basic architecture, you'll find very similar concept like topics or subscribers (well at least for Service Bus). So why do I need Event Grid(which will complicate my architecture even more) when I can easily connect e.g. my Azure Functions to a topic and achieve the same functionality with ease? Well, this is only partially true.

Event Grid functional model(source: https://docs.microsoft.com/en-us/azure/event-grid/overview)

The downside of other solutions is the need of pooling - details doesn't matter now, you have to implement some way of communication between your app and an event publisher. It can be long-pooling, event sourcing, WebSockets - whatever works can be used. So even if you establish a persistent connection, you have to talk to the other side and await messages. You're not passive in this model - that's why you cannot "react" on events passed to you. You only parse them and pass further.

Event Grid allows you to make your components "passive" - they are somewhere in the cloud and are only interested in the data you send to them. They don't have to persist any connections - it's up to Event Grid to distribute messages and deliver to the configured subscribers. Microsoft states, that this approach is suited for serverless scenario and I can agree with them - you can make underlying infrastructure even more abstract and control the flow of event from the single point. For me the possibility to configure connection between Event Hub and several Azure Functions using Azure Portal(so I don't have to pass a connection string of EH to each individual component) is definitely a big YES to Event Grid.

Should I go for it?

I still think, that though Event Grid simplifies and improves working with serverless architecture(what am I saying - it actually enables you to start thinking about serverless at all...), you cannot just take it, write a couple of Functions and say "this is how we're making applications today in our company". It still requires proper planning, it's still not valid for each and every application(with Event Hub, Event Grid and Azure Functions, you may assume, that an event will reach its destination... at some point in time) and forces you to change your mindset into being "reactive"(and this is sometimes a challenge itself).

Event Grid as the "man-in-the-middle" in serverless architecture(source: https://docs.microsoft.com/en-us/azure/event-grid/overview)

On the other hand I like how it smoothly integrates with the cloud - for now only a few publishers are available, but we're given a promise, that this will change soon. I treat it as a serverless orchestrator - it's the centre of my architecture, where I can separate concerns seamlessly. Combine it with negligible cost($0.60 per million operations, first 100k is free) and easy learning curve and ask yourself why haven't you tested it yet?