Demystifying things - WASB in Azure

When you want to interact with Azure Storage, the easiest way is to access it via HTTP client(whether we're talking about the REST API or hitting an endpoint manually). This is fairly easy to detect, consider following URL:

/
https://yourstorageaccount.blob.core.windows.net/path

Simple as that. However, it's possible to find other protocols in the documentation when it comes to using Blob Storage:

/
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

What is this mysterious WASB here and what it is used for?

Hadoop-in-the-cloud

When working with Hadoop, one of its core components is HDFS, which is a file system used by it to manage data and storage. The limitation of HDFS is the fact, that it has access only to local files available for your cluster. There's a question - what if I'd like to access files stored inside by blob storages? Well, this is where WASB comes to play.

WASB - or Windows Azure Storage Blob - is an abstraction built atop of HDFS. It allows Hadoop(or HDIsight because we're talking about Hadoop in the Azure cloud) to seamlessly integrate with Azure Blob Storage. What is more, it allows multiple clusters to access data stored in one place. But what it really gives you?

Sharing is fun!

Before Hadoop can start working on the data, it actually has to load it from somewhere. Normally you either store it in your cluster or load it from an external source. The important thing here is following statement - data has to be accessible locally. Now imagine situation that you'd like to destroy a cluster each time computation has been made(e.g. it happens twice a week and there's no need to pay for it for each day). Moving and loading data each time you want to do something with it consumes time and resources.

When using HDInsight you no longer have to be worried about those caveats. Thanks to WASB, Hadoop can load data from blob storages immediately - you can connect multiple consumers and make computations at the same time. What is more, WASB can be installed with a traditional installations of Hadoop, so even when you provision a cluster on your own in Azure, it's still possible to use WASB in it.

 

"You're older than I expected" - tricky message retention in Azure Event Hub

Event Hub(or Service Bus if we're talking about a service as "a whole") has a concept of a message retention. In short - you can set a fixed period, which will determine after how many days a message is considered outdated and is no longer available in a bus. So far so good - nothing tricky here and the concept is fairly simple. Unfortunately there're not so obvious gotchas, which can hit you really hard, if you forget about them.

You're older than I expected

The confusion comes from the fact, that Event Hub is a part of a bigger service called Service Bus and is only a subset of available functionalities. When we're considering Service Bus, we're talking about queues, topics and so on. All those have a property called TTL(time-to-live), which is attached to messages being passed. Although TTL means different thing for each different concept(e.g. at-least-once/at-most-once for queues), it's here and its definition is intuitive. The question is - how is this related to message retention mentioned earlier?

The confusion comes from the fact, that different services in Service Bus are designed for different purposes - because of that each treats a definition of message or entity in a slightly different way. Since Event Hubs are considered a big scale solution(in opposite to e.g. queues), they rather track whole blocks of messages rather than a single message, which is being pushed through a pipeline.

This being said, there's a reason why message retention is no always what you can expect - if you're using Event Hub for a fairly small amount of data(tens thousands events per day at most), there's quite a big likelihood, that it won't be considered as "outdated" as long as the container, which holds messages, is full.

I saw this... twice?

Now imagine following situation - you're about to go live with a solution, which is currently on a production environment, and using Event Hub as the heart of it. Let's say Event Hub was gathering data for the last seven days(message retention is set to only one day so this shouldn't be a case) because you wanted to avoid a cold start. Now consumers started and... your clients are receiving events/notifications/pop-ups from a week ago. 

The first problem - you forgot to check in your code whether a message is valid from your point of view. This happens, especially if you consider a documentation as the only source of your information about a service.

The second - well, it was nowhere said, that message retention is not what it looks like at the first glance.

Summary

As you can see, it's a good thing to remodel your way of thinking about messages and entities when working with Event Hub to avoid confusion. Apply a certain level of abstraction to your infrastructure and ask yourself a question - am I working with single messages or maybe whole blobs, which make sense only when are fully processed? If the answer is the former, you can trick yourself sooner or later.