Demystifying things - WASB in Azure

When you want to interact with Azure Storage, the easiest way is to access it via HTTP client(whether we're talking about the REST API or hitting an endpoint manually). This is fairly easy to detect, consider following URL:

/
https://yourstorageaccount.blob.core.windows.net/path

Simple as that. However, it's possible to find other protocols in the documentation when it comes to using Blob Storage:

/
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

What is this mysterious WASB here and what it is used for?

Hadoop-in-the-cloud

When working with Hadoop, one of its core components is HDFS, which is a file system used by it to manage data and storage. The limitation of HDFS is the fact, that it has access only to local files available for your cluster. There's a question - what if I'd like to access files stored inside by blob storages? Well, this is where WASB comes to play.

WASB - or Windows Azure Storage Blob - is an abstraction built atop of HDFS. It allows Hadoop(or HDIsight because we're talking about Hadoop in the Azure cloud) to seamlessly integrate with Azure Blob Storage. What is more, it allows multiple clusters to access data stored in one place. But what it really gives you?

Sharing is fun!

Before Hadoop can start working on the data, it actually has to load it from somewhere. Normally you either store it in your cluster or load it from an external source. The important thing here is following statement - data has to be accessible locally. Now imagine situation that you'd like to destroy a cluster each time computation has been made(e.g. it happens twice a week and there's no need to pay for it for each day). Moving and loading data each time you want to do something with it consumes time and resources.

When using HDInsight you no longer have to be worried about those caveats. Thanks to WASB, Hadoop can load data from blob storages immediately - you can connect multiple consumers and make computations at the same time. What is more, WASB can be installed with a traditional installations of Hadoop, so even when you provision a cluster on your own in Azure, it's still possible to use WASB in it.

 

Building your Big Data playground with Azure

Let's say you were assigned a task, which requires you to provision a whole new environment using technologies, which are not "cool" when used on your dev machine. Let's take into consideration Hadoop - it becomes more and more popular, yet it's still a black box for many(including me) people. What if you'd like to play with it a little? Well, here's the instruction what you have to do to install and run it on Windows. Trust me - it's doable... This is the only "good" part of the whole process.

Do it for me?

I don't like wasting my time and my computer's resources on temporary things, which I need only for a few hours. What I like, is to make something to do it for me. If you take a look at Azure Marketplace, you'll see plenty of available software images, many of them including OSS software. It can be installed and used without any additional charges. Does it include Hadoop? Yes it does. Let's grab it and install it.

Do I have money for it?

If you have an Azure subscription, feel no worries to install this image. As I said - it charges you only for the resources you're using. If you're done with it you can either delete the whole resource group with resources provisioned or disable a VM used by Hadoop - it will save you time when it will be needed next time and the cost is negligible in such case.

Got it! What's next?

The Linux VM instance used to install Hadoop on it is accessible through SSH client and requires passing SSH key to connect to it(you can use whichever client you like like PuTTY or even terminal from SourceTree) or a password, which you provided. Once connected to it, you can run tasks and scripts designed for Hadoop.

Just to make things clear - in Azure Portal, when you go to the Overview tab in VM provisioned for Hadoop, you'll se public IP address, which can be used to connect to it. What is more, you can use SFTP to upload file to the VM or download them. Go to your FTP client and use your_VM_IP:22 as host and enter your credentials. You'll see the default directory of your VM. From this point everything is set - you have your very own Hadoop playground, which you can use whenever you want.