Azure Functions, WebJobs and Data Lake - writing a custom extension #2

In the previous post I presented you the way to actually write a simple WebJobs extension, which we were able to execute using TimerTrigger. Besides running it doesn't provide much value - this is why I introduce today a functionality, which will really work with a Data Lake instance and help us push our simple project even further!

Extending DataLakeProvider

Previously DataLakeProvider was only a dummy class, which didn't have any real value. Today we'll it a centre of our logic, enabling easy work with Data Lake and acting as a simple adapter to our storage. Let's focus on our binding signature:

/
public static async Task CustomBinding([TimerTrigger("*/15 * * * * *")] TimerInfo timerInfo,
            [DataLake("clientId", "clientSecret")]
            DataLakeProvider dataLake)

As you can see we're passing two parameters - clientId and clientSecret - to the DataLakeProvider instance. You may ask what are those values and where do we need them? Well, consider following snippet:

/
public class DataLakeProvider : IDisposable
{
	private readonly DataLakeStoreFileSystemManagementClient _client;

	public DataLakeProvider(string clientId, string clientSecret)
	{
		var clientCredential = new ClientCredential(clientId, clientSecret);
		var creds = ApplicationTokenProvider.LoginSilentAsync("domainId", clientCredential).Result;
		_client = new DataLakeStoreFileSystemManagementClient(creds);
	}

	public Task CreateDirectory(string path)
	{
		return _client.FileSystem.MkdirsAsync("datalakeaccount", path);
	}

	public async Task AppendToFile(string destinationPath, string content)
	{
		using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(content)))
		{
			await _client.FileSystem.ConcurrentAppendAsync("datalakeaccount", destinationPath, stream, appendMode: AppendModeType.Autocreate);
		}
	}

	public void Dispose()
	{
		_client.Dispose();
	}
}

This is all what we need to be able to:

  • create a directory in Data Lake
  • perform a concurrent append to a chosen file

The logic which stands behind working on files stored in Data Lake is pretty simple and I won't focus on it for now. What requires some explanation is authentication for sure. As you can see, I'm doing a couple of things:

  • I'm creating a ClientCredential instance, which is a wrapper for AD credentials(we'll go through this later)
  • Next I need to actually log in silently to my AD so I obtain an access token
  • With a token received I can finally create a Data Lake client

This flow is required since all actions on Data Lake storage are authorized using permissions assigned to a specific user or a group in Azure. Once we're done here we can do two more things - fix DataLakeAttributeBindingProvider so it passes attribute parameters to DataLakeProvider and extend our function, so it really performs some real tasks.

Doing it for real!

We need to change one thing in DataLakeAttributeBindingProvider - previously we didn't need passing anything to DataLakeProvider, so the GetValueAsync() looked like this:

/
public Task<object> GetValueAsync()
{
	var value = new DataLakeProvider();

	return Task.FromResult<object>(value);
}

The only thing to do now is to use the right constructor:

/
public Task<object> GetValueAsync()
{
	var value = new DataLakeProvider(_resolvedAttribute.ClientId, _resolvedAttribute.ClientSecret);

	return Task.FromResult<object>(value);
}

Let's also extend our function and try to create a directory and append something to a file:

/
public static async Task CustomBinding([TimerTrigger("*/15 * * * * *")] TimerInfo timerInfo,
            [DataLake("clientId", "clientSecret")]
            DataLakeProvider dataLake)
{
	using (dataLake)
	{
		var path = Path.Combine("This", "Is", "Just", "A", "Test");
		await dataLake.CreateDirectory(path);
		await dataLake.AppendToFile(Path.Combine(path, "foo"), "THIS IS JUST A TEST");
	}
}

Result

When you run a function, you should see similar result to mine:

In the final post about this topic I'll show you how to integrate this extension with a Function App and describe how to obtain clientId and clientSecret - for those, who are not familiar with Azure Active Directory :)

Tips&Tricks - When my data was replicated in RA-GRS?

Today is Friday so it's time for something quick and easy. When working with Azure Storage, you might wonder from time to time when your data was replicated recently. This gives some insight into how Azure Storage internally works and what are the drawbacks of this component. Let's find last synchronization date and consider for a moment what it really means for us.

Synchronization date and time

Before we start - to be actually able to find the synchronization timestamp the actual replication has to happen. This means, that for this feature LRS mode of a storage account, it just won't work. You may ask why - the answer is fairly simple. LRS replicates data only within a datacenter and, what is even more important, it does it synchronously. There's no replication between many regions = there's no such thing like synchronization because it's either data is saved and replicated or the whole operation fails.

Presumably synchronization timestamp should be available in other three models of data replication - ZRS, GRS and RA-GRS - but surprisingly... it's not. This feature works only for RA-GRS accounts because of one simple thing - this is the only mode, which allows you to read data from a secondary location. Of course it has some limits(like you cannot declare failover to another region), but finally you'll be able to read replicated data. 

You can easily read last synchronization date by going to Azure Portal and accessing e.g. Tables:

Initial status of Table Storage with RA-GRS mode

 

Is my data replicated?

This is a serious question - if geo-replication happens asynchronously, will my data be copied to the secondary location without corruption? Well, it depends - there's an obvious gap between the primary and the secondary storage and the answer is directly related to the fact when a disaster happened. What is important here is the fact, that Azure Storage out-of-the-box doesn't guarantee, that each and every record or blob will be replicated on time. Of course it doesn't mean, that data from the half of a dat will be lost - we're talking about a few minutes - but still for some systems losing a one record means being totally unreliable.

What can you do improve your guarantees and improve consistency in geo-replication? I strongly advise to read this article regarding possible outages in Azure Storage and consequences. What for sure you can do is to implement your own backup policy and support in-built mechanism of replication by performing synchronous writes to additional storage. Depending on your needs and expectation using a different storage(like CosmosDB, which introduced Table Storage on steroid) could be a viable solution also.