Showing posts with label Azure. Show all posts

Tuesday, 31 January 2017

Announcing Zipkin Collector for Azure EventHub

If you are reading this, you have probably heard of Zipkin. If not, please take my word to leave this post to spend 10 minutes reading up on it - a very worthwhile 10 minutes which will introduce to you one of the best, yet simplest distributed tracing systems. It one word, it tells you where the time to serve requests been most spent helping you to optimise your Microservice architecture.

Zipkin, used by the likes of Twitter and Netflix, has already created a complete storm in the Java/JVM ecosystem, but many of us in the .NET community have not heard of it - and that is frankly a real pity. And if you have heard it and want to use it, yes of course we can try to port the whole system over to .NET but that would be a huge amount of work and frankly a waste since Zipkin is designed to work across different stacks as long as you can somehow get your data over to it. The data is normally pushed to Kafka, and Zipkin consume messages from Kafka by a component called Collector. Data then gets stored in a storage (currently available for MySQL, Cassandra or Elasticsearch) and then served by the UI.

Of course nothing stops you to run Kafka in your cloud or on-premise environment, but if you have never done it, to say the least, ZooKeeper (a consensus required for running Kafka) is not the easiest service to operate. And frankly if you are on Azure it makes a lot of sense to use EventHub, an Azure PaaS service with functionality very similar to Kafka. Sadly there were no collector for it.

I have been very keen to bring Zipkin to ASOS, but could not quite justify running ZK and Kafka, even for myself. Hence I felt something has to be done about it. The only problem: had never done a Java/Maven project before.

* * *

I have been doing what I have been doing - being a professional developer - for some time now. And I have had my ups and downs, both moments that I am proud of and moments of embarrassment because I have messed up. But never, have I just picked up a complete different stack, and built something like what I am going to share, within a couple of weeks. [Yeah I am talking about Zipkin Collector for Azure EventHub]

This really has been a testament to how pluggable and nicely designed-Zipkin is, and above all it has a truly amazing community - championed by Adrian Cole. Help was always around the corner, be it on hardcore stuff such as how to modularise collector or my noob problems with Maven.

Not to forget too, that Azure EventHub SDK basically made it completely trivial to implement a working solution. All the heavy lifting has been done by the EventProcessorHost so all is left is a bit of plumbing to get the configuration over to these components.

* * *

How to use EventHub Collector

So the idea is that you would run zipkin-server (which hosts the Zipkin UI) and in the same process you run your collector. Zipkin uses Spring Boot's auto configuration mechanism to load the right collector based on the configurations provided. The project is host on github. [UPDATE: Project has moved to OpenZipkin organisation here]

EventHub Collector gets triggered by the existence of "zipkin.collector.eventhub.eventHubConnectionString" configuration via command line. Rest of the configurations necessary can be passed by an application.properties or application.yaml file.

So to run the EventHub collector you need:

1- zipkin.jar (zipkin-server)
2- application.properties file
3- zipkin-collector-eventhub-autoconfig module jar (which contains transitive dependencies too). This jar is not on maven yet

So in order to run:

1- Clone the source and build

mkdir zipkin-collector-eventhub
cd zipkin-collector-eventhub
git clone [email protected]:aliostad/zipkin-collector-eventhub.git
mvn package

If you do not have maven, get maven here.

2- Unpackage MODULE jar into an empty folder

copy zipkin-collector-eventhub-autoconfig-x.x.x-SNAPSHOT-module.jar (that has been package in the target folder) into an empty folder and unpackage

jar xf zipkin-collector-eventhub-autoconfig-0.1.0-SNAPSHOT-module.jar

You may then delete the jar itself.

3- Download zipkin-server jar

Download the latest zipkin-server jar (which is named zipkin.jar) from here. For more information visit zipkin-server homepage.

4- create an application.properties file for configuration next to the zipkin.jar file

Populate the configuration - make sure the resources (Azure Storage, EventHub, etc) exist. Only storageConnectionString is mandatory the rest are optional and must be used only to override the defaults:

zipkin.collector.eventhub.storageConnectionString=<azure storage connection string>
zipkin.collector.eventhub.eventHubName=<name of the eventhub, default is zipkin>
zipkin.collector.eventhub.consumerGroupName=<name of the consumer group, default is $Default>
zipkin.collector.eventhub.storageContainerName=<name of the storage container, default is zipkin>
zipkin.collector.eventhub.processorHostName=<name of the processor host, default is a randomly generated GUID>
zipkin.collector.eventhub.storageBlobPrefix=<the path within container where blobs are created for partition lease, processorHostName>

5- Run the server along with the collector

Assuming zipkin.jar and application.properties are in the current working directory, run this from the command line (note that the connection string to the eventhub itself is passed in the command line):

java -Dloader.path=/where/jar/was/unpackaged -cp zipkin.jar org.springframework.boot.loader.PropertiesLauncher --spring.config.location=application.properties --zipkin.collector.eventhub.eventHubConnectionString="<eventhub connection string, make sure quoted otherwise won't work>"

After running, spring boot and the rest of the stack gets loaded and then you should be able to see some INFO output from the collector outputting the configuration you have passed.

You should be up and running and can start pushing spans to your EventHub.

Span serialisation guideline

EventHub Collector expects spans serialised as JSON array of spans. The payload gets read as a UTF-8 string and gets deserialised by the zipkin-server components.

Roadmap

Next step is to get the jar on to maven central. Also I will start working on a .NET library to make building spans easier.

Wednesday, 20 July 2016

Singleton HttpClient? Beware of this serious behaviour and how to fix it

If you are consuming a Web API in your server-side code (or .NET client-side app), you are very likely to be using an HttpClient.

HttpClient is a very nice and clean implementation that came as part of Web API and replaced its clunky predecessor WebClient (although only in its HTTP functionality, WebClient can do more than just HTTP).

HttpClient is usually meant to be used with more than just a single request. It conveniently allows for default headers to be set and applied to all requests. Also you can plug in a CookieContainer to allow for all sessions.

Now, ironically it also implements IDisposable suggesting a short-lived lifetime and disposing it as soon as you are done with. This lead to several discussions in the community (here from Microsoft Patterns and Practices, Darrel Miller in here and a few references in StackOverflow here) to discuss whether it can be used with longer lifetime and more importantly whether it needs disposal.

Singleton HttpClient matters, especially when it comes to the performance [Dragan Brankovich - Flickr]

HttpClient implements IDisposable only indirectly through HttpMessageHandler and only as a result of in-case not an immediate need - I am not aware of an implementation of HttpMessageHandler that holds unmanaged resources (the mere reason for implementing IDisposable).

In short, the community agreed that it was 100% safe, not only not disposing the HttpClient, but also to use it as Singleton. The main concern was thread safety when making concurrent HTTP calls - and even official documentations said there is no risk doing that.

But it turns out there is a serious issue: DNS changes are NOT honoured and HttpClient (through HttpClientHandler) hogs the connections until socket is closed. Indefinitely. So when does DNS change occur? Everytime you do blue-green deployment (in Azure cloud services when you deploy to staging slot and then swap production/staging slots). Everytime you change settings in your Azure Traffic Manager. Failover scenarios. Internally in a myriad of PaaS offerings.

And this has been going on for more than 2 years without being reported... makes me wonder what kind of applications we build with .NET?

Now if the reason for DNS change is failover, your connection would have been faulted anyway so this time connection would open against the new server. But if this were the blue-black deployment, you swap the staging and production and your calls would still go to the staging environment - a behaviour we had seen but had fixed it by bouncing the dependent servers thinking possibly this was an Azure oddity. What a fool was I - it was there in the code! Whose code? Well debateable...

Analysis

All of this goes back to the implementation in HttpClientHandler that uses HttpWebRequest to make connections none of which code is open sourced. But obviously using Jetbrain’s dotPeek we can look into the decompiled code and see that HttpClientHandler creates a connection group (named with its hashcode) and does not close the connections in the group until getting disposed. This basically means the DNS check never happens as long as a connection is open. This is really terrifying...

protected override void Dispose(bool disposing)
{
    if (disposing && !this.disposed)
    {
        this.disposed = true;
        ServicePointManager.CloseConnectionGroups(this.connectionGroupName);
    }
    base.Dispose(disposing);
}

As you can see, ServicePoint class plays an important role here: controlling number of concurrent connects to a ‘service point/endpoint’ as well as keep-alive behaviours.

Solution

A naive solution would be to dispose the HttpClient (hence the HttpClientHandler) every time you use it. As explained this is not how HttpClient is intended to be used.

Another solution is to set ConnectionClose property of DefaultRequestHeaders on your HttpClient:

var client = new HttpClient();
client.DefaultRequestHeaders.ConnectionClose = true;

This will set the HTTP’s keep-alive header to false so the socket will be closed after a single request. It turns out this can add roughly extra 35ms (with long tails, i.e amplifying outliers) to each of your HTTP calls preventing you to take advantage of benefits of re-using a socket. So what is the solution then?

Well, courtesy of my good friend Andy Jutton of Amido, the solution lies in an obscure feature of the ServicePoint class. Basically, as we said, ServicePoint controls many aspects of TCP connections and one of the properties is ConnectionLeaseTimeout which controls how many milliseconds a TCP socket should be kept open. Its default value is -1 which means connections will be stay open indefinitely… well in real terms, until the server closes the connection or there is a network disruption - or the HttpClientHandler gets disposed as discussed.

So the root cause is basically that with the default value of -1, which is IMHO, wrong and potentially dangerous setting.

Now to fix it, all we need to do is to get hold of the ServicePoint object for the endpoint by passing the URL to it and set the ConnectionLeaseTimeout:

var sp = ServicePointManager.FindServicePoint(new Uri("http://foo.bar/baz/123?a=ab"));
sp.ConnectionLeaseTimeout = 60*1000; // 1 minute

So this is something that you would want to do only at the startup of your application, once and for all endpoints your application is going to hit (if endpoints decided at runtime, you would be setting that at the time of discovery). Bear in mind, path and query strings are ignored and only the host, port and schema are important. Depending on your scenario, values of 1-5 minutes probably make sense.

Conclusion

Using Singleton HttpClient results in your instance not to honour DNS changes which can have serious implications. The solution is to set the ConnectionLeaseTimeout of the ServicePoint object for the endpoint.

Tuesday, 14 June 2016

After all, it might not matter - A commentary on the status of .NET

Did you know what was the most menacing nightmare for a peasant soldier in Medieval wars? Approaching of a knight.

Approaching of a knight - a peasant soldier's nightmare [image source]

Famous for gallantry and bravery, armed to the teeth and having many years of training and battle experience, knights were the ultimate war machine for the better part of Medieval times. The likelihood of survival for a peasant soldier in an encounter with a knight was very small. They should somehow deflect or evade the attack of the knight’s sword or lance meanwhile wielding a heavy sword bring about the injury exactly at the right time when the knight passes. Not many peasant had the right training or prowess to do so.

Appearing around 1000 AD, the dominance of knights started following the conquest of William of Normandy in 11th century and reached it heights in 14th century:

“When the 14th century began, knights were as convinced as they had always been that they were the topmost warriors in the world, that they were invincible against other soldiers and were destined to remain so forever… To battle and win renown against other knights was regarded as the supreme knightly occupation” [Knights and the Age of Chivalry,1974]

And then something happened. Something that changed the military combat for the centuries to come: the projectile weapons.

“During the fifteenth century the knight was more and more often confronted by disciplined and better equipped professional soldiers who were armed with a variety of weapons capable of piercing and crushing the best products of the armourer’s workshop: the Swiss with their halberds, the English with their bills and long-bows, the French with their glaives and the Flemings with their hand guns” [Arms and Armor of the Medieval Knight: An Illustrated History of Weaponry in the Middle Ages, 1988]

The development of longsword had provided more effectiveness for the knight attack but there was no degree of training or improved plate armour could stop the rise of the projectile weapons:

“Armorers could certainly have made the breastplates thick enough to withstand arrows and bolts from longbows and crossbows, but the knights could not have carried such a weight around all day in the summer time without dying of heat stroke.”

And the final blow was the handguns:

“The use of hand guns provided the final factor in the inevitable process which would render armor obsolete” [Arms and Armor of the Medieval Knight: An Illustrated History of Weaponry in the Middle Ages, 1988]

And with the advent of arbalests, importance of lifelong training disappeared since “an inexperienced arbalestier could use one to kill a knight who had a lifetime of training”

Projectile weapons [image source]

Over the course of the century, knighthood gradually disappeared from the face of the earth.

A paradigm shift. A disruption.

* * *

After the big promise of web 1.0 was not delivered resulting in the .com crash of 2000-2001, development of robust RPC technologies combined with better languages and tooling gradually rose to fulfill the same promise in web 2.0. On the enterprise front, the need for reducing cost by automating business process lead to the growth of IT departments in virtually any company that could have a chance to survive in the 2000s decade.

In the small-to-medium enterprises, the solutions almost invariably involved some form of a database in the backend, storing CRUD operations performed on data entry forms. The need for reporting on those databases resulted in creating business Intelligence functions employing more and more SQL experts.

With the rise of e-Commerce, there was a need for most companies to have online presence and and ability to offer some form of shopping experience online. On the other hand, to reduce cost of postage and paper, companies started having account management online.

Whether SOA or not, these systems functioned pretty well for the limited functionality they were offering. The important skills the developers of these systems needed to have was good command of the language used, object-oriented coding design principles (e.g. SOLID, etc), TDD and also knowledge of the agile principles and process. In terms of scalability and performance, these systems were rarely, if ever, pressed hard enough to break - even with sticky sessions could work as long as you had enough number of servers (it was often said “we are not Google or Facebook”). Obviously availability suffered but downtime was something businesses had used to and it was accepted as the general failure of IT.

True, some of these systems were actually “lifted and shifted” to the cloud, but in reality not much had changed from the naive solutions of the early 2000s. And I call these systems The Simpleton Swamps.

Did you see what was lacking in all of above? Distributed Computing.

* * *

It is a fair question that we need to ask ourselves: what was it that we, as the .NET community, were doing during the last 10 years of innovations? The first wave of innovations was the introduction of revolutionary papers of on BigTable and Dynamo Which later resulted in the emergence of NoSQL movement with Apache Cassandra, Riak and Redis (and later Elasticsearch). [During this time I guess we were busy with WPF and Silverlight. Where are they now?]

The second wave was the Big Data revolution with Apache Hadoop ecosystem (HDFS, Pig, Hive, Mahout, Flume, HBase). [I guess we were doing Windows Phone development building Metro UI back then. Where are they now?]

The third wave started with Kafka (and streaming solutions that followed), Grid Computing platforms with YARN and Mesos and also the extended Big Data family such as Spark, Storm, Impala, Drill, too many to name. In the meantime, Machine Learning became mainstream and the success of Deep Learning brought yet another dimension to the industry. [I guess we were rebuilding our web stack with Katana project. Where is it now?]

And finally we have the Docker family and extended Grid Computing (registry, discovery and orchestration) software such as DCOS, Kubernetes, Marathon, Consul, etcd… Also the logging/monitoring stacks such as Kibana, Grafana, InfluxDB, etc which had started along the way as an essential ingredient of any such serious venture. The point is neither the creators nor the consumers of these frameworks could do any of this without in-depth knowledge of Distributed Computing. These platforms are not built to shield you from it, but to merely empower you to make the right decisions without having to implement a consensus algorithm from scratch or dealing with the subtleties of building a gossip protocol.

And what was it that we have been doing recently? Well I guess we were rebuilding our stacks again with the #vNext aka #DNX aka #aspnetcore. Where are they now? Well actually a release is coming soon: 27th of June to be exact. But anyone who has been following the events closely knows that due to recent changes in direction, we are still - give or take - 9 to18 months far from a stable platform that can be built upon.

So a big storm of paradigm shifts swept the whole industry and we have been still tinkering with our simpleton swamps. Please just have a look at this big list, only a single one of them is C#: Greg Young’s EventStore. And by looking at the list you see the same pattern, same shifts in focus.

.NET ecosystem is dangerously oblivious to distributed computing. True we have recent exceptions such as Akka.net (a JVM port) or Orleans but it has not really penetrated and infused the ecosystem. If all we want to do is to simply build the front-end APIs (akin to nodejs) or cross-platform native apps (using Xamarin studio) is not a problem. But if we are not supposed to build the sizeable chunk of backend services, let’s make it clear here.

* * *

Actually there is fair amount of distributed computing happening in .NET. Over the last 7 years Microsoft has built significant numbers of services that are out to compete with the big list mentioned above: Azure Table Storage (arguably a BigTable implementation), Azure Blob Storage (Amazon Dynamo?) and EventHub (rubbing shoulder with Kafka). Also highly-available RDBM database (SQL Azure), Message Broker (Azure Service Bus) and a consensus implementation (Service Fabric). There is a plenty of Machine Learning as well, and although slowly, Microsoft is picking up on Grid Computing - alliance with Mesosphere and DCOS offering on Azure.

But none of these have been open sourced. True, Amazon does not Open Source its bread-and-butter cloud. But considering AWS has mainly been an IaaS offering while Azure is banking on its PaaS capabilities, making Distributed Computing easy for its predominantly .NET consumers. It feels that Microsoft is saying, you know, let me deal with the really hard stuff, but for sure, I will leave a button in Visual Studio so you could deploy it to Azure.

OK, SO I’VE INSTALLED DOCKER

ARE WE DEVOPS YET OR DO I NEED TO INSTALL CONTAINERS TOO

ALSO WHERE DO I DOUBLE CLICK TO MAKE IT CLOUD AWARE
— PHP CEO (@PHP_CEO) 1 December 2015

At points it feels as if, Microsoft as the Lords of the .NET stack fiefdom, having discovered gunpowder, are charging us knights and peasant soldiers to attack with our lances, axes and swords while keeping the gunpowder weapons and its science safely locked for the protection of the castle. .NET community is to a degree contributing to the #dotnetcore while also waiting for the Silver Bullet that #dotnetcore has been promised to be, revolutionising and disrupting the entire stack. But ask yourself, when was the last time that better abstractions and tooling brought about disruption? The knight is dead, gunpowder has changed the horizon yet there seems to be no ears to hear.

Fiefdom of .NET stack

We cannot fault any business entity for keeping its trade secrets. But if the soldiers fall, ultimately the castle will fall too.

In fact, a single company is not able to pull the weight of re-inventing the emerging innovations. While the quantity of technologies emerged from Azure is astounding, quality has not always followed exactly. After complaining to Microsoft on the performance of Azure Table Storage, others finding it too and sometimes abandon the Azure ship completely.

@aliostad we are switching away as fast as we can
— Rinat Abdullin (@abdullin) 11 May 2016

No single company is big enough to do it all by itself. Not even Microsoft.

* * *

I remember when we used to make fun of Java and Java developers (uninspiring, slow, eclipse was nightmare). They actually built most of the innovations of the last decade, from Hadoop to Elasticsearch to Storm to Kafka... In fact, looking at the top 100 Java repositories on github (minus Android Java), you find 24 distributed computing projects, 4 machine library repos and 2 languages. On C# you get only 3 with claims to distributed computing: ServiceStack, Orleans and Akka.NET.

But maybe it is fine, we have our jobs and we focus on solving different kinds of problems? Errrm... let's look at some data.

Market share of IIS web server has been halved over the last 6 years - according multiple independent sources [This source confirms the share was >20% in 2010].

IIS share of the market has almost halved in the last 6 years [source]

Now the market share of C# ASP.NET developers are decreasing to half too from tops of 4%:

Job trend for C# ASP.NET developer [source]

And if you do not believe that, see another comparison with other stacks from another source:

Comparing trend of C# (dark blue) and ASP.NET (red) jobs with that of Python (yellow), Scala (green) and nodejs (blue). C# and ASP.NET dropping while the rest growing [source]

OK, that was actually nothing, what I care more is OSS. Open Source revolution in .NET which had a steady growing pace since 2008-2009, almost reached a peak in 2012 with ASP.NET Web API excitement and then grew with a slower pace (almost plateau, visible on 4M chart - see appendix). [by the way, I have had my share of these repos. 7 of those are mine]

OSS C# project creation in Github over the last 6 years (10 stars or more). Growth slowed since 2012 and there is a marked drop after March 2015 probably due to "vNext". [Source of the data: Github]

What is worse is that the data showing with the announcement of #vNext aka #DNX aka #dotnetcore there was a sharp decline in the new OSS C# projects - the community is in a limbo situation waiting for the release - people find it pointless to create OSS projects on the current platform and the future platform is so much in flux which is not stable enough for innovation. With the recent changes announced, practically it will take another 12-18 months for it to stabilise (some might argue 6-12 months, fair enough, take what you like). For me this is the most alarming of all.

So all is lost?

All is never lost. You still find good COBOL or FoxPro developers and since it is a niche market, they are usually paid very well. But the danger is losing relevance…

Practically can Microsoft pull it off? Perhaps. I do not believe it is hopeless, I feel a radical change by taking the steps below, Microsoft could materially reverse the decay:

Your best community brains in the Distributed Computing and Machine Learning are in the F# community, they have already built many OSS projects on both - sadly remaining obscure and used by only few. Support and promote F# not just as a first class language but as THE preferred language of .NET stack (and by the way, wherever I said .NET stack, I meant C# and VB). Ask everyone to gradually move. I don’t know why you have not done it. I think someone somewhere in Redmond does not like it and he/she is your biggest enemy.
Open Source good part of distributed services of Azure. Let the community help you to improve it. Believe me, you are behind the state of the art, frankly no one will look to copy it. Someone will copy from Azure Table Storage and not Cassandra?!
Stop promoting deployment to Azure from Visual Studio with a click of a button making Distributed Computing looking trivial. Tell them the truth, tell them it is hard, tell them so few do succeed hence they need to go back and study, and forever forget about one-button click stuff. You are not doing a favour to them nor to yourself. No one should be acknowledged to deploy anything in distributed fashion without sound knowledge of Distributed Computing.

Last word

So when I am asked about whether I am optimistic about the future of .NET or on the progress of dotnetcore, I usually keep silent: we seem to be missing the point on where we need to go with .NET - a paradigm shift has been ignored by our ecosystem. True dotnetcore will be released on 27th but after all, it might not matter as much as we so much care about. One of the reasons we are losing to other stacks is that we are losing our relevance. We do not have all the time in the world. Time is short...

Appendix

Github Data

Gathering the data from github is possible but due to search results being limited to 1000to rate-limiting, it takes a while to process. The best approach I found was to list repos by update date and keep moving up. I used a python script to gather the data.

It is sensible to use the number of stars as the bar for the quality and importance of Github projects. But choosing the threshold is not easy and also there is usually a lag between creation of a project and it to gain popularity. That is why the threshold has been chosen very low. But if you think the drop in creation of C# projects in Github was due to this lag, think again. Here is the chart of all C# projects regardless of their stars (0 stars and more):

All C# projects in github (0 stars and more) - marked drop in early 2015 and beyond

F# showing healthy growth but the number of projects and stars are much less than that of C#. Hence here we look at the projects with 3 stars and more:

OSS F# projects in Github - 3 stars or more

Projects with 0 stars and more (possible showing people starting picking up and playing with it) is looking very healthy:

All F# projects regardless of stars - steady rise.

Data is available for download: C# here and F# here.

My previous predictions

This is actually my second post of this nature. I wrote one 2.5 years ago, raising alarm bells for the lack of innovation in .NET and predicting 4 things that would happen in 5 years (2.5 years from now):

All Data problems will be Big Data problems
Any server-side code that cannot be horizontally scaled is gonna die
Data locality will still be an issue so technologies closer to data will prevail
We need 10x or 100x more data scientists and AI specialists

Judge for yourself...

Deleted section

For the sake of brevity, I had to delete this section but this puts in context how we have many more hyperscale companies:

"In the 2000s, not many had the problem of scale. We had Google, Yahoo and Amazon, and later Facebook and Twitter. These companies had to solve serious computing problems in terms of scalability and availability that on one hand lead to the Big Data innovations and on the other hand made Grid Computing more accessible.

By commoditising the hardware, the Cloud computing allowed companies to experiment with the scale problems and innovate for achieving high availability. The results have been completely re-platformed enterprises (such as Netflix) and emergence of a new breed of hyperscale startups such as LinkedIn, Spotify, Airbnb, Uber, Gilt and Etsy. Rise of companies building software to solve problems related to these architectures such as Hashicorp, Docker, Mesosphere, etc has added another dimension to all this.

And last but not least, is the importance of close relationship between academia and the industry which seems to be happening after a long (and sad) hiatus. This has lead many academy lecturers acting as Chief Scientists, etc to influence the science underlying the disruptive changes.

There was a paradigm shift here. Did you see it?"

Thursday, 27 August 2015

No-nonsense Azure Monitoring in 20 Minutes (maybe 21) using ECK stack

Azure platform has been there for 6 years now and going from strength to strength. With the release of many different services and options (and sometimes too many services), it is now difficult to think of a technology tool or paradigm which is not “there” - albeit perhaps not exactly in the shape that you had wished for. Having said that, monitoring - even to the admission of some of the product teams - has not been the strongest of the features in Azure. Sadly, when building cloud systems, monitoring/telemetry is not a feature: it is a must.

I do not want to rant for hours why and how a product that is mainly built for external customers is different from the internal one which on its strength and success gets packaged up and released (as is the case with AWS) but a consistent and working telemetry option in Azure is pretty much missing - there are bits and pieces here and there but not a consolidated story. I am informed that even internal teams within Microsoft had to build their own monitoring solutions (something similar to what I am about to describe further down). And as the last piece of rant, let me tell you, whoever designed this chart with this puny level of data resolution must be punished with the most severe penalty ever known to man: actually using it - to investigate a production issue.

A 7-day chart, with 14 data points. Whoever designed this UI should be punished with the most severe penalty known to man ... actually using it - to investigate a production issue.

What are you on about?

Well if you have used Azure to deliver any serious solution and then tried to do any sort of support, investigation and root cause analysis, without using one of the paid telemetry solutions (and even with using them), painfully browsing through gigs of data in Table Storage, you would know the pain. Yeah, that's what I am talking about! I know you have been there, me too.

And here, I am presenting a solution to the telemetry problem that can give you these kinds of sexy charts, very quickly, on top of your existing Azure WAD tables (and other data sources) - tried, tested and working, requiring some setup and very little maintenance.

If you are already familiar with ELK (Elasticsearch, LogStash and Kibana) stack, you might be saying you already got that. True. But while LogStash is great and has many groks, it has been very much designed with the Linux mindset: just a daemon running locally on your box/VM, reading your syslog and delivering them over to Elasticsearch. The way Azure works is totally different: the local monitoring agent running on the VM keeps shovelling your data to durable and highly available storages (Table or Blob) - which I quite like. With VMs being essentially ephemeral, it makes a lot master your logging outside boxes and to read the data from those storages. Now, that is all well and good but when you have many instances of the same role (say you have scaled to 10 nodes) writing to the same storage, the data is usually much bigger than what a single process can handle and shoveling needs to be scaled requiring a centralised scheduling.

The gist of it, I am offering ECK (Elasticsearch, ConveyorBelt and Kibana), an alternative to LogStash that is Azure friendly (typically runs in Worker Role), out-of-the-box can tap into your existing WAD logs (as well as custom ones) and with a push of a button can be horizontally scaled to N, to handle the load for all your projects - and for your enterprise if you work for one. And it is open source, and can be extended to shovel data from any other sources.

At core, ConveyorBelt employs a clustering mechanism that can break down the work into chunks (scheduling), keep a pointer to the last scheduled point, pushing data to Elasticsearch in parallel and in batches and gracefully retry the work if fails. It is headless, so any node can fail, be shut down, restarted, added or removed - without affecting integrity of the cluster. All of this, without waking you up at night, and basically after a few days, making you forget it ever existed. In the enterprise I work for, we use just 3 medium instances to power analytics from 70 different production Storage Tables (and blobs).

Basic Concepts

Before you set up your own ConveyorBelt CB, it is better to know a few concepts and facts.

First of all, there is a one-to-one mapping between an Elasticsearch cluster and a ConveyorBelt cluster. ConveyorBelt has a list of DiagnosticSources, typically stored in an Azure Table Storage, which contains all data (and state) pertaining to a source. A source typically is a Table Storage, or a blob folder containing diagnostic data (or other) - but CB is extensible to accept other data stores such as SQL, file or even Elasticsearch itself (yes if you ever wanted to copy data from one ES to another). DiagnosticSource contains connection information for the CB to connect. CB continuously breaks down the work (schedules) for its DiagnosticSources and keeps updating the LastOffset.

Once the work is broken down to bite size chunks, they are picked up by actors (it internally uses BeeHive) and data within each chunk pushed up to your Elasticsearch cluster. There is usually a delay between data captured (something that you typically set in Azure configuration: how often copy data), so you set a Grace Period after which if the data isn't there, it is assumed there won’t be. Your Elasticsearch data will usually be behind realtime by the Grace Period. If you left everything as defaults, Azure copies data every minute which Grace Period of 3-5 minutes is safe. For IIS logs this is usually longer (I use 15-20 minutes).

The data that is pushed to the Elasticsearch requires:

An index name: by default the date in the yyyyMMdd format is used as the index name (but you can provide your own index)
The type name: default is PartitionKey + _ + RowKey (or the one you provide)
Elasticsearch mapping: Elasticsearch equivalent of a schema which defines how to store and index data for a source. These mappings are stored on a URL (a web folder or a public read-only Azure Blob folder) - schema for typical Azure data (WAD logs, WAD Perf data and IIS Logs) already available by default and you just need to copy them to your site or public Blob folder.

Set up your own monitoring suite

OK, now time to create our own ConveyorBelt cluster! Basically the CB cluster will shovel the data to a cluster of Elasticsearch. And you would need Kibana to visualise your data. Here I will explain how to set up Elasticsearch and Kibana in a Linux VM box. Further below I am explaining how to do this. But ...

if you are just testing the waters and want to try CB, you can create a Windows VM, download Elasticsearch and Kibana and run their batch files and then move to setting up CB. But after you have seen it working, come back to the instructions and set it up in a Linux box, its natural habitat.

So setting this up in Windows is just to download the files from the links below, unzip and then running the batch files elasticsearch.bat and kibana.bat. Make sure you expose the ports 5601 and 9200 from your VM, by creating endpoints.

https://download.elastic.co/kibana/kibana/kibana-4.1.1-windows.zip
https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.zip

Set up ConveyorBelt

As discussed above, ConveyorBelt is typically deployed as an Azure Cloud Service. In order to do that, you need to clone Github repo, build and then deploy it with your own credentials and settings - and all of this should be pretty easy. Once deployed, you would need to define various diagnostic source and point them to your ElasticSearch and then just relax and let CB do its work. So we will look at the steps now.

Clone and build ConveyorBelt repo

You can use command line:

git clone https://github.com/aliostad/ConveyorBelt.git

Or use your tool of choice to clone the repo. Then open administrative PowerShell window, move to the build folder and execute .\build.ps1

Deploy mappings

Elasticsearch is able to guess the data types of your data and index them in a format that is usually suitable. However, this is not always true so we need to tell Elasticserach how to store each field and that is why CB needs to know this in advance.

To deploy mappings, create a Blob Storage container with the option "Public Container" - this allows the content to be publicly available in a read-only fashion.

You would need the URL for the next step. It is in the format:
https://<storage account name>.blob.core.windows.net/<container name>/

Also use the tool of your choice and copy the mapping files in the mappings folder under ConveyorBelt directory.

Configure and deploy

Once you have built the solution, rename tokens.json.template file to tokens.json and edit tokens.json file (if you need some more info, find the instructions here). Then in the same PowerShell window, run the command below, replacing placeholders with your own values:

.\PublishCloudService.ps1 `
  -serviceName <name your ConveyorBelt Azure service> `
  -storageAccountName <name of the storage account needed for the deployment of the service> `
  -subscriptionDataFile <your .publishsettings file> `
  -selectedsubscription <name of subscription to use> `
  -affinityGroupName <affinity group or Azure region to deploy to>

After running the commands, you should see the PowerShell deploying CB to the cloud with a single Medium instance. In the storage account you had defined, you should now find a new table, whose name you defined in the tokens.json file.

Configure your diagnostic sources

Configuring the diagnostic sources can wildly differ depending on the type of the source. But for standard tables such as WADLogsTable, WADPerformanceCountersTable and WADWindowsEventLogsTable (whose mapping file you just copied) it will be straightforward.

Now choose an Azure diagnostic Storage Account with some data, and in the diagnostic source table, create a new row and add the entries below:

PartitionKey: whatever you like - commonly <top level business domain>_<mid level business domain>
RowKey: whatever you like - commonly <env: live/test/integration>_<service name>_<log type: logs/wlogs/perf/iis/custom>
ConnectionString (string): connection string to the Storage Account containing WADLogsTable (or others)
GracePeriodMinutes (int): Depends on how often your logs gets copied to Azure table. If it is 10 minutes then 15 should be ok, if it is 1 minute then 3 is fine.
IsActive (bool): True
MappingName (string): WADLogsTable . ConveyorBelt would look for mapping in URL "X/Y.json" where X is the value you defined in your tokens.json for mappings path and Y is the TableName (see below).
LastOffsetPoint (string): set to ISO Date (second and millisecond MUST BE ZERO!!) from which you want the data to be copied e.g. 2015-02-15T19:34:00.0000000+00:00
LastScheduled (datetime): set it to a date in the past, same as the LastOffset point. Why do we have both? Well each does something different so we need both.
MaxItemsInAScheduleRun (int): 100000 is fine
SchedulerType (string): ConveyorBelt.Tooling.Scheduling.MinuteTableShardScheduler
SchedulingFrequencyMinutes (int): 1
TableName (string): WADLogsTable, WADPerformanceCountersTable or WADWindowsEventLogsTable

And save. OK, now CB will start shovelling your data to your Elasticsearch and you should start seeing some data. If you do not, look at the entries you have created in the Table Storage and you will find an Error column which tells you what went wrong. Also to investigate further, just RDP to one of your ConveyorBelt VMs and run DebugView while having "Capture Global Win32" enabled - you should see some activity similar to below picture. Any exceptions will also show in there.

OK, that is it... you are done! ... well barely 20 minutes, wasn't it? :)

Now in case you are interested in setting up ES+Kibana in Linux, here is your little guide.

Set up your Elasticsearch in Linux

You can run Elasticsearch on Windows or Linux - I prefer the latter. To set up an Ubuntu box on Azure, you can follow instructions here. Ideally you need to add a Disk Volume as the VM disks are ephemeral - all you need to know is outlined here. Make sure you follow instructions to re-mount the drive after reboots. Another alternative, especially for your dev and test environments, is to go with D series machines (SSD hard disks) and use the ephemeral disks - they are fast and basically if you lose the data, you can always set ConveyorBelt to re-add the data, and it does it quickly. As I said before, never use Elasticsearch to master your logging data so you can recover losing it.

Almost all of the commands and settings below, needs to be run in an SSH session. If you are a geek with a lot of linux experience, you might find some of details below obvious and unnecessary - in which case just move on.

SSH is your best friend

Anyway, back to setting up ES - after you got your VM box provisioned, SSH to the box and install Oracle JDK:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

And then install Elasticsearch:

wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.deb
sudo dpkg -i elasticsearch-1.7.1.deb

Now you have installed ES v 1.7.1. To set Elasticsearch to start at reboots (equivalent of Windows services) run these commands in SSH:

sudo update-rc.d elasticsearch defaults 95 10
sudo /etc/init.d/elasticsearch start

Now ideally you would want to move the data and logs to the durable drive you have mounted, just edit the Elasticsearch config in vim and change:

sudo vim /etc/elasticsearch/elasticsearch.yml

and then (note uncommented lines):

path.data: /mounted/elasticsearch/data
# Path to temporary files:
#
#path.work: /path/to/work

# Path to log files:
#
path.logs:  /mounted/elasticsearch/data

Now you are ready to restart Elasticsearch:

sudo service elasticsearch restart

Note: Elasticsearch is Memory, CPU and IO hungry. SSD drives really help but if you do not have them (class D VMs), make sure provide plenty of RAM and enough CPU. Searches are CPU heavy so it will depend on number of concurrent users using it.

If your machine has a lot of RAM, make sure you set ES memory settings as the default ones will be small. So update the file below and set the memory to 50-60% of the total memory size of the box:

sudo vim /etc/default/elasticsearch

And uncomment this line and set the memory size to half of your box’s memory (here 14GB, just an example!):

ES_HEAP_SIZE=14g

There are potentially other changes that you might wanna do. For example, based on number of your nodes, you wanna set the index.number_of_replicas in your elasticsearch.yml - if you have a single node set it to 0. Also turning off the multicast/Zen discovery since will not work in Azure. But these are things you can start learning about when you are completely hooked on the power of information provided by the solution. Believe me, more addicting than narcotics!

Set up the Kibana in Linux

Up until version 4, Kibana was simply a set of static HTML+CSS+JS files that would run locally on your browser by just opening root HTML in the browser. This model could not really be sustainable and with version 4, Kibana runs as a service on a box, most likely different to your ES nodes. But for PoC and small use cases it is absolutely fine to run it on the same box.
Installing Kibana is straightforward. You just need to download and unpack it:

wget https://download.elastic.co/kibana/kibana/kibana-4.1.1-linux-x64.tar.gz
tar xvf kibana-4.1.1-linux-x64.tar.gz

So now Kibana will be downloaded to your home directory and be unpacked to kibana-4.1.1-linux-x64 folder. If you want to see where that folder is you can run pwd to get the folder name.
Now to run it you just run the command below to start kibana:

cd bin
./kibana

That will do for testing if it works but you need to configure it to start at the boot. We can use upstart for this. Just create a file in /etc/init folder:

sudo vim /etc/init/kibana.conf

and copy the below (path could be different) and save:

description "Kibana startup"
author "Ali"
start on runlevel [2345]
stop on runlevel [!2345]
exec /home/azureuser/kibana-4.1.1-linux-x64/bin/kibana

Now run this command to make sure there is no syntax error:

init-checkconf /etc/init/kibana.conf

If good then start the service:

sudo start kibana

If you have installed Kibana on the same box as the Elasticsearch and left all ports as the same, now you should be able to go to browser and browse to the server on port 5601 (make sure you expose this port on your VM by configuring endpoints) and you should see the Kibana screen (obviously no data).

Sunday, 5 October 2014

What should I do?

Level [C1]

TLDR; : I was charged for a huge egress on one of my VMs and I have no way of knowing what caused it or whether it was an infrastructure glitch nothing to do with VM.

OK, here is the snippet of the last email I received back:

"I understand what you’re saying. Because this involves a non-windows VM, we wouldn’t be able to determine what caused this. we can only validate the usage, and as you already know, the data usage seems quite appropriate, comparing to our logs. Had this been a Windows machine, we could have engaged another team(s) to have this matter looked into. As of now, I am afraid, this is all we have. You might want to check with Ubuntu support to see what has caused this."

The story started two weeks ago. I have, you know, MSDN account courtesy of my work which provides around £95/mo free Windows Azure credit - for which I am really grateful. It has allowed me to run some kinda pre-startup stuff on a shoestring. I recently realised my free credit can take you so far so started using Azure services more liberally knowing that I am going to be charged. At the end of the day, nothing valuable comes out of nothing. But before doing that, I also registered for AWS and as you know, it provides some level of free services which I again took advantage of.

But I have not said anything about the problem yet. It was around the end of the month and I knew my remaining credit would be enough to carry me to the next month. Then I noticed my credit panel turning orange from green (this is quite handy, telling you with the rate of usage you will soon run out of credit) which I thought was bizarre and then next day I realised all my services had disappeared. Totally gone! Bang! I had run out of credit...

This was a Saturday and I spent Saturday and Sunday reinstating my services. So I learnt the lesson that I need remove spending cap, which is not the reason why you read this. The reason I ran out of credit was due to egress (=data out) from one of my Linux boxes... so this box used to have an egress of a few MB to max few hundred MB a day and suddenly shoot up to 175GB and 186GB! OK, either there is a mistake or my box has been hacked into - with the latter more likely.

Here is the egress from that "renegade" Linux box:

8/30/2014 "Data Transfer Out (GB)" "GB" 0.004967
8/31/2014 "Data Transfer Out (GB)" "GB" 0.006748
9/1/2014 "Data Transfer Out (GB)" "GB" 0.001735
9/2/2014 "Data Transfer Out (GB)" "GB" 0.17618
9/3/2014 "Data Transfer Out (GB)" "GB" 0.003499
9/4/2014 "Data Transfer Out (GB)" "GB" 0.013394
9/5/2014 "Data Transfer Out (GB)" "GB" 0.016147
9/6/2014 "Data Transfer Out (GB)" "GB" 0.005412
9/7/2014 "Data Transfer Out (GB)" "GB" 0.005803
9/8/2014 "Data Transfer Out (GB)" "GB" 0.001547
9/9/2014 "Data Transfer Out (GB)" "GB" 0.003044
9/10/2014 "Data Transfer Out (GB)" "GB" 0.002179
9/11/2014 "Data Transfer Out (GB)" "GB" 0.02876
9/12/2014 "Data Transfer Out (GB)" "GB" 0.008922
9/13/2014 "Data Transfer Out (GB)" "GB" 0.28983
9/14/2014 "Data Transfer Out (GB)" "GB" 0.099229
9/15/2014 "Data Transfer Out (GB)" "GB" 0.002653
9/16/2014 "Data Transfer Out (GB)" "GB" 0.00191
9/17/2014 "Data Transfer Out (GB)" "GB" 0.00182
9/18/2014 "Data Transfer Out (GB)" "GB" 175.69292
9/19/2014 "Data Transfer Out (GB)" "GB" 182.974478

This box was running an ElasticSearch instance which had barely 1GB of data. And yes, it was not protected so it could have been hacked into. So what I did, with a bunch of bash commands which I conveniently copied and pasted from google searches, was to create a list files that were changed on the box ordered by the date and send to the support. There was nothing suspicious there - and the support team did not find it any more useful [in fact the comment was that it was "poorly formatted", I assume due to the difference in new line character in linux :) ].

So it seemed less likely that it was hacked but maybe someone has been running queries against the ElasticSearch which had been secured only by its obscurity. But hang on! If that were the case, the ingress should somehow correspond:

8/30/2014 "Data Transfer In (GB)" "GB" 0.004335
8/31/2014 "Data Transfer In (GB)" "GB" 0.005579
9/1/2014 "Data Transfer In (GB)" "GB" 0.000744
9/2/2014 "Data Transfer In (GB)" "GB" 0.021571
9/3/2014 "Data Transfer In (GB)" "GB" 0.002983
9/4/2014 "Data Transfer In (GB)" "GB" 0.002571
9/5/2014 "Data Transfer In (GB)" "GB" 0.002961
9/6/2014 "Data Transfer In (GB)" "GB" 0.001994
9/7/2014 "Data Transfer In (GB)" "GB" 0.001642
9/8/2014 "Data Transfer In (GB)" "GB" 0.000483
9/9/2014 "Data Transfer In (GB)" "GB" 0.001879
9/10/2014 "Data Transfer In (GB)" "GB" 0.002022
9/11/2014 "Data Transfer In (GB)" "GB" 0.017067
9/12/2014 "Data Transfer In (GB)" "GB" 0.002644
9/13/2014 "Data Transfer In (GB)" "GB" 0.347959
9/14/2014 "Data Transfer In (GB)" "GB" 0.089146
9/15/2014 "Data Transfer In (GB)" "GB" 0.000404
9/16/2014 "Data Transfer In (GB)" "GB" 0.001912
9/17/2014 "Data Transfer In (GB)" "GB" 0.001733
9/18/2014 "Data Transfer In (GB)" "GB" 0.012967
9/19/2014 "Data Transfer In (GB)" "GB" 0.021446

which it does in all days other than 18th and 19th. Which made me think, it was perhaps all a mistake and maybe an Azure infrastructure agent or something has gone out of control and started doing this.

So I asked the support to start investigating the issue. And it took a week to get back to me and the investigation provided only the hourly breakdown (and I was hoping for more, perhaps some kind of explanation or identifying the IP address all this egress was going). The pattern is also bizarre. For example on 19th (at the end of which my credit ran out):

2014-09-18T00:00:00 2014-09-18T01:00:00 DataTrOut 166428 External
2014-09-18T01:00:00 2014-09-18T02:00:00 DataTrOut 374040 External
2014-09-18T02:00:00 2014-09-18T03:00:00 DataTrOut 2588121384 External
2014-09-18T03:00:00 2014-09-18T04:00:00 DataTrOut 539993671 External
2014-09-18T04:00:00 2014-09-18T05:00:00 DataTrOut 1128216 External
2014-09-18T05:00:00 2014-09-18T06:00:00 DataTrOut 25462 External
2014-09-18T06:00:00 2014-09-18T07:00:00 DataTrOut 18308 AM2
2014-09-18T06:00:00 2014-09-18T07:00:00 DataTrOut 63250 External
2014-09-18T07:00:00 2014-09-18T08:00:00 DataTrOut 24588 External
2014-09-18T08:00:00 2014-09-18T09:00:00 DataTrOut 82296 External
2014-09-18T09:00:00 2014-09-18T10:00:00 DataTrOut 59362 External
2014-09-18T10:00:00 2014-09-18T11:00:00 DataTrOut 10573316727 External
2014-09-18T11:00:00 2014-09-18T12:00:00 DataTrOut 11443247791 External
2014-09-18T12:00:00 2014-09-18T13:00:00 DataTrOut 13854724048 External
2014-09-18T13:00:00 2014-09-18T14:00:00 DataTrOut 8115190263 External
2014-09-18T14:00:00 2014-09-18T15:00:00 DataTrOut 13748807057 External
2014-09-18T15:00:00 2014-09-18T16:00:00 DataTrOut 10389478694 External
2014-09-18T16:00:00 2014-09-18T17:00:00 DataTrOut 19979259451 External
2014-09-18T17:00:00 2014-09-18T18:00:00 DataTrOut 21398993891 External
2014-09-18T18:00:00 2014-09-18T19:00:00 DataTrOut 22843598777 External
2014-09-18T19:00:00 2014-09-18T20:00:00 DataTrOut 23087199863 External
2014-09-18T20:00:00 2014-09-18T21:00:00 DataTrOut 16958070173 External
2014-09-18T21:00:00 2014-09-18T22:00:00 DataTrOut 13126214430 External
2014-09-18T22:00:00 2014-09-18T23:00:00 DataTrOut 352327 External
2014-09-18T23:00:00 2014-09-19T00:00:00 DataTrOut 358377 External

So what should I do?

So first of all, I have now put the ElasticSearch box behind a proxy and access to it requires authentication with the proxy. And better to do it now rather than later. And the ES box now is protected by IPSec.

But really the big question is, when you are on cloud and you don't own any of the infrastructure or its monitoring, how can you make sure you are being charged fairly. My £40 bill for the egress is not huge but makes me wonder, what if it happens again? What would I do?

There are also other questions: would that have been different on another provider? I am not really sure [although at least they could have opened a file with Linux line ending :) ] but the usage of a cloud platform requires building a trust relationship which is essential. I really appreciate the general attitude of Azure (and Microsoft) towards Open Source in embracing everything non-Windows and I think it is the right direction, but I think the support model should be also developed in line with that. AWS is a more mature platform but have you seen anything like this there? And if yes, how was your experience?

Wednesday, 6 August 2014

Thank you Microsoft, nine months on ...

Level [C1]

I felt that almost a year after my blog post Thank you Microsoft and so long, it is a right time to look back and contemplate on the decision I made back then. If you have not read the post, well in a nutshell, I decided to gradually move towards non-Microsoft technologies - mainly due to Microsoft's lack of innovation, especially in the Big Data space.

A few things have changed since then. TLDR; Generally it really feels I made the right decision to diversify. Personally, I have learnt a lot and at the same time I had a lot of fun. I have built a bridge to the other side and I can easily communicate with non-Microsoft peers and translate my skills. On the technology landscape, however, there has been some major changes that makes me feel having a hybrid skillset is much more important than a complete shift to any particular platform.

I will first look at the technology landscape as it stands now and then will share my personal journey so far in adopting alternative platforms and technologies.

A point on the predictions

OK, I made some predictions in the previous post such as "In 5 years, all data problems become Big Data problems". Some felt this is completely wrong - which could be - and left some not very nice messages. At the end of the day, predictions are free (and that is why I like them) and you could do the same. I am sharing my views, take it as it is worth for you. I have a track record of making predictions, some came true and some did not. I predicted a massive financial crash in 2011 which did not happen and lead to one of the biggest bull markets ever (well my view is they pumped money into the economy and artificially made the bull market) and I lost some money. On the other hand back in 2010 I predicted in my StackOverflow profile something that I think it is called Internet Of Things now, so I guess I was lucky (by the way, I am predicting a financial crash in the next couple of months). Anyway, take it easy :)

Technology Horizon

The New Microsoft

Since I wrote the blog post, a lot has changed, especially in Microsoft. It now has a new CEO and a radically different view on the Open Source. Releasing the source of a big chunk of the .NET Framework is harbinger of a shift whose size is difficult to guess at the moment. Is it mere a gesture? I doubt it, this adoption was the result of years of internal campaign from the likes of Phil Haack and Scott Hanselman and it has finally worked its way up the hierarchy.

But adopting Open Source is not just a community service gesture, it has an important financial significance. With the rate of change in the industry, you need to keep an army of developers to constantly work and improve products at this scale. No company is big enough on its own to build what can be built by an organic and healthy ecosystem. So crowd-sourcing is an important technique to improve your product without paying for the time spent. It is probably true that the community around your product is the real IP of most cloud platforms and not so much the actual code.

Microsoft is also relinquishing its push strategy towards its Operating System and to be honest, I am not surprised at all. Many have talked about the WebOS but reality is we have already had it for the last couple of years. Your small smartphone or tablet come to life when they are connected - enabling you to do most of what you can do on the laptop/pc with the only limitation being the screen size. On the other hand, Microsoft has released the web version of the office and to be fair it is capable of doing pretty much everything you can do in the desktop versions, and sometimes it does it better. So for the majority of consumers, all you need is the WebOS. It feels that the value of a desktop operating system becomes of less and less importance when most of the applications you use daily are web-based or cloud-based.

Cloud and Azure

I have been doing a lot of Azure both at work and outside it. Apart from HDInsight, I think Azure is expanding at a phenomenal rate in both feature and reliability and this is where I feel Microsoft is closing in the Innovation Gap. It is mind-blowing to look at the list of new features that are coming out of Azure every month.

Focusing mainly on the PaaS products, I think future of Azure in terms of adoption by the smaller companies is looking more and more attractive compared to AWS which has traditionally been IaaS platform of choice. Companies like Netflix have built all their software empire on AWS but they had an army of great developers to write the tooling and integration stuff.

All-in-all I feel Azure is here to stay and might even overtake AWS in the next 5 years. What will be a decider is the innovation pace.

Non-Hadoop platforms

A recent trend that could change the balance is the proliferation of non-Hadoop approaches to Big Data which will favour Microsoft and Google. With Hadoop 2.0 trying to abstract away even more the algorithm from the resource management, I think there is an opportunity for Microsoft to jump in and plug-in a whole host of Microsoft languages in a real way - it was possible to use C# and F# before but no one really used it.

Microsoft announced the release AzureML which is the PaaS offering of Machine Learning on the Azure Platform. It is early to say but it looks like this could be used for smaller-than-big-date machine learning and analysis. This platform is basically productionising of the Machine Learning platform behind the Bing search engine.

Another sign that the Hadoop's elephant is getting old is Google's announcement to drop MapReduce: "We invented it and now we are retiring it". Basically in-memory processing looks more and more appealing due to the need for quicker feedback cycle and speeding up processes. Also it seems that there is resurgence of focus towards in-memory grid computing, perhaps as a result of Actor Frameworks popularity.

In terms of technologies, Spark and to a degree Storm are getting a lot of traction and the next few months are essential to confirm the trend. These still very much come from a JVM ecosystem but there is potential in building competitor products.

Personal progress

MacBook

This is the first thing I did after making the decision 9 months ago: I bought a MacBook. I was probably the farthest thing away from being an Apple fanboy, but well it has put its hooks in me too now. I wasn't sure if I should get a Windows laptop and run a Linux VM on it, or buy a MacBook and run Windows VM. Funny enough, and despite my presumptions, I found the second option to be cheaper. In fact I could not find an Windows UltraBook with 16 GB of RAM and that is what I needed to be able to comfortably run a VIM. So buying a 13.3" MacBook pro proved both economical (in the light of what you get back for the money) and the right choice - since you want your VM to be your secondary platform.

Initially I did not like OSX but it helped me to get better at using the command line - be it the OSX variant of Linux commands. Six months on, similar to what some of my twitter friends had said, I don't think I will ever go back to Windows.

I have used Mac for everything apart from using Visual Studio and occasional Visio - also using some Azure tools had to be on Windows. I think I now spend probably only 20% of my time in Windows, the rest in Linux (Azure VM) and OSX.

Linux, Shell scripting and command line

I felt like an ignorant to find out the wealth of command line tools at my disposal in OSX and Linux. Find, Grep, Sort, Sed, tail, head, etc just amazing stuff. I admit, for some there might be windows equivalent that I have not heard of (which I doubt) but it really makes the life so easier to automate and manage your servers. So been working on understanding services on Linux and OSX, learning about Apache and how to configure it... I am no expert by any stretch but it has been fun and learnt a lot.
And yes, I did use VIM - and yes, I did find it difficult to exit it the first time :) I am not mad about it, I just have to use it on Linux VMs I manage configs, etc but cannot see myself using it for development - at least anytime soon.

Languages

As I said the, I had decided to start with some JVM languages. Scala felt the right choice then and with knowing more about it now, even more so. It is powerful yet all the wealth of Java libraries are at your fingertip. It is widely adopted (and Clojure the second candidate not so much). Erlang probably not appropriate now and go is non-JVM. so I am happy with that decision.

Having said that, I could not learn a lot of it. Instead I had to focus on Python for a personal NLP project - well, as you know most NLP and data science tools are on Python. I had to learn to code, understand its OOP and functional side, its versioning and distribution and finally above all being able to serve REST APIs (using Flask and RESTful-Flask) for interop with my other C# code.
My view on it? Python is simple and has a nice built-in support for important data structures (list, map, tuple, etc) making it ideal for working with the data. So it is a very useful language but it is not anywhere near as elegant as Scala or even C#. So for complex stuff, I would still rather coding in C#, until I properly pick up Scala again. I am also not very comfortable with distributing non-compiled code - although that is what we normally do in JavaScript (minimising aside), perhaps another point of similarity between these two.

Apart from these, I have still been doing a ton of C#, as I had predicted in the previous blog post. I have been working on a Cloud Actor Mini-Framework called BeeHive which I am currently using myself. I still enjoy writing C# and am planning to try out Mono as well (.NET on OSX and Linux). Having said that, I feel tools and languages best to be used in their native platform and ecosystem, so I am not sure if Mono would be a viable option for me.

Conclusion

All-in-all I think by embracing the non-Microsoft world, I have made the right decision. A new world has been suddenly opened up for me, a lot of exciting things to learn and to do. I wish I had done this earlier.

Would I think I will completely abandon my previous skills? I really doubt it: The future is not mono-colour, it is a democratised hybrid one, where different skillsets will result in cross-pollinisation and producing better software. It feel having a hybrid skill is becoming more and more important, and if you are looking to position yourself better as a developer/architect, this is the path you need to take. Currently cross-platform/hybrid skills is a plus, in 5 years it will be a necessity.