I’m a tech ops kind of guy, having spent the bulk of my career monitoring performance and building systems. I spent many years at a transaction-heavy company called Betfair.
It’s a peer-to-peer gambling site where customers set the odds and gamble against each other online. Speed was and is critical to the site’s success. Sluggish transactions mean that a customer could lose money or potentially lose out on a profitable deal.
The ability to monitor and troubleshoot systems at a low level of granularity was critical. Of course, there weren’t any commercial tools available to do that. The best we could do for monitoring back-end (Oracle) performance (and this was 10 years ago) was to get 15 minute-slices of data. That’s an eternity in the world of network and application monitoring.
Fortuitously, I met a guy who was (truly) operating a software business in his garage down the road. He said he had a product that could deliver one-minute slices of data. I replied, that’s nice, and certainly better than what we had, but what we really needed was one-second data.
>See also: 9 steps to realising the benefits of big data’s promise
Betfair's customers bet on sporting events, like football, while a match is in progress. Hence usage is highly event-driven – a goal being scored, a penalty or red card occurring. This results in short periods of intense system activity.
We simply had no clue, mining our Oracle database, what was happening during those peak periods of usage. Our new garage entrepreneur friend came back with a one-second monitoring tool.
His application was hard to use, with a messy interface. Yet we were able to achieve some wonderful performance improvements using it, and we could finally understand why slowdowns were occurring. The guy definitely had something going on: he later sold his application to a major IT monitoring vendor.
The point of this narrative from the scrappy days of IT monitoring is that one-second monitoring mattered back then – and it matters immeasurably more today. Nearly every customer of any type of business expects instantaneous response times. If your site or app can’t deliver, they'll go elsewhere before you get the problem fixed – and complain loud and clear on the way.
If a five-second response time is seen as too slow, then one-minute data points are not going to cut it. The customer may be getting poor service and you’ll be neither aware of it nor able to accurately troubleshoot it.
Here’s an example from the world of online travel bookings, Expedia. A few years ago, we were collecting performance data on a one-minute basis. We determined that the standard deviation for CPU utilisation was 16% and set an alerting threshold accordingly.
From this, we knew that if a one-minute average CPU reading showed 85% or higher, then we were in trouble because there would be a period within that minute in which the system was maxing out and producing a delay for customers.
This was all fine and good for a while; we knew the capacity and profile of the system. We could set appropriate alerts and order more hardware to stay ahead of the growth curve with time to spare.
A couple of years later, we began to receive a barrage of complaints from customers that our site was too slow. Our one-minute slices showed 60% CPU, which would indicate normal response times.
After further investigation, we discovered that a popular hotel finder app was driving frantic bursts of activity on our site. The standard deviation had shifted from 16% to over 60%. We didn’t know that for around 10 seconds within that minute, system usage was at 100%.
When the CEO found out that customers were periodically experiencing 15-second delays, he was flabbergasted. How could we have no clue that these delays were occurring? Yet we didn’t have the tools to monitor every second of activity and correlate it to the user experience.
>See also: How to build a big data infrastructure
To fix the issue, we twisted enough arms to procure the needed hardware to resolve the capacity crunch: response times came back to normal.
Today, though, we’d have the advantage of using real-time monitoring systems like Boundary, whose cloud-based event monitoring system collects and analyses network traffic data every second, integrates that data with other metrics such as CPU and server statistics, and then correlates all of the event data to performance.
Averages lie; they disguise and hide reality. But when it comes to O/S stats, averages (or aggregates) are all you’re going to get. So ask yourself how big a lie you’re prepared to accept and plan for it to be less than what the customer can perceive.
In reality, much of your company’s data is already recorded with sufficient resolution. Your access logs are (or should be) storing considerable granular data – who did what, when and how long it took – all to the millisecond. Add the proper collection and analytical tools, processes and mindset to think in seconds and react accordingly, and those occasions of suffering for your customers will be a distant memory.