You may have heard it in passing or been somewhat confused by it, but observability could be a game-changer for your business. Morgan Mclean, Director of Product Management at Splunk, talks through everything you need to know.
For the uninitiated, could you tell us what observability is?
There’s this rote definition of the ability to observe any system and understand how it works, which can be applied to anything like how our car functions, how a clothing factory functions, things like that. But I think in this space, what we’re referring to is the ability to observe how a set of services and infrastructure – basically a giant web application – functions, so that our customers can better understand what they’ve built, its current status, things that might break it. Or, if it’s broken, how they can fix it. It’s super important for Splunk’s traditional set of end users, which is a mix of security and operators, and to some extent developers, though it tends to be more focused on developers and system operators.
Could you give us a sort of sense of what people might use observability platforms for?
When Splunk started, it was very focused on capturing logs that were generated by applications and by infrastructure-based analysing, like machine data, and pulling insights out of this. If you wind the clock back 15 years, it was a mix of developers and operators, who were running large backend services who wanted to go and look at crash reports and other such things. Then Splunk’s vision expanded to also use those logs to derive security insights.
Where observability comes in is actually returning back to that original class of customers and their needs in the intervening 15-20 years since Splunk has expanded. They aren’t doing more than running some set of 1000 named virtual machines with known applications on them, where everyone at the company knows what’s doing what. Now they’re running tens, if not hundreds of thousands of hosts in the cloud that are spinning up, spinning down or using technologies like Kubernetes to build these extraordinarily high-scale applications, as well as managing the infrastructure to back those. So, log analytics is no longer, strictly speaking, sufficient to meet their needs. You can run a business strictly with that, but they also need things like time series data, they need distributed traces to show how the thousands of services they’ve built interact and operate.
So, Splunk Observability Cloud is a solution for the extraordinarily high-scale next generation of applications and the infrastructure to back them. It gives people very opinionated views of how all their applications work, it gives people very opinionated views of what they need to monitor right within their infrastructure.
One use case would be if you’re using Splunk, or other block solutions like Elastic or Simple Logic. You can eventually trace why something crashed, but it’s going to take you hours. In order to monitor things, you’re going to be consuming huge amounts of money, because you’re basically adjusting metrics and logs. If you go to Splunk Observability Cloud, it’s very powerful. But a lot of it’s so dry. You have to write your queries, it’s this deep analytics solution. But if you want to find out how something works, and you don’t always have all the data you need, you go into Splunk Observability Cloud, it shows you right away, here’s all your infrastructure. I see you’ve got 100,000 VMs (virtual machines), a bunch deployed here on AWS, a bunch on Azure. Here’s the current health status, here’s how you can dive into them. Oh, and we found you have 1000 different applications and how they interact on a visual map. And oh, here’s a given customer request that’s failing, we’ve already highlighted that for you. On that map, by the way, here is how it bounces around all those services. Here’s where it’s failing.
It’s this very purpose-built thing for developers and for operators who are managing these big services. Splunk Cloud, very powerful platform, but not purpose-built for those particular use cases, is often used for those use cases. But this is a dedicated solution that acts as a complement to Splunk Cloud or Splunk Enterprise.
Which company systems are typically observable?
Historically for observability products, including ours, the focus is on backend services. It’s typically infrastructure and code that they’ve written that’s running on that infrastructure or third-party applications, like databases, message queues, that are also running on that infrastructure. This is typically running in cloud environments. These are people with data in AWS, GCP (Google Cloud Platform), Azure. And to some extent, they’re also still managing their own data centres. That really has been the focus on ensuring that all that stuff works, and they have visibility into it, that they can fix it when it breaks. More recently, with us and the industry, we’ve been a leader in this, that vision has expanded into the entire application stack. If a customer is running an ecommerce website – think of something like amazon.com – they have frontend web experiences that are opened by customers and web browsers, and they typically have applications that are running on iOS or Android or even desktop.
We’ve expanded Splunk Observability Cloud over the last few years to also provide visibility into those client applications so that when a customer is having an issue on, say, an iPhone mobile application, the company who’s actually hosting this can understand what the issue is. First off, that the issue is happening. And secondly, is the issue with the app that I’ve built? Is it specifically for the code deployed there, that’s distributed through the Apple App Store? Is it with this set of customer connections over the Internet to my backend services? Maybe they’re all located in a particular region, or they’re saying the same internet service provider? Or is the problem in my backend services where I need to go fix it? Observability Cloud gives you this full visibility, and can help you really quickly pinpoint where problems are and where problems might crop up.
How could heightening your observability solutions help with that digital resilience?
I wouldn’t even say what helps with, it’s just critical, right? If someone is not using a strong, solid observability solution, like Splunk Observability Cloud, they are leaving themselves extremely exposed, particularly if they’re running a lot of applications at that high scale. It’s just essential that anyone who’s investing millions of dollars, if not more, into their own code and infrastructure. For almost every business these days, that’s critical for them. I would say that operating without an observability solution is like running an ecommerce store without doing any inventory. When you order stuff, and it goes in the warehouse, and then you sell it and no one’s really keeping track of it. No one would do that, even 100 years ago.
Things will go wrong more often and when things do go wrong, it’s going to take a long time to fix. And perhaps the most overlooked is, when your developers are going in and building new features and extending the system, building the new things that enhance your business. If they don’t have an observability tool to look at, it takes them a long time to understand how to do their work and they often do it in a very fragile way. Whereas if they can actually go look at how all of your infrastructure works, how your applications interact, it makes their life so much easier to segment.
If you are a business owner who is coming to an observability platform for the first time. What do they need to know to get started?
The first thing you need to do on any of these platforms is to get your data into it. Historically, for log analytics solutions, like Splunk, that was relatively easy. I don’t mean to trivialise this, but you would grab logs from all of your infrastructure and send those back to Splunk and we process those. You would usually deploy an agent to do it. For observability solutions – not just ours, but any of them – you need more data. In addition to those logs that you capture from each host, you also need system metrics and application metrics and profiles and distributed traces and everything else. There are additional layers of complexity here. Now you’re not just capturing human readable logs from operating systems, you’re capturing all these other types of data from the individual applications that people have written. That requires hooks into all of the hundreds of thousands of libraries that software developers use. I think that historically has held back this industry to a fairly large degree.
We rely on a project that I co-founded with a number of other people, and a number of other companies. It’s called Open Telemetry. This is a solution that’s part of the Cloud Native Computing Foundation. It’s totally free, totally open source, very, very powerful. It has all of these hooks, all of these integrations into the hundreds of thousands of different bits of software that people use. You deploy Open Telemetry – we have our own step-by-step guidance on this and a bunch of enhancements we’ve made to it. You deploy Open Telemetry into your app, into your infrastructure, it goes in instruments, your applications, that’s now pretty automatic, and it extracts the data that’s needed and sends that back where we then process and analyse it.
So, how do you get started? You deploy Open Telemetry, start extracting that data stream to us, and we can provide really powerful insights on it.
Read more
How Kubernetes extends to machine learning (ML) – This article explores the ways in which Kubernetes enhances the use of machine learning (ML) within the enterprise
Five key and emerging trends in cloud observability – In a world of highly abstracted, typically virtualised, often ephemeral and always dynamic cloud computing resources, the need to achieve continuous observability is key
Two-thirds of UKI firms struggling with data insight costs – Research from SAS has revealed widespread struggles faced by businesses in the UK and Ireland to drive value from data insights