When the world’s largest storage systems company, EMC, acquired SMARTS, a niche player in network management systems in February 2005, most observers thought, perhaps logically, that its aim was to apply that specialist technology to the management of storage networks. Far from it.
What is now becoming clear – in EMC customer briefings and sessions with analysts – is that the company has always foreseen a much wider potential for SMARTS, viewing it as the starting point for an audacious assault on the traditional infrastructure management market, where companies such as IBM, Hewlett-Packard (HP) and Cisco dominate.
Just look at the PowerPoints being shown by EMC executives such as Howard Elias, the company’s head of global marketing and corporate development. He identifies EMC’s five areas for “ongoing and future technology investment”: up there alongside the obvious candidates of virtualisation, information lifecycle management, security and grid computing, is something not so familiar – “model-based resource management”.
And in the company’s collective mind, that represents a vast new opportunity – some even compare it to early-days VMware, EMC’s hugely successful virtualisation play.
“There is a very significant change underway in the management of resources within the IT infrastructure,” says Elias. “We intend to take the capability of model-based management [acquired with SMARTS] and apply it to every infrastructure domain.”
Although systems, network and storage management has for long been an integral part of any IT set-up, existing systems are now struggling to cope with the complexity and pivotal position of modern, networked and, increasingly, virtualised environments.
“Model-based management is one of EMC’s five areas for ongoing and future technology investment.”
Howard Elias, EMC
Elias argues that there is a misalignment between the traditional management applications and the needs of current infrastructures – infrastructures that may contain millions of network, system, application, database, storage, security and other elements, configured in meshed, multi-layer network topologies with complex application and service inter-dependencies.
Such matrices demand a much more ambitious – and automated – set of management capabilities that includes the ability to discover, analyse, report and act on the changing status of the estate, and relate that all the way up to the applications layer it serves.
Nowhere is that more evident than when things go wrong. According to Elias, infrastructure managers constantly find it difficult to understand the business impact of a single fault or failure from the storm of alerts that the incident triggers. Traditionally, such events have kicked off an intensive, largely manual effort right across the infrastructure to diagnose the root cause. The trouble is complexity is making that more rather than less opaque.
In collecting endless amounts of data from devices, today’s management suites, EMC maintains, are putting unacceptable pressure on systems troubleshooters.
What they really need is an approach that can “fully automate incident management and triage”, says Elias, that correlates the root cause and presents a plan of action. Addressing that has required some fresh thinking, he argues: above all, next-level resource management will be characterised by the move from framework-based management typical of an IBM Tivoli or HP OpenView environment to model-based management.
His key point is that most resource management tools work by collecting a group of events and trying to make sense of them; model-based management turns that on its head, by modelling the infrastructure in near real-time, establishing and reporting on the status of all ‘discoverable’ devices and maintaining an understanding of their dynamic connections, behaviours and dependencies.
Such model-based systems generate all possible signatures for all possible failures across all resource domains, says Elias, and correlate appropriate responses. The upshot of applying that approach is that the fault is isolated much faster, he claims. “There is a massive reduction in the number of trouble tickets.”
That is evident at Microsoft, an early-adopter of SMARTS and related EMC resource management technology for its global internal network. Prior to using EMC, it was using 19 servers as part of network fault management and could only map the environment with a frequency of every 20 to 30 minutes. Having implemented SMARTS, the software giant has reduced the number of monitoring servers from 19 to four, increased the frequency of monitoring to 10 minutes and added root-cause analysis.
As EMC boasts on Microsoft’s behalf, there has been a reduction of 30% in the event-to-trouble ticket ratio and a 60% fall in the number of alarms.
Such examples reinforce the sense of epiphany at EMC. “All the frameworks that exist for [infrastructure] management today are not going to last,” predicts EMC’s chief development officer Mark Lewis. “They are going to be yesterday’s technology – including our own.”
The reason: current frameworks “micro-manage” environments, producing hundreds of reports on events. “That doesn’t work: it is dying under its own weight,” he says.
EMC may not be the only one to have come to that realisation. Indeed, some observers point to IBM’s acquisition of network management technology specialist Micromuse in mid-2006 as a parallel – maybe even a reaction – to EMC’s buy-out of SMARTS and subsequent purchase of nLayers, an infrastructure discovery tools vendor. “It is IBM’s attempt to move forward,” suggests Lewis.
New model army
Whoever the vendor, the ultimate goal is automation – something that has been elusive in many areas of existing infrastructure management. And for EMC, the modelling of the environment is the key.
According to Howard Elias, modelling creates a simpler, actionable view of infrastructure that enables the integrated management of server, network and storage resources from a centralised control point.
John Premus, CTO, at Sumitomo Mitsui Banking Corp (SMBC) in New York gives an example from his testing of SMARTS. With its largest and most complex application, Loan IQ, modelled using SMARTS, SMBC triggered a failure on its network. “We received an alert from SMARTS that a fibre link had failed at our offices in the Chrysler Building in New York City,” says Premus. “Although CiscoWorks was showing the ports as active, out engineers were able to verify the fibre link failure that Smarts reported. [It] was intelligent enough to recognise that no traffic was able to traverse that link. And without it the backup piece of the fibre would not have picked up service, and as a result, the business would have been impacted.”
The SMARTS approach works by passively sniffing traffic across all applications, routers, databases and so forth, and modelling their behaviour and interaction. A rules-based engine then triggers certain actions depending on the type of events. This ‘codebook correlation technology’ is the “secret sauce” that creates a unique signature for each failure, says Chris Gahagan, EMC’s senior vice president for resource management software. “We generate all possible signatures for all possible failures.”
Mark Lewis, EMC
Automation is at different levels. First is the automation of the discovery of the environment – which elements are active in the infrastructure and how they relate to each other, what applications they are running and how they are tied into delivering the business services.
Second is automation of the analysis of the information collected, with the goal of reducing the average time to respond to an incident by identifying the problem and how it needs to be solved.
Lastly, there is the automation of the impact analysis – the means of understanding which business entities are impacted by any failure.
Drivers for change
The pressing need for greater automation is very clear. The current cost of infrastructure management is growing at three to five times the cost of the systems it is trying to manage.
Another key driver behind some of the fresh thinking in infrastructure management is the need for tools that can cope with an increasingly flexible, yet demanding, infrastructure. The distributed computing challenge means an explosion of complexity, warns Mark Lewis, with the result that many organisations will struggle to guarantee quality of service levels. In particular, as they embrace virtualisation across multiple parts of the infrastructure, organisations will have to employ “automated actionable intelligence”, he says.
When that might involve next-generation services such as voice over IP (VoIP), quality of service through automation is paramount. “If you are going to develop VoIP, or video on demand for that matter, you need a new paradigm. Tightly coupled management isn’t going to work,” says EMC’s Lewis.
Talk of a new paradigm may be a little premature. But the promise of a new approach that claims to streamline the operations of the world’s largest, most complex IT infrastructures through automated real-time root cause and impact analysis is already causing some large organisations to question their historic approaches.
According to Dr Patricia Soares Florissi, CTO for resource management at EMC, that is going to relieve some of the pressure on IT people to deal with the problems of a constantly changing IT infrastructure. “The model-based approach will put the burden of analysing the collected management information back onto the tool,” she says.
The role of CMDBs
Configuration management database systems (CMDBs) play a major role in the understanding of the complexities of managing the IT infrastructure – but not all provide the desired level of detail.
A recent survey of 190 IT managers by business service management software company Managed Objects, showed an overwhelming desire for more sophisticated measures of the health of different parts of the infrastructure.
Almost 90% of those questioned in the survey think it important that a dynamically updated measure of ‘state’ should be included as an attribute for all configuration items within a CMDB, and that this measure should go beyond indicating mere availability.
The ultimate goal of a CMDB is to bring together highly diverse data sources into a single, federated system, explains Dennis Drogseth, a vice president at industry analyst group Enterprise Management Associates. “The end result should paint an accurate picture of the health and availability of critical business resources. To achieve this, customers should have a solid understanding of the applications environment and how configuration changes and problems in one domain impact the multiple layers of the IT infrastructure.
That was the sentiment among survey respondents. “In the past, state was often defined only by availability – whether a [device] such as a server was up or down,” the report suggests. “Today’s IT managers, with the goal of more proactive management of their IT infrastructure also expect measures of state to answer questions such as: How available is it? How well is it performing in comparison to benchmarks?”
IT leaders see this information as necessary for CMDBs to play a larger role than simply asset management and relationship mapping, the report concludes.
“They recognise that without this critical element [of CMDBs], you just have an asset management database on steroids,” says Dustin McNabb, VP of marketing for Managed Objects.
Ultimately, a CMDB that dynamically incorporates real-time information from an organisation’s relevant data sources allows it to quickly identify and remediate IT infrastructure problems and prevent or minimise the impact of outages. Root-cause, impact and predictive analyses – all of which enable proactive management of IT infrastructures – require accurate measures of state to succeed, stresses, says McNabb.
Further reading in Information Age
Voyage of discovery – April 2006
In pursuit of alignment – January 2005
More articles can be found in the IT Infrastructure Briefing Room.