“We have an issue” was the last thing I wanted to hear. Early in my career, I ran operations for a large application service provider and many of the problems we faced were as frustrating as they were wasteful. A sudden issue sent us into our war room only to hear people around the table tell me what they wanted me to hear. Network? No problem here. Apps? No problem here. You’ve seen this too, I’m sure. Meanwhile, the problem was still out there in the data center, shredding SLAs while we searched for the cause and not yet the solution. And our strategic plans had to wait.
Today, ops leaders have far more complex environments. Not too long ago, a traditional stack would have maybe 150 metrics: 100 metrics for the operating system and 50 from the application. Today, that number can be well above 10,000 metrics. A 100-container cluster with two underlying hosts easily has 200 metrics from the operating system, 50 from the orchestrator, 5,000 from the containers and 5,000 more from the applications. When there’s a performance problem you can be sure that many metrics will change and send alerts to the operations team. That’s overwhelming, and we end up in the war room again.
Fortunately, we have far better tools today – and we need them. But the cascade of monitoring tools adds another level of complexity for the ops team when figuring out the root cause. Any big operations center will have literally millions of alerts to sort through. And the war room is no longer an effective, scalable solution. Across enterprises, time-to-detect and time-to-resolve scores are getting worse. And the impact of pulling staff away from strategic IT planning is often not measured at all.
We’ve been thinking deeply about Artificial Intelligence Operations (AIOps) as the way to better manage operations. It shows so much promise for modern data centers and agile businesses. We already have core capabilities in our recently announced AIOps software portfolio. We’re aggressively pursuing more development in machine learning and the integration of our IT and OT experience to extend our AIOps capabilities. AIOps is absolutely the future and we are building it today.
No longer does the operations team have to start with the problem and look backward for the cause and solution. In a modern data center, ops teams have predictive analytics that know your previous incidents and examine your current data path to identify potential critical issues. That’s before those issues bring your data center to a halt and your staff to a war room. And if a critical event does happen, instead of a thousand alerts, you’ll get a specific recommendation of what should be done to fix it. Promptly.
But AIOps software is far more useful than that. In addition to accelerating fixes, AIOps help you plan equipment purchases and prevent performance dips. But even that’s not the end goal. Our plans are to one day take the modern data center into autonomous operations. The people in your data center operations are far too valuable to sit in problem-resolution meetings. With the digital transformation that is sweeping the world’s businesses, IT must be more engaged with the business and help drive the innovation that makes the difference between success and irrelevance. Because it takes over many tactical tasks, AIOps is becoming a real contributor to IT strategy and business growth.
I’m glad that we are finally able to bring the IT operations infrastructure to the point where managers can ask of their staff, “Come to me bearing solutions, not problems.”