The Microsoft-CrowdStrike outage on July 19 was possibly the biggest in IT history, costing Fortune 500 companies alone over $5 billion in losses. The outage, caused by a faulty update, brought call centres, hospitals, banks, and airports across the globe to a complete halt for a few hours.
The outage might underscore the fragility of modern technology, revealing how critical systems can be disrupted by vulnerabilities, signalling the need for robust safeguards and resilience in digital infrastructure.
Observability players were quick to point out the importance of comprehensive monitoring and real-time analytics in detecting and mitigating such vulnerabilities, emphasising that enhanced visibility can prevent or minimise the impact of future disruptions.
AIM inquired with both established and new observability providers about whether their solutions can help prevent similar outages in the future.
Rohit Ramanand, the GVP of engineering, India, at New Relic said that full stack observability platforms provide real-time insights into system performance and health, making them an invaluable tool to help prevent, or mitigate outages when they occur.
New Relic is one of the leading and most dominating players in the space, with over 80,000 customers worldwide and over 12,000 customers in India alone.
Observability Can Avert System Downtime
“Observability enhances operational efficiency through three key mechanisms. First, it enables early issue detection with real-time insights, allowing engineering teams to resolve problems before they impact customers,” Ramanand told AIM.
Second, it offers a unified source of truth, streamlining the process of identifying root causes during outages by consolidating data from various sources.
Lastly, AI-driven observability platforms leverage historical data to build predictive models, helping to foresee and mitigate similar issues in the future. This integrated approach ensures a more proactive and efficient management of potential disruptions.
AIM also posed the same question to Middleware, a new-age startup based in San Francisco with roots in Ahmedabad. Sawaram Suthar, the founding director, echoed a similar sentiment.
Suthar believes observability solutions can significantly help prevent a situation like the CrowdStrike outage.
“Development and operations teams can collect metrics on performance, latency, and error rates, enabling proactive responses to anomalies. Furthermore, they can centralise logs to gain a unified view of system activity and streamline root cause analysis,” he said.
Suthar also adds that the real-time feedback mechanisms in observability solutions notify teams immediately, reducing the mean time to detect and respond. “We ourselves have helped companies achieve over 20% reduction in time to resolution,” he said.
“We’ve noticed that debugging often ends up accounting for 50% of a developer’s effort. With observability tools, they can focus on building applications, dedicating only about 10% of their time to debugging and problem resolution,” he added.
Can AI Help Enterprises Prepare Better?
Even if enterprises believe they are monitoring all aspects, they can still encounter blind spots without the right tools, highlighting the importance of full-stack observability.
AI’s ability to examine historical data and be more predictive could help organisations take appropriate action and a preventive approach.
New Relic’s AI capabilities help enterprises monitor AI-specific metrics like token usage, costs, and response quality and integrate with traditional application performance monitoring.
Having an integrated view of metrics, events, traces, and logs simplifies and accelerates root cause identification. “Comprehensive application performance monitoring (APM) capabilities enhance anomaly detection, leading to quicker remediation,” Ramanand said.
Besides implementing robust monitoring and logging, enterprises should develop automated alerting and notification systems, regularly conduct system audits and develop disaster recovery and business continuity plans, according to Suthar.
However, sometimes an outage might be inevitable. Enterprises should be well-equipped to mitigate risk and minimise the impact through effective response strategies and robust contingency plans.
“With observability, organisations gain a deeper understanding of systems to identify how to mitigate incidents when they do occur and ultimately prevent such events from reoccurring,” Ramanand added.
Observability in the Generative AI Era
Overall, the observability market is projected to grow from $2.4 billion in 2023 to $4.1 billion by 2028, reflecting a compound annual growth rate (CAGR) of 11.7% over the forecast period, according to a MarketsandMarkets research report.
Moreover, an increasing number of observability providers have begun incorporating generative AI into their products and services. Additionally, companies are developing solutions to monitor LLMs as enterprises integrate these models into their business operations.
An AIM Research report revealed that several leading players in the AI observability market, including Dynatrace, Datadog, and New Relic, have expanded their offerings to include observability capabilities tailored for GenAI-infused applications, addressing the specific needs of this emerging field.
Another interesting observation from the report is that around 80% of the companies offering tools for generative AI observability are startups, and most of them have been established in
the last three years. This signifies the growing prominence of observability, especially in the era of generative AI supremacy.