AIOps
Top challenges organizations face when managing a complicated IT infrastructure
Managing the additional complexity of a hybrid multicloud environment across multiple tools, systems and processes ranked among the top three primary challenges faced by IT leaders surveyed. It was cited as the primary challenge by 60% of respondents. The primary challenges indicated by the MD&I survey fall into three main areas:
- Lack of visibility
- Excessive complexity and cost
- Lack of insight into IT health and problems
Solution
- Pod out of memory, 2nd order debugging
- Check metrics
- Why memory is increasing
- Thread dumps
- Processes
- Check Deployments
- Create SOPs and RCAs
- Check metrics
- Code reviews
- Kubernetes manifest suggestions
- Like no request or memory provided
- Configuration Validation
- Env based validation
- DB, DNS if dev pointing or not
- Pod crashbackloopoff
- Why it happened in easy format
- IT Inventory Matured view
- Service Map
- UpStream, Downstream
- AI Alerts
Database alerts, Kubernetes events, middleware events Consume Hit - 98% Create dependency graph - manual graph Check dependent problems Make correlation
Deep Check - JSON or epbf
Prompt
Create a detailed proposal for building an advanced GenAI observability (AIOps System) and auto remediation system for Airtel Africa, a Telecom giant. The system will consume events from disparate sources like
- Database alerts
- Kubernetes events
- Middleware events
- etc
System will also have a dependency graph, where it will have whole inventory and relationship between these assets both upstream and downstream, these can be either automated discovery process or manual feed, or a combination of two.
Then using these events and alerts if will find out anomalies and all impacted systems to create an automated RCA, and then also help debug the issues by 2nd order debugging and share remediation steps.
The system will be self learning and will learn from each incident.
Data
- Metrics
- Logs
- Infra inventory
Tools
- Robusta
- Coroot is an open-source observability platform built for simplicity - Coroot
- Powerful Workflow Automation Software & Tools - n8n
- Eyer - headless AIOps
- Ops AI by Middleware - Observability copilot to resolve production issues instantly - YouTube
- OpsAI by Middleware – AI-Powered Observability Co-Pilot
- From Stateful Stream Processing to Stateful Sandbox | by Yingjun Wu | Feb, 2026 | Data Engineer Things
References
- Root cause analysis with logs: Elastic Observability's AIOps Labs
- Deploy LLM locally to use as API - Ollama
- Langfuse monitoring
- Langchain / LlamaIndex / Langsmith / LangGraph
- OpenWebUI
- RAG + Agents
- Harness Developer Hub
- AIOps - Prometheus, Logs, Kubernetes Metrics, RAG - RCAs
- https://n8n.io/workflows/3066-automate-multi-platform-social-media-content-creation-with-ai/
- Nofire.ai
- Model - lingam
- Medical use cases
- Causal
- Regression
- Automatic Root Cause Analysis
- Time Travel Troubleshooting
- ChatOps
- https://www.kyndryl.com/content/dam/kyndrylprogram/cs_ar_as/AIOps_AS_USEN.pdf
- From Stateful Stream Processing to Stateful Sandbox | by Yingjun Wu | Feb, 2026 | Data Engineer Things