Job Description
3 to 4 years of experiences: should have worked as Incident Manager.
Technical Knowledge: Strong understanding of IT infrastructure, applications, networks, and cloud services.
Communication Skills: Ability to convey technical issues in clear, non-technical terms.
Problem-Solving: Strong analytical and troubleshooting skills.
Leadership & Decision-Making: Ability to make decisions under pressure and lead cross-functional teams.
Organizational Skills: Ability to manage multiple incidents simultaneously, ensuring that high-priority issues receive appropriate focus.
Knowledge of ITIL Framework: Familiarity with the Information Technology Infrastructure Library (ITIL) or similar incident management best practices.
Experience: Prior experience in an IT support, incident management, or service delivery role is often required.
APM tools: should have good knowledge on multiple tools like Dynatrace, Grafana, ELK, Prometheus etc.
Responsibilities
Incident Handling & Resolution:
Lead the incident management process for IT or operational disruptions, coordinating resources, and ensuring swift resolution.
Prioritize and categorize incidents based on impact and urgency.
Ensure that incidents are investigated, diagnosed, and assigned to the correct team for resolution.
Coordinate major incident bridges or war rooms to facilitate rapid problem-solving.
Escalate issues as needed to ensure appropriate levels of attention.
Communication:
Serve as the central point of contact for incident updates, keeping all stakeholders informed.
Provide timely and clear communication to both internal teams and external customers on incident status and impact.
Ensure that post-incident reports are communicated, detailing the root cause and resolution steps.
Process Improvement:
Participate in post-incident reviews (PIRs) to identify lessons learned and improvement areas.
Collaborate with problem management teams to ensure that recurring incidents are addressed.
Continuously improve incident management processes by proposing enhancements based on incident data and trends.
Collaboration:
Work closely with IT, network, security, and operational teams to resolve incidents.
Engage with vendors or third-party providers if the incident involves external systems.
Ensure that all teams follow best practices in incident handling and escalation.
Documentation & Reporting:
Maintain accurate documentation of incidents, including timelines, actions taken, and resolution details.
Create reports summarizing incident statistics, resolution timeframes, and any emerging trends.
Track metrics such as Mean Time to Restore (MTTR), frequency of incidents, and service-level agreement (SLA) compliance.
Incident Response Coordination:
Develop and maintain incident response plans.
Train staff on incident response procedures.
Ensure that recovery plans are activated during major incidents or crises.