SRE Operations Engineering

McCain Foods • Gurugram, India • 7h ago

JOB PURPOSE:

As the Major Incident Manager will be responsible for overseeing the McCain's major incident management process with SRE, Automation driven thought leadership in the global technology, ensuring timely and effective response to significant disruptions or infrastructure incidents that impact business operations. Major incidents including and not limited to infrastructure, network, cloud and on-premise applications.

KEY RESPONSIBILITIES:

Lead the McCain's major incident management process, including the identification, assessment, and resolution of significant disruptions or incidents affecting business operations.
Establish and maintain predefined criteria and procedures for categorizing and prioritizing major incidents based on severity, impact, and urgency.
Coordinate cross-functional response efforts during major incidents, working closely with internal teams, external vendors, and stakeholders to minimize downtime and restore services expeditiously.
Serve as the primary point of contact and escalation for major incidents, providing regular updates and communication to stakeholders, including senior management, customers, and regulatory authorities.
Conduct post-incident reviews and analysis to identify root causes, lessons learned, and opportunities for improvement in incident response procedures.
Develop and maintain relationships with key stakeholders, including I&O teams, business units, and external partners, to facilitate effective incident response and resolution.
Implement and maintain robust monitoring and alerting systems to proactively identify potential issues and mitigate risks before they escalate into major incidents.
Provide guidance and support to incident response teams, including training, coaching, and knowledge sharing, to enhance their effectiveness and efficiency in managing major incidents.
Participate in the development and implementation of business continuity and disaster recovery plans to ensure the organization's ability to respond to and recover from major incidents.
Continuously work to improve problem identification and service restoration by leading and overseeing efforts to define, enhance, and deliver automated alerting and response systems with intelligent, self-healing capabilities.
Continuously work to improve the reliability, stability, and performance of the Infrastructure and associated platforms by overseeing the implementation of fully automated telemetry, observation, & applied intelligence systems.
Fulfill the role of Escalation Manager/Critical Incident Manager on major incidents by facilitating incident resolutions by leading teams through effective service restoration.
Communicate and provide timely status and incident reports to Sr. Leadership.
Collaborate with admins and platform engineers through implementation decisions to achieve highly reliable infrastructure, systems, and integrations.
Lead conversations and provide business and engineering support for both in-house and external customers.
Provide advanced Incident Management and Problem Management support to teams, to effectively identify, remediate, and resolve issues related to platform reliability, stability, and performance through careful analysis of telemetry data and system logs.
Document all changes following controls, procedures and documentation standards and raises issues and concerns with recommendations for follow-up action.
Partner with Site Reliability Engineering team to integrate and enhance monitoring and alerting systems that detect anomalies and potential incidents before they escalate.
Partner with Observability team to co-develop incident resolution playbooks, detailing steps for common incident types, ensuring quick and effective responses.
Partner with Site Reliability Engineering team to identify opportunities for automating incident response processes, such as automated rollback procedures or self-healing scripts.
Implement and utilize automation tools available and recommended by the SRE team to streamline incident management processes.
Drive Automation with Predictive Intelligence and AI for incident categorization, smart routing, AI-Driven RCA. and leverage clustering algorithms to group similar incidents, helping to identify common root causes and patterns.
Partner with ServiceNow Platform team to drive and support adoption for platform automation and predictive intelligence capabilities.

KEY QUALIFICATION & EXPERIENCES:

Bachelor's degree in computer science, information technology, or a related field.
12+ years of IT Operations experience, minimum 5+ years of experience in incident management, and major incident management, in a complex environment in any global organization.
10+ years’ of experience working in global organizations with the ability to effectively communicate with executives, leaders and individual contributors across the organization.
5+ years of SRE experience working with telemetry , observation, self-healing solutions, and platform automation.
Experience with monitoring, logging & telemetry tools like New Relic, Splunk, ELK, Nagios , SolarWinds, Prometheus, AWS Cloudwatch, Datadog, etc.
Azure/AWS, Microsoft, RedHat, certifications and knowledge of ITIL/MOF practices
Strong technical expertise in areas of IT infrastructure, networking, security and applications support.
Excellent communication and interpersonal skills, with the ability to effectively interact with stakeholders at all levels of the organization. Proven leadership and decision-making skills, with the ability to remain calm under pressure and make effective decisions in high-stress situations.
Relevant industry certifications (e.g., ITIL, SRE, PMP, CISSP) are a plus.