```html
Position Description
As a Site Reliability Engineer (SRE) within our team, you will play a key role in ensuring the reliability, scalability, and performance of our systems and infrastructure. Reporting to the OCC, ITSM & ServiceNow Manager, you will be responsible for closely collaborating with cross-functional teams to implement best practices, automate processes, and proactively monitor our systems to maintain optimal uptime and a satisfactory user experience.
Your Future Duties and Responsibilities
- SITE SREs will improve operational standards & update documentation:
- Evaluate current operational practices and identify areas for improvement.
- Develop and implement standardized processes and procedures to enhance efficiency and effectiveness.
- Maintain up-to-date documentation in Confluence (KB, FEX, etc.).
- SITE SREs will collaborate with DevOps teams to create a robust CI/CD pipeline for fully automated applications and platform deployment:
- Design and architect a Continuous Integration/Continuous Deployment (CI/CD) pipeline to automate the build, test, and deployment processes.
- Implement tools and technologies such as Jenkins, GitLab CI/CD, or similar, to streamline the pipeline.
- Integrate automated testing frameworks to ensure code quality and reliability throughout the deployment pipeline.
- Be the primary point of contact for code deployments.
- SITE SREs will take ownership of, manage, and enhance the release process, focusing on scalability, efficiency, and quality:
- Lead the planning, coordination, and execution of up-to-date releases across multiple products and environments.
- Continuously monitor, improve, and validate release processes based on feedback and metrics.
- SITE SREs will provide support for regular production updates and Job AppWorks corrections:
- Coordinate with development teams to prioritize and schedule production & maintenance updates.
- Execute deployment plans and verify successful updates while minimizing downtime and impact on users.
- Troubleshoot and resolve critical issues with job execution, including errors, failures, and unexpected behavior.
- Analyze job execution logs and metrics to identify any errors, failures, or performance bottlenecks.
- Reduce the number of redundant/duplicate alerts that are no longer used and be part of the optimization.
- SITE SREs will be On-call Support and Incident handling:
- Participate in an on-call rotation to provide 24/7 support for production systems, responding to alerts and incidents in a timely manner.
- Document incident response procedures and lessons learned for continuous improvement.
- Monitor system health and respond promptly to incidents, escalating as necessary for resolution.
- SITE SREs will be responsible for validation & Sanity Checks:
- Perform post-PPR and production deployment sanity checks to ensure system stability and functionality.
- Utilize both manual and automated checks to validate the integrity and coherence of the deployed code and configurations.
- Document and report any issues discovered during the validation process for further investigation and resolution.
- SITE SREs will be responsible for ServiceNow Ticket handling:
- Monitor, prioritize, and manage ServiceNow tickets according to defined SLAs and operational priorities.
- Assign tickets to appropriate teams or individuals for resolution and ensure timely follow-up and closure.
- Maintain accurate records and documentation within the ServiceNow platform.
- SITE SREs will be responsible for Capacity planning & Security Alert prioritization:
- Perform capacity testing to validate the scalability of systems and infrastructure under various load conditions.
- Prioritize security alerts based on severity and potential impact on system integrity and data confidentiality.
- Coordinate with security teams to assess and respond to security alerts promptly, implementing appropriate mitigation measures.
- SITE SREs will monitor DevOps Platform Products:
- Monitor the stability, performance, and availability of DevOps platform products such as JFrog, GitLab, Vault, Kong, ELK, Rancher, and Kubernetes (K8s).
- SITE SREs will define Monitoring Objectives:
- Collaborate with stakeholders to determine the key objectives and metrics for monitoring latency, traffic, errors, and saturation.
- Identify critical service-level indicators (SLIs) and objectives (SLOs) to ensure the monitoring aligns with business and user expectations.
Required Qualifications
- Bachelor's or Master's degree in Software Engineering, Computer Science, or equivalent.
- 2+ years of experience with Kubernetes.
- 5+ years of expertise in Linux administration.
- 3+ years of strong coding skills in languages such as Java, React.Js, etc.
- 2+ years of experience in infrastructure-related tools (Terraform, Ansible, VScode, Postman, etc.).
- Monitoring infrastructure and applications (Splunk, ECK, Grafana, Prometheus…).
- A solid understanding of CI/CD concepts, version control systems, and testing (experience with Jenkins, AppWorks, Git, Docker, Gitlab, etc.).
- Collaboration (Jira/Confluence, ServiceNow).
- Deep understanding of task automation.
- Proficiency in DevOps principles to ensure effective collaboration between IT operations and developers.
- Expertise in incident management and application security.
- Ability to define Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs).
- Excellent communication skills to collaborate with diverse teams.
- Analytical mindset to understand and solve complex problems.
- Autonomy and sense of responsibility to manage various aspects of the role.
- Can work well under pressure and manage multiple priorities.
- Must be amenable to working onsite 2 days a week in Taguig.
Skills
Linux, Kubernetes
```#J-18808-Ljbffr