About the Job:
Position: SRE Devops
Location: Chennai/Bangalore/Hyderabad/Pune/Mumbai
Experience: 5 to 8 Years only
Primary Skill - SRE, Dynatrace, Prometheus, Grafana, Kubernetes, AWS Native components, CloudWatch, (Puppet/ Chef/Ansible), CDK
Responsibilities
• Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
• Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.
• Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications
• Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency.
• Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence
• Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages, and security threats in Development, UAT, Staging and Production environments.
• Practice sustainable incident response and blameless postmortems
• Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems
• Preserve operational visibility and response capabilities — fixing and improving our dashboards, alerts, and automation.
• Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems.
Tech skills
• Bachelor’s degree in design, computer science, or a related technical field
• Strong debugging, troubleshooting, and problem-solving skills
• Proficient in Nodejs, familiarity with other scripting languages is a plus: JavaScript, Python, Maven, Ansible, Bash, etc.
• Experience with monitoring and alerting systems like Dynatrace, Prometheus, Grafana.
• Experience with logs and metrics analytics platforms like Sumologic, Splunk
• Experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications using Kubernetes, AWS Native components, CloudWatch, Dynatrace.
• Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible
• Proven history of leveraging automation
• Experience using tools like PagerDuty for managing incidents.
• Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies
• Experience in Serverless Application Framework
• Experience in containerized workloads and management platforms such as Docker or Kubernetes
• Familiarity with distributed systems is a plus including Microservices.
• Experience in Infrastructure automation tools such as CDK
• Understanding of CI/CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo
• Effective communication, collaboration & negotiation skills with the ability to interface with various business units and vendors.
• Experience liaising with developers, operations engineers, and third-party resources.
• Experience consuming APIs.
Soft Skills
• Ability to work in a team and independently.
• Excellent verbal and written communication skills
• Multitasking
• Time management