Essential Responsibilities:
- Partner with product owners and business SMEs to analyze the business needs and improve support ability, scalability and recovery for the engineered solution.
- Ensure that the overall technical solution is aligned with the business needs and operational teams methodologies
- Drive the improvement of service availability to reduce the mean time to recovery using automation.
- Develop methods for autonomous recovery and self-repairing systems. Ensure the solution is consistent with RFPIO architecture, design and development standards
- Coordinate and plan system releases and hotfixes.
- Develop methods that allow simplified triage following a set of checklists, run books and standard operating procedures.
- Make adjustments to adopt new methodologies that provide the business with increased flexibility and agility
- Support software development by providing operational improvements to non-functional requirements.
- Develop enhancements to improve service levels by leveraging key performance indicators consisting of monitoring, non-functional testing and availability reports.
- Provide a service-focused approach leveraging continuous process improvement.
- Participate in chaos testing to improve system resiliency. Mentor other engineers. Provide overall technical leadership to smaller working teams as needed
- Stay current with latest development tools, technology ideas, patterns and methodologies; share knowledge by clearly articulating results and ideas to key stakeholders
Experience:
- At least 3 to 5 years in a Site Reliability Engineering, DevOps, or Infrastructure focused role
- Experience supporting internet-facing production services and distributed systems
- Ability to implement and coordinate telemetry using monitoring and observability tools such as Splunk, Grafana or Prometheus
- Coding experience using a high-level programming languages like: Java, or Python
- Automation advocate - you truly believe in removing operational load via software
- A strong sense of ownership.
- Experience managing, scaling, and troubleshooting Java applications
- Familiarity with cloud infrastructure concepts (zones, regions, VPCs, etc)
- An understanding of a variety of software service deployment packaging, strategies, and tooling
- Working understanding of common authentication schemes, certificates, and securely managing secrets
- Capable of designing and implementing automated configuration management processes for repeatable and consistent service deployment
Education:
- BS or MS in Computer Science or equivalent industry experience
Knowledge, Skills & Ability:
- Prior experience as an SRE, software engineer, DevOps Engineer, or system administrator
- Experience in system automation technology, such as Ansible
- Experience in container technologies
- Experience using cloud services.