Machine Learning Engineer

RunPod • San Francisco, CA, United States • 2d ago

Position: ML Engineer - Full Time - Remote

Reports to: Head of Data

Salary Range:

Company Overview:

RunPod is a fast-growing start-up that empowers developer teams to deploy custom, full-stack AI apps simply and at scale. We seek a talented and experienced ML Engineer to join our dynamic team.

Job Summary:

As an ML Engineer, you will be responsible for building the next generation, highly available, global GPU cloud computing service with open-source technologies to enable and accelerate RunPod’s rapid growth.

This system spans many diverse environments (containerization, VMs and bare metal compute) and provides a cohesive and reliable abstraction for running AI workloads in them. You will get to be a technology thought leader, evangelize new, cutting-edge technologies, and solve complex problems. To be successful you have experience practicing infrastructure-as-code. You have strong software development fundamentals and skills. In addition, you have strong systems knowledge and troubleshooting abilities.

Requirements:

2+ years experience writing high-performance, well-tested, production quality code
2+ years of software development experience and proficiency in python
Excellent understanding of low level operating systems concepts including multi-threading, memory management, networking and storage, performance, and scale
Experience working on applied ML/AI products in production
Knowledge of distributed systems and HPC
Experience with Tensorflow and JAX is a plus
Pragmatic, methodical, well-organized, detail-oriented, and self-starting
Experience with containerization, VPNs, AI workloads a plus
GPU programming, NCCL, CUDA knowledge a plus
Experience in at least one backend programming language a plus
Familiarity with open source inference and training stacks like vLLM, TGI, TensorRT, Torchrun, etc. a plus
Demonstrated experience with high performance or distributed cloud microservices architectures and ideally experience building them in operation at a global scale a plus

Responsibilities:

Perform architecture and research work for AI workloads
Work on the core, RunPod AI platform
Create services, tools, and developer documentation
Create testing frameworks for robustness and fault-tolerance

Compensation Package:

RunPod's compensation package comprises three elements: salary, equity, and benefits. We are committed to pay fairness and aim for these three elements to be highly competitive with market rates. On top of this position's salary, equity will be a component of total compensation. The exact amount will be communicated at the time of offer issuance.

Join Us:

At RunPod, you’ll have the opportunity to work on cutting-edge technology and significantly impact the AI and ML fields. We encourage you to apply if you’re driven by innovation excellence and want to be part of a team that values bold ideas and professional growth. Let's shape the future of technology together!

Non-Discrimination in Hiring Practices:

RunPod is committed to maintaining a workplace free from discrimination and upholding the principles of equality and respect for all individuals. Our hiring practices are designed to ensure fairness, objectivity, and inclusiveness, adhering to all applicable laws and regulations regarding nondiscrimination.

#J-18808-Ljbffr