Skip to content
Skip to content
Sysadmin Jobs
M

ML Infrastructure Engineer

Maven Robotics

Location
Onsite (San Francisco, California)
Employment
Full-time
Level
Senior Level
Posted 2 days ago

About the Role

Maven Robotics is developing general-purpose robots and physical AI solutions for industrial autonomy. They are seeking an Infrastructure Engineer to build and scale the backend systems powering their machine learning operations, managing data, compute, and artifacts.

Skills

Distributed Systems Backend Engineering GPU Compute Infrastructure Kubernetes Ray ZenML Python Go Rust C++ Infrastructure as Code Workload Orchestration Observability Platforms Storage Systems Internal Developer Platforms Capacity Planning

Full job details

Company Overview

Maven Robotics is building the world’s leading general-purpose robots and providing physical AI solutions for the most challenging industrial autonomy tasks.

Operating in stealth, we are assembling a team of world-class innovators who think from first principles. Our mission is to achieve human-level task success rates in complex environments, even when faced with limited fine-tuning data or evolving robotic hardware. We value unwavering truth-seeking, humility, and relentless determination.

Role Description

We are looking to recruit an exceptional Infrastructure Engineer to own and build the backend systems that power machine learning at Maven Robotics. In this role, you will design and scale the core infrastructure used by our AI and robotics teams to manage data, run compute workloads, store artifacts, monitor systems, and support rapidly growing engineering workflows.

You should be excited about distributed systems, backend services, data infrastructure, GPU compute, and high-reliability internal platforms. The ideal candidate has successfully built and operated similar systems before and can independently drive complex infrastructure projects from architecture through production operation. The underlying systems may be sophisticated, but the interfaces and workflows they expose should be reliable, intuitive, and easy for engineers to use.

In this role you will:

  • Own the architecture, implementation, reliability, and evolution of Maven's machine learning infrastructure.
  • Build backend services and platforms for managing data, artifacts, jobs, logs, metadata, and compute resources across cloud and on-premise environments.
  • Design scalable systems for workload orchestration, storage, observability, security, and infrastructure automation.
  • Build intuitive internal tools and abstractions that make complex infrastructure easy for engineers to use.
  • Lead technical and commercial discussions with cloud and ML compute providers, including capacity planning, performance, reliability, and cost.

Qualifications

Must-have:

  • Significant experience designing, building, and operating production backend, distributed, or compute infrastructure.
  • A track record of independently owning complex infrastructure projects from architecture through deployment and ongoing operation.
  • Strong programming ability in Python, Go, Rust, C++, or a similar backend or systems language.
  • Experience operating GPU compute infrastructure and orchestrating distributed workloads using Kubernetes, Ray, ZenML, or similar systems.
  • Experience designing and operating storage systems, observability platforms, infrastructure-as-code, and secure access controls.
  • Experience managing large-scale GPU fleets or hybrid cloud and on-premise compute environments.
  • Experience building internal developer platforms, CLIs, SDKs, or other self-service infrastructure tools.
  • Strong technical judgment, leadership, and communication skills, with the ability to drive decisions across teams and external partners.
  • Self-starter attitude with the ability to identify priorities and deliver durable solutions in a fast-paced startup environment.

Nice-to-have:

  • Familiarity with GPU architecture, accelerator-aware software design, and profiling compute-intensive workloads.
  • Exposure to infrastructure supporting large-scale robot learning workloads, including policy training, simulation, and multimodal data pipelines.
  • Familiarity with SOC 2 controls, security practices, and audit readiness.