Skip to content
Skip to content
Sysadmin Jobs
Jobgether

AI Infrastructure Engineer

Jobgether

Location
Remote (US)
Employment
Full-time
Level
Senior Level
Posted 5 days ago

About the Role

This role offers an exciting opportunity to build and operate the foundational infrastructure powering large-scale AI training and inference environments. You will work on cutting-edge GPU platforms, distributed computing systems, and high-performance AI infrastructure designed to support advanced machine learning workloads at scale.

Skills

GPU Infrastructure Distributed Computing Python Go C++ Kubernetes Slurm Ray PyTorch Linux Internals RDMA InfiniBand CI/CD Cloud Computing Performance Engineering Platform Engineering

Benefits

  • Employee benefits
  • Support programs

Perks

  • Remote work

Full job details

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a AI Infrastructure Engineer in United States.

This role offers an exciting opportunity to build and operate the foundational infrastructure powering large-scale AI training and inference environments. You will work on cutting-edge GPU platforms, distributed computing systems, and high-performance AI infrastructure designed to support advanced machine learning workloads at scale. The position combines platform engineering, systems optimization, automation, and cloud technologies to create reliable, secure, and efficient environments for AI researchers and ML engineers. You will collaborate closely with cross-functional teams to improve developer experience, optimize infrastructure utilization, and enhance system reliability across complex distributed architectures. This opportunity is ideal for professionals passionate about AI infrastructure, distributed systems, and performance engineering in fast-paced and innovation-driven environments. Long-term projects, modern technologies, and strong career growth potential make this an exceptional opportunity for experienced infrastructure engineers looking to work at the forefront of AI operations.

\n


Accountabilities:
  • Design, deploy, and operate GPU and accelerator infrastructure supporting large-scale AI training and inference workloads across cloud, on-premise, and hybrid environments.
  • Build and optimize scheduling, queueing, and resource-sharing systems to maximize accelerator utilization across multiple teams and workloads.
  • Integrate distributed training frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train into scalable platform environments.
  • Manage high-performance storage systems and data pipelines to ensure efficient delivery of training data to compute infrastructure.
  • Design and maintain networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth distributed communication protocols.
  • Develop observability, monitoring, and analytics solutions for AI workloads, including utilization metrics, throughput analysis, and training stability monitoring.
  • Implement checkpointing, fault tolerance, restart mechanisms, and resiliency strategies for long-running distributed training jobs.
  • Drive infrastructure cost optimization initiatives across compute, storage, networking, and cloud resource utilization.
  • Create developer tooling, automation workflows, and self-service platform capabilities to improve productivity for ML engineers and researchers.
  • Partner with AI research and applied ML teams to support capacity planning, infrastructure scaling, and workload forecasting.
  • Implement security controls, multi-tenant isolation strategies, and access management policies for enterprise AI infrastructure environments.
  • Automate provisioning, lifecycle management, configuration enforcement, and operational workflows across AI infrastructure platforms.
  • Maintain operational documentation, runbooks, dashboards, and platform standards to support scalability and long-term maintainability.
  • Stay current with emerging AI infrastructure technologies, accelerator hardware advancements, and open-source ecosystem developments.

Requirements:

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • Minimum of 6 years of experience in infrastructure engineering, platform engineering, high-performance computing, or related domains.
  • Hands-on experience operating GPU clusters and large-scale machine learning infrastructure in production environments.
  • Strong programming skills in Python and at least one systems programming language such as Go or C++.
  • Deep understanding of distributed training architectures, accelerator technologies, and collective communication frameworks.
  • Experience with Kubernetes, Slurm, Ray, or comparable orchestration and scheduling systems for ML workloads.
  • Strong expertise in Linux internals, networking, storage systems, and distributed systems operations.
  • Experience working with major cloud providers and cloud-native AI infrastructure services.
  • Solid understanding of software engineering best practices including CI/CD, testing, automation, and code review processes.
  • Excellent troubleshooting, analytical, documentation, and cross-functional collaboration skills.
  • Experience operating RDMA or InfiniBand networking environments is preferred.
  • Contributions to open-source ML infrastructure projects or experience with research-scale AI platforms are considered a plus.
  • Familiarity with AI infrastructure cost optimization and FinOps practices is advantageous.
  • Candidates must be authorized to work in the United States. No new H1B sponsorship is available, though H1B transfers may be supported for qualified candidates.
  • Ability to successfully complete technical coding assessments related to infrastructure engineering and distributed systems.

Benefits:

  • 100% remote work opportunity within the Continental United States.
  • Competitive compensation package aligned with experience and technical expertise.
  • Full-time W2 employment with long-term project stability.
  • Opportunity to work on advanced AI infrastructure and cutting-edge machine learning platforms.
  • Exposure to large-scale distributed systems, GPU technologies, and enterprise AI operations.
  • Strong career growth and professional development opportunities.
  • Collaborative and innovation-focused engineering environment.
  • Equal opportunity workplace committed to diversity, inclusion, and accessibility.
  • Comprehensive employee benefits and support programs.


\n

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

 Why Apply Through Jobgether? 

 

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

 

 

#LI-CL1

Not the right fit?

Browse all IT & Infrastructure roles.

Browse all jobs

Similar Jobs