Job Title: Senior Site Reliability Engineer

Company Name: Nscale

Job Url: https://www.simplyhired.com/job/iZHsvOlJBeBoyxTBhfWvAzmU5ajmsPzqbN_oW-_3OlcQ0D-nwUpsIg

Job Description: Senior Site Reliability Engineer -AI Infrastructure Operations
Nscale
Remote

Job Details
$100,000 - $200,000 a year
1 hour ago
Qualifications
Network troubleshooting
Automation
Kubernetes
5 years
IT system monitoring
System design
Scalable systems
Improving operational efficiency
Incident response
Cloud-based systems
SRE
Mentoring
Systems engineering
Incident Investigation
Software development
Scalability
Linux
Technical troubleshooting support
Root cause analysis
Distributed computing
Senior level
AI
High availability
System performance monitoring
Full Job Description
About Nscale
Nscale is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises. We enable organizations to accelerate innovation, reduce the complexity of AI development, and achieve meaningful business outcomes through scalable, sustainable compute.

Our culture is defined by ownership, accountability, and rapid innovation. We operate with urgency and transparency, and every team member contributes to building the infrastructure powering the future of AI.

The Opportunity
Nscale's AI Infrastructure Operations team supports one of the most demanding AI platforms in the industry. We are looking for a Senior Site Reliability Engineer to help design, build, and operate reliable, scalable infrastructure across our GPU cloud.

This role is focused on hands-on engineering, system reliability, and operational excellence. You will work across software, systems, and infrastructure to improve performance, automate operations, and ensure platform stability at scale.

What You'll Be Doing
Design, build, and improve automation, tooling, and infrastructure systems supporting AI and HPC workloads
Contribute to the development of control-plane systems and operational frameworks
Define and implement SLOs, SLIs, and monitoring strategies to ensure system reliability
Participate in incident response and root cause analysis, driving improvements to reduce recurrence
Identify and address reliability and performance bottlenecks across systems
Collaborate with Engineering, Network, and Fleet teams to improve system design and operational processes
Drive improvements in availability, scalability, and operational efficiency
Mentor junior engineers and contribute to a strong engineering and reliability culture
What You Bring
5–8+ years of experience in SRE, Systems Engineering, or Software Engineering in production environments
Strong software engineering skills with experience building automation and infrastructure tooling
Solid understanding of Linux systems, networking, and distributed systems
Experience troubleshooting issues across infrastructure, OS, networking, and application layers
Familiarity with monitoring, alerting, and observability tools
Ability to balance reliability, performance, and delivery speed
Preferred Experience
Experience with AI or HPC environments, including GPUs or high-performance systems
Exposure to high-speed networking (InfiniBand/RDMA)
Familiarity with Kubernetes, cloud platforms, or bare-metal environments
Experience with observability systems in high-scale environments

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.