Job Title: Senior Site Reliability Engineer

Company Name: VAE, Inc.

Job Url: https://vaeit.hua.hrsmart.com/hr/ats/Posting/view/284

Job Description: Job Details
Senior Site Reliability Engineer — Infrastructure & Architecture - (284)
Share this job as a link in your status update to LinkedIn.

OVERVIEW
VAE, Inc. is a full service IT Infrastructure Solutions Company focused on building, securing and supporting our clients’ mission critical enterprises. We provide a distinctive array of design, integration and implementation services as well as fully managed service offerings. VAE is at the forefront of leveraging multi-tenant capable technologies and shared IT services to create secure, reliable and cost-effective end-to-end services and solutions. We deliver exceptional infrastructure solutions with extremely talented employees using a client-focused partnering approach.

Job Type
Full-time
Location
VA US (Primary)
Job Description
VAEIT’s Infrastructure & Architecture team is the engineering backbone of a 40-person organization delivering network engineering consultancy and network management solutions for Department of Defense customers.

We own the entire infrastructure surface area — AWS accounts, CI/CD pipelines, identity management, security compliance, and production deployments. Demand from the engineering organization is rapidly outpacing current capacity.

This is a force-multiplier hire. The right person will reduce operational risk, accelerate delivery across all engineering teams, and help us transition from reactive support to proactive platform engineering.

What You’ll Own:

AWS Infrastructure & Production Operations

Operate and evolve a multi-account AWS environment (5 accounts) including ECS Fargate, RDS, Lambda, CloudFront, and multi-VPC architectures
Manage production ECS services across multiple accounts and clusters
Define and maintain all infrastructure as code using Terraform — no manual configuration
Design and manage networking, IAM, security groups, and cross-account access patterns
CI/CD & Developer Experience

Own and improve GitHub Actions pipelines used across the engineering organization
Build and maintain workflows for multiple teams and tech stacks (Go, C#, Python)
Reduce build times and increase deployment reliability to accelerate delivery
Identity, Access & IT Systems

Administer Microsoft Entra ID (Azure AD) for identity and SSO
Manage user provisioning, groups, and access policies
Own GitHub Enterprise configuration, permissions, and security controls
Security & Compliance

Strengthen software supply chain security, including SBOM generation
Support and improve FedRAMP compliance posture
Enforce least-privilege IAM and conduct security audits across environments
Observability & Incident Response

Operate and evolve monitoring systems (CloudWatch, Prometheus, Alertmanager)
Improve signal-to-noise ratio in alerting and detection
Physical Infrastructure & MLOps

Manage on-premise GPU servers for ML training and inference
Bridge cloud and on-prem infrastructure, enabling scalable ML workloads in AWS
Support data science teams with reproducible environments and deployment pipelines
Maintain GPU tooling (NVIDIA drivers, CUDA) and containerized workloads
Automation & Tooling

Build internal tools (Go, Python, Bash, PowerShell) to eliminate manual work
Extend Ansible-based configuration management
Treat operations as software — automate everything possible
Platform Standards & Engineering Excellence

Define standards for logging, health checks, configuration, and deployment
Build “golden paths” (templates, starter repos, shared workflows)
Champion observability practices (structured logging, tracing, SLOs)
Review infrastructure and deployments to catch reliability and security issues early
Enabling Agentic LLM Systems

Build infrastructure for LLM-powered products (GPU compute, model serving, vector databases)
Design deployment pipelines for model versioning, evaluation, and inference routing
Enable rapid experimentation with self-service GPU-backed environments
Ensure AI systems meet DoD security requirements (audit logging, isolation, provenance tracking)
Qualifications
What We’re Looking For:

5+ years in SRE, DevOps, or Platform Engineering
Deep AWS experience (ECS/Fargate, RDS, VPCs, IAM, Lambda, CloudFront, multi-account setups)
Strong Terraform skills (modules, state management, code reviews)
Experience building and scaling CI/CD systems (GitHub Actions preferred)
Hands-on Linux administration (including physical or bare-metal systems)
Experience with GPU/ML infrastructure (CUDA, containerized workloads)
Proficiency in at least two: Go, Python, Bash, PowerShell
Strong networking fundamentals and debugging skills
Experience with identity systems (Entra ID / Azure AD, SSO/SAML)
Excellent written communication in a remote, async environment
Highly self-directed and execution-oriented
U.S. Citizenship (required for DoD work)
Strongly Preferred:

Experience in DoD or FedRAMP-regulated environments
MLOps tooling (MLflow, Weights & Biases, pipeline orchestration)
Migrating GPU workloads from on-prem to AWS (EC2 GPU, SageMaker, ECS)
Ansible for configuration management
Software supply chain security (SBOMs, signing, vulnerability scanning)
Prometheus/Alertmanager experience
GitHub Enterprise administration at scale
Container security and ECS optimization
Familiarity with .NET ecosystems
Exposure to Rust
Who You Are:

You apply software engineering rigor to operational challenges
You instinctively automate rather than rely on manual processes
You thrive in high-autonomy, high-impact environments
You communicate clearly and proactively in async workflows
You’ve operated as a senior IC in a small team with broad ownership
Clearance Level
Ability to obtain and maintain a U.S. Security Clearance
Certifications
VAE, Inc. is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to sex, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law.