Job Title: Senior Site Reliability Engineer — Infrastructure & Architecture Company Name: VAE Job Details: RemoteFull,Time Job Url: https://hiring.cafe/viewjob/87b2s8b7d8aqn350 Job Description: Posted 2d agoSenior Site Reliability Engineer — Infrastructure & Architecture@ VAEView All JobsWebsiteUnited StatesRemoteFull TimeResponsibilities:operate AWS, manage pipelines, define infrastructureRequirements Summary:5+ years in SRE/DevOps; strong AWS (ECS/Fargate, RDS, VPCs, IAM, Lambda, CloudFront); Terraform; CI/CD; Linux; GPU/ML infra; Entra ID/Azure AD; DoD security clearance eligibility.Technical Tools Mentioned:Terraform, Amazon Web Services, GitHub Actions, Ansible, Linux, Go, Python, Bash, PowerShell, Prometheus, CloudWatch, Azure Active Directory, Entra ID, SAML VAEIT’s Infrastructure & Architecture team is the engineering backbone of a 40-person organization delivering network engineering consultancy and network management solutions for Department of Defense customers.We own the entire infrastructure surface area — AWS accounts, CI/CD pipelines, identity management, security compliance, and production deployments. Demand from the engineering organization is rapidly outpacing current capacity.This is a force-multiplier hire. The right person will reduce operational risk, accelerate delivery across all engineering teams, and help us transition from reactive support to proactive platform engineering.What You’ll Own:AWS Infrastructure & Production OperationsOperate and evolve a multi-account AWS environment (5 accounts) including ECS Fargate, RDS, Lambda, CloudFront, and multi-VPC architecturesManage production ECS services across multiple accounts and clustersDefine and maintain all infrastructure as code using Terraform — no manual configurationDesign and manage networking, IAM, security groups, and cross-account access patternsCI/CD & Developer ExperienceOwn and improve GitHub Actions pipelines used across the engineering organizationBuild and maintain workflows for multiple teams and tech stacks (Go, C#, Python)Reduce build times and increase deployment reliability to accelerate deliveryIdentity, Access & IT SystemsAdminister Microsoft Entra ID (Azure AD) for identity and SSOManage user provisioning, groups, and access policiesOwn GitHub Enterprise configuration, permissions, and security controlsSecurity & ComplianceStrengthen software supply chain security, including SBOM generationSupport and improve FedRAMP compliance postureEnforce least-privilege IAM and conduct security audits across environmentsObservability & Incident ResponseOperate and evolve monitoring systems (CloudWatch, Prometheus, Alertmanager)Improve signal-to-noise ratio in alerting and detectionPhysical Infrastructure & MLOpsManage on-premise GPU servers for ML training and inferenceBridge cloud and on-prem infrastructure, enabling scalable ML workloads in AWSSupport data science teams with reproducible environments and deployment pipelinesMaintain GPU tooling (NVIDIA drivers, CUDA) and containerized workloadsAutomation & ToolingBuild internal tools (Go, Python, Bash, PowerShell) to eliminate manual workExtend Ansible-based configuration managementTreat operations as software — automate everything possiblePlatform Standards & Engineering ExcellenceDefine standards for logging, health checks, configuration, and deploymentBuild “golden paths” (templates, starter repos, shared workflows)Champion observability practices (structured logging, tracing, SLOs)Review infrastructure and deployments to catch reliability and security issues earlyEnabling Agentic LLM SystemsBuild infrastructure for LLM-powered products (GPU compute, model serving, vector databases)Design deployment pipelines for model versioning, evaluation, and inference routingEnable rapid experimentation with self-service GPU-backed environmentsEnsure AI systems meet DoD security requirements (audit logging, isolation, provenance tracking)What We’re Looking For:5+ years in SRE, DevOps, or Platform EngineeringDeep AWS experience (ECS/Fargate, RDS, VPCs, IAM, Lambda, CloudFront, multi-account setups)Strong Terraform skills (modules, state management, code reviews)Experience building and scaling CI/CD systems (GitHub Actions preferred)Hands-on Linux administration (including physical or bare-metal systems)Experience with GPU/ML infrastructure (CUDA, containerized workloads)Proficiency in at least two: Go, Python, Bash, PowerShellStrong networking fundamentals and debugging skillsExperience with identity systems (Entra ID / Azure AD, SSO/SAML)Excellent written communication in a remote, async environmentHighly self-directed and execution-orientedU.S. Citizenship (required for DoD work)Strongly Preferred:Experience in DoD or FedRAMP-regulated environmentsMLOps tooling (MLflow, Weights & Biases, pipeline orchestration)Migrating GPU workloads from on-prem to AWS (EC2 GPU, SageMaker, ECS)Ansible for configuration managementSoftware supply chain security (SBOMs, signing, vulnerability scanning)Prometheus/Alertmanager experienceGitHub Enterprise administration at scaleContainer security and ECS optimizationFamiliarity with .NET ecosystemsExposure to RustWho You Are:You apply software engineering rigor to operational challengesYou instinctively automate rather than rely on manual processesYou thrive in high-autonomy, high-impact environmentsYou communicate clearly and proactively in async workflowsYou’ve operated as a senior IC in a small team with broad ownershipVAE, Inc. is a full service IT Infrastructure Solutions Company focused on building, securing and supporting our clients’ mission critical enterprises. We provide a distinctive array of design, integration and implementation services as well as fully managed service offerings. VAE is at the forefront of leveraging multi-tenant capable technologies and shared IT services to create secure, reliable and cost-effective end-to-end services and solutions. We deliver exceptional infrastructure solutions with extremely talented employees using a client-focused partnering approach.CertificationsVAE, Inc. is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to sex, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law.