Job Title: Lead Platform Engineer

Company Name: Embark Labs

Job Url: https://www.linkedin.com/jobs/search/?currentJobId=4370471943&f_AL=true&f_TPR=r86400&f_WT=2&keywords=software%20engineer&origin=JOB_SEARCH_PAGE_JOB_FILTER&start=150

Job Description: Lead Platform Engineer
Embark Labs · United States (Remote)

Easy Apply

Save
Save Lead Platform Engineer at Embark Labs
Show more options
Your profile is missing required qualifications


Show match details

Help me update my profile


BETA

Is this information helpful?


Get personalized tips to stand out to hirers
Find jobs where you’re a top applicant and tailor your resume with the help of AI.

Try Premium for PKR0
About the job
About Embark Labs

Embark Labs builds open-source, AI-powered tools for clinical and translational research. We grew out of the team behind XNAT, an open-source medical imaging platform used by thousands of research studies worldwide and cited in over 3,500 research papers.


Our flagship product, Scout, ingests medical imaging and clinical data at scale and organizes it in a unified data lake for analysis. We work with large academic medical centers and are funded through customer revenue and an ARPA-H grant (ADAPT), not venture capital.


We are a small, focused team (5 engineers plus our CTO and CEO) with deep experience in medical informatics. Platform reliability, reproducibility, and operational rigor are central to our mission: our systems support real scientific work that directly impacts patient and population health.


Role Summary

We're hiring a Lead Platform Engineer to be the hands-on technical leader for our platform as we scale reliability, migrate core workloads into the cloud, and prepare the platform for AI-heavy operations. This is a player-coach role: you will design, build, and operate critical platform systems while directly managing and mentoring platform engineers (one to start).


You will own the platform end-to-end, including infrastructure, deployment, observability, security posture, and the developer experience that enables the rest of the team to move quickly and safely. This role reports to the CTO and has meaningful decision-making authority over platform standards and architecture.


Initially, the role is expected to be majority hands-on (roughly 60–70%), with management responsibilities growing as the platform team expands. We're looking for someone who has led engineers through ambiguous technical work — setting direction and mentoring, not just assigning tickets.


As the platform matures, AI Operations will become an increasing focus area: you'll help shape our approach to model inference, GPU-backed infrastructure, and ML lifecycle management in regulated research settings.


What You'll Do

Lead platform direction

Set platform technical direction: reliability, security posture, upgrade strategy, scaling, and operational excellence.
Establish standards for production readiness, runbooks, incident response, and post-incident learning.
Treat the platform as a product with internal users: define clear "golden paths" that make secure, reliable patterns easy to adopt.
Manage, mentor, and grow platform engineers; build a team culture of ownership, learning, and sustainable operational practices.


Take the platform into the cloud

Define and execute a phased cloud reference architecture (AWS/EKS preferred), balancing security, compliance, cost, operability, and hybrid compatibility with existing on-prem deployments.
Enhance Terraform for cloud infrastructure provisioning, including module design, state strategy, policy guardrails, and drift detection.
Improve deployment automation and reproducibility through CI-driven deployments and safe rollback patterns.
Evaluate and introduce event or streaming infrastructure to support real-time HL7 ingestion alongside existing batch pipelines.


Improve reliability and developer velocity

Own Kubernetes operations across stateless and stateful workloads; improve cluster lifecycle, upgrades, and troubleshooting.
Mature observability using Prometheus, Loki, and Grafana; improve alert quality, define service health standards, and reduce MTTR.
Own CI/CD and workflow automation across build, test, release, environment promotion, and deployment safety (evolving existing systems rather than rewriting unnecessarily).
Define and validate HA, backup, and DR approaches; ensure environment reproducibility.
Drive performance and cost optimization through capacity planning, cost guardrails, and operational efficiency.


Security and governance (built-in, not bolted-on)

Establish secure defaults for networking, IAM patterns, secrets handling, encryption/TLS, and auditability, with particular attention to healthcare data sensitivity.
Maintain a compliance-ready security posture appropriate for regulated research environments (formal certifications may be pursued over time, but are not the immediate focus).
Improve software supply-chain hygiene: artifact and image management, scanning, provenance, and signing integrated into platform workflows.


Our Current Stack

Scout and XNAT are deployed on Kubernetes (K3s and EKS) using Ansible and Helm, with Terraform for cloud infrastructure provisioning. We use Keycloak for authentication, Prometheus/Loki/Grafana for observability, and operate PostgreSQL, Cassandra, and Elasticsearch via Kubernetes operators. The Scout data layer uses Delta Lake on MinIO, queried through Trino, with Temporal for workflow orchestration.


Some areas are mature and stable; others are intentionally evolving as we scale and move further into the cloud.


Required Qualifications

7+ years in platform, SRE, or infrastructure roles with significant production ownership.
1+ years people management or sustained technical leadership experience, including direct reports, coaching, and delivery accountability.
BA or BS Degree in Computer Science, related field, or equivalent experience
Strong Kubernetes experience in production, including operating stateful workloads.
Terraform expertise: modules, state management, multi-environment patterns.
Production experience with at least one major cloud provider (AWS preferred).
Experience with configuration management in production environments (Ansible in our stack today, experience with AAP or AWX a plus).
Strong Linux and networking fundamentals; security-minded operations.
Strong scripting or development skills (Python, Go, or similar).
Experience with or enthusiasm for AI-assisted development workflows (e.g., Claude Code).


Preferred Qualifications

We do not expect experience in all of the following; These represent areas the platform touches today or may evolve toward over time.


Healthcare data experience (HIPAA, compliance frameworks, data governance in regulated environments).
Hybrid and on-prem + cloud deployments, including air-gapped or network-restricted environments.
Service mesh, ingress, API gateway experience; certificate management and zero-trust patterns.
Experience with data/analytics infrastructure: lakehouse architectures, Spark, Trino, or similar distributed query engines.
Temporal, Argo Workflows, or workflow orchestration platforms.
Kafka or event streaming experience in production.
Vector database experience (pgvector, Milvus, Qdrant, or similar) for embedding-based search and retrieval.
Experience deploying and operating LLM inference infrastructure (vLLM, Triton, KServe, or similar) on GPU-equipped clusters.
Experience building or operating MLOps platforms: experiment tracking, model registries, GPU scheduling, or model lifecycle management.


What Success Looks Like

90 days: You understand the current platform deeply. You've produced a clear cloud migration plan and roadmap. Terraform foundations are in place for cloud environments. Operational visibility and runbooks are improved. You've built trust with the team and established working relationships across engineering.


6 months: Cloud environments are repeatable and provisioned through Terraform. Deployments and rollbacks are safer and more automated. SLOs are defined with actionable alerts. You've begun migrating workloads to the cloud reference architecture. The platform team has clear ownership areas and is operating well.


12 months: Cloud production is stable, scalable, and cost-efficient. Platform delivery is standardized with self-service patterns for product teams. Operational toil is meaningfully reduced. The platform team is staffed, healthy, and operating with clear ownership and sustainable on-call practices.


Compensation

Salary at Embark Labs is determined by various factors, including but not limited to location, the individual’s particular combination of education, knowledge, skills, competencies, and experience, as well as contract-specific affordability and organizational requirements. The projected compensation range for this position is $180,000 to $220,000 (annualized USD).