Job Title: Senior Machine Learning Engineer Company Name: Kellton Job Url: https://www.linkedin.com/jobs/view/4375967571/ Job Description: About the Opportunity Our client — a well-funded, advanced AI research company — is building next-generation foundation models at a massive scale. They are hiring a deeply technical, hands-on Senior Machine Learning Engineer to join their core model engineering team. This is a high-impact role focused on building, scaling, and optimizing large language models (10B–100B+ parameters). The ideal candidate thrives at the intersection of applied research and production-grade engineering and has direct experience working with large-scale training systems. What You’ll Own 🔹 Foundation Model Engineering Lead end-to-end engineering of large language models (10B–100B+ parameters). Implement large-scale pre-training, SFT, and alignment pipelines. Optimize model architectures and training strategies based on scaling laws and product objectives. Drive measurable improvements in performance, reasoning capability, and training efficiency. 🔹 Distributed Training & Performance Optimization Architect and optimize multi-node GPU distributed training systems (A100 / H100 / B200 environments). Implement advanced parallel strategies: Data, Tensor, Pipeline, and Sequence Parallelism. Maximize Model FLOPs Utilization (MFU) and overall cluster efficiency. Improve training stability, fault tolerance, and monitoring. 🔹 High-Throughput Data Systems Build and maintain TB–PB scale data pipelines. Implement ingestion, cleaning, deduplication (MinHash/LSH), safety filtering, and PII removal. Support multimodal data strategies, synthetic data generation, and curriculum learning. 🔹 Applied LLM Research Implementation Productionize alignment techniques (RLHF, DPO, KTO). Work with Mixture-of-Experts (MoE) architectures and routing optimization. Improve model reasoning, math, and coding performance. Build and enhance agent and tool-calling systems. 🔹 Engineering Excellence Uphold strong coding and system design standards. Identify and eliminate performance bottlenecks. Take ownership of major system components end-to-end. What They’re Looking For Technical Depth MS/PhD in Computer Science, AI, Mathematics, or equivalent practical experience. Strong hands-on experience in engineering and optimizing large-scale deep learning systems. Deep understanding of Transformer architectures (RoPE, FlashAttention, SwiGLU). Experience working with modern open-source or proprietary LLMs. Distributed Systems Expertise Advanced proficiency in PyTorch or JAX. Experience with Megatron-LM, DeepSpeed, FSDP, or equivalent frameworks. Strong understanding of 3D parallelism and ZeRO optimization strategies. Infrastructure at Scale Hands-on experience training on large GPU clusters (100+ GPUs preferred). Familiarity with InfiniBand, RDMA, and storage I/O optimization. Experience debugging large distributed training runs. Core Traits Highly self-driven and execution-focused. Strong system ownership mindset. Comfortable operating in fast-moving R&D environments. Nice to Have Open-source contributions in the LLM ecosystem. Experience building agentic systems or multi-step reasoning frameworks. CUDA or Triton kernel optimization experience. Published research or major production LLM deployments. Featured benefits Medical insurance, Vision insurance, Dental insurance, 401(k), Paid paternity leave, Disability insurance, Paid maternity leave Requirements added by the job poster • Authorized to work in the United States