Job Title: Senior Software Engineer, Generative AI Cloud Infrastructure Company Name: iOPEX Technologies Job Url: https://arc.dev/remote-jobs/details/senior-software-engineer-generative-ai-cloud-infrastructure-perm-us-uk-europe-o3m34as264 Job Description: About Us We are building a Gen AI Acceleration Cloud an end-to-end platform for the full generative AI lifecycle. Our focus is to deliver blazing-fast LLM inference, scalable fine-tuning, and modern AI cloud infrastructure that GPUs, SmartNICs/DPUs, and ultra-fast networking fabrics. Our platform powers mission-critical workloads with: ● On-demand & managed Kubernetes clusters ● Slurm-based training clusters ● High-performance inference services ● Distributed fine-tuning and eval pipelines ● Global data centers &heterogeneous GPU fleets We are looking for a Senior Software Engineer to design, build, and scale the core systems behind our AI cloud. What You’ll Work On High-Performance AI Cloud Infrastructure ● Design and maintain fault-tolerant, high-availability backend services running across global data centers. ● Build operators and automation systems for: ○ GPU management ○ Infiniband partitioning ○ VM provisioning ○ High-throughput storage provisioning LLM & GPU Virtualization Platform ● Build the IaaS software layer for new GPU clusters with thousands of next-gen accelerators (H100, GB200, GB300). ● Work on scalable GPU virtualization (PCIe passthrough, MIG, SR-IOV, VFIO). Massive-Scale Storage & Data Systems ● Contribute to a global multi-exabyte, high-performance object store optimized for pretraining datasets. Build distributed data loaders, caching layers, metadata services, and throughput-optimized pipelines. Observability, Reliability &Automation ● Develop advanced observability stacks (Prometheus, Grafana, OpenTelemetry).cDesign automated node lifecycle management for large-scale distributed training and inference. ● Build robust testing frameworks for resiliency, failover, and fault tolerance. Core Platform Engineering ● Contribute to the core internal + open-source platform components. ● Write tooling, SDKs, and documentation for developer-facing services. ● Research decentralized AI workloads and build reference architectures. Requirements Fundamentals ● 5+ years of production software engineering experience. ● Strong proficiency in one or more backend languages (Golang highly preferred; Rust/Python also valued). ● 5+ years building high-performance, well-tested, production-grade distributed services. Cloud & Systems Experience ● Experience with distributed microservices across AWS/GCP/Azure. ● Deep understanding of systems fundamentals: ○ Concurrency ○ Memory management ○ High-performance I/O ○ Distributed consensus ○ Large-scale system design Kubernetes / Infrastructure Expertise (Big Plus) ● Kubernetes internals: custom operators, CRDs, schedulers, or networking/storage plugins. ● Experience with Cluster API, KubeVirt, or similar orchestration tooling. Virtualization / Compute (Big Plus) ● Experience with hypervisors (QEMU/KVM, cloud-hypervisor). ● PCIe passthrough, SR-IOV, GPU virtualization, MIG, NVLink topologies. ● Experience with DPUs/SmartNICs. Networking (Big Plus) ● Infiniband / RDMA ● VLAN/VXLAN/VPC ● OVS/OVN ● High-performance DC networking High-Performance Compute (Plus) ● CUDA, NCCL, GPU drivers, parallel training stacks ● Experience with GPU scheduling, workloads, and distributed ML Infrastructure Automation &Tooling (Expected) ● Terraform, Ansible, CI/CD ● GitHub Actions, ArgoCD ● Prometheus, Grafana, ELK, OpenTelemetry Preferred Experience ● Built or operated IaaS/PaaS systems ● Experience with large-scale storage systems (Ceph, Lustre, or custom object stores) ● Knowledge of vLLM, TensorRT-LLM, TGI, or other LLM-serving frameworks ● Experience building infra for ML, training, inference, or fine-tuning Responsibilities ● Perform architecture &research for distributed and decentralized AI workloads. ● Build and maintain foundational infrastructure powering training, inference, and fine-tuning. ● Contribute to core, open-source platform components. ● Own end-to-end services from design → implementation → operations. ● Create testing frameworks for robustness, failover, and performance. ● Collaborate across hardware, product, and ML teams to design next-gen infra. Who You Are ● A deeply technical engineer who thrives in complex systems work. ● Strong communicator who writes clear design docs. ● Curious, low-ego, and great at collaborating with cross-functional teams. ● Motivated by building world-class AI infrastructure from the ground up. ● Thrives in zero-to-one, fast-moving startup environments. Compensation ● Competitive salary ● Meaningful early equity ● Benefits ● Salary determined by experience and location.