Job Title: Staff Software Engineer

Company Name: Tubi

Job Url: https://boards.greenhouse.io/embed/job_app?token=7255621&utm_source=jobright&jr_id=6961654ae7ed9a5731ba2bd1

Job Description: About the Role:

As a Staff Software Engineer on the ML Infrastructure team, you will collaborate closely with the Machine Learning and Product teams to build world-class machine learning inference platforms. These platforms power essential services like personalized recommendations, search, and content understanding across Tubi.

A core responsibility of this team is developing and maintaining low-latency ML model serving systems that support Deep Learning, LLM, and Search models. This involves building self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.

You will improve the way we deploy and operate our services and even contribute to open-source projects. This role grants the architectural freedom to explore new frameworks, lead critical cross-functional projects, and transform the capabilities of our ML and Product teams.

Responsibilities:

Design and build scalable, high throughput, and low latency distributed systems using Scala
Build reusable components and services that serve various ML applications like Personalization, Search, Ads and Exploration
Partner closely with ML engineers to understand their challenges and limitations and develop scalable solutions to address them. Proactively recommend solutions to keep our ML Inference stack state of the art.
Take a data driven approach to identifying & optimizing latency, cost, and efficiency of our infra. Lead large scale cross functional refactorings if necessary
Mentor other engineers on the team on system design, effective incident management, interviewing, leveraging LLMs for work, etc.
Collaborate with ML, Product, and cross functional engineering teams to define the long term vision and architecture for ML Infrastructure at Tubi.  
Your Background:

8+ years of experience designing and building scalable, distributed systems in any modern backend language (e.g., Scala, Java, Python, Go, C++); experience with Scala or JVM based language is a plus.
Strong experience with AWS or an equivalent cloud platform
Experience building online microservices at scale with low latency serving
Experience with both SQL (e.g. Postgres) and NoSQL databases (e.g. Cassandra), message brokers (e.g. Kafka), and caches (e.g. Redis)
Experience with containerization technologies, such as Docker or Kubernetes
Led the response and resolution efforts for multiple major, large-scale incidents
Bonus:

Familiarity with the machine learning infrastructure like inference engines (e.g. torschserve, triton, vLLM), vector stores (e.g. LanceDB, FAISS), feature stores (e.g. Feast), ElastiCache, model training orchestration, etc.
Understanding of ML model training pipelines and model internals. Experience with Recommender Systems, Search, Autocomplete and Ads ML is a plus
Previous experience with Akka, Erlang, Elixir or Go
Proficient in data-driven analysis of complex A/B testing results