Job Url: https://job-boards.greenhouse.io/upwork/jobs/6587736003

Job Description: Lead Machine Learning Engineer - Applied Scientist
Remote
Apply
Upwork ($UPWK) is the world’s largest work marketplace, connecting businesses with highly skilled professionals worldwide. From entrepreneurs to Fortune 100 enterprises, companies trust Upwork’s platform to access expert talent, leverage AI-powered work solutions, and drive meaningful business outcomes.

Upwork’s AI-powered platform has facilitated over $20 billion in economic opportunity for professionals worldwide. With professionals spanning 10,000+ skills, including AI and machine learning, software development, sales and marketing, customer support, finance and accounting, and more, Upwork empowers businesses of all sizes to scale, innovate, and build agile teams.

We’re looking for a Lead Machine Learning Engineer / Applied Scientist with a passion for rigorously evaluating and improving the performance of LLMs and AI agents. In this role, you will focus on building feedback loops, defining success metrics, and driving measurable improvements to the quality and reliability of our intelligent systems. This is a rare opportunity to shape how evaluation and iteration are embedded in the product lifecycle for AI at scale.

You’ll partner closely with research, engineering, and product teams to embed your insights into Upwork’s AI infrastructure. This includes designing testbeds for agentic workflows, refining prompts and orchestration strategies, and guiding the iteration of ML-powered features to deliver better outcomes for our users. Your work will directly influence the success of our most advanced AI initiatives.

Responsibilities
Develop and own evaluation pipelines for agentic LLM systems, enabling consistent measurement across simulation, benchmark, and live-user scenarios.
Define and iterate on quality metrics that guide the training, tuning, and deployment of LLMs and agents.
Lead experiments to assess and improve system behaviors across dimensions such as correctness, safety, latency, and helpfulness.
Collaborate with cross-functional partners to integrate insights from evaluation into product development and deployment pipelines.
Build automated testing and monitoring tools that scale with the complexity of agent behaviors and LLM responses.
Share findings and improvements through documentation, dashboards, and internal demos, contributing to a culture of continuous learning and excellence.
What it takes to catch our eye
Deep familiarity with evaluation methodologies for LLMs or autonomous agents, including benchmark selection, prompt sensitivity analysis, and human-in-the-loop review processes.
Hands-on experience in Python and ML frameworks such as PyTorch, with the ability to analyze outputs and drive performance iteration.
Proven ability to work across teams and disciplines to align ML evaluation work with product and business goals.
Comfortable operating in ambiguous problem spaces and taking initiative to define structure, priorities, and impact.
Passion for continuous improvement, strong documentation habits, and a collaborative, inclusive working style.