Evaluations Engineer

Engineering + Research
Full-time
San Francisco, CA
125K - 150K USD a year

About the Role

We are looking for strong engineers to join our team and own the leaderboards that appear on Vals AI.

You will be responsible for testing and benchmarking new models as they are released on tasks in law, tax, coding, finance, and more. You will analyze error modes of models, evaluate their strengths and weaknesses, and work with our communications team to release results.

Our results are used by startups, enterprises, and research labs alike. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg.

We are building the standard for evaluating the ability of LLMs to perform real-world tasks. You will contribute directly to the leaderboards that make this possible.

What You’ll Do

Evaluate new LLM model releases across the Vals AI suite of benchmarks
Work directly with both open-source and closed-source foundation model labs in evaluating model performance
Use tools like Docent to analyze common failure modes and patterns in model performance
Work directly with our social media team to post interesting findings and results
Add new models and maintain integrations in our model library
Help improve and maintain the infrastructure we use to run benchmarks (agentic and non-agentic).
Collaborate closely with our research team on the creation of new benchmarks

This role follows the rhythm of model releases. Expect intense sprints in the days following a major launch, and calmer stretches in between releases.

Requirements

Familiarity with the LLMs: You should already be familiar with the space - the current leading models, relative performance across them, how to use large language models in practice.
Strong engineering fundamentals: You can build and ship quickly with high quality. You should have a track record of building things of significant scope (at jobs, side projects, open source, etc.)
Python expertise: Significant experience in Python, especially in a professional setting.
Team collaboration: Experience working in development sprints, Git workflows, and pull request reviews.
Location: We are an in-person team based in San Francisco. We will support your relocation or transportation as needed.

Nice-to-Haves

Previous experience with benchmarking large language models, or creating benchmarks
Previous experience working at a startup or starting your own company
Technical writing experience and ability
Machine learning research experience

What We Offer

Highly competitive salary and meaningful ownership. Excellence is well rewarded.
Relocation and transportation support
Health/dental insurance coverage
Lunch and dinner provided, free snacks/coffee/drinks
401K plan
Unlimited PTO

About Us

Founding team: The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a $5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work. Our early team include Stanford PhDs, ex-Jane Street quants, and the first designer at Snorkel.

Tech stack: We use Python for most things at Vals. Our platform is built on Django, with a React frontend. All of the infra is on AWS using CDK for IaC.

What We're Looking For

Learning velocity: The role encompasses a wide variety of tasks. Rather than expecting you to be an expert on Day 1, we are looking for someone who can learn new skills and technologies extremely quickly.
Ownership: Working in a small, talent-dense team, we expect everyone to show initiative to build where it's needed, not where it's asked. We strive for autonomy over consensus. This is especially true for this role.
Intensity: The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier. The unicorn companies that will emerge from this technology shift are being built now. Those that win will have an incredibly high speed of execution.
Solution-oriented mindset: We're looking for people who see opportunities to craft solutions at each juncture, not those who pass hard problems to others or admit defeat.

Referral Bonus

Know someone who would be a good fit? Connect them with [email protected]. If we hire them and they stay on for 90 days you’ll get a $10,000 referral bonus and Vals AI merch! Please mention the bonus in your email.

Subscribe to updates

Confirm your email

You're now subscribed