Why I Stopped Evaluating AI Quality Alone

(and Designed a System So the Right People Could)

AI

Workflows

April 15, 2026.

Why I Stopped Evaluating AI Quality Alone

How a structural shift turned AI evaluation from a solo engineering effort into a team-wide discipline. Not because other expertise isn't needed, but because the process was never built to bring it in.

By Zachary Goldstein, Solutions Architect

Image generated by Sara Jaye.

When only engineering is familiar with operating AI systems, other disciplines don't feel comfortable contributing. Teams knew quality mattered, but didn't have a shared place to define it or a clear way to act on it. So the work defaulted to whoever was closest to the system. 

The fix wasn't better engineering. It was designing a structure where the right expertise had a clear way into the process. That’s what led to a framework that gives every discipline a defined role in AI quality:

  • The Success Metric: Business requirements define the “what” (the standards).

  • The Testing Narrative: Quality assurance defines the “who, when and where” (the context).

  • The Enablers: Engineering defines the “how” (the tools).

This structure doesn't redistribute burden, it brings in expertise that was always needed but had no way to contribute.

We put this framework into practice evaluating NBCU's OLI, an AI assistant built to help U.S. viewers navigate the thousands of hours of competition of the 2026 Milan Cortina Winter Games.

For OLI, we built a custom service named Osiris on top of Google’s Vertex AI Evaluation API. Hosted on Google Cloud Platform, Osiris made the API's powerful features accessible to the entire team.

End-to-end architecture of the evaluation pipeline, from query ingestion to scored outputs based on business requirements.

Here is how the team contributed to the framework.

1. The “What”: Metrics for measurement.

This is an exercise that leverages Product’s insights of the feature’s purpose and formats them in a way an AI can be evaluated programmatically. 

The system we built makes this shift explicit: Product translates business requirements directly into evaluation metrics. No guesswork or interpretation gap.

A Metric Manager was created to allow anyone to take their business requirements  applying them to AI evaluations. Here’s how we structured it for OLI:

  • Criteria: What specific behavior are we measuring? For OLI, this meant rules like:

    • “the response speaks to dates and times naturally”

    • “the chatbot does not compare athletes”

  • Metrics: Group-related criteria that allows to see patterns in performance. OLI’s metrics included categories that allowed the team to say, for instance, “OLI is strong on scheduling equality, but weak on athlete disambiguation.” Examples of these categories included:

    • “Tone and Voice”

    • “Programming Time”

    • “Athlete Topics”

  • Rubric: A guide defining how quality is scored. The key here is clarity: everyone should understand what “good” and “bad” look like. For OLI, we used a pass/fail scoring rubric to find failures quickly:

    • 1 - (Fail) The response shows no awareness of our programming rules.

    • 2 - (Pass) The response accurately and clearly speaks to our athlete rules.

This way the AI is measured against the actual business goal, not just what’s technically possible.

2. The "Who, When and Where": Building the testing narrative

The role of quality assurance is to structure the test plan around real-world usage and measure against the metrics and rubric already defined by the business. That means creating a centralized place to collaborate and track scenarios. 

The shared artifact we devised is called “Query Store,” and it’s where we can see exactly what the AI was asked, under what conditions, and what it should have known.

For OLI, QA constructed testing narratives across categories such as Programming, Athletes, Sports, and so on. Each entry in the store was constructed with:

  • The User Prompt: What is the user trying to do? What entities are involved? In the case of OLI, potential queries were:

    • “When is Chloe Kim on?”

    • “When can I watch skiing tomorrow?”

  • The Context: Under what conditions is the question being asked? Time, location, page, user state — all of it matters. In the case of OLI, this could have been:

    • User is asking during a live broadcast window

    • Primetime is on in this user’s timezone now.

  • Reference Data: What should the AI have known? In the case of OLI, possible reference data was:

    • The Winter Olympics figure skating schedule.

    • NBC's broadcast times.

By applying traditional QA rigor—coverage reports, edge cases, adversarial scenarios and regression testing—we moved from “it feels better” to measurable coverage across every intent OLI was expected to handle, allowing us to say: “OLI speaks to programming equality X% of the time.”

3. The "How": Building the enabler.

Engineering's role shifts from running evaluations to building Osiris, the tool that gives the team a clear path into the process. 

We leveraged “Custom Metrics” to focus on the requirements we cared about and “Bring Your Own Response” to allow re-evaluations without re-running the AI, which keeps costs predictable and responses versioned so you can watch your AI improve.

This abstracted away complexity while keeping each step decoupled and intuitive:

  • Product inputs their criteria and rubrics into a simple interface to create metrics. No code or prompt engineering.

  • QA uploads scenarios from the Query Store and runs them against Product’s metrics. Each step accepts a CSV and returns a CSV, so evaluation layers stay independent and portable.

  • The Tool automates the heavy lifting: batch-querying the AI, scoring responses, generating explanations for every pass and fail, and producing reports.

By building this layer we enabled each expert to focus on what they do best! From building AI pipelines, to defining requirements, to measuring outputs, the entire team now owns the quality story.

The Result: Building the Habit for the Future

The frameworks we're building now —the metrics, query stores, and evaluation pipelines— are the foundation, not the final product. The teams best positioned to navigate Generative AI's future are the ones who built the right habits early on.

To prepare for what’s next, from self-flagging evaluation systems to adaptive rubrics, you must first master this organizational shift. We achieved this by:

Learning to deconstruct AI quality: We broke down the immense complexity of AI evaluation into three manageable and distinct roles: Success Metric, Testing Narrative, and Enablers.

Achieving accessibility: We built simple interfaces and CSV pipelines to abstract away the technical how, ensuring non technical team members could own the evaluation process without needing to code.

Establishing shared language: We designed a process where the team could speak about AI quality easily, using clear metrics and rubrics that connect business intent directly to measurable outcomes.

What’s coming next is more interesting: evaluation systems that flag their own blind spots, rubrics that adapt as user behavior shifts, and quality benchmarks that feed directly back into model fine-tuning. The loop between “define quality” and “improve the AI” is closing fast.

The teams that will navigate this well aren’t the ones with the most sophisticated tooling. They’re the ones who built the habit early. The ones speaking the same language about what “good” means.

This is the difference between reacting to AI complexity and strategically organizing for it. As AI systems become more central to customer experience, quality can no longer live only in prompts, pipelines, or model settings. It must live in a structured framework that connects business intent, evaluation rigor, and technical enablement.

Let's talk. Your new ambition starts here.