Mastering AI Benchmarking: How Reliant AI Ensures Relevant and Reliable Evaluation for Life Sciences

September 10, 2025
Reliant AI Team

What is Benchmarking for AI-Powered Systems? 

Benchmarking is a fundamental process in AI-powered systems where we compare a system's performance against correct or desired outcomes. There are many different types of these evaluation datasets depending on what you’re trying to measure and why—they can be used to evaluate entire systems end-to-end or determine improvements from minor algorithmic changes. Benchmarking our system performance enables us to identify our current standing, the types of errors we’re making, the performance of other systems, and whether our changes have led to improvements. 

There are many, many benchmark datasets publicly available for different AI tasks, including large language models (LLMs). When a new LLM foundation model comes out, the release notes often include how well the model has scored on well-known evaluation datasets.

Public leaderboards, such as LMArena, compile different LLMs’ performance across general large task datasets  While these are good at covering a broad range of activities you would want an LLM to be able to handle (math, reasoning, instruction-following), often they’re not sufficient to capture domain-specific abilities, or to evaluate functionality that’s unique to a commercial application. As a result, companies also have their own internal benchmark datasets that they build themselves.

At Reliant AI, we have very domain-specific applications in the life sciences. Most publicly available benchmarks are insufficient to capture the nuances of what constitutes a gold standard answer for our commercial pharma users. Additionally, our product Reliant Tabular isn’t just a singular AI model, but a combination of different models with different purposes, whose capabilities we also want to benchmark independently of the system as a whole. As a result, we take a very conscientious approach to creating high-quality benchmarks that we can trust to guide our product development. 

Exploring Benchmarking Datasets: Public vs. Internal Benchmarking

There are two key questions to ask when considering which benchmarking datasets to employ: 

  • How do we make sure the benchmark is relevant?
  • How do we make sure the benchmark is reliable?

Ensuring Relevant Benchmarks

At Reliant AI, we utilize a combination of both internal and external datasets. To ensure these datasets are fit for purpose, we focus on alignment with both our machine learning and commercial teams. This means that we need to construct or utilize benchmarks that are useful to both: 

  • They need to be commercially relevant so we can improve our systems in the ways that matter most to users.
  • They need to be actionable for our machine learning engineers to reflect the different ways our machine learning engineers work. 

This alignment is critical and means continuous re-alignment. Our commercial team helps develop the type of tasks that we want to monitor performance on, explicitly related to actions a user would take in Reliant Tabular. From there, we iterate with the machine learning team on what type of data is needed to get a sufficient signal about performance. 

Once you’ve decided what you want to benchmark, you then have to create the benchmark. There are many ways to do this, including human-annotated benchmarks, machine-annotated benchmarks, and various combinations of the two. We use a combination of approaches based on what we want to evaluate, often using human annotators to ensure a high level of alignment with user activities.

The Limitations of Public AI Benchmarks

Annotating datasets with humans is a notoriously laborious and tricky process, and errors are unavoidable. However, with strong diligence on assessing the annotator performance both between external annotators and internal annotators, we can spot difficult tasks where agreement is low. If the annotation task itself is poorly defined, humans will interpret it in different ways, and you end up with a benchmark that is no longer representative of what you had planned. 

We spend a lot of time on internal alignment (where do our experts agree, and where do they disagree) as well as communicating this with the external domain experts we use as our annotators. Being able to trust external annotation means that we are engaging in constant quality control and oversight, which often organizations prefer to outsource. However, large data annotation providers usually lack the domain expertise in the workforce they employ to create our benchmarks accurately. 

This high level of control also allows us to iterate more seamlessly. Often, we will go back to both teams during the creation of a benchmark to ensure everyone is happy with how the data collection is looking. This allows both the commercial team to ensure the tasks they envisioned are being properly captured, as well as checking that the data quality itself is high. We can then make adjustments to the annotation process as needed. 

Once a benchmark has been implemented, the ML team will look at where the system disagrees with the benchmark. To identify the source of these errors, this often requires sitting down with our internal domain experts. Having domain expertise in both AI and the life sciences allows us to extract the most benefit from this error analysis, as it requires knowledge of both to be able to make the necessary improvements. 

How Reliant AI Builds High-Quality Benchmarks

Benchmarking can become an extremely useful tool in guiding smart product development, ultimately becoming part of your edge. It can also be very costly and labour-intensive to do well, and so requires the support of leadership that understands its strategic importance. At Reliant AI, we’re always looking for ways to improve this process, knowing that our end users will reap the benefits.

Why Reliant AI Invests in High-Quality Benchmarks

Investing in robust benchmarking practices gives us clarity and confidence when making product decisions. High-quality benchmarks serve as a foundation for measuring progress and identifying specific areas for improvement, ensuring that our enhancements have a meaningful impact. For leadership, having accurate and actionable data underscores the value of strategic investments in AI development, supporting long-term vision and innovation.

For our customers, this commitment translates to reliable, high-performing products that address real needs. By prioritizing benchmarks that are relevant and trustworthy, we can better align our capabilities with user goals and expectations. Ultimately, our focus on benchmarking empowers us to deliver tangible value, helping our clients stay ahead in their fields and trust that the solutions they rely on are rigorously validated.

Download Reliant AI's Readiness and Benchmarking Checklist for Life Sciences