AI Benchmarks Just Got Cool: LayerLens Beta is Here
A new platform for the evaluation and benchmarking of frontier models.
Last week, we unveiled a project we’ve been building over the past eight months, aimed at one of the most pressing challenges in generative AI: evaluation and benchmarking. LayerLens is a new platform that empowers the execution of benchmarks (evals) against frontier models and agents, while also enabling the creation of entirely new evals.
If you’re a data scientist or AI engineer working on generative AI solutions, you’ve likely experienced the difficulty of accurately evaluating and benchmarking models and agents. This problem is becoming increasingly central as evaluation has emerged as the primary method for understanding the capabilities of foundation models. Unlike earlier machine learning systems, which benefitted from interpretability tools, frontier models offer limited transparency — making evals our best available tool. And yet, the process remains incredibly challenging.
Some Challenges with AI Benchmarks
If you’ve been following the generative AI space, you’ve probably noticed a new benchmark announced nearly every week. Despite this flurry of activity, evaluating frontier models remains a core unsolved problem. That contradiction is what initially drew me to this space and led to the founding of LayerLens. From first principles, we’ve identified three core challenges:
The Industry Needs More Benchmarks
The rate of innovation in frontier models has vastly outpaced the creation of meaningful benchmarks. As a result, only a few benchmarks continue to matter at higher scales. The lack of robust, quantitative understanding of foundation models’ capabilities is a significant bottleneck to unlocking the next wave of growth in AI.
Benchmarks Need to Get More Practical
Most existing AI benchmarks are optimized for academic scenarios and don’t reflect real-world use cases. If you’re building an agent to interact with your CRM, creative writing capabilities may be irrelevant. The creation of more practical evals is essential to support reliable model evaluation.
Evaluations Must Be Holistic and Simple
No single evaluation fully captures a model’s capabilities. Combining multiple evals offers broader coverage, but requires a consistent framework to run them, review prompt-level results, and stay updated as new evals are released. Right now, that workflow is extremely fragmented and difficult to manage.
While there are other challenges in this domain, most of them are derivatives of the three above. Put simply, the AI industry is overdue for better, more holistic benchmarking tools.
Enter LayerLens
LayerLens is designed to bring transparency to frontier models, starting with evaluation and benchmarking. Our first product, Atlas, enables AI scientists and engineers to comprehensively evaluate models and agents across a wide range of benchmarks. At its core, Atlas provides two major capabilities:
- Benchmark Execution — AI teams can run both public and private benchmarks against public and private models and agents. For example, an enterprise can apply state-of-the-art evals to its own fine-tuned models.
- Eval Creation — The LayerLens platform supports the creation of new evals via synthetic data generation, crowdsourcing, and other techniques. This capability enables practical, use-case-specific evaluations that deliver actionable insights.
Eval Execution
The beta version of Atlas introduces a novel approach to benchmark execution that streamlines how AI teams derive insights from evaluations. Key features include:
1) Evaluation Spaces
Instead of a single leaderboard, Atlas introduces Evaluation Spaces, which abstract a specific evaluation criterion. For example, one space could compare U.S. versus Chinese frontier models; another might focus on a benchmark like Humanity’s Last Exam. These spaces offer tremendous flexibility and the initial release includes over 30 of them — covering themes such as vendor-specific benchmarks, small models, and topic-specific evals like math or code generation.
2) Evaluation Execution
LayerLens users can execute any evaluation against any model registered in the platform. This allows any AI team to rapidly reproduce experiment results and validate the performance of target models.
3) Side-by-Side Model Comparison
Comparing two models remains one of the most essential yet difficult evaluation tasks. Atlas makes this a native feature, enabling users to compare models head-to-head across any supported benchmark.
3) Prompt-by-Prompt Results
Most benchmarks only show a final score, leaving out critical insights. With Atlas, users get access to full prompt-by-prompt results, including granular analyses that reveal how models behave across individual test cases.
4) Custom Evals
There’s often a gap between academic benchmarks and enterprise needs. Atlas supports the creation of custom evals using user-defined datasets, helping AI teams design evaluations that mirror their practical use cases.
5) Custom Model Endpoints
In addition to using custom datasets, users can evaluate fine-tuned models or agents via custom endpoints. This unifies public and private model evaluations into a single stack.
Creating Evaluations
The second core pillar of Atlas is enabling the creation of new evals that close the gap between theoretical benchmarks and real-world use cases. Our first release includes a powerful feature that lets teams create evals from custom document sets. For example, a company can build practical evals based on internal documentation that reflects its business processes.
Roadmap
In the coming months, we’ll be rolling out an ambitious roadmap that includes:
- Launching new, domain-specific evals
- Delivering a full-stack enterprise evaluation suite
- Expanding into new areas such as agentic evaluation
The Company
LayerLens brings together a team of AI and distributed systems experts from companies like Microsoft, Google, Cisco, Oracle, OpenZeppelin, ConsenSys, and others. We raised a $6.5 million pre-seed round from an amazing group of investors who share our vision of bringing transparency to AI. We’re actively hiring — if you’re passionate about evaluation, benchmarking, and frontier models, come join us.