

Galileo Technologies Inc., which makes tools for observing and evaluation artificial intelligence models, today unveiled Agentic Evaluations, a platform aimed at evaluating the performance of AI agents powered by large language models.
The company said it’s addressing the additional complexity created by agents, which are software robots imbued with decision-making capabilities that enable them to plan, reason and execute tasks across multiple steps and adapt to changing environments and contexts with little or no human oversight.
Because agent behavior is situational, developers can struggle to understand when and why failures occur. That hasn’t dampened interest in the technology’s workflow productivity potential. Gartner Inc. expects 33% of enterprise software applications to include agentic AI by 2028, up from less than 1% in 2024.
Agents challenge existing development and testing techniques in new ways. One is that they can choose multiple action sequences in response to a user request, making them unpredictable. Complex agentic workflows are difficult to model and require more complex evaluation. Agents may also work with multiple LLMs, making performance and costs harder to pin down. The risk of errors grows with the size and complexity of the workflow.
Galileo said its Agentic Evaluations provide a full lifecycle framework for system-level and step-by-step evaluation. It gives developers a view of an entire multistep agent process, from input to completion, with tracing and simple visualizations that help developers quickly pinpoint inefficiencies and errors. The platform uses a set of proprietary LLM-as-a-Judge metrics — an evaluation technique that use LLMs to check and adjudicate tasks — specifically for developers building agents.
Metrics include an assessment of whether the LLM planner selected the correct tool and arguments, an assessment of errors by individual tools, traces reflecting progress toward the ultimate goal and how the final action align with the agent’s original instructions. Metrics are between 93% and 97% accurate, the company wrote in a blog post.
Performance is measured using proprietary, research-based metrics at multiple levels. Developers can choose which LLMs are involved in planning and assess errors in individual tasks.
Aggregate tracking for cost, latency and errors across sessions and spans helps with cost and latency measurement. Alerts and dashboards help in identifying systemic issues for continuous improvement such as failed tool calls or misalignment between the actions and instructions. The platform supports the popular open-source AI frameworks LangGraph and CrewAI.
Agentic Evaluations is now available to all Galileo users. The company has raised $68 million, including a $45 million funding round last October.
THANK YOU