Advancing methods for agentic evaluations across domains

International Network of Advanced AI Measurement, Evaluation and Science joint testing exercise

Date published:

17 July 2025

Date updated:

1 June 2026

Topics

Publisher

AI Safety Institute

Introduction

The International Network of Advanced AI Measurement, Evaluation and Science conducted this joint testing exercise of AI ‘agents’ to build better, shared ways to test them across different problem areas.

Agents are AI systems that can plan a course of action, use tools and carry out tasks. The problem areas include leakage of sensitive information, fraud and cybersecurity.

Australia contributed to the exercise. The United Kingdom, Singapore and Japan led the exercise alongside AI Safety Institutes and government mandated offices from Canada, the European Union, France, Kenya, South Korea, and the United Kingdom.

We drew on technical expertise from researchers in:

CSIRO’s Data61
Gradient Institute
Harmony Intelligence
Mileva Security Labs
UNSW’s AI Institute.

This exercise aimed to improve how we test AI agents. These systems can take actions on their own in the real world, by planning, choosing steps and using tools. This can create new risks if testing only checks the final answer, instead of checking how the system got there. The main focus was improving testing methods, not ranking models.

These exercises aim to improve our ability to accurately measure AI capabilities and risks. This will help us better identify, understand and manage the risks of AI systems before they cause harm.

Read the report and key learnings

Advancing methods for agentic evaluations across domains: leakage of sensitive information, fraud and cybersecurity threats (AISI.gov.uk)

More information

Read about the Australian AI Safety Institute

Advancing methods for agentic evaluations across domains

Topics

Publisher

Introduction

Read the report and key learnings

More information

Contact us at the department

Connect with us at the department

Acknowledgement of Country

Advancing methods for agentic evaluations across domains

Share

Topics

Publisher

Introduction

Read the report and key learnings

More information

Acknowledgement of Country