Why current benchmarks approaches are not sufficient for Safety

Short description: Rethinking AI safety metrics and evaluation: The critical need for more rigorous and robust evaluation

Date: 04.11.2024 | 14.00 - 15.30

Location: Saturn room

Description:

Current benchmarks often suffer from limitations such as prompt sensitivity and lack of consistency. For instance, (Sclar et al., ( 2023) found that LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points. (Alzahrani et al., (2024) further showed that this prompt sensitivity varies across different models, and changing the prompt can often shuffle the leaderboard of these benchmarks. The rapid development of large language models (LLMs) has highlighted the need for more rigorous and standardized benchmarking practices. If AI model evaluations want to have meaningful real-world impact, we need a "Science of Evals" that provides more confidence in evals methodology and results.

Organiser: Tom David - PRISM Eval

Speakers:

Tom David - PRISM Eval
Pierre Peigné - PRISM Eval
Peter Matson - President of MLCommons