Short description: Rethinking AI safety metrics and evaluation: The critical need for more rigorous and robust evaluation
Date: 04.11.2024 | 14.00 - 15.30
Location: Saturn room
Description:
Current benchmarks often suffer from limitations such as prompt sensitivity and lack of consistency. For instance, (Sclar et al., ( 2023) found that LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points. (Alzahrani et al., (2024) further showed that this prompt sensitivity varies across different models, and changing the prompt can often shuffle the leaderboard of these benchmarks. The rapid development of large language models (LLMs) has highlighted the need for more rigorous and standardized benchmarking practices. If AI model evaluations want to have meaningful real-world impact, we need a "Science of Evals" that provides more confidence in evals methodology and results.
Organiser: Tom David - PRISM Eval
Speakers:
- Tom David - PRISM Eval
- Pierre Peigné - PRISM Eval
- Peter Matson - President of MLCommons