Limitations in AI Safety Evaluations Highlight Need for Better Methods

Despite growing demand for AI safety and accountability, current tests and benchmarks may fall short, as outlined in a recent report. Generative AI models, capable of analyzing and producing text, images, music, and videos, face heightened scrutiny for their tendency to make mistakes and behave unpredictably. Organizations from public sector agencies to tech firms are proposing new benchmarks to test these models’ safety.

Towards the end of last year, startup Scale AI formed a lab dedicated to evaluating model alignment with safety guidelines. This month, NIST and the U.K. AI Safety Institute released tools designed to assess model risk. However, these tests and methods might not be adequate.

Findings from the Ada Lovelace Institute

The Ada Lovelace Institute (ALI), a U.K.-based nonprofit AI research organization, conducted a study involving experts from academic labs, civil society, and model vendors. The study revealed that while current evaluations can be useful, they are non-exhaustive, can be easily manipulated, and may not accurately reflect real-world model behavior.

Elliot Jones, senior researcher at ALI and co-author of the report, emphasized the importance of rigorous testing, comparing it to safety expectations for products like smartphones, prescription drugs, and cars. “Our research aimed to examine the limitations of current approaches to AI safety evaluation, assess how evaluations are currently being used and explore their use as a tool for policymakers and regulators,” Jones said.

Issues with Benchmarks and Red Teaming

The study’s co-authors surveyed academic literature to understand the risks posed by AI models and the current state of AI evaluations. They interviewed 16 experts, including employees at unnamed tech companies developing generative AI systems. The study found significant disagreements within the AI industry on the best methods and taxonomy for evaluating models.

Some evaluations only tested model alignment with lab benchmarks, not real-world impact. Others used research-developed tests, unsuitable for evaluating production models. Experts noted difficulties in extrapolating model performance from benchmark results and questioned whether benchmarks could indicate specific capabilities. For example, a model performing well on a state bar exam might not solve more open-ended legal challenges effectively.

Data contamination was another concern, where benchmark results might overestimate model performance if trained on the same data used for testing. Organizations often chose benchmarks for convenience rather than suitability, potentially skewing evaluation results.

Mahi Hardalupas, a researcher at ALI and study co-author, highlighted risks of manipulation by developers training models on the same datasets used for evaluation or strategically choosing benchmarks. “It also matters which version of a model is being evaluated. Small changes can cause unpredictable changes in behavior and may override built-in safety features,” Hardalupas said.

The study also identified problems with “red-teaming,” where individuals or groups attack a model to identify vulnerabilities. While companies like OpenAI and Anthropic use red-teaming, the lack of agreed standards makes assessing efforts’ effectiveness difficult. Experts pointed out challenges in finding skilled red-teamers and the costly, labor-intensive nature of red-teaming, posing barriers for smaller organizations.

Potential Solutions

Pressure to release models quickly and reluctance to conduct potentially issue-raising tests hinder AI evaluation improvements. “A person we spoke with working for a company developing foundation models felt there was more pressure within companies to release models quickly, making it harder to push back and take conducting evaluations seriously,” Jones said. Major AI labs release models faster than their or society’s ability to ensure safety and reliability.

Despite one interviewee calling AI safety evaluation an “intractable” problem, Hardalupas believes a path forward exists but requires more public-sector engagement. “Regulators and policymakers must clearly articulate what it is that they want from evaluations,” he said. “Simultaneously, the evaluation community must be transparent about the current limitations and potential of evaluations.”

Hardalupas suggests governments mandate public participation in evaluation development and implement measures to support an ecosystem of third-party tests, ensuring regular access to required models and datasets. Jones advocates for “context-specific” evaluations considering model users’ backgrounds and potential attacks on models.

Investing in the underlying science of evaluations to develop more robust and repeatable evaluations based on understanding AI model operations is essential. However, there may never be a guarantee of model safety. “Determining if a model is ‘safe’ requires understanding the contexts in which it is used, who it is sold or made accessible to, and whether the safeguards that are in place are adequate and robust to reduce those risks,” Hardalupas said. Evaluations can identify potential risks but cannot ensure a model is safe or perfectly safe.

See also: Character.AI CEO Noam Shazeer Returns To Google

Advance.AI Enhances KYB for Better Corporate Due Diligence
OpenAI Cautious Approach to Releasing ChatGPT Text Detection Tools

Trending Posts

Trending Tools

FIREFILES

FREE PLAN FIND YOUR WAY AS AN TRADER, INVESTOR, OR EXPERT.
Menu