However, perhaps because it is believ that “simple benchmarking” cannot take into account the diversity of a group of LLMs, Anthropic recently launch a new plan to fund third-party organizations to develop new benchmark tests that can evaluate the performance of AI models. “complete “AI Assessment Test” has become the new industry standard and is now open for registration .
Anthropic mainly focuses on model benchmark testing in three major areas: AI security level assessment, advanc functions, and development model assessment tools.
Anthropic hopes the new
AI security benchmark will help Brother Cell Phone List define AI security levels, including assessing a model’s ability to conduct cyberattacks, enhance weapons of mass destruction, and manipulate or deceive humans. As for national security-relat AI risks, Anthropic is committ to developing an “early warning system” for identifying and assessing risks.
In addition to AI safety assessment
Anthropic will also Sale Leads develop ways to evaluate model functionality, such as designing new end-to-end tasks to test the model’s potential to assist scientific research. The focus of these benchmark questions will be around AI’s ability to combine multi-domain knowlge, generate novel hypotheses, and perform long-term tasks involving a large number of decisions. At the same time, Anthropic hopes the benchmark will assess multiple language abilities.