Standard LLM benchmarks fail at science so OpenAI built LifeSciBench
6 min read
AI Benchmarks
OpenAI launched LifeSciBench, a highly rigorous, expert-reviewed evaluation framework designed to test if large language models can actually handle complex, real-world life science and biology research tasks instead of just passing standardized tests....