Evaluation & Testing

Frameworks for evaluating, benchmarking, and testing AI systems

40 repos

Stars Score Name

ValueByte-AI/Awesome-LLM-in-Social-Science

Awesome papers involving LLMs in Social Science.

darkrishabh/agent-skills-eval

A test runner for agentskills.io-style AI agent skills

★ 620◇ 34TypeScript

Pacific-AI-Corp/langtest

Deliver safe & effective language models

★ 563◇ 51Python

PacificAI/langtest

Deliver safe & effective language models

★ 563◇ 51Python

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

★ 516◇ 38Python

faiscadev/fakecloud

Free, open-source AWS emulator. LocalStack alternative: 26 services, 1,924 operations, 100% conformance. No account, no auth token, no paid tier.

★ 481◇ 33Rust

rhesis-ai/rhesis

The testing platform for AI teams. Bring engineers, PMs, and domain experts together to generate tests, simulate (adversarial) conversations, and trace every failure to its root cause.

★ 381◇ 29Python

JonathanChavezTamales/llm-leaderboard

A comprehensive set of LLM benchmark scores and provider prices. (deprecated, read more in README)

★ 360◇ 40JavaScript

palico-ai/palico-ai

Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework

★ 342◇ 28TypeScript

ai-dashboad/flutter-skill

AI-powered E2E testing for 10 platforms. 253 MCP tools. Zero config. Works with Claude, Cursor, Windsurf, Copilot. Test Flutter, React Native, iOS, Android, Web, Electron, Tauri, KMP, .NET MAUI — all from natural language.

★ 325◇ 45Dart

PetroIvaniuk/llms-tools

A list of LLMs Tools & Projects

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

★ 301◇ 22Python

testdriverai/testdriverai

Computer-Use SDK for E2E QA Testing

★ 226◇ 32JavaScript

PramodDutta/qaskills

QA Skills Directory QA Skills is a curated directory of testing-specific skills for AI coding agents (Claude Code, Cursor, Copilot, etc.).

★ 177◇ 17TypeScript

vostride/agent-qa

The self-improving Agentic QA harness with Memory. Write tests in natural language.  Catch regressions before releases ship.

★ 161◇ 9TypeScript

blackhaiyu-sudo/spec2case

Spec2Case 是生产级 AI 测试用例生成智能体，支持图片/文本需求理解、人工确认、LangGraph 流程编排和 Excel 用例导出。

★ 109◇ 3Python