Evaluation & Testing

Frameworks for evaluating, benchmarking, and testing AI systems

40 repos

Stars Score Name

confident-ai/deepeval

8.2

The LLM Evaluation Framework

★ 16.9k◇ 1.7kPython

Ragas

7.4

Ragas — a leading open-source project in the AI/LLM ecosystem.

★ 14.9k◇ 1.6kPython

NVIDIA/garak

7.7

the LLM vulnerability scanner

★ 8.5k◇ 1.1kPython

jeinlee1991/chinese-llm-benchmark

6.4

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括335个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.5、文心ERNIE-X1.1、ERNIE-5.0-Thinking、qwen3-max、百川、讯飞星火、商汤senseChat等商用模型，以及kimi-k2、ernie4.5、minimax-M2、deepseek-v3.2、qwen3-2507、llama4、智谱GLM-4.6、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

★ 6.3k◇ 257

PacktPublishing/LLM-Engineers-Handbook

6.6

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

★ 5.2k◇ 1.3kPython

EvolvingLMMs-Lab/lmms-eval

7.9

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

★ 4.3k◇ 623Python

Agenta-AI/agenta

7.6

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

★ 4.3k◇ 573TypeScript

Tencent/AI-Infra-Guard

7.7

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.

★ 4.1k◇ 397Python

truera/trulens

7.2

Evaluation and Tracking for LLM Experiments and AI Agents

★ 3.4k◇ 311Python

lmnr-ai/lmnr

7.0

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

★ 3.1k◇ 219TypeScript

BlazeUp-AI/Observal

6.2

Observal is an AI agent registry with first in class observabilty and eval framework

★ 2.2k◇ 456Python

huggingface/aisheets

6.1

Build, enrich, and transform datasets using AI models with no code

★ 1.6k◇ 140TypeScript

cyberark/FuzzyAI

5.4

A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

★ 1.5k◇ 212Jupyter Notebook

ifixai-ai/iFixAi

6.7

The open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception, unpredictability, and opacity. Provider-agnostic. Runs against OpenAI, Anthropic, Bedrock, Azure, Gemini, and more. Letter grade in under 5 minutes, content-addressed manifest for bit-identical replay. Built by iMe.

★ 1.5k◇ 193Python

microsoft/prompty

6.7

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

★ 1.2k◇ 119TypeScript

bug0inc/passmark

5.9

The open-source Playwright library for AI browser regression testing with intelligent caching, auto-healing, and multi-model verification.

★ 1.2k◇ 182TypeScript

cvs-health/uqlm

6.7

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

★ 1.2k◇ 127Python

juanjuandog/FinSight-AI

5.0

AI equity research agent with resilient workflows, Redis Lua single-flight, pgvector RAG, versioned reports, evidence tracing, and RAG evaluation.

★ 1.1k◇ 60Java

JudgmentLabs/judgeval

6.6

The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.

★ 1.0k◇ 94Python

MGdaasLab/WHartTest

6.5

WHartTest 是一款AI驱动的测试自动化平台，实现从需求到可执行测试用例的自动化生成与管理，帮助测试团队提升效率与覆盖率。 (WHartTest is an AI-driven test automation platform that automates the generation and management of executable test cases from requirements, helping testing teams improve efficiency and coverage.)

★ 959◇ 142Python

langwatch/scenario

6.2

Agentic testing for agentic codebases

★ 920◇ 67TypeScript

benchflow-ai/awesome-evals

4.6

A curated, non-BS library of the best resources for building and evaluating AI agents — papers, blogs, talks, tools, benchmarks. Maintained by BenchFlow.

★ 734◇ 63

onejune2018/Awesome-LLM-Eval

4.6

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

★ 651◇ 80

CopilotKit/aimock

6.7

Mock everything your AI app talks to — LLM APIs, MCP, A2A, vector DBs, search. One package, one port, zero dependencies.

★ 650◇ 46TypeScript

1 / 2next →