Evaluation & Testing
Frameworks for evaluating, benchmarking, and testing AI systems
confident-ai/deepeval
8.0
★ 15.9k◇ 1.5kPython
Ragas
7.5
★ 14.2k◇ 1.5kPython
NVIDIA/garak
7.5
★ 8.0k◇ 989Python
jeinlee1991/chinese-llm-benchmark
6.3
★ 6.1k◇ 247
PacktPublishing/LLM-Engineers-Handbook
6.7
★ 5.1k◇ 1.2kPython
EvolvingLMMs-Lab/lmms-eval
7.5
★ 4.2k◇ 597Python
Agenta-AI/agenta
7.8
★ 4.2k◇ 531TypeScript
Tencent/AI-Infra-Guard
7.4
★ 3.8k◇ 375Python
truera/trulens
7.3
★ 3.4k◇ 284Python
lmnr-ai/lmnr
6.9
★ 3.0k◇ 203TypeScript
BlazeUp-AI/Observal
5.9
★ 1.9k◇ 327Python
huggingface/aisheets
6.2
★ 1.6k◇ 141TypeScript
cyberark/FuzzyAI
5.5
★ 1.5k◇ 203Jupyter Notebook
microsoft/prompty
6.7
★ 1.2k◇ 116TypeScript
cvs-health/uqlm
6.6
★ 1.2k◇ 125Python
JudgmentLabs/judgeval
6.7
★ 1.0k◇ 92Python
MGdaasLab/WHartTest
6.2
★ 925◇ 123Python
bug0inc/passmark
5.7
★ 897◇ 163TypeScript
langwatch/scenario
5.9
★ 893◇ 65Python
juanjuandog/FinSight-AI
5.1
★ 842◇ 56Java
onejune2018/Awesome-LLM-Eval
4.8
★ 639◇ 70
ValueByte-AI/Awesome-LLM-in-Social-Science
4.8
★ 624◇ 47
CopilotKit/aimock
6.3
★ 614◇ 40TypeScript
darkrishabh/agent-skills-eval
5.3
★ 560◇ 28TypeScript
1 / 2next →