Evaluation & Testing
Frameworks for evaluating, benchmarking, and testing AI systems
confident-ai/deepeval
7.8
★ 14.8k◇ 1.4kPython
Ragas
7.7
★ 13.4k◇ 1.4kPython
NVIDIA/garak
7.3
★ 7.5k◇ 877HTML
jeinlee1991/chinese-llm-benchmark
6.2
★ 5.9k◇ 236
PacktPublishing/LLM-Engineers-Handbook
6.7
★ 4.9k◇ 1.2kPython
Agenta-AI/agenta
7.7
★ 4.0k◇ 508TypeScript
EvolvingLMMs-Lab/lmms-eval
7.5
★ 4.0k◇ 560Python
Tencent/AI-Infra-Guard
7.3
★ 3.5k◇ 345Python
truera/trulens
7.3
★ 3.2k◇ 262Python
lmnr-ai/lmnr
6.9
★ 2.8k◇ 191TypeScript
huggingface/aisheets
6.2
★ 1.6k◇ 136TypeScript
cyberark/FuzzyAI
5.6
★ 1.3k◇ 188Jupyter Notebook
microsoft/prompty
6.8
★ 1.2k◇ 114TypeScript
cvs-health/uqlm
6.6
★ 1.1k◇ 119Python
JudgmentLabs/judgeval
6.7
★ 1.0k◇ 90Python
langwatch/scenario
5.9
★ 834◇ 58TypeScript
onejune2018/Awesome-LLM-Eval
5.0
★ 631◇ 55
ValueByte-AI/Awesome-LLM-in-Social-Science
5.0
★ 609◇ 46
PacificAI/langtest
6.1
★ 555◇ 49Python
Pacific-AI-Corp/langtest
6.1
★ 555◇ 49Python
relari-ai/continuous-eval
4.7
★ 516◇ 38Python
JonathanChavezTamales/llm-leaderboard
4.8
★ 361◇ 40JavaScript
CopilotKit/aimock
5.7
★ 343◇ 21TypeScript
palico-ai/palico-ai
4.5
★ 342◇ 28TypeScript
1 / 2next →