STACKQUADRANT

Inference Engines

High-performance model inference and serving runtimes

37 repos

llama.cpp

8.0

llama.cpp — a leading open-source project in the AI/LLM ecosystem.

103.7k16.9kC++

nomic-ai/gpt4all

7.2

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

77.3k8.3kC++

vLLM

8.6

vLLM — a leading open-source project in the AI/LLM ecosystem.

76.6k15.6kPython

ray-project/ray

8.6

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

42.1k7.4kPython

gitleaks/gitleaks

8.2

Find secrets with Gitleaks 🔑

26.0k2.0kGo

liguodongiot/llm-action

6.7

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

24.0k2.8kHTML

Lightning-AI/litgpt

7.9

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

13.3k1.4kPython

bentoml/OpenLLM

7.4

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

12.3k805Python

mistralai/mistral-inference

6.9

Official inference library for Mistral models

10.8k1.0kJupyter Notebook

openvinotoolkit/openvino

8.2

OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

10.1k3.2kC++

Tiiny-AI/PowerInfer

6.8

High-speed Large Language Model Serving for Local Deployment

9.3k561C++

bentoml/BentoML

8.0

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

8.6k950Python

InternLM/lmdeploy

7.5

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

7.8k684Python

katanemo/plano

7.4

Plano is an AI-native proxy server and data plane for agentic apps - centralizing orchestration, safety, observability, and smart LLM routing so you can deliver agents faster.

6.3k399Rust

algorithmicsuperintelligence/openevolve

6.8

Open-source implementation of AlphaEvolve

6.0k950Python

flashinfer-ai/flashinfer

7.5

FlashInfer: Kernel Library for LLM Serving

5.4k896Python

kserve/kserve

7.7

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

5.3k1.4kGo

xlite-dev/Awesome-LLM-Inference

6.6

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

5.1k360Python

FellouAI/eko

7.3

Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai

4.9k436TypeScript

gpustack/gpustack

7.0

Performance-optimized AI inference on your GPUs. Unlock superior throughput by selecting and tuning engines like vLLM or SGLang.

4.8k497Python

Michael-A-Kuykendall/shimmy

6.2

⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.

4.0k343Rust

ruvnet/RuVector

6.7

RuVector is a High Performance, Real-Time, Self-Learning, Vector Graph Neural Network, and Database built in Rust.

3.8k464Rust

ruvnet/ruvector

6.7

RuVector is a High Performance, Real-Time, Self-Learning, Vector Graph Neural Network, and Database built in Rust.

3.8k464Rust

predibase/lorax

6.1

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

3.7k312Python