Inference Engines

High-performance model inference and serving runtimes

37 repos

Stars Score Name

algorithmicsuperintelligence/optillm

6.6

Optimizing inference proxy for LLMs

★ 4.1k◇ 355Python

predibase/lorax

6.9

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

★ 3.8k◇ 316Python

neuralmagic/deepsparse

5.9

Sparsity-aware deep learning inference runtime for CPUs

★ 3.2k◇ 192Python

spiceai/spiceai

7.0

A portable accelerated SQL query, search, and LLM-inference engine, written in Rust, for data-grounded AI apps and agents.

★ 2.9k◇ 197Rust

b4rtaz/distributed-llama

6.2

Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

★ 2.9k◇ 232C++

FasterDecoding/Medusa

5.4

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

★ 2.7k◇ 201Jupyter Notebook

ovg-project/kvcached

5.8

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

★ 1.1k◇ 118Python

nobodywho-ooo/nobodywho

6.2

NobodyWho is an inference engine that lets you run LLMs locally and efficiently on any device.

★ 944◇ 66Rust

zhihu/ZhiLight

5.3

A highly optimized LLM inference acceleration engine for Llama and its variants.

★ 905◇ 102C++

jjang-ai/mlxstudio

5.3

MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)

★ 763◇ 49

andrewkchan/yalm

3.7

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

★ 584◇ 62C++

zjhellofss/KuiperLLama

4.0

校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

★ 548◇ 142C++

interestingLSY/swiftLLM

3.8

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

★ 329◇ 38Python

← prev2 / 2