STACKQUADRANT

Inference Engines

High-performance model inference and serving runtimes

37 repos

algorithmicsuperintelligence/optillm

6.6

Optimizing inference proxy for LLMs

4.1k355Python

predibase/lorax

6.9

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

3.8k316Python

neuralmagic/deepsparse

5.9

Sparsity-aware deep learning inference runtime for CPUs

3.2k192Python

spiceai/spiceai

7.0

A portable accelerated SQL query, search, and LLM-inference engine, written in Rust, for data-grounded AI apps and agents.

2.9k197Rust

b4rtaz/distributed-llama

6.2

Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

2.9k232C++

FasterDecoding/Medusa

5.4

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

2.7k201Jupyter Notebook

ovg-project/kvcached

5.8

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

1.1k118Python

nobodywho-ooo/nobodywho

6.2

NobodyWho is an inference engine that lets you run LLMs locally and efficiently on any device.

94466Rust

zhihu/ZhiLight

5.3

A highly optimized LLM inference acceleration engine for Llama and its variants.

905102C++

jjang-ai/mlxstudio

5.3

MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)

76349

andrewkchan/yalm

3.7

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

58462C++

zjhellofss/KuiperLLama

4.0

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

548142C++

interestingLSY/swiftLLM

3.8

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

32938Python