STACKQUADRANT

Inference Engines

High-performance model inference and serving runtimes

37 repos

lemonade-sdk/lemonade

7.0

Lemonade helps users discover and run local AI apps by serving optimized LLMs right from their own GPUs and NPUs. Join our discord: https://discord.gg/5xXzkMu8Zk

3.5k261C++

algorithmicsuperintelligence/optillm

6.5

Optimizing inference proxy for LLMs

3.4k268Python

neuralmagic/deepsparse

6.1

Sparsity-aware deep learning inference runtime for CPUs

3.2k190Python

b4rtaz/distributed-llama

6.3

Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

2.9k225C++

spiceai/spiceai

6.9

A portable accelerated SQL query, search, and LLM-inference engine, written in Rust, for data-grounded AI apps and agents.

2.9k185Rust

FasterDecoding/Medusa

5.4

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

2.7k197Jupyter Notebook

zhihu/ZhiLight

5.5

A highly optimized LLM inference acceleration engine for Llama and its variants.

904102C++

ovg-project/kvcached

5.6

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

85298Python

nobodywho-ooo/nobodywho

6.2

NobodyWho is an inference engine that lets you run LLMs locally and efficiently on any device.

79055Rust

andrewkchan/yalm

3.8

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

57059C++

zjhellofss/KuiperLLama

4.1

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

527137C++

jjang-ai/mlxstudio

4.8

MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)

47732

interestingLSY/swiftLLM

3.9

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

32337Python