STACKQUADRANT

omlx

jundot/omlx
7.8

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

Model Serving
15.7k1.3kPythonApache-2.0today

TensorRT-LLM

NVIDIA/TensorRT-LLM
7.3

TensorRT-LLM — a leading open-source project in the AI/LLM ecosystem.

Model Serving
13.8k2.4kPythonNOASSERTIONtoday

vllm-omni

vllm-project/vllm-omni
7.3

A framework for efficient model inference with omni-modality models

Model Serving
4.9k1.0kPythonApache-2.0today

Olares

beclab/Olares
7.0

Olares: An Open-Source Personal Cloud to Reclaim Your Data

Model Serving
4.6k266GoAGPL-3.0today

Deep-Learning-in-Production

ahkarami/Deep-Learning-in-Production
4.5

In this repository, I will share some useful notes and references about deploying deep learning-based models in production.

Model Serving
4.4k6871y ago

LightLLM

ModelTC/LightLLM
6.5

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Model Serving
4.1k332PythonApache-2.0today

AI-Infra-from-Zero-to-Hero

HuaizhengZhang/AI-Infra-from-Zero-to-Hero
6.2

🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑‍💻 Video Tutorials.

Model Serving
4.1k393MIT10mo ago

chitu

thu-pacman/chitu
6.8

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Model Serving
3.1k266PythonApache-2.0today

ramalama

containers/ramalama
7.5

RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.

Model Serving
2.9k340PythonMIT1d ago

inference

roboflow/inference
7.3

Turn any computer or edge device into a command center for your computer vision projects.

Model Serving
2.3k269PythonNOASSERTIONtoday

envd

tensorchord/envd
6.9

🏕️ Reproducible development environment for humans and agents

Model Serving
2.2k168GoApache-2.012d ago

vllm-ascend

vllm-project/vllm-ascend
7.2

Community maintained hardware plugin for vLLM on Ascend

Model Serving
2.2k1.3kC++Apache-2.0today

aici

microsoft/aici
4.9

AICI: Prompts as (Wasm) Programs

Model Serving
2.1k84RustMIT1y ago

sie

superlinked/sie
6.6

Superlinked Inference Engine is an Open-source inference server and production cluster for embeddings, reranking, and extraction.

Model Serving
2.0k177PythonApache-2.04d ago

mlrun

mlrun/mlrun
7.2

MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.

Model Serving
1.7k305PythonApache-2.0today

kitops

kitops-ml/kitops
6.9

An open source DevOps tool from the CNCF for packaging and versioning AI/ML models, datasets, code, and configuration into an OCI Artifact.

Model Serving
1.3k170GoApache-2.0today

hopsworks

logicalclocks/hopsworks
5.8

Hopsworks - Data-Intensive AI platform with a Feature Store

Model Serving
1.3k158JavaAGPL-3.01y ago

rtp-llm

alibaba/rtp-llm
6.0

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Model Serving
1.2k204CudaApache-2.0today

truss

basetenlabs/truss
6.7

The simplest way to serve AI/ML models in production

Model Serving
1.2k107PythonMITtoday

Nanoflow

efeslab/Nanoflow
4.7

A throughput-oriented high-performance serving framework for LLMs

Model Serving
96249Jupyter Notebook2mo ago

mosec

mosecorg/mosec
6.5

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

Model Serving
90072PythonApache-2.02d ago

model_server

openvinotoolkit/model_server
6.5

A scalable inference server for models optimized with OpenVINO™

Model Serving
880253C++Apache-2.0today

pipeless

pipeless-ai/pipeless
4.9

An open-source computer vision framework to build and deploy apps in minutes

Model Serving
85052RustApache-2.02y ago

Yatai

bentoml/Yatai
6.2

Model Deployment at Scale on Kubernetes 🦄️

Model Serving
84576TypeScriptNOASSERTION4d ago

ServerlessLLM

ServerlessLLM/ServerlessLLM
5.9

Serverless LLM Serving for Everyone.

Model Serving
68573PythonApache-2.029d ago

timber

kossisoroyce/timber
5.5

Ollama for classical ML models. AOT compiler that turns XGBoost, LightGBM, scikit-learn, CatBoost & ONNX models into native C99 inference code. One command to load, one command to serve. 336x faster than Python inference.

Model Serving
68223PythonNOASSERTION1mo ago

fastapi-ml-skeleton

eightBEC/fastapi-ml-skeleton
4.6

FastAPI Skeleton App to serve machine learning models production-ready.

Model Serving
60191PythonApache-2.04mo ago

pinferencia

underneathall/pinferencia
4.7

Python + Inference - Model Deployment library in Python. Simplest model inference server ever.

Model Serving
54483PythonApache-2.03y ago

ome

ome-projects/ome
6.0

Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, TensorRT-LLM, and Triton

Model Serving
46181GoApache-2.0today

JetStream

AI-Hypercomputer/JetStream
4.9

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

Model Serving
44265PythonApache-2.04mo ago

xFasterTransformer

intel/xFasterTransformer
4.4

xFasterTransformer — open-source AI/LLM project.

Model Serving
43575C++Apache-2.08mo ago

gpu-rest-engine

NVIDIA/gpu-rest-engine
3.7

A REST API for Caffe using Docker and Go

Model Serving
42393C++BSD-3-Clause7y ago

stable-diffusion-deploy

Lightning-Universe/stable-diffusion-deploy
4.6

Learn to serve Stable Diffusion models on cloud infrastructure at scale. This Lightning App shows load-balancing, orchestrating, pre-provisioning, dynamic batching, GPU-inference, micro-services working together via the Lightning Apps framework.

Model Serving
39139PythonApache-2.02y ago

pmetal

Epistates/pmetal
5.1

PMetal: high-performance Apple Silicon framework for local LLM inference, LoRA/QLoRA fine-tuning, serving, quantization, and MLX/Metal acceleration.

Model Serving
29320RustNOASSERTION26d ago

podman-desktop-extension-ai-lab

containers/podman-desktop-extension-ai-lab
5.9

Work with LLMs on a local environment using containers

Model Serving
29182TypeScriptApache-2.0today

TurboOCR

aiptimizer/TurboOCR
5.1

Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.

Model Serving
28435C++MIT9d ago

BMW-YOLOv4-Inference-API-GPU

BMW-InnovationLab/BMW-YOLOv4-Inference-API-GPU
4.1

This is a repository for an nocode object detection inference API using the Yolov3 and Yolov4 Darknet framework.

Model Serving
27967PythonBSD-3-Clause3y ago

BMW-YOLOv4-Inference-API-CPU

BMW-InnovationLab/BMW-YOLOv4-Inference-API-CPU
3.9

This is a repository for an nocode object detection inference API using the Yolov4 and Yolov3 Opencv.

Model Serving
21958PythonNOASSERTION3y ago