inference-optimization

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.

text-generation batch-processing server-optimization model-serving model-acceleration inference-optimization optimization-techniques machine-learning-operations deep-learning-techniques model-inference-service performance-enhancement scalability-strategies serving-infrastructure large-scale-deployment

Updated Apr 12, 2024
Jupyter Notebook

Harly-1506 / Faster-Inference-yolov8

Star

Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢

opencv image-processing torch segmentation object-detection numpy-arrays openvino inference-optimization openvino-toolkit numpy-implementation ultralytics yolov8

Updated Dec 8, 2024
Python

grazder / template.cpp

Star

[WIP] A template for getting started writing code using GGML

deep-learning cpp inference-optimization ggml

Updated May 1, 2024
C++

amazon-science / llm-rank-pruning

Star

LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.

pagerank graph-theory pruning inference-optimization weighted-pagerank large-language-models llm llms

Updated Nov 29, 2024
Python

Bisonai / ncnn

Star

Modified inference engine for quantized convolution using product quantization

quantization product-quantization edge-machine-learning inference-optimization mobile-deep-learning inference-acceleration

Updated Jul 1, 2022
C++

effrosyni-papanastasiou / constrained-em

Star

A constrained expectation-maximization algorithm for feasible graph inference.

network-inference expectation-maximization feasibility expectation-maximisation-algorithm inference-optimization

Updated Jun 10, 2021
Jupyter Notebook

sjlee25 / batch-partitioning

Star

Batch Partitioning for Multi-PE Inference with TVM (2020)

deep-learning data-parallelism tvm inference-optimization dl-optimization dl-compiler

Updated Dec 17, 2022
Python

EZ-Optimium / Optimium

Star

Your AI Catalyst: inference backend to maximize your model's inference performance

raspberry-pi arm deep-learning neural-network runtime amd intel inference inference-engine tensorflow-lite inference-optimization mediapipe ai-compiler

Updated Dec 10, 2024
C++

ccs96307 / fast-llm-inference

Star

Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.

acceleration inference-optimization large-language-models speculative-decoding

Updated Dec 9, 2024
Python

zhliuworks / Fast-MobileNetV2

Star

🤖️ Optimized CUDA Kernels for Fast MobileNetV2 Inference

cuda-kernels mobilenet-v2 inference-optimization

Updated Dec 28, 2021
Cuda

Improve this page

Add a description, image, and links to the inference-optimization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the inference-optimization topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference-optimization

Here are 36 public repositories matching this topic...

google / XNNPACK

alibaba / BladeDISC

jiazhihao / TASO

imedslab / pytorch_bn_fusion

mit-han-lab / inter-operator-scheduler

ZFTurbo / Keras-inference-time-optimizer

Rapternmn / PyTorch-Onnx-Tensorrt

BaiTheBest / SparseLLM

keli-wen / AGI-Study

lmaxwell / Armednn

ksm26 / Efficiently-Serving-LLMs

Harly-1506 / Faster-Inference-yolov8

grazder / template.cpp

amazon-science / llm-rank-pruning

Bisonai / ncnn

effrosyni-papanastasiou / constrained-em

sjlee25 / batch-partitioning

EZ-Optimium / Optimium

ccs96307 / fast-llm-inference

zhliuworks / Fast-MobileNetV2

Improve this page

Add this topic to your repo