CS20B1060

Exploring Deep Learning Model
Deployment and Inference Strategies
Suraj Rajendran
Meghana Manvitha Venna
Valeo India Private Limited,
CS20B1060 Chennai
Dr. Sivaselvan B
Dept. of CSE, IIITDM
Kancheepuram
Table of Contents
• Introduction
• Problem Definition
• Literature Survey
• Contribution
• Progress/ Work done
• Results/ Analysis
• Conclusion
• Future Work
2
Weekly Review Report
3
• Approval letter from Internal Guide for doing final project under the guidance of industry
personal.
4
Introduction
• Valeo Company specializes in developing Advanced Driver
Assistance Systems (ADAS) like automated parking systems and
enhanced automated driving systems
• The current algorithm framework utilizes the OpenMM Lab framework
[1] and builds upon it to incorporate multiple decoder heads as
demanded by the client
• These objectives include the detection of pedestrians, identification of
free space, recognition of hatched markings, detection of kerbs, Bird’s
Eye View (BEV) space polygon object detection, slot detection,
generation of height maps etc
• During my internship with the Model Deployment and Optimization
team,I worked on Deep learning model deployment and optimisation
using QNN SDK ,LIDAR output visualization and research on
quantization techniques for improved inference of LLMs
5
Problem Definition
• Utilizing Qualcomm Neural SDK for Model Optimization and Edge Inference
• Performed on-device inference of Multi task neural networks. This involved Model
Quantization, Conversion, and Optimization. Conducted profiling and runtime
analysis of the MTL network using QNN SDK tools to analyze Qualcomm hardware
workload and hardware resource utilization.
• Texas instruments - LIDAR live demo setup
• As part of this project, I was involved in the real-time visualization of hardware
outputs from the LIDAR setup, seamlessly interpreting the 3D point cloud inputs
loaded in the hardware using the algorithmic binary for synchronous outputs
• Optimized edge inference on LLMs - Quantization techniques
• Conducted research on quantization techniques to deploy on LLMs for inference on
edge devices and optimization of LLM inference and explored different techniques
followed by current industrial standard
• Contributed to building automation pipeline to run inference and for generating
evaluation KPIs for various decoders involved in current Multi task learning network
6
Literature Survey
• Explored several libraries suitable for visualizing 3D LIDAR data, including
Mayavi , Open3D, and PyVista. After a comprehensive assessment, opted for
Open3D due to its robust point cloud visualization features and flexibility for
customization.
• Investigated various quantization techniques outlined in the AIMET paper to
understand their applicability and effectiveness.
• Explored the functionalities offered by the Qualcomm Neural SDK, analyzing
their implications for runtime analysis and model optimization.
• Examined open-source frameworks like llama.cpp, Hugging Face libraries like
Quanto, and popular repositories such as TheBloke to gain insights into the
functionality of quantized Large Language Models (LLMs).
• Researched industrial quantization methods for LLMs from Nvidia and Hugging
Face documentation.
7
Contributions/ Work Done
Quantization of LLMs
● Models like GPT3-175B occupy a staggering
326GB of memory [2], even when stored in
the more compact float16 format Imposes
challenges not only during training but also
during inference tasks
● Growing interest in deploying efficient
quantization techniques to represent weights
and activations in lower precision formats.
● Important to achieve the same without
sacrificing accuracy significantly
● Helps in running LLMs on edge devices -
reduced latency , enhanced user
experience, user privacy as data processing
occurs locally
8
Post-Training Quantization (PTQ)

● Calibration dataset is used to capture the activation distribution
● computes scale after the network is trained
Quantization-Aware Training (QAT)

● After training the LLM , Quant and Dequant nodes are
introduced into the network
● Fine tunes and learns the quantization parameters - scale and
zero point - during training
9
Quantization strategies Activation aware Weight Quantization(AWQ)
• Not all weights in the model carry equal significance; a small
Affine Quantization fraction of weights disproportionately influence the output.
• Scale and zero point are • Selectively preserving as little as 1% of these critical weights,
dependent on bit number(b) AWQ effectively minimizes quantization error
• Calculated post training of • AWQ’s innovation lies in basing its search for optimal per-
LLM channel scaling on activation distributions, not weights
10
Building HTP Eval Automation pipeline
• Model JSON configuration
• selection of backbone architectures
• enabled decoder heads
• chosen loss function
• Data JSON configuration
• images marked for inference
• Annotations
• Calibration files
• Send images to Qualcomm hardware
• QNN SDK is utilized to execute inference
on hardware with bin - send back results to
server
• Extraction of key performance indicators
(KPIs)
• Torch outputs
• Hardware outputs
11
Analysis/ Results
• Texas instruments - LIDAR live demo setup
12
Conclusion
• Throughout my internship, I have explored different aspects of Advanced Driver
Assistance Systems (ADAS).This experience has involved conducting runtime
analyses of numerous deep learning networks and addressing challenges
associated with various decoder heads . My tasks have ranged from Lidar
visualization to network conversion, deployment, and analysis using the QNN
SDK.
• Additionally I engaged in valuable research efforts on LLM quantization strategies,
exploring different types and considering factors such as when to use each
technique, their advantages and disadvantages, as well as how data and
computational constraints influence the choice of quantization method
• This internship has been instrumental in introducing me to a previously unfamiliar
aspect of deep learning—deploying inference onto hardware, optimizing it, and
conducting comprehensive analyses.
13
Future Work
• The Qualcomm Neural SDK (QNN SDK) offers diverse options like
intermediary output generation, layer-wise and block-wise
quantization, custom quantization override configurations, and custom
op package design. This presents extensive avenues for exploration
and optimization within the SDK.
• In the field of LLM quantization, tools such as llama.cpp, LM studio,
and the Quanto library by Hugging Face, along with repositories like
TheBloke, offer significant potential for further research, innovation,
and performance optimization in the realm of quantized LLMs.
14
References
1. K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J.
Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu,
J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open
mmlab detection toolbox and benchmark,” 2019.
2. W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi, “BiLLM:
Pushing the limit of post-training quantization for llms,” 2024.
3. E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-
training quantization for generative pre-trained transformers,” 2023.
4. J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang,
C. Gan, and S. Han, “AWQ: Activation-aware weight quantization for llm
compression and acceleration,” 2024
15
Thank You
Any Questions?
16

CS20B1060

Uploaded by

Copyright:

Available Formats

CS20B1060

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS20B1060

Uploaded by

Copyright:

Available Formats

Exploring Deep Learning Model

Deployment and Inference Strategies

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

You might also like