On Approximate Computing Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

Approximate

Computing
Techniques

Sparsh Mittal

IIT Hyderabad, India


Acronyms

• AC = approximate computing • NVM = non-volatile


memory
• ACT =AC technique
• MLC = multi-level cell
• FP = floating point
• ISA = instruction set
• NN = neural network
architecture
• NPU = neural processing unit
• SSIM = structural
• QoR = quality of result similarity
• FU = functional unit
• SIMD = single-instruction,
multiple-data

2
Introduction to AC

• Exact computation or peak-level service demands


=> require high amount of resources
• Selective approximation or occasional violation of
specification => provides disproportionate gains in
efficiency

• Examples: 50X energy saving for 5% accuracy loss


in k-means clustering
• NN-approximation can accelerate inverse
kinematics app by 26X for 5% QoR loss

3
Introduction to AC

• AC exploits the gap between level of accuracy required


by the apps/users and that provided by the computing
system

• Building acceptable or “good-enough” systems out of


unreliable/inaccurate hardware and software

4
Terminology

Ideas synonymous (or similar to) AC


Dynamic effort-scaling,
quality programmability/configurability,
variable accuracy

Approximable portions are called ...


Soft slices/computations, best-effort computations,
non-critical/crucial, error-resilient/tolerant
error-acceptable, tunable, having ‘forgiving nature’
and being ‘optional’ (versus guaranteed/mandatory)

5
Scope and motivation for AC
Let us see where does scope for AC arise and
what are motivations for using it?

6
1. Inherent scope/need for AC

• AC is unavoidable for noisy input, limited data


precision (floating-point), HW defect, sudden
additional load or hard real-time constraints

• No unique answer may exist, but many are acceptable,


e.g., search engine

• A complex problem where exact solution is not known,


but approximate solution is efficient and sufficient

7
2. Error-resilience of programs and users

• Perceptual limitations of humans


– PSNR >30db considered acceptable
• Non-critical portions in apps
– 98% FP operations are approximable in 3D raytracer
– In clustering, only relative distances matter, not
absolute distances
– In document analysis, processing of common words,
such as “a”, “of”, “the” can be avoided

• Redundancy due to spatial/temporal correlation

8
2. Error-resilience of programs and users

• In iterative refinement algorithms, running extra


iterations with reduced precision of intermediate
computations can still provide same QoR
– On FPGA, this approach allows greater
parallelism for a fixed area constraint

9
3. Efficiency optimization and quality configurability

• AC can improve perf. and energy by relaxing accuracy


– Reducing DRAM refresh rate or SRAM supply voltage
=> reduce memory energy
• AC can alleviate scalability bottleneck, allow early loop
termination, skipping memory accesses, improve yield.

• AC provides knobs to trade-off quality with efficiency


• Spend only as much effort (e.g., energy) as required by
QoR requirement
– E.g., Program versions of different quality

10
Example of quality-effort tradeoff

PSNR loss due to


errors in a single bit
at various positions

Maximum % errors in
neural approximators
for different nodes in
hidden layer
Li et al. TCAD'15

11
Challenges of AC

12
Limited application domain and gains of AC

• Many apps can't be approximated, e.g., cryptography


• Many approximations hold in a limited range only
• E.g. approximating sin function using an NN in an
unbounded range is intractably hard

Function Approximate Range where


Function applicable
y = 1/x y = 2.823 – 1.882x 0.5 ≤ x ≤ 1

13
Correctness issues

• Approximating control flow => no program termination


(e.g. an unsolvable equation) or crash
• ACTs may lead to corrupt output which may be
undetectable by quality metric
– E.g., using size as a metric for approx. compression

• AC may interfere with synchronization and memory


ordering and may lead to non-deterministic output
• These errors are difficult to debug. Avoiding them may
require comparing with single-threaded exact execution

14
Tradeoff between quality and energy saving

Errors Energy Errors Energy


Acceptable

Errors Energy Errors Energy

Credit: A. Sampson and others


15
Blind approximation can lead to…

Credit: A. Sampson and others


16
Finding right strategy and quality metric

• Need to carefully choose approximable portions and


approximation strategy
• Strategy for AC needs to be found on per-app basis
• Uniform approximation => unacceptable loss

• Need to choose accurate and lightweight quality metric


• Carefully monitor output to ensure QoR is acceptable,
otherwise, repeat execution with precise parameters
• Apart from average error, worst-case error needs to be
bounded to maintain high QoR

17
Large overheads

• Voltage scaling requires voltage shifters

• Analog NN implementations require conversion of


signals between digital and analog domain

• Need to write multiple approx. versions of program


or annotate source code => not scalable

• Some ACTs require ISA extension => not


applicable to existing platforms

18
ACT must work across the stack

User
Application
Algorithm
Language
Compiler
Architecture
Circuits
Devices

Working at only one level provides limited benefits

Credit: A. Sampson and others


19
First task in any ACT is

Finding Approximable Portions

20
1. Error-injection (automatic)

Introduce errors in a
variable

QoR
Acceptable Variable is approximable
?

Variable is non-approximable

• Similarly, compiler can explore multiple algorithms with


different accuracy-levels
21
2. User tells (manual)

• For some apps, user may have insights into which


variables/portions are approximable
– LSBs of FP numbers
– Background portion in an image
– Variables having no impact on output

22
Monitoring and ensuring
output quality

is very important to ensure that the result is


acceptable

23
Output quality monitoring

• Monitoring QoR is important to ensure acceptable


results

• For several problems, finding a solution is hard,


but checking solution quality is easy, e.g., Boolean
satisfiability
• Modules for checking QoR can be lightweight
– e.g., simple prediction approaches, such as linear
estimation, moving average and decision tree
– Solution is within bound? Equation solvable?

24
Strategies for monitoring and ensuring QoR

• Some strategies for monitoring QoR:


1. Check precise and approximate app on random inputs
2. Interpolate past executions with similar inputs to
estimate output for current execution
3. Use programmer-supplied checking modules
• Some strategies for ensuring QoR:
1. Compute QoR and if QoR unacceptable, invoke precise
re-execution or increase precision in next iteration
2. Estimate QoR on using an ACT and if QoR
unacceptable, invoke precise code

25
Some quality metrics used for different
approximable applications/kernels

26
After identifying approximable portions,
next step is

Conveying info about approximable


portions to software/compiler

27
1. Using type qualifiers
@approx int a = ...;

✗Approximate
int p; // precise by default
p = a; // not allowed (illegal)
a = p; // allowed Precise

Example: Sobel Filter


@approx float grad(approx float[3][3] p) {..}
@approx float dst[][];
void edgeDetection(aImage &src, aImage &dst)
edgeDetection()
{
for (int y = …) {
for (int x = …) {
dst[x][y] = grad(window(src, x, y));
}}}
Credit: A. Sampson et al.
28
2. Using OpenMP-style code annotations
Example: Gaussian Filter
#pragma omp parallel
{
#pragma omp accurate The kernel is accurate
#pragma omp for
for (i=K/2; i <(IMG_M-K/2); ++i) {
// iterate over image
for (j=K/2; j <(IMG_N-K/2); ++j) {
float sum = 0; int ii, jj;
for (ii =-K/2; ii<=K/2; ++ii) {
// iterate over kernel
for (jj = -K/2; jj <= K/2; ++jj) { except this code portion
float data = in[i+ii][j+jj]; float result;
float coef = coeffs[ii+K/2][jj+K/2];
#pragma omp approximate error_significance_threshold(20) {
result = data * coef;
sum += result;
Ignore errors in 20 LSBs
}
}} out[i][j]=sum/scale;}}}
Rahimi et al. CODES'13
29
We are all set! Now, lets actually use

Approximation Strategies

30
Some approximation strategies

1. Precision scaling 9. Inexact hardware


2. Loop-perforation 10. Voltage scaling
3. Load-value 11. (e)DRAM refresh rate
approximation reduction
4. Memoization 12. Inexact reads/writes
5. Task dropping 13. Reducing divergence in
GPU
6. Memory access skipping
14. Lossy compression
7. Data Sampling
15. Use of NN
8. Program versions of
different accuracy
31
Explanation of some strategies

• Memory access skipping: skip some memory accesses


or fetch only MSBs
• Memoization: Reuse previously stored results for
similar (but not identical) functions/inputs
– Reuse result of an instruction across different
parallel lanes of SIMD architecture
– Used in scatter/gather and map computations
• Program versions of different accuracy may be supplied
by the user

32
Load-value approximation (LVA)

Core

Data 2. Estimate K_approx. and send to core


1. Load miss K LVA
cache 4. Train LVA module

Main
memory 3. Get K_exact

• Only predict loads causing largest fraction of misses


• Can offset both latency and BW constraints
• Avoids pipeline flush and recovery overheads
• For GPUs, leverage value locality to only predict for few
threads in a warp. Use this value for remaining threads

33
Loop perforation

• Skipping some iterations of a loop

• Original: for (i = 0; i< N; i++) {...}


• Perforated: for (i = 0; i< N; i += p) {...}

• Applicable to several global computational patterns,


e.g., Monte Carlo simulation, iterative refinement and
search space enumeration

34
Precision scaling (changing bit-width)

• Leaves some bit slices in data path => power-gate


them
• Allows packing multiple data in same word and
processing by a single FU
• Allows using smaller and faster FUs for some
operations

35
Precision Scaling

• Increases spatial and temporal locality => increases


coverage of memoization; allows using lookup table for
FP add/multiply
• Turns an FP operation into trivial => no need of FU

Examples of trivial cases

Factors increasing trivialization


in an image processing app

Yeh et al. MICRO'07


36
Sampling

Use of sampling in a
MapReduce job. Only 4
map tasks (1, 2, 4, and 6)
are executed, processing
10 input data items.
Goiri et al. ASPLOS’15

Increasing quality with


increasing size of data
Chippa et al. DAC’13

37
Algorithm-level approximation strategies

• Early termination: Instead of terminating at full


convergence, stop when improvement by successive
iterations is small

• Convergence-based pruning: Tasks/inputs for which


no change/improvement is seen in last K iterations are
eliminated from further iterations

38
Creating approximate program versions

• Skip atomic operations causing frequent conflicts and


thread serialization; leave those that don’t cause conflicts
• Fuse adjacent threads not sharing data into one and
replicate one thread’s output to other threads

Relaxing dependency to run


Exact version multiple iterations concurrently
Iteration i Iteration i Iteration i+1 Iteration i+k

Iteration i+1
Time
Iteration i+2 Iteration Iteration
Iteration i+2k
i+k+1 i+k+2
Iteration i+3

39
Scalable-effort design

• Observation: only some tasks are hard => spending


equal effort on all tasks is a waste

Handwriting recognition: only


5-30% instances are hard

• Spend only as much effort as required for a task

Venkataramani et al. DAC'15


40
Input tasks

Approximate approach Approx accelerator 1


Most tasks can be done
by approx. accelerators; QoR check
only few tasks require
the precise accelerator Approx accelerator 2

QoR check

Traditional approach
Approx accelerator k
Input tasks
QoR check
Precise accelerator
Precise accelerator

Commit Commit

41
DRAM refresh rate reduction

Critical/non-critical
Programmer
declaration (object level)

Critical/non-critical
Runtime system
allocation (page level)

OS Virtual to physical
address mapping
Critical Non-critical
pages pages

Normal Low
Approximate DRAM
refresh refresh

42 Liu et al. ASPLOS'11


Value similarity based on spatial locality

Stencil pattern

(a) (b) (c)

(a): Value at center of tile can approximate all


neighboring values
(b) and (c): One column/row values can approximate
other column/rows

Samadi et al. ASPLOS’14


43
Voltage scaling (in SRAM)

• Reducing SRAM supply voltage saves leakage energy


but increases probability of read disturbance and
incorrect write

• Store critical data in fault-free blocks, non-critical in


faulty blocks.

• Too many faults in a block =>power-gate it

44
Inexact reads/writes in NVMs

1. Low read-current => erroneous reads


2. Low sensing duration and large read-current => read
disturbance (i.e., bit-flip during read)
3. Lowering either write duration or write current or
both => unsuccessful writes
• Use these knobs during reads/writes to control quality

Energy v/s error probability for STT-RAM bit-cell


Ranjan et al. DAC'15
45
Inexact writes in NVMs

• In PCM (phase change memory), if existing data are


almost same as new data => cancel write operation.
Take existing data as new data

46
Inexact reads/writes in MLC NVMs

• Perform approximate writes by reducing number of


programming pulses in writing MLC NVM

Approximate MLC NVM


configuration to reduce energy in
iterative MLC writes

• Store approximate data in blocks that have


exhausted their error-correction resources
• Preferentially correct MSBs, then LSBs
Sampson et al. MICRO’13
47
Using faulty/inexact hardware

• On each write, circularly shift data to store LSBs in


faulty cells of memory
• This reduces error magnitude, unlike ECC which
corrects errors

• Use inexact adders (e.g., ignore carry), multipliers,


cosine transform, etc

Ganapathy et al. DAC’15


48
Using novel approximation-aware ISAs

• Allows specifying the desired accuracy in ISA itself


=> allows using AC to a larger portion of app

• Provide processing elements of different accuracy

• Divide processor components into two:


– Data movement/processing (e.g. cache, register
file, FU, load-store queue) => approximable
– Instruction control (dealing with instruction
fetching, decoding) => precise

49
Neural-network based acceleration

• Observation: NNs expose parallelism and allow


efficient hardware implementation
• They can be trained to mimic several computations
• They are error-resilient

• Replace a kernel with approximate NN-based code

50
Reducing branch divergence in SIMD

• Use of approximate NN-based code:


• Characterize data-dependent control flow of the app
• Identify kernels responsible for largest performance
loss due to branch divergence
• Train NNs to approximate these kernels
• Inject NNs into code itself by substituting the kernels

• Branch herding: Find outcome of majority of threads.


Force all threads to follow this

Branch divergence: threads in SIMD diverging due to if/else Grigorian et al. TACO'15, Sartori et al. TMM’13
51
Neural-network based acceleration

Approximable
portion

Program
1. Find an approximable program portion
2. Train an NN to mimic this
3. Implement NN on an neural processing unit (NPU)
Credit: Sampson et al.
52
Neural Algorithmic Transformation

Code1 Code2 Code3 Code4 Code5 Code6


Source
Code …

Common Neural
Intermediate Representation
Representation

Acceleration CPU NPU

Esmaeilzadeh et al. MICRO’12


53
Implementing NN-based code

CPU NPU

Digital Analog
CPU GPU FPGA FPAA
ASIC ASIC

Lower implementation cost, higher Lower energy, higher


communication/computation overhead performance, lower accuracy

FPAA = field programmable analog array Emaeilzadeh et al. MICRO’12


54
Approximating NNs

• Quantify impact of approximating any neuron to


overall quality using back-propagation
• Replace least sensitive neurons with their approx.
versions
• Different approx. versions can be created by
adapting input-precision and neuron-weights
• Training involves error-healing => retraining
allows further approximations for same QoR

Venkataramani et al. ISLPED'14


55
Reducing error-correction overhead using AC

• Use weak error-correction approach for resilient apps


• For critical code, use redundant execution for soft-
error protection
• For non-critical code, use only low-overhead resilience
schemes, e.g., checking store operations by redundant
address calculation
• A soft-error in non-critical portion does not affect
correctness
• In multicores, control resilience of each core
independently

Shi et al. CAL’15, Xu et al. TR’15


56
Some application areas of AC

• Image processing or multimedia


• Signal processing
• Machine learning
• Scientific computing
• Financial analysis
• Database search
• MapReduce
•…

57
References

• S. Mittal, “A Survey Of Techniques for Approximate


Computing”, ACM Computing Surveys, 2016

58

You might also like