On Approximate Computing Techniques

Approximate
Computing
Techniques
Sparsh Mittal
IIT Hyderabad, India

Acronyms
• AC = approximate computing • NVM = non-volatile

memory
• ACT =AC technique
• MLC = multi-level cell
• FP = floating point
• ISA = instruction set
• NN = neural network
architecture
• NPU = neural processing unit
• SSIM = structural
• QoR = quality of result similarity
• FU = functional unit
• SIMD = single-instruction,
multiple-data
2
Introduction to AC
• Exact computation or peak-level service demands

=> require high amount of resources
• Selective approximation or occasional violation of
specification => provides disproportionate gains in
efficiency
• Examples: 50X energy saving for 5% accuracy loss

in k-means clustering
• NN-approximation can accelerate inverse
kinematics app by 26X for 5% QoR loss
3
Introduction to AC
• AC exploits the gap between level of accuracy required

by the apps/users and that provided by the computing
system
• Building acceptable or “good-enough” systems out of

unreliable/inaccurate hardware and software
4
Terminology
Ideas synonymous (or similar to) AC

Dynamic effort-scaling,
quality programmability/configurability,
variable accuracy
Approximable portions are called ...

Soft slices/computations, best-effort computations,
non-critical/crucial, error-resilient/tolerant
error-acceptable, tunable, having ‘forgiving nature’
and being ‘optional’ (versus guaranteed/mandatory)
5
Scope and motivation for AC
Let us see where does scope for AC arise and
what are motivations for using it?
6
1. Inherent scope/need for AC
• AC is unavoidable for noisy input, limited data

precision (floating-point), HW defect, sudden
additional load or hard real-time constraints
• No unique answer may exist, but many are acceptable,

e.g., search engine
• A complex problem where exact solution is not known,

but approximate solution is efficient and sufficient
7
2. Error-resilience of programs and users
• Perceptual limitations of humans

– PSNR >30db considered acceptable
• Non-critical portions in apps
– 98% FP operations are approximable in 3D raytracer
– In clustering, only relative distances matter, not
absolute distances
– In document analysis, processing of common words,
such as “a”, “of”, “the” can be avoided
• Redundancy due to spatial/temporal correlation
8
2. Error-resilience of programs and users
• In iterative refinement algorithms, running extra

iterations with reduced precision of intermediate
computations can still provide same QoR
– On FPGA, this approach allows greater
parallelism for a fixed area constraint
9
3. Efficiency optimization and quality configurability
• AC can improve perf. and energy by relaxing accuracy

– Reducing DRAM refresh rate or SRAM supply voltage
=> reduce memory energy
• AC can alleviate scalability bottleneck, allow early loop
termination, skipping memory accesses, improve yield.
• AC provides knobs to trade-off quality with efficiency

• Spend only as much effort (e.g., energy) as required by
QoR requirement
– E.g., Program versions of different quality
10
Example of quality-effort tradeoff
PSNR loss due to

errors in a single bit
at various positions
Maximum % errors in
neural approximators
for different nodes in
hidden layer
Li et al. TCAD'15
11
Challenges of AC
12
Limited application domain and gains of AC
• Many apps can't be approximated, e.g., cryptography

• Many approximations hold in a limited range only
• E.g. approximating sin function using an NN in an
unbounded range is intractably hard
Function Approximate Range where

Function applicable
y = 1/x y = 2.823 – 1.882x 0.5 ≤ x ≤ 1
13
Correctness issues
• Approximating control flow => no program termination

(e.g. an unsolvable equation) or crash
• ACTs may lead to corrupt output which may be
undetectable by quality metric
– E.g., using size as a metric for approx. compression
• AC may interfere with synchronization and memory

ordering and may lead to non-deterministic output
• These errors are difficult to debug. Avoiding them may
require comparing with single-threaded exact execution
14
Tradeoff between quality and energy saving
Errors Energy Errors Energy

Acceptable
Errors Energy Errors Energy
Credit: A. Sampson and others

15
Blind approximation can lead to…

16
Finding right strategy and quality metric
• Need to carefully choose approximable portions and

approximation strategy
• Strategy for AC needs to be found on per-app basis
• Uniform approximation => unacceptable loss
• Need to choose accurate and lightweight quality metric

• Carefully monitor output to ensure QoR is acceptable,
otherwise, repeat execution with precise parameters
• Apart from average error, worst-case error needs to be
bounded to maintain high QoR
17
Large overheads
• Voltage scaling requires voltage shifters
• Analog NN implementations require conversion of

signals between digital and analog domain
• Need to write multiple approx. versions of program

or annotate source code => not scalable
• Some ACTs require ISA extension => not

applicable to existing platforms
18
ACT must work across the stack
User
Application
Algorithm
Language
Compiler
Architecture
Circuits
Devices
Working at only one level provides limited benefits

19
First task in any ACT is
Finding Approximable Portions
20
1. Error-injection (automatic)
Introduce errors in a
variable
QoR
Acceptable Variable is approximable
?
Variable is non-approximable
• Similarly, compiler can explore multiple algorithms with

different accuracy-levels
21
2. User tells (manual)
• For some apps, user may have insights into which

variables/portions are approximable
– LSBs of FP numbers
– Background portion in an image
– Variables having no impact on output
22
Monitoring and ensuring
output quality
is very important to ensure that the result is

acceptable
23
Output quality monitoring
• Monitoring QoR is important to ensure acceptable

results
• For several problems, finding a solution is hard,

but checking solution quality is easy, e.g., Boolean
satisfiability
• Modules for checking QoR can be lightweight
– e.g., simple prediction approaches, such as linear
estimation, moving average and decision tree
– Solution is within bound? Equation solvable?
24
Strategies for monitoring and ensuring QoR
• Some strategies for monitoring QoR:

1. Check precise and approximate app on random inputs
2. Interpolate past executions with similar inputs to
estimate output for current execution
3. Use programmer-supplied checking modules
• Some strategies for ensuring QoR:
1. Compute QoR and if QoR unacceptable, invoke precise
re-execution or increase precision in next iteration
2. Estimate QoR on using an ACT and if QoR
unacceptable, invoke precise code
25
Some quality metrics used for different
approximable applications/kernels
26
After identifying approximable portions,
next step is
Conveying info about approximable

portions to software/compiler
27
1. Using type qualifiers
@approx int a = ...;
✓
✗Approximate
int p; // precise by default
p = a; // not allowed (illegal)
a = p; // allowed Precise
Example: Sobel Filter

@approx float grad(approx float[3][3] p) {..}
@approx float dst[][];
void edgeDetection(aImage &src, aImage &dst)
edgeDetection()
{
for (int y = …) {
for (int x = …) {
dst[x][y] = grad(window(src, x, y));
}}}
Credit: A. Sampson et al.
28
2. Using OpenMP-style code annotations
Example: Gaussian Filter
#pragma omp parallel
{
#pragma omp accurate The kernel is accurate
#pragma omp for
for (i=K/2; i <(IMG_M-K/2); ++i) {
// iterate over image
for (j=K/2; j <(IMG_N-K/2); ++j) {
float sum = 0; int ii, jj;
for (ii =-K/2; ii<=K/2; ++ii) {
// iterate over kernel
for (jj = -K/2; jj <= K/2; ++jj) { except this code portion
float data = in[i+ii][j+jj]; float result;
float coef = coeffs[ii+K/2][jj+K/2];
#pragma omp approximate error_significance_threshold(20) {
result = data * coef;
sum += result;
Ignore errors in 20 LSBs
}
}} out[i][j]=sum/scale;}}}
Rahimi et al. CODES'13
29
We are all set! Now, lets actually use
Approximation Strategies
30
Some approximation strategies
1. Precision scaling 9. Inexact hardware

2. Loop-perforation 10. Voltage scaling
3. Load-value 11. (e)DRAM refresh rate
approximation reduction
4. Memoization 12. Inexact reads/writes
5. Task dropping 13. Reducing divergence in
GPU
6. Memory access skipping
14. Lossy compression
7. Data Sampling
15. Use of NN
8. Program versions of
different accuracy
31
Explanation of some strategies
• Memory access skipping: skip some memory accesses

or fetch only MSBs
• Memoization: Reuse previously stored results for
similar (but not identical) functions/inputs
– Reuse result of an instruction across different
parallel lanes of SIMD architecture
– Used in scatter/gather and map computations
• Program versions of different accuracy may be supplied
by the user
32
Load-value approximation (LVA)
Core
Data 2. Estimate K_approx. and send to core

1. Load miss K LVA
cache 4. Train LVA module
Main
memory 3. Get K_exact
• Only predict loads causing largest fraction of misses

• Can offset both latency and BW constraints
• Avoids pipeline flush and recovery overheads
• For GPUs, leverage value locality to only predict for few
threads in a warp. Use this value for remaining threads
33
Loop perforation
• Skipping some iterations of a loop
• Original: for (i = 0; i< N; i++) {...}

• Perforated: for (i = 0; i< N; i += p) {...}
• Applicable to several global computational patterns,

e.g., Monte Carlo simulation, iterative refinement and
search space enumeration
34
Precision scaling (changing bit-width)
• Leaves some bit slices in data path => power-gate

them
• Allows packing multiple data in same word and
processing by a single FU
• Allows using smaller and faster FUs for some
operations
35
Precision Scaling
• Increases spatial and temporal locality => increases

coverage of memoization; allows using lookup table for
FP add/multiply
• Turns an FP operation into trivial => no need of FU
Examples of trivial cases
Factors increasing trivialization

in an image processing app
Yeh et al. MICRO'07

36
Sampling
Use of sampling in a
MapReduce job. Only 4
map tasks (1, 2, 4, and 6)
are executed, processing
10 input data items.
Goiri et al. ASPLOS’15
Increasing quality with

increasing size of data
Chippa et al. DAC’13
37
Algorithm-level approximation strategies
• Early termination: Instead of terminating at full

convergence, stop when improvement by successive
iterations is small
• Convergence-based pruning: Tasks/inputs for which

no change/improvement is seen in last K iterations are
eliminated from further iterations
38
Creating approximate program versions
• Skip atomic operations causing frequent conflicts and

thread serialization; leave those that don’t cause conflicts
• Fuse adjacent threads not sharing data into one and
replicate one thread’s output to other threads
Relaxing dependency to run

Exact version multiple iterations concurrently
Iteration i Iteration i Iteration i+1 Iteration i+k
Iteration i+1
Time
Iteration i+2 Iteration Iteration
Iteration i+2k
i+k+1 i+k+2
Iteration i+3
39
Scalable-effort design
• Observation: only some tasks are hard => spending

equal effort on all tasks is a waste
Handwriting recognition: only

5-30% instances are hard
• Spend only as much effort as required for a task
Venkataramani et al. DAC'15

40
Input tasks
Approximate approach Approx accelerator 1

Most tasks can be done
by approx. accelerators; QoR check
only few tasks require
the precise accelerator Approx accelerator 2
QoR check
Traditional approach
Approx accelerator k
Input tasks
QoR check
Precise accelerator
Precise accelerator
Commit Commit
41
DRAM refresh rate reduction
Critical/non-critical
Programmer
declaration (object level)
Critical/non-critical
Runtime system
allocation (page level)
OS Virtual to physical
address mapping
Critical Non-critical
pages pages
Normal Low
Approximate DRAM
refresh refresh
42 Liu et al. ASPLOS'11

Value similarity based on spatial locality
Stencil pattern
(a) (b) (c)
(a): Value at center of tile can approximate all

neighboring values
(b) and (c): One column/row values can approximate
other column/rows
Samadi et al. ASPLOS’14

43
Voltage scaling (in SRAM)
• Reducing SRAM supply voltage saves leakage energy

but increases probability of read disturbance and
incorrect write
• Store critical data in fault-free blocks, non-critical in

faulty blocks.
• Too many faults in a block =>power-gate it
44
Inexact reads/writes in NVMs
1. Low read-current => erroneous reads

2. Low sensing duration and large read-current => read
disturbance (i.e., bit-flip during read)
3. Lowering either write duration or write current or
both => unsuccessful writes
• Use these knobs during reads/writes to control quality
Energy v/s error probability for STT-RAM bit-cell

Ranjan et al. DAC'15
45
Inexact writes in NVMs
• In PCM (phase change memory), if existing data are

almost same as new data => cancel write operation.
Take existing data as new data
46
Inexact reads/writes in MLC NVMs
• Perform approximate writes by reducing number of

programming pulses in writing MLC NVM
Approximate MLC NVM

configuration to reduce energy in
iterative MLC writes
• Store approximate data in blocks that have

exhausted their error-correction resources
• Preferentially correct MSBs, then LSBs
Sampson et al. MICRO’13
47
Using faulty/inexact hardware
• On each write, circularly shift data to store LSBs in

faulty cells of memory
• This reduces error magnitude, unlike ECC which
corrects errors
• Use inexact adders (e.g., ignore carry), multipliers,

cosine transform, etc
Ganapathy et al. DAC’15

48
Using novel approximation-aware ISAs
• Allows specifying the desired accuracy in ISA itself

=> allows using AC to a larger portion of app
• Provide processing elements of different accuracy
• Divide processor components into two:

– Data movement/processing (e.g. cache, register
file, FU, load-store queue) => approximable
– Instruction control (dealing with instruction
fetching, decoding) => precise
49
Neural-network based acceleration
• Observation: NNs expose parallelism and allow

efficient hardware implementation
• They can be trained to mimic several computations
• They are error-resilient
• Replace a kernel with approximate NN-based code
50
Reducing branch divergence in SIMD
• Use of approximate NN-based code:

• Characterize data-dependent control flow of the app
• Identify kernels responsible for largest performance
loss due to branch divergence
• Train NNs to approximate these kernels
• Inject NNs into code itself by substituting the kernels
• Branch herding: Find outcome of majority of threads.

Force all threads to follow this
Branch divergence: threads in SIMD diverging due to if/else Grigorian et al. TACO'15, Sartori et al. TMM’13
51
Neural-network based acceleration
Approximable
portion
Program
1. Find an approximable program portion
2. Train an NN to mimic this
3. Implement NN on an neural processing unit (NPU)
Credit: Sampson et al.
52
Neural Algorithmic Transformation
Code1 Code2 Code3 Code4 Code5 Code6

Source
Code …
Common Neural
Intermediate Representation
Representation
Acceleration CPU NPU
Esmaeilzadeh et al. MICRO’12

53
Implementing NN-based code
CPU NPU
Digital Analog
CPU GPU FPGA FPAA
ASIC ASIC
Lower implementation cost, higher Lower energy, higher

communication/computation overhead performance, lower accuracy
FPAA = field programmable analog array Emaeilzadeh et al. MICRO’12

54
Approximating NNs
• Quantify impact of approximating any neuron to

overall quality using back-propagation
• Replace least sensitive neurons with their approx.
versions
• Different approx. versions can be created by
adapting input-precision and neuron-weights
• Training involves error-healing => retraining
allows further approximations for same QoR
Venkataramani et al. ISLPED'14

55
Reducing error-correction overhead using AC
• Use weak error-correction approach for resilient apps

• For critical code, use redundant execution for soft-
error protection
• For non-critical code, use only low-overhead resilience
schemes, e.g., checking store operations by redundant
address calculation
• A soft-error in non-critical portion does not affect
correctness
• In multicores, control resilience of each core
independently
Shi et al. CAL’15, Xu et al. TR’15

56
Some application areas of AC
• Image processing or multimedia

• Signal processing
• Machine learning
• Scientific computing
• Financial analysis
• Database search
• MapReduce
•…
57
References
• S. Mittal, “A Survey Of Techniques for Approximate

Computing”, ACM Computing Surveys, 2016
58

On Approximate Computing Techniques

Uploaded by

Copyright:

Available Formats

On Approximate Computing Techniques

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On Approximate Computing Techniques

Uploaded by

Copyright:

Available Formats

Approximate

IIT Hyderabad, India

• AC = approximate computing • NVM = non-volatile

• Exact computation or peak-level service demands

• Examples: 50X energy saving for 5% accuracy loss

• AC exploits the gap between level of accuracy required

• Building acceptable or “good-enough” systems out of

Ideas synonymous (or similar to) AC

Approximable portions are called ...

• AC is unavoidable for noisy input, limited data

• No unique answer may exist, but many are acceptable,

• A complex problem where exact solution is not known,

• Perceptual limitations of humans

• Redundancy due to spatial/temporal correlation

• In iterative refinement algorithms, running extra

• AC can improve perf. and energy by relaxing accuracy

• AC provides knobs to trade-off quality with efficiency

PSNR loss due to

• Many apps can't be approximated, e.g., cryptography

Function Approximate Range where

• Approximating control flow => no program termination

• AC may interfere with synchronization and memory

Errors Energy Errors Energy

Errors Energy Errors Energy

Credit: A. Sampson and others

Credit: A. Sampson and others

• Need to carefully choose approximable portions and

• Need to choose accurate and lightweight quality metric

• Voltage scaling requires voltage shifters

• Analog NN implementations require conversion of

• Need to write multiple approx. versions of program

• Some ACTs require ISA extension => not

Working at only one level provides limited benefits

Credit: A. Sampson and others

Finding Approximable Portions

• Similarly, compiler can explore multiple algorithms with

• For some apps, user may have insights into which

is very important to ensure that the result is

• Monitoring QoR is important to ensure acceptable

• For several problems, finding a solution is hard,

• Some strategies for monitoring QoR:

Conveying info about approximable

Example: Sobel Filter

1. Precision scaling 9. Inexact hardware

• Memory access skipping: skip some memory accesses

Data 2. Estimate K_approx. and send to core

• Only predict loads causing largest fraction of misses

• Skipping some iterations of a loop

• Original: for (i = 0; i< N; i++) {...}

• Applicable to several global computational patterns,

• Leaves some bit slices in data path => power-gate

• Increases spatial and temporal locality => increases

Examples of trivial cases

Factors increasing trivialization

Yeh et al. MICRO'07

Increasing quality with

• Early termination: Instead of terminating at full

• Convergence-based pruning: Tasks/inputs for which