The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
Stephen Blair-Chappell
Intel Compiler Labs
• 1. Auto-Vectorisation
• 2. CPU Dispatch
• 3. Manual Processor Dispatch
• 4. A Case Study
A customer when moving from version 9.1 to version 10 of the Intel compiler
r1 r2 v1 v2
Scalar Vector
Processing + Processing +
r3 v3 VL =
vector
add.d r3, r1, r2 addvec.d v3, v1, v2 length
70 instr 144 instr 13 instr 32 instr 47 instr 8 instr 7 instr ~100 new
Single- Double- Complex instr.
Decode Video String/XML Encryption
Precision precision Data processing and ~300 legacy
Vectors Vectors Graphics Decryption sse instr
building POP-Count updated
Streaming 8/16/32 blocks Key
operations CRC Generation 256-bit
64/128-bit Advanced vector
vector vector instr
integer 3 and 4-
operand
instructions
128 0
X4 X3 X2 X1 Intel® SSE
Vector size: 128bit
Y4 Y3 Y2 Y1
Data types:
8,16,32,64 bit integers
32 and 64bit floats
X4opY4 X3opY3 X2opY2 X1opY1 VL: 2,4,8,16
Sample: Xi, Yi bit 32 int / float
X4 X3 X2 X1addY1
Now:
Performance / core
Core
10
Key Intel® Advanced Vector Extensions
(Intel® AVX) Features
KEY FEATURES BENEFITS
• Wider Vectors • Up to 2x peak FLOPs (floating point
– Increased from 128 to 256 bit operations per second) output with good
power efficiency
– Two 128-bit load ports
• Enhanced Data Rearrangement • Organize, access and pull only necessary
– Use the new 256 bit primitives to data more quickly and efficiently
broadcast, mask loads and permute data
• Three and four Operands: Non • Fewer register copies, better register use for
Destructive Syntax for both AVX 128 and both vector and scalar code
AVX 256
Memory Control
48 bytes/cycle
•1-per-cycle 256-bit multiply, add, and shuffle
•Load double the data L1 Data Cache
with Intel microarchitecture (Sandy Bridge) !!!Software and Services Group Optimization Notice
for (i=0;i<MAX;i++)
c[i]=a[i]+b[i];
• -vec-report
Adding Portability
ssse3 May generate Intel® SSSE3, SSE3, SSE2, and SSE instructions.
sse This option has been deprecated; it is now the same as specifying
ia32.
ia32 Generates x86/x87 generic code that is compatible with IA-32
architecture.
This option tells the compiler to generate code specialized for the processor that
executes your program.
Code generated with these options should execute on any compatible, non-Intel
processor with support for the corresponding instruction set.
SSE4.1 SSE4 Vectorizing Compiler and Media Accelerator, SSSE3, SSE3, SSE2, and SSE . May optimize
for Intel® 45nm Hi-k next generation Intel® Core™ microarchitecture.
SSE3_ATOM MOVBE , (depending on -minstruction ), SSSE3, SSE3, SSE2, and SSE . Optimizes for the Intel®
Atom™ processor and Intel® Centrino® Atom™ Processor Technology
SSSE3 SSSE3, SSE3, SSE2, and SSE. Optimizes for the Intel® Core™ microarchitecture.
SSE3 SSE3, SSE2, and SSE. Optimizes for the enhanced Pentium® M processor microarchitecture
and Intel NetBurst® microarchitecture.
SSE2 SSE2 and SSE . Optimizes for the Intel NetBurst® microarchitecture.
–xAVX
for(i=0;i<NUM;i++)
{
AVX
j[i] = h[i] + i + 3
}
Running on a CPU
supporting AVX
–xAVX
for(i=0;i<NUM;i++)
{
AVX
j[i] = h[i] + i + 3
}
–axAVX
for(i=0;i<NUM;i++)
SSE2 {
j[i] = h[i] + i + 3
AVX }
Base line
(set with –m or –x option)
SSE2
–axAVX
for(i=0;i<NUM;i++)
SSE2 {
j[i] = h[i] + i + 3
AVX }
Base line
(set with –m or –x option)
SSE2
for(i=0;i<NUM;i++)
SSE2 {
j[i] = h[i] + i + 3
AVX }
Base line
(set with –m or –x option)
SSE2
http://developer.amd.com/Assets/CompilerOptQuickRef-61004100.pdf
Code
• still works on older processors
• Use __declespec(cpu_dispatch(cpuid,cpuid…)
__declspec(cpu_specific(generic))
void dispatch_func() {
printf("Code for non-Intel processors\and generic Intel\n");
}
__declspec(cpu_specific(future_cpu_16))
void dispatch_func() {
printf("Code for 2nd generation Intel Core processors goes here\n");
}
int main() {
dispatch_func();
printf("Return from dispatch_func\n");
return 0;
}
www.pishurlok.com
Tick
ADC
Complete
Interrupt
Request
Model T2
a
T3
Logger b
T4
c
Script
Frame 1 Frame 2 Frame 3
Intel® compilers, associated libraries and associated development tools may include or
utilize options that optimize for instruction sets that are available in both Intel® and non-
Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for
non-Intel microprocessors. In addition, certain compiler options for Intel compilers,
including some that are not specific to Intel micro-architecture, are reserved for Intel
microprocessors. For a detailed description of Intel compiler options, including the
instruction sets and specific microprocessors they implicate, please refer to the “Intel®
Compiler User and Reference Guides” under “Compiler Options." Many library routines that
are part of Intel® compiler products are more highly optimized for Intel microprocessors
than for other microprocessors. While the compilers and libraries in Intel® compiler
products offer optimizations for both Intel and Intel-compatible microprocessors, depending
on the options you select, your code and other factors, you likely will get extra performance
on Intel microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not
optimize to the same degree for non-Intel microprocessors for optimizations that are not
unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD
Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and
Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of
any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining
the best performance on Intel® and non-Intel microprocessors, Intel recommends that you
evaluate other compilers and libraries to determine which best meet your requirements. We
hope to win your business by striving to offer the best performance of any compiler or
library; please let us know if you find we do not.
Notice revision #20101101