梁存铭Intel - Core - effeciency PDF

Practices for Building Core/Efficient Applications
Liang Cunming
2015.04.21
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING
TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION
CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS
COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH
MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel
reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize
a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware
or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more
information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are
trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
IntelActive Management Technology requires the platform to have an IntelAMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With
regard to notebooks, Intel AMT may not be available or certain capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more
information, see http://www.intel.com/technology/iamt.
64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel64 architecture. Performance will vary depending on
your hardware and software configurations. Consult with your system vendor for more information.
No computer system can provide absolute security under all conditions. IntelTrusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel
Virtualization Technology, an Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted
Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technology/security/ for more information.
Hyper-Threading Technology (HT Technology) requires a computer system with an IntelPentium4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance
will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology.
IntelVirtualization Technology requires a computer system with an enabled Intelprocessor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or
other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application
vendor.
* Other names and brands may be claimed as the property of others.
Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these
devices. This list and/or these devices may be subject to change without notice.
Copyright 2014, Intel Corporation. All rights reserved.
Whats this talk is about

Throughput
10
8
6
4
Platform
Efficientcy
2
0
Latency
Virturlization
5 dimension assessment or user experience
The talk focus on efficiency
CPU Efficiency
1. Reducing stalled
cycles
2. Managing idle loop
3. Leverage HW offload
Practice Sharing
SIMD in Packet IO
AVX2 memcpy
Dynamic frequency adaption
Preemptive task switch
Interrupt Packet IO
Gain from csum offload
1. Stalled Cycles
Reducing stalled cycles

Instruction parallel
Improving cache bandwidth
Hide cache latency
1. Stalled Cycles
Instruction Parallel and Cache BW

Unified Reservation Station
Vector Integer
Multiply
Store Data
Slow
Integer ALU &
Integer ALU &
Fast LEA
Shift
FP & INT
Shuffle
Branch
Port 7
Branch
Port 6
FMA / FP MUL/
FP Add
Load/STA
Port 5
FMA /
FP Multiply
Load/STA
Port 4
Fast LEA
Port 3
Integer ALU &
Shift
Divide
Port 2
Port 1
Port 0
Integer ALU &
Store Address
Vector Integer
Integer
Vector Integer
ALU
ALU
Vector Logical
Vector Logical
A Sample of the Superscalar Execution Engine
Vector Logical
Vector Shifts
Metric
Nehalem
SNB
HSW
Instruction Cache
32K
32K
32K
L1 Data Cache (DCU)
32K
32K
32K
4/5/7
4/5/7
4/5/7
No index / nominal
/ non-flat seg
16+16
32+16
64+32
2 loads + 1 store
256K
256K
256K
Hit Latency (cycle)
10
12
12
Nominal load
BW (bytes/cycle)
32
32
64
HSW doubled MLC

hit BW
Hit Latency (cycle)
From Nehalem, SNB to HSW L1

Cache BW was greatly improved
How to expose this capability
in DPDK ?
Bandwidth
(bytes/cycle)
L2 Unified Cache
(MLC)
https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Comments
1. Stalled Cycles
Hide Cache Latency

Base line
Load
ALU
Store
Load
ALU
Store
Load
ALU
Store
Load
ALU
Store
+1
cycle cache
latency
Load
Stall
ALU
Store
Load
Stall
ALU
Store
Load
Stall
ALU
Store
Load
Load
ALU
Store
Store
Store
Store
Load
Load
ALU
ALU
ALU
+4
cycle cache
latency
Load
Stall
Stall
Stall
Stall
ALU
Store
+4
cycle cache
latency
Load
Load
Stall
Stall
Store
Store
Load
Load
Stall
Stall
ALU
ALU
ALU
ALU
hide latency
Store
Load
Stall
Note: It assumes each

instruction in sample
cost the same number
of cycles.
Load
Stall
Stall
Stall
ALU
Stall
Store
ALU
Store
Store
Time
save half time
save ~60% time
hide cache latency by bulk process

Sounds great in theory but how to realize this performance ?
1. Stalled Cycles
Practice: vector packet IO (1)

One big while loop
Things happen one at a time
Except amortizing tail pointer
Straightforward
implementation
Linear execution
Check multiple
descriptors at a
time?
Vector copies?
Allocate buffers in bulk?
What happens when you poll much faster than the rate at which
packets are coming in?
Every received packed will result in modification of a descriptor
cache line (to write new buffer address) likely in the same cache
line that the NIC is reading. These conflicts should be avoided.
Desc A
Desc B
Desc C
CACHE LINE
Desc D
1. Stalled Cycles
Things happen in chunks
Defer descriptor writes until multiple have

accumulated
Loops unroll
Easy to do bulk copies
Easier to vectorize
Reduce probability of modifying a cache line

being written by the NIC
Remove linear dependency on DD bit check
Always copy 4 descriptors data and 4 buffer

pointers
Hide load latency and fully consuming the
double cache load bandwidth, 16Bytes
descriptor <-> 128bits XMM register <-> 2
buffer pointers.
128bits descriptor shuffle to 16Bytes buffer
header, issue shuffle in parallel
Using popcnt to check the number of available
descriptors
128 bit desc format
128 bit partial mbuf data
1. Stalled Cycles

Throughput
Growth
IPV4-COUNT-BURST
60
120.00%
50
100.00%
40
80.00%
IPV6-COUNT-BURST
40
35
30
Mbps
25
30
60.00%
20
40.00%
20
15
10
10
20.00%
0.00%
Base
Bulk
5
0
S-OLD
Vector
Packet IO throughput optimization
V-OLD
V-NEW
Vector L3fwd throughput
SNB Server 2.7GHz

No hyper-thread
No turbo-burst
1 x Core
4 x Niantic card, one port/card
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only
on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may
cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other
S-OLD original L3FWD with scalar PMD.

V-OLD original L3FWD with vector PMD.
V-NEW modified L3FWD with vector PMD
Vector based IO reduces cycle count @ 54

cycles/packet
1. Stalled Cycles
Practice: AVX2 memory copy
Utilized 256-bit
load/store
Forced 32-byte aligned
store to improve
performance
Improved control flow to
reduce copy bytes
(Eliminate unnecessary
MOVs)
Resolved performance
issue at certain odd
sizes
2X faster by 256-bit load/store

Fixed performance
issue
at certain odd sizes
in current memcpy
No weight applied in calculation

Average throughput speedup on
selected sizes, Non-constant
32B aligned
C2C
new
current
glibc
1.85
1.24
1.00
C2M
4.57
4.06
1.00
M2C
1.26
1.14
1.00
M2M
2.62
2.41
1.00
Disclaimer: Software and workloads used in performance tests

may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to
any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the
performance of that product when combined with other
2. Idle Loop
Managing Idle Loop

Problem Statement
always dead loop even no packet comes in
can we do something else on the packet IO
core
Effective Way
Frequency scale and turbo

Limit the IO thread in some quota
On-demand yield
Turn to sleep
2. Idle Loop
Practice: Power Optimization(1)

L3fwd
Platform Power (idle)
123W
Platform Power (L3fwd w/o traffic)
245W
CPU Utilization (L3fwd w/o traffic)
100%
Frequency (L3fwd w/o traffic)
2701000 KHz(Turbo Boost)
platform
power with
traffic
best
perf/wat
t
Idle Scenario
L3fwd busy-wait loop
consumes unnecessary cycles
and power
Linux power saving
mechanism totally not utilized!
Active Scenario
Manually set P-state at
different freq.
Freq. insensitive to I/O
intensive DPDK peak perf., but
sensitive to power consumption
Negligible peak perf.
degradation at lower freq.
On SNB, 1.7/1.8G freq. achieves
best perf/watt(considering
1C/2T for 2 ports)
Disclaimer: Software and workloads used in performance tests may
have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured
using specific computer systems, components, software, operations
and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other
2. Idle Loop
Practice: Power Optimization(2)
1% perf.
down
Idle Power Comparison

L3fwd
L3fwd_opt
Platform Power
(idle)
123W
123W
Platform Power
(L3fwd w/o traffic)
245W
135W
CPU Utilization
(L3fwd w/o traffic)
100%
0.3%
Frequency
(L3fwd w/o traffic)
2701000 KHz
(Turbo Boost)
1200000 KHz
Idle Scenario
Sleep till incoming
traffic
Lowest core freq.
Power saving for tidal
effect
Active Scenario
Peak perf. degradation
for 64B only
~90W platform power
reduction for the most
of cases(different
packet sizes)
Active Power and Performance Comparison
Disclaimer: Software and workloads used in performance tests may

have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured
using specific computer systems, components, software, operations
and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other
2. Idle Loop
Practice: Multi-pThreads per Core(1)

1:1
affinity
w/ or w/o
core
isolate
lcore_2
pthread
lcore_3
pthread
lcore_4
pthread
lcore_0
pthread
lcore_5
pthread
Linux
CPU
Core 0
CPU
Core 1
EAL/nonEAL
thread n:m
affinity
EAL/nonEAL
thread n:1
affinity
EAL
thread n:1
affinity
30
%
20
%
10
%
40
%
pthread
A
pthread
B
lcore_6
pthread
40%
Group
with
Profile
(CQM)
60
%
pthread
D
pthread
C
lcore_7
pthread
Many operations dont

require 100% of a CPU so
share it smartly
Cgroups allows
Prioritization where
groups may get different
shares of CPU resources
Split thread model against
vertical
grouping
packet IO
Linux Scheduling
CPU
Core 2
affinity pthread model
CPU set
Core 3,4
horizontal
grouping
cgroup Pre-emptive multitasking
Cgroup manages CPU cycle accounting efficiently but what about other
2. Idle Loop
Cgroup and Cache QoS

Cache Monitoring Technology (CMT)
Cache Allocation Technology (CAT)
Identify misbehaving or cache-starved

applications and reschedule according to priority
Cache Occupancy reported on per Resource
Monitoring ID (RMID) basis
Core 0
Core 1
App
App
Core n
..
Last Level Cache
Available on Communications SKUs only

Last Level Cache partitioning mechanism enabling the
separation of applications, threads, VMs, etc.
Misbehaving threads can be isolated to increase
determinism
Core 0
Core 1
App
App
Core n
..
Last Level Cache
Cache Monitoring and Allocation Improve Visibility and Runtime Determinism

Cgroup can be used to control both of them.
https://www-ssl.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-
2. Idle Loop
Practice: Multi-pThreads per Core(2)

Testpmd 64Bytes iofwd
Scheduling task switch

latency average 4~6us
Impact & Penalty on IO
throughput
100.00%
80.00%
60.00%
Throughput
40.00%
20.00%
100.00%
0.00%
1c1pt
Throughput
80.00%
1c2pt
1c2pt w/ yield
2x10GE packet IO
60.00%
1c1pt
1c2pt
40.00%
1c4pt
20.00%
0.00%
testpmd
w/ yield
rxd=512
SNB Server 2.7GHz, No hyper-thread, No turboburst, 1 x Core, 4 x Niantic card, one port/card
Make use of RX idle

On-demand yield
Queuing more descriptor
With rxd 512 and 1c 4 pthreads achieves ~90% line

rate
2. Idle Loop
Practice: Interrupt Mode Packet IO

DPDK
Main thread
wake up latency
average ~9us
~150pkts(14.8Mpps
* 10us) on 10GE
Packet Burst during
wake up
DPDK
Polling thread
epoll_wait()
User Space
epoll_wait()
return
FD
Kernel Space
7
igb_uio.ko/
vfio-pci.ko
pthread_creat
e
6
ISR
uio_event_notify()
vfio/uio.ko
Rx
interrupt
RX -> interrupt -> Process -> TX -> sleep
2
Polling
Rx packet
Turn ON or OFF
Rx interrupt
5
SNB Server 2.7GHz, 1x Niantic port
3. HW
offload
Leverage HW offload
Reduce CPU utilization by HW
Well known offload capability
RSS
FDIR
CSUM offload
Tunnel Encap/Decap
TSO
3. HW
offload
Practice: CSUM offload on VXLAN

Outer
MAC
header
Outer
IP
header
Outer IP + Inner
IP + Inner UDP
UDP
header
VxLAN
header
Outer IP +
Inner IP
Inner
MAC
header
Outer IP +
Inner UDP
inner
IP
header
L4 packet
Outer IP + Inner
IP + Inner UDP
10
190
9.8
185
9.6
180
9.4
175
9.2
170
165
8.8
160
8.6
155
8.4
150
8.2
145
140
SOFTWARE ALL
HW IP
1S/1C/1T Mpps
HW UDP
1x40GE FVL
128Bytes packet
size
Tunneling packet,
VxLAN as sample
Offload do helps to
reduce CPU cycles
HW IP&UDP
cycles/packet
VXLAN Inner CSUM offload

HSW 2.3GHz, Turbo Burst Enabled
Envision/Future
Light weight thread (Co-operative multitask)
AVX2 vector packet IO
Interrupt mode packet IO on virtual ethdev
(virtio/vmxnet3)
Interrupt latency optimization
Thanks

梁存铭Intel - Core - effeciency PDF

Uploaded by

Copyright:

Available Formats

梁存铭Intel - Core - effeciency PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

梁存铭Intel - Core - effeciency PDF

Uploaded by

Copyright:

Available Formats

Practices for Building Core/Efficient Applications

Whats this talk is about

5 dimension assessment or user experience

The talk focus on efficiency

Reducing stalled cycles

Instruction Parallel and Cache BW

Integer ALU &

Integer ALU &

Integer ALU &

A Sample of the Superscalar Execution Engine

L1 Data Cache (DCU)

Hit Latency (cycle)

HSW doubled MLC

Hit Latency (cycle)

From Nehalem, SNB to HSW L1

Hide Cache Latency

Note: It assumes each

hide cache latency by bulk process

Practice: vector packet IO (1)

Allocate buffers in bulk?

Practice: vector packet IO (2)

Things happen in chunks

Defer descriptor writes until multiple have

Reduce probability of modifying a cache line

Remove linear dependency on DD bit check

Always copy 4 descriptors data and 4 buffer

128 bit desc format

128 bit partial mbuf data

Practice: vector packet IO (3)

Packet IO throughput optimization

Vector L3fwd throughput

SNB Server 2.7GHz

S-OLD original L3FWD with scalar PMD.

Vector based IO reduces cycle count @ 54

Practice: AVX2 memory copy

2X faster by 256-bit load/store

No weight applied in calculation

Disclaimer: Software and workloads used in performance tests

Managing Idle Loop

Frequency scale and turbo

Practice: Power Optimization(1)

Platform Power (L3fwd w/o traffic)

CPU Utilization (L3fwd w/o traffic)

Frequency (L3fwd w/o traffic)

2701000 KHz(Turbo Boost)

Practice: Power Optimization(2)

Idle Power Comparison

Active Power and Performance Comparison

Disclaimer: Software and workloads used in performance tests may

Practice: Multi-pThreads per Core(1)

Many operations dont

affinity pthread model

Cgroup and Cache QoS

Cache Allocation Technology (CAT)

Identify misbehaving or cache-starved

Last Level Cache

Available on Communications SKUs only

Last Level Cache

Cache Monitoring and Allocation Improve Visibility and Runtime Determinism