ARM Cortex M4 in few words
Introduction
Cortex-M processors binary compatible
Cortex-M feature set comparison
Cortex-M0
Architecture Version
Cortex-M3
Cortex-M4
V6M
v7M
v7ME
Thumb, Thumb-2
System Instructions
Thumb + Thumb-2
Thumb + Thumb-2,
DSP, SIMD, FP
0.9
1.25
1.25
Yes
Yes
Yes
Number interrupts
1-32 + NMI
1-240 + NMI
1-240 + NMI
Interrupt priorities
8-256
8-256
4/2/0, 2/1/0
8/4/0, 2/1/0
8/4/0, 2/1/0
Memory Protection Unit (MPU)
No
Yes (Option)
Yes (Option)
Integrated trace option (ETM)
No
Yes (Option)
Yes (Option)
Fault Robust Interface
No
Yes (Option)
No
Yes (Option)
Yes
Yes
Hardware Divide
No
Yes
Yes
WIC Support
Yes
Yes
Yes
Bit banding support
No
Yes
Yes
Single cycle DSP/SIMD
No
No
Yes
Floating point hardware
No
No
Yes
AHB Lite
AHB Lite, APB
AHB Lite, APB
Yes
Yes
Yes
Instruction set architecture
DMIPS/MHz
Bus interfaces
Integrated NVIC
Breakpoints, Watchpoints
Single Cycle Multiply
Bus protocol
CMSIS Support
13
Cortex M4
DSP features
Cortex-M4 processor architecture
ARMv7ME Architecture
Thumb-2 Technology
DSP and SIMD extensions
Single cycle MAC (Up to 32 x 32 + 64 -> 64)
Optional single precision FPU
Integrated configurable NVIC
Compatible with Cortex-M3
Microarchitecture
3-stage pipeline with branch speculation
3x AHB-Lite Bus Interfaces
Configurable for ultra low power
Deep Sleep Mode, Wakeup Interrupt Controller
Power down features for Floating Point Unit
Flexible configurations for wider applicability
Configurable Interrupt Controller (1-240 Interrupts and Priorities)
Optional Memory Protection Unit
Optional Debug & Trace
15
Cortex-M4 overview
Main Cortex-M4 processor features
ARMv7-ME architecture revision
Fully compatible with Cortex-M3 instruction set
Single-cycle multiply-accumulate (MAC) unit
Optimized single instruction multiple data (SIMD)
instructions
Saturating arithmetic instructions
Optional single precision Floating-Point Unit (FPU)
Hardware Divide (2-12 Cycles), same as Cortex-M3
Barrel shifter (same as Cortex-M3)
Hardware divide (same as Cortex-M3)
Single-cycle multiply-accumulate unit
The multiplier unit allows any MUL or MAC
instructions to be executed in a single cycle
Signed/Unsigned Multiply
Signed/Unsigned Multiply-Accumulate
Signed/Unsigned Multiply-Accumulate Long (64-bit)
Benefits : Speed improvement vs. Cortex-M3
4x for 16-bit MAC (dual 16-bit MAC)
2x for 32-bit MAC
up to 7x for 64-bit MAC
Cortex-M4 extended single cycle MAC
OPERATION
CM3
CM4
SMULBB, SMULBT, SMULTB, SMULTT
SMLABB, SMLABT, SMLATB, SMLATT
SMLALBB, SMLALBT, SMLALTB, SMLALTT
SMULWB, SMULWT
SMLAWB, SMLAWT
SMUAD, SMUADX, SMUSD, SMUSDX
n/a
n/a
n/a
n/a
n/a
n/a
1
1
1
1
1
1
(16 x 16) (16 x 16) + 32 = 32
(16 x 16) (16 x 16) + 64 = 64
SMLAD, SMLADX, SMLSD, SMLSDX
SMLALD, SMLALDX, SMLSLD, SMLSLDX
n/a
n/a
1
1
32 x 32 =
32 (32
32 x 32 =
(32 x 32)
(32 x 32)
MUL
MLA, MLS
SMULL, UMULL
SMLAL, UMLAL
UMAAL
1
2
5-7
5-7
n/a
1
1
1
1
1
SMMLA, SMMLAR, SMMLS, SMMLSR
SMMUL, SMMULR
n/a
n/a
1
1
16 x 16 =
16 x 16 +
16 x 16 +
16 x 32 =
(16 x 32)
(16 x 16)
32
32 = 32
64 = 64
32
+ 32 = 32
(16 x 16) = 32
32
x 32) = 32
64
+ 64 = 64
+ 32 + 32 = 64
32 (32 x 32) = 32 (upper)
(32 x 32) = 32 (upper)
INSTRUCTIONS
All the above operations are single cycle on the Cortex-M4 processor
Saturated arithmetic
Intrinsically prevents overflow of variable by
clipping to min/max boundaries and remove CPU
burden due to software range checks
Benefits
1,5
1,5 Audio applications
Without
saturation
0,5
0
-0,5
-1
0,5
-1,5
1,5
-0,5
With
saturation
-1
-1,5
Control applications
0,5
0
-0,5
-1
-1,5
The PID controllers integral term is continuously accumulated
over time. The saturation automatically limits its value and
saves several CPU cycles per regulators
Single-cycle SIMD instructions
Stands for Single Instruction Multiple Data
It operates with packed data
Allows to do simultaneously several operations with 8-bit or 16-bit data
format
i.e.: dual 16-bit MAC (Result = 16x16 + 16x16 + 32)
Benefits
Parallelizes operations (2x to 4x speed gain)
Minimizes the number of Load/Store instruction for exchanges between
memory and register file (2 or 4 data transferred at once), if 32-bit is not
necessary
Maximizes register file use (1 register holds 2 or 4 values)
Packed data types
Byte or halfword quantities packed into words
Allows more efficient access to packed structure types
SIMD instructions can act on packed data
Instructions to extract and pack data
A
00......00 A
Extract
00......00 B
Pack
Further optimization strategies
Circular addressing alternatives
Loop unrolling
Caching of intermediate variables
Extensive use of SIMD and intrinsics
Cortex-M4 - Final FIR Code
sample = blockSize/4;
do
{
sum0 = sum1 = sum2 = sum3 = 0;
statePtr = stateBasePtr;
coeffPtr = (q31_t *)(S->coeffs);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++);
i = numTaps>>2;
do
{
c0 = *(coeffPtr++);
x2 = *(q31_t *)(statePtr++);
x3 = *(q31_t *)(statePtr++);
sum0 = __SMLALD(x0, c0, sum0);
sum1 = __SMLALD(x1, c0, sum1);
sum2 = __SMLALD(x2, c0, sum2);
sum3 = __SMLALD(x3, c0, sum3);
c0 = *(coeffPtr++);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++);
sum0 = __SMLALD(x0, c0, sum0);
sum1 = __SMLALD(x1, c0, sum1);
sum2 = __SMLALD (x2, c0, sum2);
sum3 = __SMLALD (x3, c0, sum3);
} while(--i);
*pDst++ = (q15_t) (sum0>>15);
*pDst++ = (q15_t) (sum1>>15);
*pDst++ = (q15_t) (sum2>>15);
*pDst++ = (q15_t) (sum3>>15);
stateBasePtr= stateBasePtr + 4;
} while(--sample);
Uses loop unrolling, SIMD intrinsics,
caching of states and coefficients, and
work around circular addressing by
using a large state buffer.
Inner loop is 26 cycles for a total of 16,
16-bit MACs.
Only 1.625 cycles per filter tap!
Cortex-M4 - FIR performance
DSP assembly code = 1 cycle
Cortex-M4 standard C code takes 12 cycles
Using circular addressing alternative = 8 cycles
After loop unrolling < 6 cycles
After using SIMD instructions < 2.5 cycles
After caching intermediate values ~ 1.6 cycles
Cortex-M4 C code now comparable in
performance