0% found this document useful (0 votes)

416 views14 pages

ARM Cortex-M4 Architecture Overview

The document compares features of the Cortex-M0, Cortex-M3, and Cortex-M4 microcontroller processors. The Cortex-M4 offers improvements over prior versions including a single-cycle multiply-accumulate unit, single-cycle SIMD instructions, saturated arithmetic, and an optional floating-point unit. Code examples show how these features can be used to optimize a FIR filter algorithm to achieve high performance comparable to assembly code.

Uploaded by

Anis Billie Jeans

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

416 views14 pages

ARM Cortex-M4 Architecture Overview

Uploaded by

Anis Billie Jeans

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

ARM Cortex M4 in few words

Introduction

Cortex-M processors binary compatible

Cortex-M feature set comparison

Cortex-M0
Architecture Version

Cortex-M3

Cortex-M4

V6M

v7M

v7ME

Thumb, Thumb-2
System Instructions

Thumb + Thumb-2

Thumb + Thumb-2,
DSP, SIMD, FP

0.9

1.25

Yes

Number interrupts

1-32 + NMI

1-240 + NMI

Interrupt priorities

8-256

4/2/0, 2/1/0

8/4/0, 2/1/0

Memory Protection Unit (MPU)

Yes (Option)

Integrated trace option (ETM)

Yes (Option)

Fault Robust Interface

Yes (Option)

Yes

Hardware Divide

Yes

WIC Support

Yes

Bit banding support

Yes

Single cycle DSP/SIMD

Yes

Floating point hardware

Yes

AHB Lite

AHB Lite, APB

Yes

Instruction set architecture

DMIPS/MHz
Bus interfaces
Integrated NVIC

Breakpoints, Watchpoints

Single Cycle Multiply

Bus protocol
CMSIS Support

Cortex M4
DSP features

Cortex-M4 processor architecture

ARMv7ME Architecture

Thumb-2 Technology
DSP and SIMD extensions
Single cycle MAC (Up to 32 x 32 + 64 -> 64)
Optional single precision FPU
Integrated configurable NVIC
Compatible with Cortex-M3

Microarchitecture
3-stage pipeline with branch speculation
3x AHB-Lite Bus Interfaces

Configurable for ultra low power

Deep Sleep Mode, Wakeup Interrupt Controller
Power down features for Floating Point Unit

Flexible configurations for wider applicability

Configurable Interrupt Controller (1-240 Interrupts and Priorities)
Optional Memory Protection Unit
Optional Debug & Trace
15

Cortex-M4 overview
Main Cortex-M4 processor features
ARMv7-ME architecture revision
Fully compatible with Cortex-M3 instruction set

Single-cycle multiply-accumulate (MAC) unit

Optimized single instruction multiple data (SIMD)
instructions
Saturating arithmetic instructions
Optional single precision Floating-Point Unit (FPU)
Hardware Divide (2-12 Cycles), same as Cortex-M3
Barrel shifter (same as Cortex-M3)
Hardware divide (same as Cortex-M3)

Single-cycle multiply-accumulate unit

The multiplier unit allows any MUL or MAC
instructions to be executed in a single cycle
Signed/Unsigned Multiply
Signed/Unsigned Multiply-Accumulate
Signed/Unsigned Multiply-Accumulate Long (64-bit)

Benefits : Speed improvement vs. Cortex-M3

4x for 16-bit MAC (dual 16-bit MAC)
2x for 32-bit MAC
up to 7x for 64-bit MAC

Cortex-M4 extended single cycle MAC

OPERATION

CM3

CM4

SMULBB, SMULBT, SMULTB, SMULTT

SMLABB, SMLABT, SMLATB, SMLATT
SMLALBB, SMLALBT, SMLALTB, SMLALTT
SMULWB, SMULWT
SMLAWB, SMLAWT
SMUAD, SMUADX, SMUSD, SMUSDX

n/a
n/a
n/a
n/a
n/a
n/a

1
1
1
1
1
1

(16 x 16) (16 x 16) + 32 = 32

(16 x 16) (16 x 16) + 64 = 64

SMLAD, SMLADX, SMLSD, SMLSDX

SMLALD, SMLALDX, SMLSLD, SMLSLDX

n/a
n/a

1
1

32 x 32 =
32 (32
32 x 32 =
(32 x 32)
(32 x 32)

MUL
MLA, MLS
SMULL, UMULL
SMLAL, UMLAL
UMAAL

1
2
5-7
5-7
n/a

1
1
1
1
1

SMMLA, SMMLAR, SMMLS, SMMLSR

SMMUL, SMMULR

n/a
n/a

1
1

16 x 16 =
16 x 16 +
16 x 16 +
16 x 32 =
(16 x 32)
(16 x 16)

32
32 = 32
64 = 64
32
+ 32 = 32
(16 x 16) = 32

32
x 32) = 32
64
+ 64 = 64
+ 32 + 32 = 64

32 (32 x 32) = 32 (upper)

(32 x 32) = 32 (upper)

INSTRUCTIONS

All the above operations are single cycle on the Cortex-M4 processor

Saturated arithmetic
Intrinsically prevents overflow of variable by
clipping to min/max boundaries and remove CPU
burden due to software range checks
Benefits
1,5

1,5 Audio applications

Without
saturation

0,5
0
-0,5

-1

0,5

-1,5
1,5

-0,5

With
saturation

-1
-1,5

Control applications

0,5
0
-0,5
-1
-1,5

The PID controllers integral term is continuously accumulated

over time. The saturation automatically limits its value and
saves several CPU cycles per regulators

Single-cycle SIMD instructions

Stands for Single Instruction Multiple Data

It operates with packed data
Allows to do simultaneously several operations with 8-bit or 16-bit data
format
i.e.: dual 16-bit MAC (Result = 16x16 + 16x16 + 32)

Benefits
Parallelizes operations (2x to 4x speed gain)
Minimizes the number of Load/Store instruction for exchanges between
memory and register file (2 or 4 data transferred at once), if 32-bit is not
necessary
Maximizes register file use (1 register holds 2 or 4 values)

Packed data types

Byte or halfword quantities packed into words

Allows more efficient access to packed structure types
SIMD instructions can act on packed data
Instructions to extract and pack data
A

00......00 A

Extract
00......00 B
Pack

Further optimization strategies

Circular addressing alternatives
Loop unrolling
Caching of intermediate variables
Extensive use of SIMD and intrinsics

Cortex-M4 - Final FIR Code

sample = blockSize/4;
do
{
sum0 = sum1 = sum2 = sum3 = 0;
statePtr = stateBasePtr;
coeffPtr = (q31_t *)(S->coeffs);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++);
i = numTaps>>2;
do
{
c0 = *(coeffPtr++);
x2 = *(q31_t *)(statePtr++);
x3 = *(q31_t *)(statePtr++);
sum0 = __SMLALD(x0, c0, sum0);
sum1 = __SMLALD(x1, c0, sum1);
sum2 = __SMLALD(x2, c0, sum2);
sum3 = __SMLALD(x3, c0, sum3);
c0 = *(coeffPtr++);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++);
sum0 = __SMLALD(x0, c0, sum0);
sum1 = __SMLALD(x1, c0, sum1);
sum2 = __SMLALD (x2, c0, sum2);
sum3 = __SMLALD (x3, c0, sum3);
} while(--i);
*pDst++ = (q15_t) (sum0>>15);
*pDst++ = (q15_t) (sum1>>15);
*pDst++ = (q15_t) (sum2>>15);
*pDst++ = (q15_t) (sum3>>15);
stateBasePtr= stateBasePtr + 4;
} while(--sample);

Uses loop unrolling, SIMD intrinsics,

caching of states and coefficients, and
work around circular addressing by
using a large state buffer.
Inner loop is 26 cycles for a total of 16,
16-bit MACs.
Only 1.625 cycles per filter tap!

Cortex-M4 - FIR performance

DSP assembly code = 1 cycle

Cortex-M4 standard C code takes 12 cycles

Using circular addressing alternative = 8 cycles
After loop unrolling < 6 cycles
After using SIMD instructions < 2.5 cycles
After caching intermediate values ~ 1.6 cycles
Cortex-M4 C code now comparable in
performance

Arm Cortex-M4 Processor Datasheet
100% (2)
Arm Cortex-M4 Processor Datasheet
10 pages
ARM Cortex-M4 Endianness & Vector Table
No ratings yet
ARM Cortex-M4 Endianness & Vector Table
93 pages
ARM Cortex-M3/M4 Processor Core Features
No ratings yet
ARM Cortex-M3/M4 Processor Core Features
38 pages
Overview of MSP430 Microcontroller
100% (2)
Overview of MSP430 Microcontroller
70 pages
Understanding Computer Architecture Basics
No ratings yet
Understanding Computer Architecture Basics
62 pages
Unit3 ARM Cortex Architecture
No ratings yet
Unit3 ARM Cortex Architecture
112 pages
Chapter 2 ARM Cortex-M3 Architecture - 3
No ratings yet
Chapter 2 ARM Cortex-M3 Architecture - 3
68 pages
Embedded System Design Overview
100% (3)
Embedded System Design Overview
16 pages
AVR Microcontroller Memory Overview
No ratings yet
AVR Microcontroller Memory Overview
37 pages
Arm M4
No ratings yet
Arm M4
186 pages
ARM Data Processing Instructions Overview
No ratings yet
ARM Data Processing Instructions Overview
42 pages
ARM Processor Architecture Overview
No ratings yet
ARM Processor Architecture Overview
10 pages
Embedded Systems in ELEC 3300
No ratings yet
Embedded Systems in ELEC 3300
47 pages
Embedded Systems Computing Platforms
No ratings yet
Embedded Systems Computing Platforms
50 pages
Module-2 Notes PDF
No ratings yet
Module-2 Notes PDF
12 pages
STM32 Microcontroller Light Sensor Project
100% (1)
STM32 Microcontroller Light Sensor Project
19 pages
Day 2: Embedded C Programming Overview
No ratings yet
Day 2: Embedded C Programming Overview
56 pages
LPC2148 ARM7 Development Board Manual
No ratings yet
LPC2148 ARM7 Development Board Manual
63 pages
Embedded Firmware Development Overview
100% (1)
Embedded Firmware Development Overview
18 pages
8.2.0 ARM Architecture
No ratings yet
8.2.0 ARM Architecture
117 pages
Armv7 A Cortex A Series PG PDF
No ratings yet
Armv7 A Cortex A Series PG PDF
421 pages
Overview of 8085 Microprocessor Features
100% (1)
Overview of 8085 Microprocessor Features
13 pages
Test Engineering Syllabus Overview
No ratings yet
Test Engineering Syllabus Overview
3 pages
ARM Processor Design Overview
100% (1)
ARM Processor Design Overview
65 pages
Interfacing LCD with STM32 in ELEC3300
No ratings yet
Interfacing LCD with STM32 in ELEC3300
28 pages
Advantages of Digital Instruments
No ratings yet
Advantages of Digital Instruments
30 pages
Embedded Firmware Design & Development
No ratings yet
Embedded Firmware Design & Development
23 pages
ARM DSP Application Development Guide
100% (1)
ARM DSP Application Development Guide
36 pages
Microcontroller System Design Guide
67% (3)
Microcontroller System Design Guide
58 pages
STM32F4011 Embedded C Programming Guide
No ratings yet
STM32F4011 Embedded C Programming Guide
41 pages
Overview of the 8051 Microcontroller
No ratings yet
Overview of the 8051 Microcontroller
138 pages
RISC-V Processor Design Overview
100% (1)
RISC-V Processor Design Overview
107 pages
Introduction to Microprocessors
No ratings yet
Introduction to Microprocessors
35 pages
Overview of Embedded System Components
100% (2)
Overview of Embedded System Components
39 pages
Microprocessor Addressing Techniques
No ratings yet
Microprocessor Addressing Techniques
2 pages
ARM Assembly Language Data Processing
100% (1)
ARM Assembly Language Data Processing
26 pages
Understanding CAN Communication Protocol
No ratings yet
Understanding CAN Communication Protocol
49 pages
Lecture 05 ARM Processors
No ratings yet
Lecture 05 ARM Processors
65 pages
ARM7TDMI-S CPU Architecture Overview
No ratings yet
ARM7TDMI-S CPU Architecture Overview
40 pages
Basic Computer Structure in Embedded Systems
No ratings yet
Basic Computer Structure in Embedded Systems
19 pages
Microcontrollers: 8051 & MSP430 Overview
No ratings yet
Microcontrollers: 8051 & MSP430 Overview
124 pages
Embedded C - PeK
No ratings yet
Embedded C - PeK
189 pages
Self Study: Comparative Study of Arm Cores - Armv4 To Arm Cortex
100% (1)
Self Study: Comparative Study of Arm Cores - Armv4 To Arm Cortex
4 pages
The Definitive Guide To The ARM Cortex-M3, Second Edition
No ratings yet
The Definitive Guide To The ARM Cortex-M3, Second Edition
2 pages
TMS320C50 DSP Architecture Overview
100% (5)
TMS320C50 DSP Architecture Overview
2 pages
Introduction to AVR ATmega8 Microcontroller
100% (1)
Introduction to AVR ATmega8 Microcontroller
40 pages
ARM Interrupt Handling Overview
No ratings yet
ARM Interrupt Handling Overview
33 pages
Microcontrollers Course Overview 10ES42
No ratings yet
Microcontrollers Course Overview 10ES42
121 pages
MA Unit1 Part 1
No ratings yet
MA Unit1 Part 1
70 pages
ARM - Module 1
No ratings yet
ARM - Module 1
67 pages
Overview of Real-Time Operating Systems
No ratings yet
Overview of Real-Time Operating Systems
22 pages
Memory Types in Embedded Systems
No ratings yet
Memory Types in Embedded Systems
31 pages
ARM Trusted Firmware Design Overview
No ratings yet
ARM Trusted Firmware Design Overview
25 pages
Advanced Embedded Systems Syllabus
100% (1)
Advanced Embedded Systems Syllabus
11 pages
Embedded C Programming Master Class
No ratings yet
Embedded C Programming Master Class
35 pages
STM32F3xx Training V1 - 2x PDF
100% (2)
STM32F3xx Training V1 - 2x PDF
602 pages
ARM Cortex-M4 Microcontroller Overview
No ratings yet
ARM Cortex-M4 Microcontroller Overview
91 pages
ARM Cortex-M4 for DSP Applications
No ratings yet
ARM Cortex-M4 for DSP Applications
10 pages
Freescale Cortex.m0.code - Density
No ratings yet
Freescale Cortex.m0.code - Density
66 pages
Long Division Explained Step-by-Step
No ratings yet
Long Division Explained Step-by-Step
3 pages
IRS Electronic Tax Return Specs 2007
No ratings yet
IRS Electronic Tax Return Specs 2007
1,434 pages
2.circular & G O On 3&4 SemC-15
No ratings yet
2.circular & G O On 3&4 SemC-15
12 pages
Comprehensive C++ Programming Guide
100% (1)
Comprehensive C++ Programming Guide
162 pages
Comprehensive Python 3.13 Guide
No ratings yet
Comprehensive Python 3.13 Guide
58 pages
Mastercard Rules
No ratings yet
Mastercard Rules
354 pages
Data Mining Assignment: Classification & Sampling
No ratings yet
Data Mining Assignment: Classification & Sampling
2 pages
The Itil Foundation Examination: Sample Paper B, Version 5.1
No ratings yet
The Itil Foundation Examination: Sample Paper B, Version 5.1
3 pages
Overview of Distributed Systems Concepts
No ratings yet
Overview of Distributed Systems Concepts
48 pages
Waterfall Methodology in SDLC Phases
No ratings yet
Waterfall Methodology in SDLC Phases
2 pages
Shadowfax E-commerce Logistics Overview
0% (1)
Shadowfax E-commerce Logistics Overview
15 pages
Grade 6 Computer Syllabus KSA 2025/26
No ratings yet
Grade 6 Computer Syllabus KSA 2025/26
5 pages
Autonomous AI for Vulnerability Research
No ratings yet
Autonomous AI for Vulnerability Research
19 pages
SQL Server Replication Configuration Guide
No ratings yet
SQL Server Replication Configuration Guide
65 pages
Data Quality DMB Ok Dam A Brasil
100% (1)
Data Quality DMB Ok Dam A Brasil
46 pages
Configure Oracle NTP Server on Linux
No ratings yet
Configure Oracle NTP Server on Linux
4 pages
Web Technologies II Course Overview
No ratings yet
Web Technologies II Course Overview
2 pages
Defusing the Binary Bomb Lab
No ratings yet
Defusing the Binary Bomb Lab
4 pages
MBA Degree Examination Subject Codes
No ratings yet
MBA Degree Examination Subject Codes
4 pages
EE 312 SP2003 Exam 1 Programming Tasks
No ratings yet
EE 312 SP2003 Exam 1 Programming Tasks
7 pages
Lesson 2 Operating System II Notes
No ratings yet
Lesson 2 Operating System II Notes
20 pages
Designing A Data Warehouse A Hands On Workshop
No ratings yet
Designing A Data Warehouse A Hands On Workshop
16 pages
Detroit Web Designer Profile
No ratings yet
Detroit Web Designer Profile
2 pages
Latest CKA Exam Dumps 2023
No ratings yet
Latest CKA Exam Dumps 2023
8 pages
Intel (R) IT Director User's Guide
No ratings yet
Intel (R) IT Director User's Guide
43 pages
B.Tech Syllabus: Essential Studies for Professionals
No ratings yet
B.Tech Syllabus: Essential Studies for Professionals
3 pages
LDPC Decoding Architectures Overview
No ratings yet
LDPC Decoding Architectures Overview
55 pages
Advanced Data Structures: Dictionaries & Hashing
No ratings yet
Advanced Data Structures: Dictionaries & Hashing
92 pages
HHH
No ratings yet
HHH
1 page
Chapter 1 Math Solutions
No ratings yet
Chapter 1 Math Solutions
3 pages

ARM Cortex-M4 Architecture Overview

Uploaded by

ARM Cortex-M4 Architecture Overview

Uploaded by

ARM Cortex M4 in few words

Cortex-M processors binary compatible

Cortex-M feature set comparison

Memory Protection Unit (MPU)

Integrated trace option (ETM)

Fault Robust Interface

Bit banding support

Single cycle DSP/SIMD

Floating point hardware

AHB Lite, APB

AHB Lite, APB

Instruction set architecture

Single Cycle Multiply

Cortex-M4 processor architecture

Configurable for ultra low power

Flexible configurations for wider applicability

Single-cycle multiply-accumulate (MAC) unit

Single-cycle multiply-accumulate unit

Benefits : Speed improvement vs. Cortex-M3

Cortex-M4 extended single cycle MAC

SMULBB, SMULBT, SMULTB, SMULTT

(16 x 16) (16 x 16) + 32 = 32

SMLAD, SMLADX, SMLSD, SMLSDX

SMMLA, SMMLAR, SMMLS, SMMLSR

32 (32 x 32) = 32 (upper)

1,5 Audio applications

The PID controllers integral term is continuously accumulated

Single-cycle SIMD instructions

Stands for Single Instruction Multiple Data

Packed data types

Byte or halfword quantities packed into words

Further optimization strategies

Cortex-M4 - Final FIR Code

Uses loop unrolling, SIMD intrinsics,

Cortex-M4 - FIR performance

Cortex-M4 standard C code takes 12 cycles

You might also like