System-On-Chip Design Book 2019 200dpi Aw
System-On-Chip Design Book 2019 200dpi Aw
System-On-Chip Design Book 2019 200dpi Aw
System-on-Chip
Design
with Arm® Cortex®-M Processors
Reference Book
JOSEPH YIU
1
System-on-Chip Design with Arm® Cortex®-M processors
2
System-on-Chip Design with Arm® Cortex®-M processors
Arm Education Media is an imprint of Arm Limited, 110 Fulbourn Road, Cambridge, CBI 9NJ, UK
Copyright © 2019 Arm Limited (or its affiliates). All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, recording or any other information storage and retrieval
system, without permission in writing from the publisher, except under the following conditions:
Permissions
You may download this book in PDF format from the Arm.com website for personal, non-
commercial use only.
You may reprint or republish portions of the text for non-commercial, educational or research
purposes but only if there is an attribution to Arm Education.
This book and the individual contributions contained in it are protected under copyright by the
Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden
our understanding, changes in research methods and professional practices may become necessary.
Readers must always rely on their own experience and knowledge in evaluating and using any
information, methods, project work, or experiments described herein. In using such information or
methods, they should be mindful of their safety and the safety of others, including parties for whom
they have a professional responsibility.
To the fullest extent permitted by law, the publisher and the authors, contributors, and editors shall
not have any responsibility or liability for any losses, liabilities, claims, damages, costs or expenses resulting
from or suffered in connection with the use of the information and materials set out in this textbook.
Such information and materials are protected by intellectual property rights around the world and are
copyright © Arm Limited (or its affiliates). All rights are reserved. Any source code, models or other materials
set out in this textbook should only be used for non-commercial, educational purposes (and/or subject to
the terms of any license that is specified or otherwise provided by Arm). In no event shall purchasing this
textbook be construed as granting a license to use any other Arm technology or know-how.
ISBN: 978-1-911531-19-7
For information on all Arm Education Media publications, visit our website at www.armedumedia.com
3
System-on-Chip Design with Arm® Cortex®-M processors
Foreword
The most surprising thing is that Arm does not produce chips. It just designs the technology and
enables its partners to manufacture differentiated devices that integrate them.
Many more of those chips, also called SoCs (system-on-chip), are expected to be produced in the
coming years. We even start talking about trillions of devices for the Internet of Things (IoT). Of the
total number of SoCs currently out in the market, the great majority use the smallest processors in the
Arm product range: the Cortex-M series. Small, very energy efficient and powerful enough for many
applications, they are at the heart of many of today’s electronic devices.
This book is here to explain how SoCs based on the Arm Cortex-M processor portfolio cores are
designed, detail the different elements that compose such a system, explain the different design
issues, describe the integration into systems, and discuss how these SoCs are programmed.
At the time, RISC processors were confined to high-end computers, where cost was less of an issue,
since no existing RISC processors were exactly suitable. That led the team to embark on the journey
to develop their own piece of silicon.
This secret project was named “Acorn RISC Machine” (ARM, in short). The first processor, ARM1, was
launched in 1985. It was produced by VLSI Technology in a 3µm technology (almost 500 times larger
than the most advanced designs now) and could run at 6 MHz. One of the side-benefits of this simple
processor architecture was its lower power consumption (compared to contemporaneous CPUs),
which allowed the component to use a lower-cost plastic package without melting it.
At the heart of the processor design was the Arm instruction set, which progressively evolved to
optimize the performance and efficiency of new generations of processors. This is a key element of
what is called the ‘architecture.’
4
System-on-Chip Design with Arm® Cortex®-M processors
The Arm processors powered several models of Acorn computers, but a major change happened when
VLSI Technology, which was manufacturing the components in its factories, signed an agreement with
Acorn to re-sell the chips to other companies. This was the first ‘Arm license.’
In 1990, after discussions with Apple Computer, who needed a new processor for the Newton
project, Acorn decided to spin-off its processor division and form a joint venture with Apple and VLSI
Technology. The team then changed the meaning of Arm to ‘Advanced RISC Machines’, which became
Arm Ltd later on.
This evolution came at the same time as a great change in the new company’s business model. On
the one hand, Arm had unique assets: great expertise in processor design and an original architecture.
However, producing chips required caring about fabrication, yield, quality, logistics, sales channels,
complex application-specific marketing, or any other tasks that a silicon manufacturer should do to be
successful. This was not optimal.
On the other hand, silicon manufacturers had a hard time staying competitive, because they had to
excel at these activities while simultaneously investing in design and innovation around processors,
at an increasingly fast pace. This was not great either.
The revolutionary idea for the newly-formed company was to become a specialist in R&D and focus
on the processor design only. Instead of selling components, Arm would license ‘Intellectual Property’
(IP in short) to semiconductor manufacturers, who would then use this IP to design their chips, in
combination with other elements that would be more application-specific.
Arm Ecosystem
The IP model selected from the start by Arm required a very tight relationship with the other
companies using the IP. As the company did not manufacture products, its success was entirely
dependent on the success of chip manufacturers embedding the Arm IP into their chips. Conversely,
to make sure that they always get the best performance and efficiency for their products, silicon
manufacturers had to make sure that the success of their products also benefited Arm, so that part of
the increasing revenues would be invested in improved and competitive IP. Together, Arm and partners
solidified the symbiosis using a royalty-based model: Arm revenues were largely dependent on the
success of the chips containing its IP. This resulted in a strong partnership between the company and
its customers, and a great sign of this very special relationship is that customers were called ‘partners’
(This is still the case more than 25 years after the foundation of the company).
Another great benefit from these partnerships was that each semiconductor ‘partner’ could focus
on a different set of applications, on different market segments, and integrate its own expertise and
‘secret sauce’ into the design of their products. This business model allowed the creation of a rich
variety of products that no single company (even the largest ones) would have been able to put into
their product catalog. It also made it increasingly difficult for processor manufacturers using other
architectures to compete with Arm because they had to compete with a whole ‘ecosystem.’ Many
of them progressively decided to stop wasting money on processor architecture development and
realized that it was much less expensive just to license state-of-the-art IP from Arm.
5
System-on-Chip Design with Arm® Cortex®-M processors
Another consequence of having several companies using the same processor IP cores was that tools,
software, and expertise could be reused from one chip to another. Indeed, a processor requires many
tools like code compilers or debuggers: having a larger market for these tools encouraged several
companies to start supporting the Arm architecture. Similarly, having a family of processors that could
execute the same instructions enabled the software developers to propose many operating systems,
libraries, frameworks or various elements that could easily run or be adapted to several components.
Finally, this allowed engineers to avoid having to learn about a new processor every time they changed
their chip, which allowed them to build strong expertise and become more efficient.
All of these factors meant that Arm could add several additional partners in the ecosystem, bringing
even greater value to every participant and making Arm-based solutions even more attractive. This
virtuous circle has significantly contributed to the success of the Arm ecosystem.
Softbank Acquisition
Even if the IP model has been duplicated many times, no other company has managed to be as
successful. This propelled Arm into a very special position in the industry. Its long-term success
required fairness with each member of the industry, and careful management to keep the balance
between all partners of the ecosystem.
2016 marked a significant milestone in Arm’s history: Softbank group agreed with Arm management
to acquire the company with the promise to continue promoting the same values of fairness and
partnership while accelerating its development.
However, central processing units are not the only IP offered by Arm: a diverse range of IP has been
developed or acquired by the company to address the needs of many applications. This is the case of
what is called ‘System IP’: all the elements that enable processors to connect to the rest of the system,
transfer or store data between those elements, manage security, enable the debug of the software,
and manage power. Another very important line of products relates to media processing, and the Arm
Mali series is now the world’s ‘most shipped’ commercial GPU IP.
6
System-on-Chip Design with Arm® Cortex®-M processors
An entire division in Arm is now focusing on building this embedded software foundation, and also
creating a Cloud platform, called Pelion, to connect and manage to all these embedded devices, and
to integrate the associated data into enterprise systems.
From providing the IP for the chip to delivering the Cloud services that allow organizations to manage
the deployment of products throughout their lifecycle securely, Arm delivers a pre-integrated IoT
solution for its partners, rooted in its deep understanding of the future of compute and security.
Arm technologies continuously evolve to ensure that intelligence is at the core of a secure and
connected digital world. With a range of licensing options, such as Arm DesignStart and Arm Flexible
Access, it’s now never been easier or faster to start working with Arm IP. Developed to facilitate the
design of modern innovations—from the sensor to the smartphone to the supercomputer—Arm
technologies are making smart possible.
Mike Eftimakis
Director of Business Innovation Strategy, Arm
7
System-on-Chip Design with Arm® Cortex®-M processors
Contents
Foreword 4
Preface 16
Example Codes and Projects / Disclaimer / A note about the scope of this book 17
Acknowledgments 19
1.3.5 Documentation 28
8
System-on-Chip Design with Arm® Cortex®-M processors
2.13 SysTick 49
9
System-on-Chip Design with Arm® Cortex®-M processors
5.2.7 Debug components discovery (ROM table and component IDs) 124
10
System-on-Chip Design with Arm® Cortex®-M processors
6. Low-power support
6.1 Overview of low-power Cortex-M features 140
6.5.7 Sleep mode that completely powers down the processor 157
11
System-on-Chip Design with Arm® Cortex®-M processors
12
System-on-Chip Design with Arm® Cortex®-M processors
10.3.5 Connecting ADC and DAC IPs into a Cortex-M system 262
10.4 Bring an SoC to life – Beetle test chip case study 263
13
System-on-Chip Design with Arm® Cortex®-M processors
11.3 Introduction of the Arm Development Studio featuring Arm Keil Microcontroller
Development Kit (MDK) 281
14
System-on-Chip Design with Arm® Cortex®-M processors
15
System-on-Chip Design with Arm® Cortex®-M processors
Preface
In the past, apart from microprocessors and microcontrollers, not many chip designs had internal
embedded processors. This has changed significantly since Arm Cortex-M processors were released,
and many more device types have emerged that are part of the rapidly growing Internet of Things (IoT).
Today, Arm processors are being used in smart sensors, smart batteries (e.g., for battery health monitor
systems), wireless communication chipsets, power electronics controllers, etc. This trend is driven by
the need for tighter system integration, additional functional features, better system reliability, and
reduction of supply chain dependency.
SoC design is an exciting industry with plenty of opportunities – the applications of Cortex-M based
SoCs ranges from consumer products, industrial and automotive applications, communications,
agriculture, transportation, healthcare/medical, etc. With the expanding IoT device market, the need
for embedding processors into SoC designs continues to increase.
Cortex-M processors, like Cortex-M0, Cortex-M0+, and Cortex-M3, are very small and can integrate
into a range of SoC designs easily. With Arm DesignStart lowering the cost barrier, many small
businesses and start-ups are taking advantage of this to develop their own SoC solutions to offer
better product differentiation. All of these developments have resulted in significant demand for SoC
designers with Arm DesignStart. Arm DesignStart has also received strong interest from academia,
where we see some universities interested in introducing SoC design topics into their courses.
In addition to the popular Armv6-M and Armv7-M processors, newly available SoCs/microcontrollers
based on the Armv8-M processors such as Cortex-M23 and Cortex-M33 processors, deliver enhanced
security solution with Arm TrustZone technology. In February 2019, Arm announced the new
Armv8.1-M architecture with Arm Helium technology, which brings vector processing capability to
Arm Cortex-M devices. These technology enhancements continue to enable the Cortex-M processors
to be used in an even wider range of applications.
While there are many technical resources on the internet on Arm software development, very limited
information was available for Arm-based SoC design, particularly on topics about integrating Arm
processors and on-chip bus protocols. This book is written to fill this gap to enable beginners in the
field to understand a range of technical concepts on SoC design, and also provide detailed descriptions
of design integration with several of the Arm Cortex-M processors. A range of other topics, including
system component design, SoC design flow, and software development, are also covered.
If you are a beginner in SoC design, I hope that this book will enable you to gain SoC design
knowledge and help you to kickstart your SoC or FPGA design projects. For those of you who are
experienced chip designers, I hope that you find this a useful reference source. Enjoy the book -
and let your SoC design creativity go wild! There are always opportunities for new and fascinating
Arm-based SoCs on the market.
16
System-on-Chip Design with Arm® Cortex®-M processors
For readers of this book, Joseph Yiu has prepared a package of example codes and projects
to download that includes:
An example Cortex-M3 system design based on Arm Cortex-M3 DesignStart Eval.
An FPGA project setup for the example system, for Digilent Arty-S7-50T FPGA board
and Xilinx Vivado 2019.1.
The package can be downloaded from the book section of Arm Education Media’s website at
www.armedumedia.com
Disclaimer
The Verilog design examples and related software files included in this book are created for
educational purposes and are not validated to the same quality level as Arm IP products.
Arm Education Media and the author do not make any warranties of these designs.
This book focuses on the concepts of system designs based on Cortex-M0 and Cortex-M3 processors.
Since the product offering DesignStart and DesignStart FPGA will change over time, the full details of
using those packages will not be covered here. However, the system design concepts and some of the
technical details in this document are relevant to most of the Cortex-M system designs.
17
System-on-Chip Design with Arm® Cortex®-M processors
Joseph Yiu
Distinguished Engineer, Embedded Technology at Arm
Joseph is a distinguished engineer in the Arm IoT/Embedded processors product marketing team.
His role is focused on technologies and products for embedded applications, including areas such as:
Technical marketing
Technical advisory for various internal and external projects, as well as Arm’s product support team
He also works with EEMBC (www.eembc.org) on benchmark development – for example, ULPMark.
Joseph started as an IP designer on accelerated 8-bit processors in 1998 before joining Arm in 2001,
where he worked on some of the first Arm-based SoC projects in the emerging System-on-Chip
group. In 2005, he moved to the processor division and worked on a range of Cortex-M processor
and design kit projects. After over 10 years in various senior engineering roles, he moved into the
product management team, while continuing his involvement in Arm embedded technology projects.
His technical specialisms include microcontroller and SoC system-level design with Arm Cortex-M
processors, applications and programming, ASIC/SoC designs, verifications, FPGA prototyping and
implementation areas such as low-power design and production tests (DFT), and RF circuit design.
Authorship
Joseph’s previous book titles include:
The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 1st to 3rd edition
(Elsevier, October 2013)
The Definitive Guide to the ARM Cortex-M3, 1st and 2nd edition
(Elsevier, January 2010)
18
System-on-Chip Design with Arm® Cortex®-M processors
Acknowledgments
A big thank you to the editor, Michael Shuff, for his efforts in proofreading and various useful
suggestions. I would also like to thank Christopher Seidl, Chris Shore, and Jon Marsh for contributing
materials, and the Arm marketing team for their support on this project.
19
System-on-Chip Design with Arm® Cortex®-M processors
20
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Introduction to 1
Arm Cortex-M
21
System-on-Chip Design with Arm® Cortex®-M processors
You may wonder, ‘Why not use a state machine to handle the control function?’ In the simplest digital
applications, a finite state machine (FSM) implemented in Verilog or VHDL could handle all the required
control functions, and in those cases, there is indeed no need to have a processor in the system. However,
when the application gets more complex, the number of states in the control function FSM increases,
or when the system’s behavior needs to be more flexible, the inclusion of a processor in the system
is unavoidable. To enable better flexibility, complex control flows are handled by a processor running
control software, which can be easily modified and debugged. As a result, embedded processors are
being increasingly embedded in FPGA designs. Although it is possible to use a separate microcontroller to
control an FPGA-based digital system, this will result in an increased component count in the completed
system, as well as potential issues with signal routing between the processor and the FPGA-like timing,
PCB signals routing, noise, and reliability problems.
Ability to handle complex tasks like Graphical User Interface (GUI) and data storage management
(e.g., file system);
Application programs can be developed and updated separately from the hardware design, allowing
better flexibility in product development;
Reduces the total number of components in the system because there is no need for a separated
processor chip;
Signal routing between the processor and the functional logic is handled automatically by FPGA
design tools;
Debugging software on a well-established processor is much easier than debugging a complex state
machine;
Little limitation on the interface between the processor and the user-defined logic blocks;
In comparison, the use of separated processor chips can have limitations on the interface like the
number of pins, selection of protocol and electrical characteristics;
Program code can be stored on configuration flash for the FPGA, allowing firmware update to the
hardware design and the application code to be carried out at the same time;
22
System-on-Chip Design with Arm® Cortex®-M processors
Processor implementation features are now becoming part of the FPGA development tools, making
integration of the processor into FPGA easier than using separate processor chip.
There are other intellectual property (IP) products available in the market, of course. However, the
designs of the Cortex-M processors provide:
Well-proven technology.
Products based on Arm Cortex-M processors have been around since 2005. In recent years, Arm has
made Cortex processor IP more accessible to cost-constrained companies through easy to arrange,
fast, no/low-cost licensing. For example, Arm Flexible Access introduced in 2019 offers a simple way
to evaluate and fully design system-on-chip (SoC) solutions with a wide-ranging mix of Arm IP before
committing to production, paying only for what is used at manufacture. There are also Arm DesignStart
programs that assist designers who are new to Cortex-M technology with a range of Arm IP to help
them get started on their designs instantly and risk-free. You can source various FPGA development
solutions, like affordable FPGA development boards, that can save you both time and money. Through
partnerships with FPGA vendors, Arm also offers DesignStart FPGA, which includes instant and free
access to Cortex-M1 and Cortex-M3 soft CPU IP Cortex-M processors for use on selected FPGA
platforms. Together with an industry-leading ecosystem of tools, software, and services, the Arm
Cortex-M processor portfolio offers some of the best embedded processors for digital system designs.
Education – for many universities teaching digital system design, FPGAs are perfect platforms.
Universities had been interested in using Arm processors in their teaching of digital design courses,
like how to create a typical SoC design with a processor and develop applications for it. However,
doing real chip design is costly and takes a long time, making the FPGA platform much more suitable.
Cortex-M3 soft CPU IP are now easy to use, low cost, and can be used with FPGA devices from
various FPGA vendors. DesignStart FPGA program, the Cortex-M1 and Cortex-M3 soft CPU IP
are integrated with FPGA design tools to make them even easier to use. In addition, since these
processors share many architectural characteristics with other Arm processors, the knowledge and
skills gained by the students through the experiments will prove valuable for them when they use
Arm processors again in the future.
Commercial product development – many digital designers are creating custom digital systems with
FPGA and need a processor to control the operations of the digital systems they design. In some
other applications, the digital functions needed are not available in off-the-shelf microcontroller
products, and therefore using the Cortex-M processors in FPGA enables alternate solutions.
23
System-on-Chip Design with Arm® Cortex®-M processors
Prototyping for chip/SoC designs – many ASIC designers use FPGA for prototyping their designs
and their chip/SoC designs that contain the Cortex-M processors. It is also a useful way to
prototype new product ideas, and to provide demonstrations/proof of concepts.
While there have been several FPGA vendor-specific processors available, most of those architectures
are proprietary and could be restricted to certain FPGA architectures. In contrast, the Cortex-M
processors are much more generic. Most of the Cortex-M processors (e.g., Cortex-M0 and Cortex-M3)
are optimized for ASIC/SoC applications. The Cortex-M1 processor was designed to be optimized for
most of the FPGA devices (it is small and allows high operation frequency), and at the same time can
be portable between different FPGA types and is upward-compatible to other Cortex-M processors.
For example, from a software point of view, the architecture used in Cortex-M1 is based on the same
instruction set used by the popular Cortex-M0, Cortex-M0+ processors. Designers can also upgrade
to a Cortex-M3 or other Cortex-M processor if more instruction features are needed.
Since the recent availability of the Cortex-M processor IP in FPGA design tools, Cortex-M system
designs are no longer restricted to SoC design professionals. Even students, academic researchers,
and electronics enthusiasts now have access to the world of Cortex-M system design.
The Cortex-A portfolio – Application processors for complex systems. An example of the processors
in this class is the Cortex-A53. It is developed to support applications like smartphones, PDAs, set-
top boxes, which need high-performance processing and require OS support like Linux, Android,
Microsoft Windows, etc.
24
System-on-Chip Design with Arm® Cortex®-M processors
Architecture type Support both 64 and 32-bit from Support both 64 and 32-bit from 32-bit only
Armv8-A, 32-bit in Armv7-A and Armv8-R, 32-bit in Armv7-R and
older architecture older architecture
Clock frequency range Longer pipeline optimized for high Medium-length pipeline Short to medium length pipeline
and pipeline clock frequency range (e.g., 8-stage in Cortex-R5) (2 to 6 stages) for low-power
systems
Virtual memory support Yes No (it is permitted in Armv8-R, No
(required for Linux) but not supported in current
Cortex-R processors)
Virtualization support Yes Yes, from Armv8-R No
(e.g., Cortex-R52)
Arm TrustZone security Yes No Yes, from Armv8-M, but not
extension in Armv6-M and Armv7-M
architectures
Interrupt handling Based on Generic Interrupt Based on Generic Interrupt Based on Nested Vectored
Controller (GIC) with multi-core Controller with multi-core Interrupt Controller (NVIC)
and virtualization support. and virtualization support, internal to the processor.
Non-deterministic interrupt or Vectored Interrupt Controller Low interrupt latency and easy
response speed. in older Cortex-R. to use.
Fast interrupt response.
ISA for DSP acceleration Neon Advanced SIMD Neon Advanced SIMD support Support legacy SIMD (32-bit
(128-bit vectored processing). on Armv8-R. Also, support legacy vector processing) in Cortex-M4,
Latest architecture from SIMD (32-bit vector processing). Cortex-M7, Cortex-M33, and
Armv8.3-A supports Scalable Cortex-M35P
Vector Extension (SVE).
If you are planning to use Linux in your applications, a Cortex-A processor would be needed. Both
Xilinx and Intel (previously Altera) have FPGA products with built-in Cortex-A processor subsystems.
On the other hand, the Cortex-M processors are ideal for smaller embedded systems, often with real-
time requirements.
There are different types of the Cortex-M processors, too. We can classify them into three product ranges:
Mainstream processor Cortex-M3 and Cortex-M4 processors (Armv7-M) Cortex-M33 and Cortex-M35P processors
25
System-on-Chip Design with Arm® Cortex®-M processors
For general data processing and control applications, Armv6-M processors are more than capable of
handling these requirements:
Cortex-M0 processor: the smallest Arm processor (only 12K gates in minimum configuration) with
a simple 3-stage pipeline, based on Von-Neumann bus architecture. No privilege level separation
and no memory protection unit (MPU).
Cortex-M1 processor: similar to the Cortex-M0 processor, but optimized for FPGA applications.
It provides Tightly-Coupled-Memory (TCM) interface to simplify memory integration on FPGA and
delivers higher clock frequency for FPGA implementations.
Cortex-M0+ processor: also based on Armv6-M architecture, with privilege level separation and
an optional memory protection unit (MPU). It also has an optional single-cycle I/O interface for
connecting peripheral registers that need low latency accesses, and a low-cost instruction trace
feature called Micro Trace Buffer (MTB).
Cortex-M23 processor: For constrained embedded systems that need advanced security, the
Cortex-M23 processor with the Arm TrustZone security extension is more suitable. In addition
to TrustZone support, the Cortex-M23 processor has many other enhancements compared to
Armv6-M processors:
Cortex-M3 processor: For applications that need more complicated data processing, Armv7-M
processors could be more suitable. The instruction set in Armv7-M provides support for more
addressing modes, conditional execution, bit field processing, multiply, and accumulate (MAC).
So even with a relatively small Cortex-M3 processor, you can have a relatively high-performance
system.
Cortex-M7 processor: the highest performance Cortex-M processor today with a six-stage
pipeline and superscalar design, allowing execution of up to two instructions per cycle. Similar to
the Cortex-M4, it supports 32-bit SIMD operations and an optional FPU. The FPU in Cortex-M7
can be configured to support single-precision or both single and double-precision floating-point
operations. It is also designed to work with high performance and complex memory system by
supporting instruction and data caches and TCM.
26
System-on-Chip Design with Arm® Cortex®-M processors
Cortex-M35P processor: similar to the Cortex-M33 processor, but with the enhancement of
anti-tampering features to prevent physical security attacks (e.g., side-channel and fault injection
attacks). It also includes an optional instruction cache.
For beginners, Cortex-M0, Cortex-M1, and Cortex-M3 are good starting points for most projects.
Arm DesignStart
Cortex-M0 and Cortex-M3 processors are available via DesignStart program (Note: The Cortex-A5
processor is also available, but this book is not intended to cover this).
Cortex-M1 and Cortex-M3 processors are available at no cost as soft CPU IP optimized for easy
integration with FPGA partners.
There are different types of deliverables for each of these DesignStart programs. Currently, Cortex-M
DesignStart is divided into several types:
DesignStart Eval(ulation) – delivered as obfuscated Verilog with fixed configuration. Instant access
and free. Suitable for evaluation, research, and teaching.
DesignStart Pro – delivered as full RTL source, configurable and requires a simple license;
Zero license fee and success–based royalty model.
DesignStart for University - delivered as full RTL source, configurable and requires a simple license.
Zero license fee.
DesignStart FPGA – delivered as packages for FPGA development tools. Instant access and free.
Suitable for evaluation, research, teaching, and commercial use.
27
System-on-Chip Design with Arm® Cortex®-M processors
For the latest information and details of DesignStart (including licensing conditions, please visit the
Arm website: https://developer.arm.com/products/designstart
Cortex-M0 and Cortex-M3 DesignStart Eval and Pro contains the following offerings:
Cortex-M0 DesignStart Eval Cortex-M3 DesignStart Eval Cortex-M0 DesignStart Pro Cortex-M3 DesignStart Pro
Cortex-M0 obfuscated model Cortex-M3 obfuscated model Full version of Cortex-M0 Full version of Cortex-M3
deliverable deliverable
Cortex-M0 System Design Kit Corstone-100 foundation IP Cortex-M0 System Design Kit Cortex-M System Design
(CM0SDK) including SSE-050 subsystem (CM0SDK) Kit (CMSDK), Corstone-100
foundation IP including
SSE-050 subsystem and
several IP blocks including
a full-blown TRNG (True
Random Number Generator)
for security
Cortex-M3 Cycle Model Cortex-M3 Cycle Model
(1-year license) (1-year license)
FPGA project for MPS2 FPGA FPGA project for MPS2 FPGA FPGA project for MPS2 FPGA FPGA project for MPS2 FPGA
board board board board
Trial license of Keil MDK Trial license of Keil MDK Trial license of Keil MDK Trial license of Keil MDK
(time-limited license) (time-limited license) (time-limited license) (time-limited license)
DesignStart RTL Review DesignStart RTL Review
Table 1.3: Offerings from Arm Cortex-M DesignStart Eval and Pro.
Trial license for IAR Embedded Workbench for Arm is also available from IAR Systems (https://www.
iar.com/designstart).
You can find out more about Flexible Access and DesignStart on the Arm website and request more
information: https://arm.com/why-arm/how-licensing-works
Disclaimer: The IP offering and commercial terms available through Arm DesignStart and Flexible
Access above are accurate as of July 2019 and are subject to change.
The Cortex-M0 DesignStart Eval includes an example system based on the Cortex-M System Design
Kit (CMSDK) product. The example system is delivered as RTL sources, with example test codes and
simulation scripts. A FPGA prototyping project for MPS2 (Microcontroller Prototyping System 2) is
also included.
28
System-on-Chip Design with Arm® Cortex®-M processors
The Cortex-M3 DesignStart Eval includes a system design based on the CoreLink System Design Kit SDK-
100 (a successor of CMSDK). It also has examples, simulation scripts, and FPGA projects for MPS2.
The DesignStart Pro also includes the deliverable for the full CoreLink subsystem products.
1.3.5 Documentation
There are several types of documents that you will come across when working on Arm system designs:
Architecture reference manuals: these documents specify the behavior of the architecture (e.g.,
instruction set, programmer’s model) but not the processor-specific implementation details (e.g.,
pipeline and interface). There are separated architecture reference manuals for Armv6-M, Armv7-M,
and Armv8-M, and you can download them from https://developer.arm.com (Please refer to Table 1.2
to see which architecture is for which processors).
Technical reference manuals: Often known as TRM, they describe the specification of the processors
or other system IPs. These documents are public and can be found on https://developer.arm.com
Integration and Implementation manuals: Also known as IIM, they describe the interface,
configuration options and explain how to use the deliverables like the execution testbenches. These
documents are confidential and are inside product bundles.
Use guides: The details of the FPGA examples are documented in user guides notes.
Release notes: All of the deliverables from ARM are provided with a release note which identifies
the versions of parts within a bundle, any known issues and any changes since a previous release.
The release note will also describe how to install and test the deliverable. These documents are
confidential and are inside product bundles.
Errata: The errata document describes known issues with ARM products, together with workarounds
if applicable.
29
System-on-Chip Design with Arm® Cortex®-M processors
30
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Introduction to system 2
design with Cortex-M
processors
31
System-on-Chip Design with Arm® Cortex®-M processors
2.1 Overview
One of the key advantages of using the Cortex-M processor is that, for small system designs, in
particular, it is not that difficult to get the system to work in a Verilog simulation or on FPGA. You will,
of course, need to acquire some knowledge beforehand, like a basic overview of the architecture used
in the Cortex-M processors. Also, if you are using a Verilog RTL version of the design, you will need an
understanding of the bus protocols used in the Cortex-M processors, such as AHB and APB protocols.
The first step of the project is to understand the requirements of the applications. For example, you
will need to know:
For ASIC designs, many additional areas should be investigated. For example, the following are
generic chip design considerations:
What types of memory technologies are available (e.g., embedded flash memories are not available
for many small geometry process nodes)?
What type of Design-for-Test (DFT) features are needed for device manufacturing testing?
For the era of IoT, designers should also investigate security aspects and many other challenging areas
of integrating wireless communication interfaces inside SoC designs.
To keep this document manageable, let us look into the processor system design areas only. To get
a simple Cortex-M processor system to work, typically we need to consider and, where appropriate,
define, the following (this is not a definitive list):
Memory blocks – what type of memories are needed, and memory sizes?
32
System-on-Chip Design with Arm® Cortex®-M processors
Memory map.
Debug integration.
In the rest of this chapter, you can read an overview of some of these areas.
Non-volatile memory (NVM), typically using embedded flash technologies or masked ROM, for
program storage;
In some systems, there can be additional memories for bootloader and other preloaded firmware.
Some low-power devices also have special retention static RAM (SRAM) for holding small amounts
of data while the rest of the device is shut down during sleep modes.
Most of the Cortex-M processors use 32-bit AHB for memory interfacing (except Cortex-M1 which
uses Tightly Coupled Memory (TCM) interfaces for connecting memories, and Cortex-M7 which
supports both Tightly-Coupled-Memory (TCM) and AXI bus interfaces). Therefore, the memory system
designs are normally 32-bit wide, but they also need to be byte-addressable – it means the RAM must
support byte (8-bit), half-word (16-bit) and word (32-bit) write operations.
For FPGA-based projects, the SRAM inside the FPGA can be used for both program storage (most
FPGA initialization sequences can initialize SRAM contents at the same time) and read-write data.
33
System-on-Chip Design with Arm® Cortex®-M processors
Therefore, in theory, you could use just one SRAM block for a Cortex-M based FPGA system design.
SRAM in FPGA
FPGA image
Use as
Initial content program
storage
Figure 2.1: SRAM in FPGA can have initial values so that a single SRAM block can be used as both program ROM and RAM.
However, such an arrangement differs from ASIC/SoC system designs where SRAM cannot be
initialized in the same way. Also, doing so will impact performance on a Cortex-M3/M4-based system
as it will no longer be using a Harvard bus architecture. To avoid confusion, the rest of the examples in
this book use two memory blocks for separating program storage and data read-writes.
A long time ago, FPGA tools could not generate RAM blocks using behavioral Verilog codes and
declaration of memories in FPGA projects required instantiation of memory macros manually. This was
changed a few years ago, but such a capability might require the RAM declarations to be written in
a specific way to allow the FPGA design tools to recognize it correctly.
SRAM
AHB interface
interface
cmsdk_fpga_rom /
cmsdk_ahb_to_sram
cmsdk_fpga_ram
This arrangement allows you to swap over the FPGA ROM/RAM with other memories easily
(e.g., when migrating to ASIC).
34
System-on-Chip Design with Arm® Cortex®-M processors
If you would like to simplify the design, it is possible to use a simple AHB block SRAM design (from
my paper in Embedded World 2014 – “Arm Cortex-M Processor-based System Prototyping on FPGA”
https://community.arm.com/processors/b/blog/posts/embedded-world-2014---arm-cortex--m-
processor-based-system-prototyping-on-fpga
module AHBBlockRam #(
// --------------------------------------
// Parameter Declarations
// --------------------------------------
parameter AWIDTH = 12
)
(
// --------------------------------------
// Port Definitions
// --------------------------------------
input HCLK, // system bus clock
input HRESETn, // system bus reset
input HSEL, // AHB peripheral select
input HREADY, // AHB ready input
input [1:0] HTRANS, // AHB transfer type
input [1:0] HSIZE, // AHB hsize
input HWRITE, // AHB hwrite
input [AWIDTH-1:0] HADDR, // AHB address bus
input [31:0] HWDATA, // AHB write data bus
output HREADYOUT, // AHB ready output to S->M mux
output HRESP, // AHB response
output [31:0] HRDATA // AHB read data bus
);
parameter AWT = ((1<<(AWIDTH-2))-1); // index max value
// --- Memory Array ---
reg [7:0] BRAM0 [0:AWT];
reg [7:0] BRAM1 [0:AWT];
reg [7:0] BRAM2 [0:AWT];
reg [7:0] BRAM3 [0:AWT];
// --- Internal signals ---
reg [AWIDTH-2:0] haddrQ;
wire Valid;
reg [3:0] WrEnQ;
wire [3:0] WrEnD;
wire WrEn;
// --------------------------------------
// Main body of code
// --------------------------------------
assign Valid = HSEL & HREADY & HTRANS[1];
// --- RAM Write Interface ---
assign WrEn = (Valid & HWRITE) | (|WrEnQ);
assign WrEnD[0] = (((HADDR[1:0]==2’b00) && (HSIZE[1:0]==2’b00)) ||
((HADDR[1]==1’b0) && (HSIZE[1:0]==2’b01)) ||
((HSIZE[1:0]==2’b10))) ? Valid & HWRITE : 1’b0;
assign WrEnD[1] = (((HADDR[1:0]==2’b01) && (HSIZE[1:0]==2’b00)) ||
((HADDR[1]==1’b0) && (HSIZE[1:0]==2’b01)) ||
((HSIZE[1:0]==2’b10))) ? Valid & HWRITE : 1’b0;
assign WrEnD[2] = (((HADDR[1:0]==2’b10) && (HSIZE[1:0]==2’b00)) ||
((HADDR[1]==1’b1) && (HSIZE[1:0]==2’b01)) ||
((HSIZE[1:0]==2’b10))) ? Valid & HWRITE : 1’b0;
assign WrEnD[3] = (((HADDR[1:0]==2’b11) && (HSIZE[1:0]==2’b00)) ||
((HADDR[1]==1’b1) && (HSIZE[1:0]==2’b01)) ||
((HSIZE[1:0]==2’b10))) ? Valid & HWRITE : 1’b0;
35
System-on-Chip Design with Arm® Cortex®-M processors
Using an Arm toolchain such as Keil MDK (Microcontroller Development Kit) or DS-5, we can create
a hex file that can be read by $readmemh for SRAM initialization, using the fromelf utility with the
following command-line:
This generates four hex files (itcm0, itcm1, itcm2 and itcm3), one for each byte lane, which need to
be available during the FPGA synthesis. The tool merges the data into the FPGA bitstream so that the
SRAM content can be set up during FPGA configuration stage.
In most cases, to connect a SRAM block to AHB, you can use the “cmsdk_ahb_to_sram” block, possibly
with a little bit of glue logic for signal protocol conversion. Additional considerations apply when low-
power support is a requirement, as SRAM macros usually have some low-power modes or even state
retention modes.
36
System-on-Chip Design with Arm® Cortex®-M processors
To connect embedded flash macros to AHB, you need a flash interface controller. The interface on the
flash macros is vendor and process node-specific. However, Arm has worked with multiple embedded
flash vendors to define a Generic Flash Bus protocol (GFB, https://developer.arm.com/docs/ihi0083/a),
so most parts of the flash controller are generic; only a smaller part of the interface is process-dependent.
Arm provides generic flash controller IP, which is licensable as a part of the Corstone-101 product.
Since embedded flash macros are often relatively slow (e.g., around 30MHz to 50MHz access speed)
and many Cortex-M designs run at over 100MHz, cache systems are often required to reach desired
performance levels. To address this need, Arm also offers cache units such as the AHB flash cache,
which is part of the Cortex-M3 DesignStart Pro.
Bits [31:24] [23:16] [15:8] [7:0] Bits [31:24] [23:16] [15:8] [7:0]
0x00000008 Byte 0xB Byte 0xA Byte 9 Byte 8 0x00000008 Byte 8 Byte 9 Byte 0xA Byte 0xB
0x00000004 Byte 7 Byte 6 Byte 5 Byte 4 0x00000004 Byte 4 Byte 5 Byte 6 Byte 7
0x00000000 Byte 3 Byte 2 Byte 1 Byte 0 0x00000000 Byte 0 Byte 1 Byte 2 Byte 3
Figure 2.3: Data arrangement in a Little-Endian system. Figure 2.4: Data arrangement in a Big-Endian system.
Please note that the endiann configuration only affect data accesses (including read-only data). Instructions
are always encoding as little endian. Also, access to the Private Peripheral Bus (PPB) is always in little endian.
Timers;
Pulse Width Modulator (PWM) – usually for motor or power electronic system control;
SPI (Serial Peripheral Interface) for external hardware modules such as LCDs;
37
System-on-Chip Design with Arm® Cortex®-M processors
In addition to these basic peripherals, a simple system might also integrate a group of registers for
various system control functions (e.g., clock source control, selection of low-power modes). This could
be integrated as part of the peripheral system, but additional care must be taken for system security
reasons. Typically, system management functions need to be restricted to privilege accesses only.
Microcontrollers also have analog interfaces like ADC (Analog to Digital Converter) and DAC (Digital
to Analog Converter). However, many FPGA devices do not support such peripherals. For ASIC
designs, typically the ADC and DAC IP need to be sourced from specialist IP providers.
The general layout of the memory map is shown in the diagram below (Figure 2.5).
0xE00FFFFF 0xE000EFFF
0xFFFFFFFF
Reserved Private Peripheral System Control
Bus (PPB) Space (SCS)
Private peripherals including
building interrupt controller 0xE0000000 Private Peripheral Bus
(NVIC) and debug components
0xDFFFFFFF 0xE0000000 0xE000E000
Mainly used as external
External Device 1GB
peripherals.
0xA0000000
0x9FFFFFFF
Mainly used as external
External RAM 1GB
memory.
0x60000000
0x5FFFFFFF
Mainly used as peripherals. Peripherals 0.5GB
0x40000000
0x3FFFFFFF
Mainly used as static RAM. SRAM 0.5GB
0x20000000
Mainly used for program 0x1FFFFFFF
code. Also, provides CODE 0.5GB
exception vector table after 0x00000000
power-up
38
System-on-Chip Design with Arm® Cortex®-M processors
The top 512Mb of the System Level Memory contains a region for system control and reserved areas.
This bus provides access to the built-in interrupt controller and various debug components. Within the
PPB memory range, a special range of memory is defined as System Control Space (SCS). It contains
the interrupt control registers, system control registers, debug control registers, and so on. The
remaining system-level memory space from address 0xE0100000 is reserved.
By having a predefined memory map, it makes porting of applications easier as all of the Cortex-M
systems have a similar look and feel, and an identical address range for NVIC and SysTick timer, etc.
It also simplifies the boot code as there is no need to program the system to define the memory
attributes for different memory/device types.
There are some restrictions concerning what the memory maps look like:
1. In many Cortex-M processors, including Cortex-M0, Cortex-M0+, Cortex-M1, Cortex-M3, and
Cortex-M4, the initial vector table address must be zero after reset.
2. In Cortex-M3 and Cortex-M4 processors, there is an optional bit band feature that allows the first
1MB of SRAM and the first 1MB of Peripheral region to be bit addressable. When this feature is
enabled, the bit-band alias region is remapped to bit band address range, and therefore the bit-band
alias address range cannot be used for data memory or peripherals.
3. In Cortex-M1 and Cortex-M7 processors, the instruction TCM and data TCM has fixed memory
addresses (TCM sizes are configurable). Both of these TCMs are optional.
External to processor
(System bus)
0x3FFFFFFF
DTCM (1MB maximum)
SRAM 0x20000000
0.5GB
0x20000000
0x1FFFFFFF
CODE
External to processor
0.5GB
0x00000000 (System bus)
ITCM upper alias
(1MB maximum)
0x10000000
For example, in the Cortex-M1 processor, there are two TCM interfaces: the ITCM interface is
primarily for instruction memory (including literal data access inside a program), and the DTCM is
primarily for data transfers. If the TCM size is set to 0, the TCM interface is not used, and the transfers
are carried out on the system bus. The maximum size of the TCM supported on Cortex-M1 is 1MB for
each TCM interface.
39
System-on-Chip Design with Arm® Cortex®-M processors
The TCM interfaces on the Cortex-M1 processor are designed to be used with typical RAM blocks in
modern FPGA architecture. The accesses are single-cycles (i.e., have no wait state) and are limited to a
maximum size of 1MB each.
It is possible to add additional memory blocks on the system bus of the Cortex-M1 processor. The
original design of the Cortex-M1 system bus is based on AHB Lite protocol (AMBA version 3), it is
generic and allows wait state and error response. Please note that the Cortex-M1 design integrated
with the FPGA design tool might have been customized for a specific FPGA design environment and
might, therefore, have some cycle timing differences as a result.
The TCM interfaces on the Cortex-M7 processor are designed to be used with RAM blocks for ASIC
designs and support wait states. The maximum TCM size is 16MB each, but in practice, the TCM sizes
used in Cortex-M7 based microcontrollers are likely to be in the range of 64KB to 512KB. A large TCM
can increase the cost of the silicon due to the size of the area used and can have an impact on the
maximum clock frequency that can be achieved. For Cortex-M7, you can also add additional memories
on the AXI master interface.
Peripherals are typically placed in the Peripheral region of the memory map (0x40000000 to
0x5FFFFFFF). In most designs, peripherals are grouped into address ranges based on the bus segment
that they are placed in. For example, a Cortex-M based system can have multiple AHB and APB
peripheral buses. Bus bridges can be used to allow these buses to run at different clock frequencies.
When using Cortex-M3 and Cortex-M4 processors, if the designer would like to take advantage of the
bit-band feature which allows peripheral registers to be bit addressable (using bit-band alias), then the
peripherals that use this feature need to be in the first 1MB of the peripheral region. Similarly, when
supporting the bit-band feature for SRAM, the SRAM must be placed in the 1MB of the SRAM region.
When using Cortex-M23 and Cortex-M33 processors with TrustZone security extension enabled, the
memory map design needs to divide memory spaces into Secure and Non-secure ranges. More details
on this topic are covered in Section 3.5 AHB5 TrustZone support.
The bus interface on the Cortex-M processor being used – different Cortex-M processors can have
different bus interfaces (e.g., Harvard versus Von Neumann bus architecture).
The performance of memory blocks (e.g., if embedded flash memories are used for program storage
and the design need to provide high performance, then a cache unit should be considered).
The bus bandwidth of other bus masters in the system. For example, a USB controller is likely to
have a bus master interface and needs high data bandwidth to SRAM. In such cases, you might
need to have multiple blocks of SRAM and design the bus system to allow the processor and the
40
System-on-Chip Design with Arm® Cortex®-M processors
USB controller to have concurrent access to SRAM blocks. Another type of common bus master is
DMA controller – DMA operations enable high-performance data transfers and device-driven data
transfers without software intervention.
The clock speed of peripheral buses – your designs might have multiple peripheral buses with
multiple clock speeds to enable low-power operations for some peripherals and higher performance
for peripherals that can benefit from lower access latency.
Security – with TrustZone based systems for Cortex-M23 and Cortex-M33 processors, security
management in bus system design is an important area to ensure that security measures cannot
be compromised. For some of the other Cortex-M systems without TrustZone, you might still
want to have some levels of security level management to handle the separation of privileged and
unprivileged software components.
Later on in this book, we cover some of the processor-specific bus system design concepts in Chapter 4.
For microcontroller designs with the Cortex-M7 processor, it is unlikely that you will connect slow
memory blocks like an embedded flash to instruction TCM because accesses to TCM memories
bypass the caches. Therefore, for Cortex-M7 system designs, slow program memories are expected to
be connected via the AXI master interface.
For details of TCM integration, please refers to the Integration and Implementation Manual (IIM) in the
product bundle.
The Cortex-M7 processor supports optional built-in instructions and data caches (they are
optional).
The Cortex-M35P processor supports an optional built-in program cache (sometimes referred to as
instruction cache but technically it is a unified cache that can cache both instruction and read-only
data).
For details of cache RAM integration on these processors, please refers to the IIM in the product
bundles.
41
System-on-Chip Design with Arm® Cortex®-M processors
System designers using the Cortex-M processor source code need to study the configuration options
documented in the Integration and Implementation Manual (IIM) carefully to select the right options
for their applications. Some other parts of the product bundles also need to be configured with
matching options. If the options of some parts of the deliverable are not configured correctly, items
like the execution testbench might not work correctly.
The maximum number of interrupts supported by the Cortex-M processors are listed in Table 2.1:
If the number of interrupt signals exceeds the maximum number support, it is possible to merge
multiple interrupt lines and share one interrupt service routine (ISR) and determine which interrupt to
be serviced in the ISR by software.
Are active high and must be synchronous to the processor’s system clock signal;
Can be level triggered or pulse triggered. If using pulse triggered, the duration of the pulse must be
at least one clock cycle.
The unused or not implemented interrupt input pins should be tied to 0 and must not be allowed
to enter unknown state ‘X.’ (e.g., if a peripheral outputs X in its interrupt line when the peripheral is
powered down, the signal level must be clamped to 0 before the power down happened). Issues with
unknown or ‘X’ signal values generally affect simulation but represent possible unexpected values
when using ASIC or FPGA.
42
System-on-Chip Design with Arm® Cortex®-M processors
If the peripheral interrupt is generated at a different clock domain, a synchronization circuit (such as
the example in Figure 2.7) is needed to remove potential metastability issue and to prevent transients
from forming unexpected pulses.
Interrupt
DFF DFF DFF
Output
Interrupt input D Q D Q D Q
(Different clock
domain)
CLK CLK CLK
Double flip-flop
Remove pulses form by
synchronization to
glitches
prevent metastability
Figure 2.7: Interrupt synchronizer to convert interrupt signal from one clock domain to processor’s clock domain.
Cortex-M processors have a Non-Maskable Interrupt (NMI) input. In common embedded systems the
NMI could be connected to:
Voltage monitoring logic (also known as brownout detector) to ensure that the system is shut down
correctly when support voltage drops to a certain value or
The NMI could be connected to a watchdog timer to carry out remedial actions if the system has
stopped normal operation.
NMI is unlikely to be used as the interrupt for normal peripherals. This is because the built-in
interrupt controller NVIC already provides interrupt prioritization, so each peripheral can already be
programmed as the highest priority by just using the normal IRQ connection. Also, a fault generated
within the NMI handler can cause the processor to enter lockup state, which can be problematic
for some applications. Faults in normal interrupt handlers allow the Hard Fault handler (or other
configurable fault handlers) to be triggered and executed.
Another characteristic of the NVIC is that it can handle interrupt requests in the form of pulse as well
as level signal. If a peripheral generates an interrupt request in the form of a pulse signal, the request is
held by pending status within the NVIC until the interrupt request is processed, or when the pending
status is cleared manually. If a peripheral generates an interrupt request in the form of a level signal,
the interrupt handler must clear the request at the peripheral.
The key advantage of a pulsed interrupt is that it saves a few clock cycles in the ISR that there is no
need to clear the interrupt requests at the peripherals. However, in many cases, a level-triggered
interrupt is preferred because:
Cross clock domain synchronization of level-triggered interrupts is simpler than pulsed interrupts.
In the case where pulse interrupt synchronization logic is used, two successive interrupt request
pulses could be merged into one after the synchronizer due to the latency of the synchronization,
which can be confusing.
43
System-on-Chip Design with Arm® Cortex®-M processors
If the interrupt event occurred when the processor is reset, the interrupt event could be lost.
Level trigger interrupts can remain at a high level to indicate an additional service is needed by the
peripheral (e.g., when additional data is available in a receiver’s FIFO).
Easier for debugging (e.g., in Verilog simulation, where it is hard to tell if there has been an interrupt
event unless the event information is kept by, for example, a waveform database).
The peripheral design can be reused for other processors that do not support pulsed trigger
interrupts.
In addition to the number of interrupts, there are other configuration options related to interrupt handling:
Number of interrupt priority levels – In Armv7-M and Armv8-M Mainline processors, the
programmable interrupt priority level registers has configurable width from 3-bits to 8-bits. Typically,
the options of 3-bit to 4-bit are used, and some devices do support 5-bit. Most applications do not
need many interrupt priority levels, so eight levels (3-bit) is likely to be sufficient.
Wakeup Interrupt Controller (WIC) – An optional block to handle interrupt detection while the
processor is in-state retention power gating (SRPG) or when the processor’s clock is completely
stopped. If the WIC feature is implemented and enabled, the interrupt masking information is
transferred from NVIC to WIC automatically before entering sleep mode. The WIC then takes
over the role of interrupt event detection and can generate a wakeup request to power management
blocks in the system when an enabled interrupt event is detected. The interrupt pending status
is held in the WIC when the processor is waking up and transfers the interrupt request to NVIC
when the processor is back up. At the same time, the masking information inside the WIC is cleared
automatically by hardware as the NVIC is back running.
The event interface is typically used in multi-core systems to allow one processor to wake up another
during spinlocks. In RTOS semaphores, if a processor is waiting for a spinlock, it can enter sleep
mode using WFE to save power and wakes up if there is an interrupt to serve or if there is an
event from another processor. By crossing over the event interface signal (as shown in Figure 2.8),
processors in a dual-core system can wake up each other from WFE sleeps using the SEV (send event)
instruction.
44
System-on-Chip Design with Arm® Cortex®-M processors
Cortex-M Cortex-M
TXEV TXEV
RXEV RXEV
Events could also be generated from peripherals or DMA controller, but normally interrupts are more
suitable for that purpose as we need software to react to those hardware events vis ISRs.
For single-processor systems, it is fine to tie RXEV to 0 and leave TXEV unconnected.
Please note: The event interface on the Cortex-M processor is unrelated to the definition of events in
RTOS. In RTOS, an application thread waiting for a certain operation X to be carried out can call an OS
API that waits for an event Y. This API call also takes the thread out of the ready task queue. When the
specified operation X has been carried out (e.g., in another thread or an ISR), the other thread or ISR that
carried out the operation X can call another OS API to set the OS event Y. This puts all the waiting threads
that were waiting for the operation X to be put back in the ready task queue to resume operation.
Free-running clock (if gated, all logic in the processor stopped and needs external logic blocks such
as WIC to handle interrupt detection and wakeup);
Debug clock(s) – this includes the JTAG or Serial Wire debug clock signals for debug interface, and
also a clock signal for internal debug components which can be gated if there is no active debug
connection.
The free running clock, system clock and debug clock (except the clock for the debug interface and
DAP interface on Cortex-M3/M4 processors) must be synchronous and in the same phase. The
separation of clock signals is to allow the system power to be reduced by gating off some of the clock
signals when they are not needed.
In Cortex-M0 and Cortex-M3 processors, the design exported GATEHCLK signal is asserted when
the processor is in sleep mode, and there is no debug connection. This signal can be used to gate off
the system clock.
In some of the Cortex-M processors, the clock gating logic is done internally and so might not have
all these clock signals visible on the top-level.
45
System-on-Chip Design with Arm® Cortex®-M processors
It is important not to gate off the system clock when the processor is running. In system-level designs,
there can be multiple clock sources, and a glitch-less clock switching circuit would be needed. The clock
switching circuit is outside of the processor and is normally application and process node dependent. In
FPGA designs, you can design an FSM that controls the PLLs (Phase-Locked Loops) and gate-off the clock
signals to the processor subsystem during PLL configuration changes.
PLL config
Lock
Enable
status Control
Depending on the FPGA design tool being used, the system clock generation/control logic might be
generated by the tools. In this case, there is no need to develop your own clock generation/control logic.
External crystal oscillator for medium speed (e.g., 1MHz to 12MHz) – this might be turned off by
default after a reset to save power. Instead of using a higher frequency crystal to generate higher
frequency clocks, it is more common to use a PLL to generate a high clock speed when needed to
avoid having a high-frequency clock running all the time to save power.
Internal RC oscillator for medium speed (e.g., 1MHz to 12MHz). This will use less power than a crystal
oscillator, but will not provide an accurate frequency reference for timing or peripheral interfaces.
External 32KHz oscillator for real-time clock (might also be used for system management).
PLL config
Clock
PLL
Internal R-C switch
oscillator
On/off
Clock buffer
control Glitch Free running
free clock output
Clock
Fast switch
crystal Clock buffer
oscillator Clock Gated
On/off gate system clock
control
output
Real-time Power
clock management
32KHz (RTC) control
crystal
oscillator
46
System-on-Chip Design with Arm® Cortex®-M processors
In ASIC/SoC implementations, the system can boot-up from the internal RC oscillator and switch
over to external crystal oscillator or PLL for clock source when needed. PLL can provide higher clock
frequency for high-performance operations.
System reset;
Debug reset;
Optionally you might find a power-on reset, which resets both the system and debug logic.
If power-on reset is present, it resets both the system and debug system. The reason that we separate
the reset into two signals is to allow the processor to be reset without affecting the debug system.
Otherwise, the debug settings like breakpoints, watchpoints, and the debug connection from the
debugger to the core, would be lost each time the processor core is reset.
The processor also outputs a reset request signal called SYSRESETREQ. This is controlled by a register
bit in the Application Interrupt and Reset Control Register (AIRCR) inside System Control Space. This
allows:
Software to request a system reset, for example, in the case of fault error handling;
Debugger to request a system reset. This is essential to allow the debugger to request a reset of the
targeted processor.
SYSRESETREQ only generates a system reset but not debug reset or power-on reset;
SYSRESETREQ does not generate a system reset in a combinatorial path (in other words – it must
be registered by registers that are not affected by the system reset), as the SYSRESETREQ output
is affected by a system reset and the use of a combinatorial path for reset generation causes a reset
glitch.
All of the Cortex-M processors use an asynchronous active-low reset signal and must be de-asserted
synchronously to the system and debug clock to prevent timing violations. This ensures that most of
the registers can be reset when the clock is not running. However, most of the Cortex-M processors
require the reset to last at least two clock cycles. This arrangement has the following benefits:
47
System-on-Chip Design with Arm® Cortex®-M processors
In the case where the assertion of reset causes timing violations and leads to metastability, the
multi-cycle nature of reset ensures the metastability is cleared up. To ensure reset de-assert occurs
at the correct time, a simple reset generator could be used for a Cortex-M0/M0+/M1/M3/M4
processor. Figure 2.11 shows such an example.
1 D Q D Q DBGRESETn
buffer
External power- clr clr
on-reset
(active low)
Registers to hold reset
for 2 cycles after
HCLK SYSRESETREQ
SYSRESETREQ D Q D Q SYSRESETn
buffer
clr clr
Assuming the Cortex-M1 is used (some other Cortex-M processors have different signal names for
reset signals): The Cortex-M1 processor generates the SYSRESETREQ signal. Since the Cortex-M1
processor can be reset by SYSRESETn, the SYSRESETREQ signal must not drive SYSRESETn in a
combinatorial path. Otherwise, it could result in a race condition where SYSRESETREQ gets cleared
in a very short time after assert, as it gets cleared by its output. This could result in some parts of
the processor getting reset and other parts not. For this reason, the SYSRESETREQ signal must be
registered by a separated flip-flop that is not affected by SYSRESETn before being used to generate
SYSRESETn. In the example above (Figure 2.11), the reset request from the SYSRESETREQ is held in
two registers that are reset by DBGRESETn, or if using Cortex-M3/M4, you can use power-on reset in
Cortex-M3/M4 processor.
We can also design the reset generator so that it can optionally reset the system if it enters lock-up
state. To make this behavior controllable, a programmable register would be needed in your FPGA/
system design to specify if a lock-up state can cause a reset. This register is not provided in the
Cortex-M processor core as such requirement is application dependent. During software development,
the control signal at this external reset control register can be set to 0 to disable the automatic reset.
In a production system, the reset control register can be set to 1 so that when the system enters lock-
up state, the SYSRESETn is activated automatically.
48
System-on-Chip Design with Arm® Cortex®-M processors
1 D Q D Q DBGRESETn
buffer
External power- clr clr
on-reset
(active low)
Registers to hold reset
for 2 cycles after
HCLK SYSRESETREQ
D Q D Q SYSRESETn
buffer
clr clr
SYSRESETREQ
LOCKUP
Programmable register to
Reset control register allow the system to be
reset on a lock-up
Depending on the FPGA design tool being used, the system reset controller might already be included.
In this case, there is no need to develop your own reset controller.
2.13 SysTick
The SysTick timers in the Cortex-M processors support external reference “clock.” Technically the reference
“clock” is not a clock signal, as it is sampled by D-flip flops inside the SysTick at the processor’s clock speed.
The SysTick interface also provides a calibration input, which is fed to the SysTick calibration value
register:
The support for SysTick reference clock and calibration value are optional.
If TENMS is not used, STCALIB[23:0] should be tied low, and STCALIB[24] needs to tied high.
49
System-on-Chip Design with Arm® Cortex®-M processors
In CMSIS-CORE, an alternate way for software to determine system clock speed is provided that uses
a software approach: the SystemCoreClock variable should provide the clock frequency information,
and that is initialized and updated by the software when the clock settings are updated.
Interface for debug connection (JTAG or Serial Wire Debug) – for connecting a debugger to the
hardware target to carry out halting, stepping, restart, resume, setting breakpoints/watchpoints,
access to memories and peripherals. Debug connection is also used for downloading code and flash
programming.
Interface for trace data (connecting ATB from the processor and ETM to trace port) – enables the
debugger to obtain real-time trace information, either using trace port protocol which contains
multiple data bits (usually 4-bit) and a clock signal or using a single pin trace output protocol for
trace with lower bandwidth (e.g., instrumentation trace, event trace). The trace interface is optional
and is not available on Cortex-M1, Cortex-M0 and Cortex-M0+ processors.
Debug system clock and reset generation, and power management – depending on which
processor is used, the debug system can have its own clock and reset signals, and in some designs,
debug logic can be powered down or clock gated if not being used.
More details on the debug interface are covered in Chapter 5. Please note, here we only cover single-
core designs. In the case of multi-core designs, the debug integration should be handled by Arm
CoreSight SoC-400/600 products.
50
System-on-Chip Design with Arm® Cortex®-M processors
Sleep modes – architecturally, the processor can have sleep and deep sleep, but these sleep modes
could be extended with additional system-specific registers to have addition granularity of sleep
characteristics. The processors have sleep mode status output signals so that system designers can
use these signals to control clock gating and other power management hardware.
Sleep hold interface – in the case where a system designer utilizes sleep mode signals to turn off
hardware resources (e.g., program ROM), the wake-up process can take a while (e.g., hundreds
to thousands of clock cycles). In such cases, it is essential to be able to hold off the processor’s
program execution, and the sleep hold interface is designed exactly for this purpose. To use this
feature, the system designer needs to design a simple Finite State Machine (FSM) to handle the
handshaking with the sleep hold interface.
Wakeup Interrupt Controller (WIC) – explained in this chapter earlier, the WIC is an optional
feature that allows interrupts or other wakeup events to be detected when the processor is in
a powered-down state, retention state, or if the clock to the processor is gated off. The system
designer can customize the example WIC design if needed.
Debug power management – the debug interface modules provides handshaking signals to
indicate whether there is a debugger connection, which allows system designers to implement
power management for the debug system of the processors if needed. For example, in Cortex-M0,
Cortex-M0+, Cortex-M7, Cortex-M23, Cortex-M33, and Cortex-M35P processors, there is a
separate debug power domain that can be powered down if there is no debug connection.
System designers are also likely to integrate additional power management features for memory
blocks, clock generation and distribution systems, and some of the peripherals.
Apart from the debug and trace signals, normally there is no need to expose other interfaces of the
Cortex-M processors directly to the top-level of the devices. For external interrupt generation, usually,
that is handled by GPIO blocks so that external hardware can trigger interrupts via GPIO. In some
cases, chip designers can also implement a signal path to allow off-chip hardware to generate an
event pulse to the Cortex-M processor so that it can wake up from WFE (Wait-for-event) instruction;
however, this is not essential for many systems.
51
System-on-Chip Design with Arm® Cortex®-M processors
When designing top-level pins, several areas related to the Cortex-M processors should be
considered:
In most cases, the debug interface pins (JTAG or Serial Wire Debug) need to be accessible at the
device’s top-level by default. For Cortex-M3, Cortex-M4 and Cortex-M33 processors, the debug
interface module supports dynamic protocol switching, so it is possible to expose just two pins
of the SWD debug by default. If there is a need to switch over to JTAG, then you can program a
device-specific pin multiplexer (mux) control register to expose the other pins for JTAG, and then
apply a switchover sequence to start JTAG operations.
The SWD interface requires a tristate pin for the data connection (SWDIO), which is enabled when
SWDIEN is high.
If the debug interface is multiplexed with other peripheral I/O pins, the peripheral I/O operations
can cause a debug connection to be disconnected.
The debug and trace interface provides a range of status signals to allow some of the signals to be
multiplexed with functional pins. It is also possible to use device-specific programmable registers
to help control the pin multiplexing. However, in such cases, the device vendor needs to provide
the details of the setup sequence for various debug tools to allow them to work correctly with
the device.
When creating systems using Armv8-M processors with TrustZone, the debug connection might
contain Secure information, and therefore the pin multiplexing logic needs to prevent Non-secure
software from seeing activities in the debug connection.
Newer Cortex-M processors support a CPUWAIT signal. This is used to delay the start-up of the
processor after releasing from reset. In most single-core systems, this pin can be tied low. In multi-core
SoC designs when the Cortex-M subsystem is running a program in SRAM, the CPUWAIT signal can
be used to delay the boot-up so that a different bus master can transfer the program image into the
SRAM. After the program image is loaded, the CPUWAIT signal can be released, and the Cortex-M
processor can start executing the program.
52
System-on-Chip Design with Arm® Cortex®-M processors
53
System-on-Chip Design with Arm® Cortex®-M processors
54
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
AMBA, AHB, and APB 3
55
System-on-Chip Design with Arm® Cortex®-M processors
Unlike most other bus protocols such as PCI, AMBA bus protocols are designed for on-chip
communications. To enable easier system integration inside chip designs, almost all the AMBA
specifications have the following characteristics:
Synchronous operations – use only clock rising edge for flip-flops, friendly to common synthesis flow;
No on-chip bi-directional signals – avoiding the need for tri-state buffers.
AHB (Advanced High-performance Bus) – a lightweight pipelined bus protocol used in the majority
of the Arm Cortex-M processors.
APB (Advanced Peripheral Bus) – a simple bus protocol for connecting general simple peripherals
with low data bandwidth requirements.
AXI (Advanced eXtensible Interface) – a high-performance bus protocol for efficient, high-
performance processors including the Cortex-M7 processor, Cortex-R processors and the majority
of the Arm Cortex-A processors. The AXI protocol:
Allows new transfers to be issued and take place even with previous transfers still outstanding.
Supports unaligned data and provides data security based on TrustZone technology.
You might wonder why the AMBA standard was developed? Although there are many bus standards in
existence, they mainly focus on circuit board level connections. These types of interface standards contain
overhead for supporting signal multiplexing, configuration detection, and electrical characteristics handling
56
System-on-Chip Design with Arm® Cortex®-M processors
(e.g., turn-around time when signals switch direction). Overheads of this kind do not apply to system-on-chip
environments. At the same time, there is a need to have an open standard to allow better design reusability
and enable Intellectual Properties (IP) providers to develop peripherals for the Arm platform. As a result,
Arm published the AMBA 2 specification as an open standard (royalty-free), and this has become the most
popular processor interface standard for embedded 32-bit processors. Due to its open nature and simplicity,
AMBA has been for some time a de-facto standard for bus interface in system-on-chip architectures.
The low overhead and low latency characteristics are often necessary for high speed embedded systems.
The AHB protocol also supports pipelined operation, which is important to most designs.
In this chapter, we will only cover the AHB and APB protocols as they are the most commonly used
in simple embedded microcontrollers and SoC designs. For other protocols, the specifications can be
downloaded from the Arm website. Additional reference materials are also available:
AMBA https://developer.arm.com/architectures/system-architectures/amba
ACE https://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
CHI https://community.arm.com/processors/b/blog/posts/what-is-amba-5-chi-and-how-does-it-help
Table 3.2: Reference sources for some of the bus protocols that are out of scope for this document.
57
System-on-Chip Design with Arm® Cortex®-M processors
The AHB standard has gone through multiple releases and phases:
AMBA 2 AHB – first release. This specification release uses a pair of handshaking signals (Bus
Request and Bus Granted) for arbitration between multiple bus masters.
Multi-layer AHB designs – The AMBA Design Kit product from Arm introduced an AHB
interconnect component called AHB Bus Matrix. This enables concurrent bus transfers in multi-
master systems for higher bandwidth, avoids the need for the Bus Request and Bus Granted signals,
and was unofficially called ‘AHB Lite.’
In the AMBA 3 Specification, AHB Lite became the official name. It removed Bus Request and Bus
Granted signals and simplified several other aspects of the AHB protocol. It is used on many Arm
processor systems.
In the AMBA 5 specification, AHB has been updated to support TrustZone for Armv8-M and add
official support for exclusive access sideband signals. It also introduced several improvements
including additional cache attribute support and clarifications.
AMBA 5 AHB (also known as AHB5) is the most recent release of the AHB specification. In many
ways it is highly compatible with its predecessor and existing bus slaves designed for AHB Lite can be
reused in AHB5 systems.
58
System-on-Chip Design with Arm® Cortex®-M processors
For a typical AHB system, you can find most of the following signals:
1
AHB3 does not have HMASTER though most Cortex-M processors using AHB-Lite have an HMASTER signal provided.
59
System-on-Chip Design with Arm® Cortex®-M processors
For older AHB systems (AMBA 2) that use AHB Arbiter to handle multiple master accesses, you can
also see the following signals:
Many of the signals are optional. For a minimal AHB system with a single Arm Cortex-M0 processor,
it is possible to create a working system using:
HCLK, HRESETn, HADDR, HTRANS, HSIZE, HWRITE, HSEL, HWDATA, HRDATA, HRESP, and
HREADY signals.
HSEL1
Address decoder
HSEL2
Address
AHB slave #1
Address &
HSEL
HADDR, HTRANS, controls HADDR, HTRANS,
HSIZE, HWRITE, etc HSIZE, HWRITE, etc
HWDATA
Write data HWDATA
HREADY
HREADYOUT
Data phase
HREADY
HRESP
AHB master MUX control HRDATA
AHB slave #2
HREADY HSEL
HRESP HADDR, HTRANS, etc
HRDATA HWDATA
HREADY
HREADYOUT
HRESP
HRDATA
Figure 3.1: Simple AHB system with one AHB master and two AHB slaves.
The connections between AHB masters and AHB slaves based on the AMBA 5 Specification can be
viewed in the following diagram: multiple bus masters share the same bus using a master multiplexer,
controlled by a bus arbiter. The return data and responses from bus slaves are also multiplexed using
a slave multiplexer and feedback to the bus masters.
The signals are grouped into “address phase” signals and “data phase” signals.
60
System-on-Chip Design with Arm® Cortex®-M processors
Each transfer is composed of an address phase and data phase. The transfers are pipelined. The
address phase of a transfer can be overlapped with the data phase of the previous transfer.
HCLK
Address
Address phase
phase Address Phase N Address Phase (N+1) Address Phase (N+2)
(N-1)
signals
Data
Data phase
phase Data Phase (N-1) Data phase N Data Phase (N+1)
(N-2)
signals
HREADY
Figure 3.2: Splitting of a transfer into address phase and data phase.
Each phase is terminated by the assertion of HREADYOUT (HREADY) from the currently activated
AHB slave in the data phase. The HREADYOUT from the AHB slaves are multiplexed by the slave
multiplexer, forming the system-wide HREADY signal. The multiplexer is operating at the data phase
of each transfer. Control of the multiplexer can be generated from the AHB decoder, or the HSEL
signals and the HREADY signal.
Address phase
select control AHB slave #1
HADDR HSEL
Address
HREADY
Decoder
AHB slave #2
HSEL
HREADY
HREADY
HREADYOUT
AHB Masters HRESP
HREADY, HRESP, HRDATA
HRDATA
Figure 3.3: HREADYOUT route from AHB slave output to HREADY inputs of AHB slaves and AHB masters.
61
System-on-Chip Design with Arm® Cortex®-M processors
If an AHB slave is not currently selected, its HREADYOUT should be high to indicate it is ready.
However, it can only accept a transfer from the bus master when the previous transfer is completed,
indicated by a high level in the system-wide HREADY. For example, if transfer N selects AHB slave A,
transfer N+1 selects AHB slave B, and transfer N+2 selects AHB slave C, the waveform can be like:
HCLK
Address
Address
Address Phase N Address Phase N+1 phase N+2
phase Address Phase (N+3)
(AHB Slave A selected) (AHB Slave B selected) (Slave C
signals
selected)
HSELA
HSELB
HSELC
HREADYOUTA
HREADYOUTB
Figure 3.4: AHB transfer can only be accepted when the previous transfer is completed, indicated by HREADY being high.
Components Descriptions
Address decoder Based on HADDR input, generates HSEL signals to bus slaves and AHB slave multiplexer
AHB slave multiplexer Connect multiple bus slaves to a single AHB segment
Default slave This is a special type of AHB bus slave, which is selected when the transfer address (HADDR) does not
match the address range of other AHB slaves. This only happens when something has gone wrong (e.g., the
software attempts to access an invalid memory location due to an error in C pointer processing). This bus
slave only returns an ERROR response when accessed, write data is ignored by the default slave, and it
returns 0 for read accesses. Default slave is optional - If the address space is fully-utilized by other bus
slaves, then there is no default slave.
62
System-on-Chip Design with Arm® Cortex®-M processors
The design of the address decoder is system-specific – each system has its memory map, and the chip
designers need to create the address decoder for each bus slave accordingly. The HSEL signal is an
address phase signal generated by decoding the HADDR signal. Since the HSEL generation process is
combinatorial, the design of memory maps needs to avoid complex decoding of HADDR. Otherwise,
the synthesis timing could be affected.
The design of AHB slave multiplexer can be much more generic and is available in various Arm AHB based
system IP bundles. Since each AHB slave has their own read data output and response outputs, an AHB
slave multiplexor is needed to select the return data and response from the current active slave.
The slave multiplexer is controlled by a data phase version of the HSEL signal (delayed by one pipeline
stage). This can be generated by registering the HSEL with system-wide HREADY as the enable signal,
as shown in Figure 3.3. In most cases, the functionality of delaying the HSEL is included within the
AHB slave multiplexor, so the system integrator only needs to connect the address phase HSEL to the
AHB slave multiplexor.
There can be more than one address decoder and slave multiplexer in an AHB system. For example,
an AHB system design can be divided into two or more subsystems, and each AHB subsystem can
contain only a part of the memory space. In this case, the address decoders in the subsystems also
need to take account of the HSEL signals from the top-level address decoder.
AHB
Decoder
Slave multiplexor Top
HSEL_B
HSEL_A
Sub-system Sub-system
AHB AHB
Slave Decoder A Slave Decoder B
multiplexor multiplexor
HSEL_A1 HSEL_B1
HSEL_A2 HSEL_B2
Figure 3.5: Multiple AHB decoders could be needed if the AHB system is divided into multiple subsystems.
63
System-on-Chip Design with Arm® Cortex®-M processors
Address
decoder
HBUSREQ
Arbiter AHB master AHB slave
HGRANT #0 #0
Shared
AHB
HBUSREQ
AHB master AHB slave
HGRANT #1 #1
Master Slave
multiplexer multiplexer
Figure 3.6: AMBA 2 AHB system with two AHB masters and two AHB slaves.
The HBUSREQ (Bus Request) and HGRANT (Bus Granted) handshaking take place before the bus
master is allowed to issue any transfers:
A bus master must first assert a bus request (HBUSREQ) to the arbiter, then
After arbitration, the arbiter returns the bus granted signal (HGRANT) to one of the bus masters, then
If the HGRANT signal is de-asserted, the bus master must stop issuing new transfers.
This arrangement works for simple systems, but the maximum bus bandwidth is limited. When the
AMBA Design Kit (ADK) product was launched, a new approach called multi-layer AHB was used to
support multiple bus masters.
The AHB Bus Matrix component in the ADK is an AHB interconnect component with multiple bus
ports for connections to bus masters and multiple bus ports for connection to AHB slaves. Potentially,
each bus port for bus slave connection could have multiple bus slaves connected to it by having
additional bus slave multiplexer.
Address Arbiter
decoder
Address
decoder Arbiter
AHB master Input AHB slave
#1 stage #1
Figure 3.7: System with multiple bus masters using AHB Bus Matrix.
64
System-on-Chip Design with Arm® Cortex®-M processors
To resolve the bus access conflict when multiple bus masters try to access to the same bus slave at the
same time, each master port (the port that connects to AHB slaves) has an arbiter. If a transfer from a
bus master is targeting a bus slave, which is accessing by a different bus master, the incoming transfer
is held in the input stage with a buffer. With this arrangement, the HBUSREQ and HGRANT signals
are no longer needed.
The bus matrix design allows different bus masters to access to different bus slaves at the same time,
hence enhance the system bandwidth.
The bus matrix component in AMBA Design Kit is configurable and was enhanced when the Arm
Cortex-M System Design Kit (CMSDK) was developed. In newer IP product development, the AHB Bus
Matrix is now part of the Corstone foundation IP / CoreLink SDK (System Design Kit).
For systems that do not require high data bandwidth, a simplified arrangement similar to the AHB
in AMBA 2 can be used with the AHB Master Multiplexer. The redesigned AHB master multiplexer
component has its internal input stage and arbiter, just like the AHB bus matrix. However, since there
is one downstream AHB, there is no need for an internal address decoder.
Shared
AHB
Figure 3.8: System with multiple bus masters using AHB Master Multiplexer.
The HTRANS signal is used to indicate transfer types. Most AHB systems do not need to handle data
transfers 100% of the time. When a bus master does not need to start another transfer immediately,
it can issue an idle transfer. The HTRANS signal in the AHB is used to indicate if the current transfer is
an active transfer or in idle state.
65
System-on-Chip Design with Arm® Cortex®-M processors
HTRANS[1:0] Descriptions
00 IDLE (non-active)
01 BUSY (non-active)
10 Non-Sequential (active)
11 Sequential (active)
When a data transfer is needed, the AHB master generates a Non-sequential (NSEQ) or a Sequential
(SEQ) transfer. NSEQ is used in normal transfers or at the beginning of a burst transfer, and SEQ is
used for the remaining part of a burst transfer, indicating that the transfer is a continuation of the
previous one.
Both IDLE and BUSY are non-active transfers. It means that no real data transfer takes place. BUSY
is only used when the bus master has started a burst transfer but is not ready to handle the next
data transfer. In this case, it issues the BUSY transfer between the burst transfers to keep the burst
sequence going, and continues with the next sequential transfer when it is ready.
The HADDR signal is usually 32-bit and specifies the address of the transfer. HWRITE signal indicates
that the transfer is a write transfer if it is set to 1, or if it is set to 0, then the transfer is a read operation.
HWRITE Descriptions
0 Read operation
1 Write operation
The HSIZE signal is used to indicate the data size to be transferred. Typically, the HSIZE signal is 3-bit
wide, but in most AHB systems only the lowest two bits are used so that you might find some systems
or AHB components with HSIZE of only two bits.
000 Byte
001 Half-word
010 Word
011 Double word (64-bit)
100 128-bit
101 256-bit
110 512-bit
111 1024-bit
66
System-on-Chip Design with Arm® Cortex®-M processors
When a bus master generates a transfer, the bus master should ensure that the data being transferred
is aligned. In other words, a half-word transfer should only take place in even memory addresses, and
a word transfer should only take place in addresses divisible by 4. The AHB interface does not support
unaligned transfers; if a bus master needs to access unaligned data, it should split the transfer into
multiple aligned AHB transfers of smaller size.
The Cortex-M processors can generate read and write transfers of byte, half-word, or word size. In
some of the Cortex-M processors like the Cortex-M3 and Cortex-M4, Instruction fetches are always
in word size. For some others like the Cortex-M0+ and Cortex-M23 processors, instruction fetch of a
branch target could be in word or half-word size, depending on the instruction address alignment.
Besides the crucial AHB control signals, there are also optional sideband signals in the AHB interface.
They are helpful for processor systems, for example, providing privilege level information and
supporting burst transfers. However, they might not be present in some AHB systems.
Signals Descriptions
HPROT[3:0]/[6:0] Protection information (AHB5 has 7 bits of HPROT, and previous versions of AHB has 4-bits)
HNONSEC Security attribute (available in AHB5 only. This is needed for TrustZone security extension)
HBURST[2:0] Burst transfer information
HMASTLOCK Indicate the transfer sequence is atomic, so bus ownership is locked until this signal is released
HMASTER[3:0] Indicates which bus master issued the current transfer.
In some Cortex-M processors, this signal is used to indicate the transfer types (e.g., whether the transfer is
generated by the debugger). The width of this signal can be customized to fit the system requirement.
HEXCL Exclusive access indication signal. This signal is introduced in AHB5 to support exclusive accesses in Arm
processors. The bus slave response to exclusive access with HEXOKAY, a data phase signal.
HAUSER[x-1:0] This is a user-defined address phase signal introduced in AHB5. Potentially it could be used for the
following: propagation of additional information about the transfer;
parity bits for address phase control signals.
In AMBA 2 AHB and AHB Lite, the HPROT signal contains 4 bits, each of them has a different
function:
67
System-on-Chip Design with Arm® Cortex®-M processors
When accessing to normal memories (not peripherals), the encoding of the HPROT[3:2] can be used
to indicate cache types:
HPROT[3:2]
Table 3.12: Cache type indication with AMBA 2 AHB / AHB Lite.
In AMBA 5 AHB, the cache attribute information is extended, and as a result, the HPROT signal becomes:
68
System-on-Chip Design with Arm® Cortex®-M processors
The Cortex-M0 processor does not have a user access level, so HPROT[1] is always 1. Since
Cortex-M0/M0+/M3/M4/M23/M33 processors do not support internal cache, the cacheability
information is often not used.
Another address phase signal is HBURST, which indicates a burst transfer type. Burst transfer can
often improve system performance if the memory device can access data quicker when the accesses
are in sequential orders. If a burst transfer is used, the HBURST signal will indicate the burst transfer
type. AHB supports several types of burst transfers:
Single. (Not burst transfer. Each transfer is separated from each other.)
Incrementing burst transfer. (The address is incremented by the size of the transfer.)
Wrapping burst transfer. For each transfer, the address increments as in an incrementing burst
except when the address reaches the block size boundary of the burst. In this case, the address
wraps round to the beginning of the block size boundary. The block size of the burst can be
determined from the number of beats times the size of each transfer.
Burst transfer sequence is composed of multiple “beats.” Each beat is an AHB transfer with addresses
linked to others inside the burst. Within a burst, the transfer size, direction, and control information of
each transfer must be the same-. Both incrementing and wrapping bursts are supported for 4-beats,
8-beats, and 16-beats transfers. Incremental bursts can also be of unspecified length.
Wrapping burst is useful in cache controller designs. For example, when a processor requests to
read word data in address 0x1008, a cache controller for cache line size of 4 words might want to
fetch address 0x1000, 0x1004, 0x1008 and 0x100C into the cache memory. In this case, it can use
a 4-beats incrementing burst from address 0x1000, or a wrapping burst from address 0x1008. If
wrapping burst is used, the address wraps around in a block size boundary, which is four times word
size, or four words. Therefore, the address wraps around to 0x1000 after the transfer to 0x100C.
69
System-on-Chip Design with Arm® Cortex®-M processors
4 beats incrementing
burst from 0x1000
4 beats incrementing
burst from 0x1008
The wrapping burst transfers the same data as the incrementing burst, but it has the additional
advantage that the data needed by the processor can be transferred first, hence, reducing the waiting
time in the processor. Wrapping burst is commonly used by cache memory line fill. It allows the
processor core to access the required data as soon as possible while allowing the rest of the data in
the cache line to be cached. Unlike wrapping burst, an incrementing burst from address 0x1008 will
not fetch the data in 0x1000 and 0x1004, so it is not suitable for caching because the data for the
cache line is not complete.
Burst transfers can be carried out in different data sizes (e.g., byte, halfword, word, etc.). For each beat,
the address calculation should be adjusted by the size of the transfer data. There is a restriction in
burst generation: An AHB burst must not cross a 1K byte address boundary. This is because:
Easier design of AHB device: to optimize the performance for burst transfers, both AHB masters
and AHB slaves might need to have internal counters to monitor the burst operation and possibly
for advance address generation. By limiting the burst transfers to within a 1K byte address
boundary, a 10-bit counter will be sufficient for all possible burst transfers, even if third parties
develop some of the blocks in the system.
It prevents a burst from going across multiple AHB slaves. If a burst crosses a device memory
boundary, the second device receiving the burst will have a SEQ transfer as its first transfer, which
violates the AHB protocol.
HMASTLOCK is used to indicate the bus ownership should be locked for an atomic access sequence.
When HMASTLOCK is set, the bus infrastructure (e.g., arbiter) must not switch the bus ownership
until the HMASTLOCK is released. This is commonly used for semaphore operations, where a memory
location (lock flag) is used to indicate a resource is locked by a process or processor. When a processor
needs to lock a resource, it carries out a locked transfer sequence that reads the lock flag and then
updates it. Since the read-modify-write transfers are locked (atomic), another bus master cannot
change the lock flag between the two transfers, and hence this prevents race conditions.
Apart from the Cortex-M3 and Cortex-M4 processors, most of the Cortex-M processors do not use
HMASTLOCK signal. In Cortex-M3 and Cortex-M4 processors, HMASTLOCK is used for atomic
read-modify-write when a bit-band write operation take place. For other Cortex-M processors, if the
processor top-level does not have HMASTLOCK and connects to a standard AHB component that has
HMASTLOCK. The unused signal can be tied to 0.
70
System-on-Chip Design with Arm® Cortex®-M processors
In a system with multiple bus masters, the arbiter or master multiplexer can output a signal called
HMASTER. This is used as an ID value for AHB slaves. In most cases, the AHB slave does not need
to know which bus master is accessing it. However, in a few cases, a peripheral might need to have
different behaviors when accessed by different bus masters. In the Cortex-M processors, often the
HMASTER signal is used to indicate if the transfer is generated by software running on the processor
or by debugger connected to the processor.
Signals Descriptions
HWDATA[n-1:0] Write data. Data bus width “n” is typically 32, but can also be 64-bit.
HWUSER[x-1:0] This is a user-defined data phase signal introduced in AHB5 connecting from the bus master to bus slaves.
Potentially this can be used for: Parity bits for write data.
Table 3.16: Additional AHB data phase signals from bus masters to bus slaves.
Signals Descriptions
HRDATA[n-1:0] Read data. Data bus width “n” is typically 32, but can also be 64-bit
HRUSER[x-1:0] This is a user-defined data phase signal introduced in AHB5 connect from bus slaves to bus masters.
Potentially this can be used for:
- Parity bits for read data.
HRESP / HRESP[1:0] Bus response type. In AHB Lite and AHB5, this signal is single bit. In AMBA 2 AHB, this is two bits
HREADY / HREADYOUT HREADYOUT signals are generated from bus slaves. After the AHB slave multiplexer merges the responses
from bus slaves, the result HREADY is returned to all the bus slaves in the same AHB segment to indicate
the end of the current bus phase.
HEXOKAY Exclusive access is okay. This was introduced in AHB5. If the bus transfer is indicated as exclusive access.
Table 3.17: Additional AHB data phase signals from bus slaves to bus masters.
0x00000007 Data
0x00000006 Data
0x00000005 Data
0x00000004 Data
0x00000003 Data
0x00000002 Data
0x00000001 Data
0x00000000 Data
71
System-on-Chip Design with Arm® Cortex®-M processors
Similarly, if the transfer is in half-word size, the data appears on the data bus as:
HRDATA / HWDATA
0x0000000E Data
0x0000000C Data
0x0000000A Data
0x00000008 Data
0x00000006 Data
0x00000004 Data
0x00000002 Data
0x00000000 Data
AMBA 2 AHB, AHB LITE and AHB 5 support aligned transfers only. Unaligned transfers are not
supported (except for the ARM1136 processor, which has extra sideband signals to support unaligned
transfers). Aligned transfers have address values which are a multiple of the transfer size. For example,
word transfer addresses have address values that are multiples of 4, and half-word transfers have
address values that are multiples of 2. Byte transfers are always aligned.
HRESP
The HRESP signals are responses from the AHB slaves. In AHB for AMBA 2, these can be OKAY,
ERROR, RETRY, or SPLIT. Hence the HRESP is two-bit wide. For AHB LITE and AHB5, they can only
be OKAY or ERROR and are one bit wide.
When an AHB slave receives a transfer, it should carry out the transfer as requested by the AHB
master. If needed, it can insert wait states during the data phase. In normal circumstances, an AHB
slave should generate an OKAY response status, indicated by zeros in HRESP[1:0].
The OKAY response can be a single-cycle (no wait state is needed), but other responses are two
cycles, with the possibility of additional wait states before the response is asserted. For a simple case
of an AHB slave read, followed by a write, and then idle, it can look like:
72
System-on-Chip Design with Arm® Cortex®-M processors
HCLK
HSEL
Master
to slave
HTRANS NSEQ NSEQ IDLE IDLE
HWRITE
Other address
Controls #1 Controls #2
phase controls
HREADY
Slave to
master
HRESP OKAY OKAY OKAY
Figure 3.12: AHB Slave interface – a read with 2 wait states, followed by a write with 1 wait state.
Note that the AHB slave should not start to process a transfer on the AHB until it sees that HREADY is
high. In the Figure 3.12, the AHB slave does not start to process the transfer request of a second transfer
until HREADY is high (fourth cycle), and starts the data phase of the second transfer from the fifth cycle.
In case the slave cannot handle the transfer due to an error condition, it can respond with ERROR on
the HRESP signal. The error response must be two cycles wide. Using the previous example, if both
transfers reply with error response, the waveform will look like:
HCLK
HSEL
Master
to slave
HTRANS NSEQ NSEQ IDLE IDLE
HWRITE
Other address
Controls #1 Controls #2
phase controls
HREADY
Slave to
master
HRESP OKAY ERROR ERROR OKAY
Figure 3.13: AHB Slave interface – a read with 2 wait states with error response, followed by a write with 1 wait state and error response.
73
System-on-Chip Design with Arm® Cortex®-M processors
An error response must be two cycles, but additional wait states with OKAY responses can be added
before the error response. The minimum data phase for an error response is two cycles. Having a
single-cycle error response is illegal.
Upon receiving the first cycle of the error responses (when HREADY is still low), the bus master can
choose to continue the announced transfer in the next cycle, or cancel the announced transfer by
putting HTRANS as IDLE. The behavior might depend on which processor you are using.
An AHB slave must respond OKAY in the next cycle with no wait state if the HTRANS is IDLE or
BUSY, or if HSEL is not active.
RETRY and SPLIT responses have the same waveform as an ERROR response. However, they are
used when the AHB slave is not ready to complete the transfer in a short time. They ask the bus
master to drop the current transfer and retry it later. By doing this, and if the system contains multiple
bus masters, they provide a chance for other bus masters to take ownership of the bus to prevent
bandwidth from being wasted.
On the bus master side, RETRY and SPLIT are handled differently. When a bus master receives the
first cycle RETRY response, it will cancel the current announced transfer by replacing it with an IDLE
transfer, and then retry the transfer that the slave could not process from the last attempt.
On the other hand, when a bus master receives the first cycle of a SPLIT response, it cancels the
currently announced transfer by replacing it with an IDLE transfer, at the same time dropping the
HBUSREQ to bus arbiter to allow the bus ownership to be switched over. When the AHB slave is
ready to receive the transfer, it uses a separate sideband signal called HSPLIT to alert the arbiter to
restore bus ownership to the bus master, which can then issue the transfer again.
Due to the complexity of the operation, SPLIT response and HSPLIT signals are rarely used. Starting from
2001, the multi-layer AHB approach was developed and effectively made SPLIT and RETRY mechanisms
obsolete because the new solution is easier to use and can prevent a single transfer from slowing down
the rest of the system, while at the same time allowing a higher system bandwidth to be achieved.
Some Cortex-M processors like Cortex-M3 and Cortex-M4 have two-bits wide HRESP but are designed
for AHB Lite. This is because the processor was released before the AHB Lite specification was
officialized. When the AHB LITE system is connected to such a bus master, for example, the Cortex-M3
processor, it is possible that the HRESP from the AHB slave is only 1-bit wide (the HRESP[1] signal is
not implemented). In this case, only the HRESP[0] needs to be connected, and HRESP[1] can be tied 0
since HRESP[1] should never be asserted in an AHB LITE system.
HEXOKAY
HEXOKAY is designed to support exclusive access operations. It was introduced in AMBA 5 AHB
and is generated by a global exclusive access monitor in the bus system. Exclusive access sequences
contain an exclusive load and an exclusive store of the same data. The exclusive access monitor
detects if the same data might have been modified by another bus master between the load and store.
If there is a potential access conflict, the monitor blocks the store operation and returns an exclusive
fail status using HEXOKAY.
74
System-on-Chip Design with Arm® Cortex®-M processors
When an exclusive store occurs, HEXOKAY is asserted at the same cycle as HREADY if the global
exclusive access monitor does not detect an access conflict. Otherwise, the HEXOKAY remains low
(an exclusive access conflict might result if the bus slave does not support exclusive accesses).
HEXOKAY must not be asserted in the same cycle as HRESP is asserted (i.e., ERROR response).
More information on exclusive accesses is contained in Section 3.4.
The bus arbiter solution requires each bus master to have a HBUSREQ (Bus Request) output and a
HGRANT (Bus Granted) signal. Optionally the bus masters can also provide a HLOCK signal to indicate
they need to carry out a locked transfer. All these signals are connected to the bus arbiter.
Mux control
HBUSREQ HBUSREQ
Arbitration
phase
HLOCK Arbiter HLOCK
HGRANT HGRANT
Address phase
control
Bus master #1 Bus master #2
Register
Register
Address phase en
Multiplexer
control
Data phase
HREADY
HWDATA
AHB slaves
Basically, the arbitration phase is a step ahead of the address phase. If the address phase is multi-
cycled, the arbiter updates the arbitration continuously. At the end of a transfer phase, the arbitration
result is captured by the register and becomes the HMASTER signal, which indicates which bus master
is the current owner of the bus. Since the bus masters will know the arbitration result one cycle earlier
based on HGRANT, they can prepare for the switching and start outputting their transfers on the AHB
as soon as HMASTER switch to route the bus master’s outputs to the AHB slaves.
75
System-on-Chip Design with Arm® Cortex®-M processors
HCLK
HBUSREQ#A
HGRANT#A
Bus Master B de-asserted bus
request and bus grant switch back
HBUSREQ#B Bus Master B assert bus request to bus master A
and is granted by arbiter
HGRANT#B
When current transfer phase is
completed, HMASTER is updated to
new value
HREADY
HMASTER A A B A
HADDR A1 A2 B1 A
HWDATA A1 A2 B1
In the example in Figure 3.15, two bus masters share one AHB. In the waveform, bus master #A
continuously requests bus accesses, and during transfer accesses from master A, master B requests
the bus as well. In this example, bus master B has higher priority, so the arbiter switches the bus grant
signal to allow bus master B to gain access to the bus when the current transfer phase is completed.
When HREADY goes high, and HGRANT #B is high, this indicates that the current transfer phase has
been completed and the bus master is given bus ownership; bus master B can then output its transfer
control signals to the bus in the next cycle.
The control signals generated from the bus masters can be multiplexed to AHB slaves using the
HMASTER signal. The only signal that must be handled differently is the HWDATA. This is a data phase
signal; as a result, a further registration of HMASTER is needed to control the multiplexing of HWDATA.
Various arbitration schemes can be used for the arbiter. The most common arrangements are ‘fix
priority’ or ‘round robin.’ More complex arrangements which support a mixture of multiple schemes
can also be found. The arbitration process should usually also consider the type of transfer currently
being applied to the AHB slave. If a burst transfer is taking place, the arbiter should wait until the
burst is completed before switching the bus ownership. In addition, if one of the bus masters supports
locked transfers (e.g., the ARM7TDMI processor generates lock transfers when a SWAP instruction
is executed), the bus arbiter must also support the HLOCK signals from the bus master and must not
switch the bus ownership if the processor is generating a locked transfer.
With bus arbiter solutions, if a bus slave needs a longer time to process a transfer, the bandwidth of
the system can be badly impacted if the bus slave generates a long wait state. To solve this problem,
the AHB slave can generate a RETRY response. When the bus master receives such response, it will
retry the same transfer again. The retry process can repeat several times before the bus slave is finally
able to complete the transfer. During this process, if another bus master requires bus access, the bus
arbiter can switch the bus ownership so that the other transfer can be carried out.
76
System-on-Chip Design with Arm® Cortex®-M processors
The SPLIT response is very similar to the RETRY response, except that the bus master should drop the
bus request and wait until the AHB slave responds with another signal HSPLIT.
A processor system is running an OS, and a context switch happened at the middle of a RMW
sequence;
One of the processors in a multiple processor system executes the RMW sequence, and another
processor accesses the same memory location.
1. Exclusive access instructions - In Cortex processors (except Armv6-M processors), exclusive access
instructions (e.g., LDREX, STREX) are available to support exclusive accesses.
2. Exclusive access signals on the bus interface – exclusive access signals are introduced in AHB5.
Previous Cortex-M processors (Cortex-M3/M4/M7) use non-standardized exclusive access
sideband signals to support exclusive accesses.
3. System-level – for multi-processor systems, a bus-level global exclusive access monitor is needed to
detect access conflicts between multiple bus masters. Potentially, multiple global exclusive access
monitors could be used to detect access conflicts for different address ranges.
First, let us look at the access conflicts issue. Semaphores are needed in resource management in OS,
for example, to ensure that different application threads/tasks won’t try to access the same hardware
resource (e.g., a DMA channel) at the same time. The OS uses data variables (e.g., a data variable can
be used to indicate if a resource is locked) in memory to keep track of resource allocation, and the
application thread can request the resource by calling APIs which have access to the semaphore data.
Take the case of semaphore data (P), which indicates that a DMA channel is allocated. An application
task X calls an API to set this data to 1:
Read P to see if the value is 0. If the value is 1, then the DMA channel is already allocated and
return with fail status;
If the OS context switch happens between the read and the write operation, then another task Y can
execute the same RMW sequence to set P to 1, and when the OS returns to task X, it resumes the
RMW operation and writes 1 to the semaphore data. Now both task X and Y think that they have
allocated the DMA channel resource.
77
System-on-Chip Design with Arm® Cortex®-M processors
Read
semaphore
data
Hardware
resource
Between the read and
Exit. No avaliable?
write operations, a
Retry
Context switch /
later
Interrupt / other bus
Yes
master updated
semaphore data during
Modify
RMW sequence
semaphore
data
The same problem can happen between an application thread and an interrupt handler, or between
two different processors. In the case of two processors, the following sequence can occur:
Processor X write 1 to P
Processor Y write 1 to P
Now both threads running on processor X and Y believe that they have secured the DMA channel resource.
To solve this problem, legacy processors like Arm7TDMI use locked transfers to ensure that bus
interconnect does not switch bus ownership in the middle of RMW. However, lock transfers cannot be
implemented in high-speed bus protocols that use separated read and write bus channels. As a result,
exclusive accesses were introduced in Armv6 architecture.
The exclusive store instruction returns a success or fail status. If the status is success, then the
application task can continue the operation. If the status is fail, then it means there is a potential
access conflict, and it needs to restart the RMW sequence. The store operation is blocked if the
exclusive store returns fail status.
78
System-on-Chip Design with Arm® Cortex®-M processors
Read
semaphore
data with
exclusive read
Hardware
resource
Between the read and
Exit. No avaliable?
write operations, a
Retry
Context switch /
later
Interrupt / other bus
Yes
master updated
semaphore data during
Modify
RMW sequence
semaphore
data
Write
Exclusive store is blocked if
semaphore
there has been a potential
data with
access conflict. This is detected
exclusive store
either by local or global
Exclusive store exclusive access monitor
succeed?
No
To determine the return status, the system contains two exclusive access monitors:
Local exclusive access monitor – This hardware is inside the processor, and it triggers exclusive fail if
there has been a context switch (including exception entry/exit);
Global exclusive access monitor – This hardware is at the bus level, and it triggers exclusive fail if
another bus master has access to the address which is marked for exclusive access (when exclusive
load is made).
Local monitor
Detect context switches
(exception events)
Processor #1 Processor #2
Bus interconnect
SRAM
Semaphore data
79
System-on-Chip Design with Arm® Cortex®-M processors
If the local exclusive access monitor returns fail status – the exclusive store is blocked before it gets
to the bus interface level, so it won’t be carried out.
If the global exclusive access monitor returns fail status - the exclusive store is blocked by the global
exclusive store monitor, so the memory won’t get updated.
If either local or global exclusive access monitor return fail status, the write operation will not be
carried out, and the processor should retry the RMW sequence.
HEXCL
HEXOKAY
HMASTER
To generate exclusive access responses, the global exclusive access monitor contains at least one Finite
State Machine (FSM) and one tag register (it can contain multiples of these components).
Exclusive load
Reset (HEXCL==1), update
Write to same ERG stored HMASTER and
Exclusive store
by other bus masters HADDR
(HEXCL==1), response
with !HEXOKAY
(exclusive failed) and
open state exclusive state
block transfer from
reaching downstream
AHB
Read operation with HMASTER and partial
HEXCL asserted HADDR stored in
monitor
The address tag in the monitor does not necessarily record all bits of the address value. Typically, the
monitor can drop the lowest bits of the address as the exclusive fail information can be speculative:
80
System-on-Chip Design with Arm® Cortex®-M processors
There is no harm (except the cost of extra execution cycles and power) if the global exclusive access monitor
returns exclusive fail status in these cases as the software just needs to repeat the RMW sequence.
The minimum region that can be tagged for exclusive access is called the Exclusive Reservation
Granule (ERG). There can be different ERG sizes in different platforms, ranging from 8-bytes to
2048-bytes. It is common for Arm systems to use a 128 bytes ERG, it means bit 6 down to 0 are not
stored in the address tags in the global exclusive access monitors.
EXREQ – same as HEXCL, an address phase signal to indicate exclusive load/store accesses.
EXRESP – exclusive fail status in the data phase (assert at the end of the data phase - opposite
polarity compared to HEXOKAY).
To use Cortex-M3/M4/M7 processor with AHB5 system, simple glue logic is needed to convert
between EXRESP and HEXOKAY:
Cortex-M3/M4/M7
EXREQx HEXCL
D Q
En
D type
flip-flop HREADY
HREADY
EXRESPx
HEXOKAY
Figure 3.20: Glue logic for using Cortex-M3/M4/M7 processors exclusive access signals with AHB5.
Similarly, glue logic is needed when connecting Cortex-M23/M33 processors to bus slaves that use
EXREQ and EXRESP.
Cortex-M23/M33
HEXCL EXREQ
D Q
En
D type
flip-flop HREADY
HREADY
EXRESP
HEXOKAY HRESP
Figure 3.21: Glue logic for using bus slaves with legacy exclusive access signals in AHB5 systems.
1
Note: The Cortex-M7 processor uses AHB Lite for an optional peripheral AHB part of the peripheral region.
81
System-on-Chip Design with Arm® Cortex®-M processors
If HNONSEC is 1 (Non-secure), the bus interconnect must block the transfer if the address of the
transfer is pointing to a Secure location.
If HNONSEC is 0 (Secure), the bus master has Secure access privilege.
Please note that in Armv8-M, when a processor is in Secure state, access to Non-secure address is
indicated as Non-secure (HNONSEC==1).
The definitions of Secure and Non-secure address ranges are handled by a Security Attribution Unit
(SAU) inside the processor and Implementation Defined Attribution Unit (IDAU) which is tightly
coupled to the processor(s). SAU is programmable, and IDAU is system-specific, in some cases, even
the IDAU could be programmable.
To enable a high level of flexibility, Arm Corstone foundation IP / CoreLink SDK-200 included several
bus components for TrustZone security management:
Memory Protection Controller (MPC): for the partitioning of a memory block into Secure and Non-
secure address spaces.
Peripheral Protection Controller (PPC): for assigning bus peripherals into Secure and Non-secure domains.
Master security controller: An AHB5 bus wrapper for legacy bus masters that do not support
TrustZone. This handles the blocking of Non-secure transfers to Secure addresses and generation of
correct HNONSEC signals.
The details of the TrustZone system design is beyond the scope of this book. To help system designers,
Arm provides a document called Trusted Based System Architecture for Armv8-M (TBSA-M) which
include guidelines on best practices, including many areas beyond bus system designs. This document
is a part of the Arm Platform Security Architecture (PSA). For more information, please visit: https://
developer.arm.com/products/architecture/platform-security-architecture
82
System-on-Chip Design with Arm® Cortex®-M processors
Please note that TrustZone support is optional on the Cortex-M23 and Cortex-M33 processors. So, it
is perfectly fine to use Cortex-M23 and Cortex-M33 on a system without TrustZone.
Although it is possible to directly connect a peripheral to the AHB, separating peripheral connections
using APB has various advantages:
1. Many system-on-chip designs contain large numbers of peripherals. If they are connected to the
AHB system bus, they could reduce the maximum frequency of the system due to high signal fan
out and complex address decoding logic. Grouping peripheral connections in the APB can reduce
the performance impact on the AHB.
2. A peripheral subsystem can run at a different clock frequency, or be powered down without
affecting AHB.
3. APB interfaces use a simpler bus protocol, which simplifies the peripheral designs as well as
reducing the verification effort.
4. Most peripherals designed for traditional processors can be connected to APB easily as APB
transfers are not pipelined.
An APB system operates with a clock signal called PCLK. This signal is common to bus master (usually
an AHB to APB Bridge), bus slaves and the bus infrastructure blocks. All registers on the APB trigger
at rising edges of PCLK. There is also an active-low reset signal called PRESETn. When this signal is
low, it resets the APB system immediately (asynchronous reset). This allows a system to be reset even
if the clock is stopped. Like the reset signal in AHB (HRESETn), the PRESETn signal itself should be
synchronized to PCLK so that race conditions can be avoided.
In most simple systems with both AHB and APB, PCLK is from the same clock source as HCLK, and
PRESETn is from the same reset source as HRESETn. However, there are also systems that use separate
HCLK and PCLK frequencies. In that case, the AHB to APB bus bridge design will need to be able to
handle the data transfers across different clock frequencies or different clock domains.
83
System-on-Chip Design with Arm® Cortex®-M processors
Normally, the APB only occupies a small part of the memory space. As a result, the address bus of
the APB system is normally less than 32-bit. There is no transfer size control on APB. All transfers are
assumed to be 32-bit, and usually, the two LSB (bit 1 and bit 0) of the PADDR are not used because a
word transfer on APB must be aligned.
In most cases, an APB system has a bus bridge as the bus master that connects the APB to the main
processor bus. In addition, an APB slave multiplexer and an address decoder are needed, which is
sometimes a combined unit.
84
System-on-Chip Design with Arm® Cortex®-M processors
Combined APB
salve MUX with
address decoder
PSEL (for each
PSEL slave)
Address
Decoder APB Slave
#1
HSEL Address
APB Slave
MUX APB Slave
#3
Read data & optional
Response signals
(AMBA 3.0 or later)
Unlike AHB, APB operations are not pipelined. In AMBA 2, APB transfers must be two cycles. For read
operations, the read data needs to be valid - at least at the end of the second clock cycle.
Read transfer
PCLK
PSEL
PADDR Address
PENABLE
PWRITE
PWDATA
PRDATA Valid
85
System-on-Chip Design with Arm® Cortex®-M processors
For write transfers in APB for AMBA 2, the write data from APB master must be valid for the two
clock cycles off the transfer.
Write transfer
PCLK
PSEL
PADDR Address
PENABLE
PWRITE
PWDATA Valid
PRDATA
During a write transfer on APB, the actual write operation in the slave could happen either in the
first clock cycle or in the second cycle. This is implementation-defined. Therefore, the APB master
must ensure that the write data is valid for both clock cycles. There can be any number of clock cycles
between two transfers on APB.
With AMBA 3, each APB slave can extend a transfer by de-asserting the PREADY output signal or feedback
with an error response by the PSLVERR (Peripheral Slave Error) signal. For example, if an APB slave needs 4
clock cycles to complete a read transfer (3 wait states), the read operation waveform will be like:
Read transfer
PCLK
PSEL
PADDR Address
PENABLE
PWRITE
PWDATA
PRDATA Valid
PREADY
PSLVERR
The end of the read transfer is indicated by the assertion of PREADY. The minimum number of cycles
for the APB transfer is two cycles (same as AMBA 2), and the value of PREADY is ignored in the first
86
System-on-Chip Design with Arm® Cortex®-M processors
cycle of the transfer so that even if its value is logic 1, the transfer still takes at least two cycles. In
the previous example, the AHB slave responds with OKAY (indicated by logic zero on PSLVERR when
PREADY is one). If an APB slave response with an error response, the waveform would be:
Read transfer
PCLK
PSEL
PADDR Address
PENABLE
PWRITE
PWDATA
PRDATA
PREADY
PSLVERR Error
Figure 3.26: APB read for AMBA 3 with 3 wait states and error response.
When an error response is generated, the read data from the AHB slave might not contain any useful
information and could be discarded. The value of PSLVERR is only valid when PREADY is high and not in
the first cycle of the transfer. For example, if PSLVERR and PREADY are both high in the first cycle of the
transfer, it is not considered as an error response as an APB transfer must be at least two clock cycles.
The waveform for APB writes in AMBA 3 are very similar. For write operations with an okay response,
the waveform would be like:
Write transfer
PCLK
PSEL
PADDR Address
PENABLE
PWRITE
PWDATA Valid
PRDATA
PREADY
PSLVERR
87
System-on-Chip Design with Arm® Cortex®-M processors
And for write with error response, the waveform would be like:
Write transfer
PCLK
PSEL
PADDR Address
PENABLE
PWRITE
PWDATA Valid
PRDATA
PREADY
PSLVERR Error
Figure 3.28: APB write for AMBA 3 with 3 wait states and error response.
PPROT is similar to HPROT in AHB and is asserted throughout the whole transfer. However, it is only
3-bits wide.
Please note, unlike AHB5, the TrustZone security attribute is part of PPROT instead of a separate signal.
AMBA 4 APBv2 also introduces byte strobe signals for write operations. For a 32-bit APB, the PSTRB
signal is 4 bits – one bit per byte lane. It is also asserted for the whole transfer.
32-bit
31 24 23 16 15 8 7 0
PSTRB[3] PSTRB[2] PSTRB[1] PSTRB[0]
The PSTRB signal is active high and is used for write-operations only. During read operations, the
PSTRB signal is ignored by the bus slave, and the whole 32-bit word is read.
88
System-on-Chip Design with Arm® Cortex®-M processors
During write transfers, an APB slave can sample and register the write data at any cycle within
the transfer. It is common for APB slaves to sample the write data at the last cycle of the transfer,
especially in APB devices for AMBA 2. However, it is also perfectly acceptable to sample the write
data at the first cycle of the write transfer because the APB master must provide valid write data to
APB slaves in even in the first clock cycle.
For read transfer, the APB master should only read the return read data value at the last cycle or (when
PREADY is 1). If an APB slave returns an error response, the bus master should discard the read data.
If an AMBA 4 APB master (APBv2) is used, then the suitability of using AMBA 2/3 slaves is dependent
on application – if there is a need to support protection information or byte strobe, then AMBA 2/3
APB slaves must be modified to support AMBA 4 APB functionality.
An AMBA 2 APB master (e.g., an AHB to APB bridge) cannot support APB slaves designed for AMBA
3/4 as it cannot handle wait states.
An AMBA 3 APB master might be able to support an APB slave designed for AMBA 4 (APBv2),
providing that the bus slave only needs 32-bit write operations (no need to support byte strobe
signals). In this case, the PSTRB signals can be tied to PWRITE so that all byte strobes are asserted for
all write operations.
89
System-on-Chip Design with Arm® Cortex®-M processors
90
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Building simple bus 4
systems for Cortex-M
processors
91
System-on-Chip Design with Arm® Cortex®-M processors
For processors that support the Harvard bus architecture, design the bus system to enable
concurrent instruction and data accesses.
Use default slaves to detect access to invalid addresses – this enables bus error to be triggered, and
software to handle it when something has gone wrong.
In most of the earlier Cortex-M processor designs, the initial address for vector tables is fixed in
address 0x00000000. Therefore, the program image needs to be visible in this address at startup.
Minimize the number of wait states in the memories – in processors that don’t have caches, having
wait states in the memory system directly impacts the performance, energy efficiency, and interrupt
latency. In general, wait states in peripheral accesses are less of a problem as those accesses
happen less frequently.
Try to keep to a minimum the number of bus slaves on the main system bus.
Separating the peripheral bus from the system bus has a number of advantages:
A high number of bus slaves in the main system bus could reduce the maximum clock frequency
and can also increase the area and power of the bus interconnect. By separating peripheral
connections in different buses, address decoding and bus switching logic on the system bus can be
optimized for speed because most peripherals are grouped as one item via the bus bridge.
By using a bus bridge to separate the peripheral bus from the system bus, it is possible to provide
timing isolation between the two bus segments. This allows the peripherals to be operated
at different clock speeds, as well as providing a better chance to get a higher maximum clock
frequency on the system bus.
Bus protocol for the peripheral bus is simpler. This reduces the time for peripheral development and
testing, as well as reducing complexity and gate counts.
As a result, most of the peripherals that do not need low latency accesses (e.g., SPI, I2C, UART) can be
placed in separated peripheral buses. Some peripherals like GPIO can gain the benefit of lower access
latency, so some GPIO blocks are placed system AHB or single-cycle I/O port interface (available on
Cortex-M0+ and Cortex-M23 processors).
92
System-on-Chip Design with Arm® Cortex®-M processors
AHB to
Program Default AHB
RAM AHB APB
ROM slave peripheral(s)
peripheral(s) bridge
Interrupts
Interrupts
The “ROM” (could be embedded flash, or other NVM for holding program image) is placed in
address 0x00000000 as the initial vector table address is fixed to this location. For FPGA designs,
you can use on-chip SRAM with the initial image. Ideally, use zero wait state for “ROM.”
The RAM is normally synchronous static RAM with zero wait state to provide the best performance.
Usually, we put the RAM in address 0x20000000, the starting address of the SRAM region.
Some of the peripherals can be designed with an AHB interface to lower access latency (e.g., GPIO
could be designed as an AHB peripheral as some control applications can be I/O intensive).
General peripherals that do not need fast access can be designed with an APB interface and
connected via an AHB to APB bridge. Potentially, the peripheral bus can run at a lower clock
frequency.
The address ranges of AHB and APB peripherals are usually within 0x40000000 to 0x5FFFFFFF.
The exact arrangement is up to the system designers.
93
System-on-Chip Design with Arm® Cortex®-M processors
The default slave is selected when the AHB transaction is targeting an invalid address.
For minimum requirements, the top-level of the FPGA/SoC design only needs to expose the clock
and reset connections, the peripheral interface, and the debug connection.
1. Optional single-cycle I/O port (IOP) interface for low latency peripheral register accesses;
If a designer decided to use the single-cycle I/O port interface for a peripheral:
The peripheral might need to be modified to support the single-cycle I/O port interface;
The system would need to include a simple IOP address decoder to tell the processor which address
range should route to the IOP and which should not. This decoder contains simple combinatorial
logic that decodes the 32-bit address value, and feedback to the processor that the address belongs
to either the IOP or AHB interface.
The MTB feature is used to provide a low-cost instruction trace solution. The MTB is placed between
the AHB and SRAM, working as an AHB to SRAM bridge in normal operations. When used for
instruction trace, the debugger programs the MTB to allocate a small portion of the SRAM for storing
instruction trace information. The MTB has a trace interface to receive instruction trace information
from the processor and can also generate a debug event (halting request) to the processor.
Typically, the MTB would be configured in circular buffer mode so that only the recent history is kept.
While it doesn’t provide the full software execution history, it is still a useful feature in debugging
software issues like providing program flow details just before fault exceptions.
The 32-bit SRAM interface can work with most synchronous on-chip SRAM and FPGA block RAM.
Please note, it supports zero wait state SRAM only.
94
System-on-Chip Design with Arm® Cortex®-M processors
Clock and
reset
generation
Debug interface Cortex-M0+ Integration SYSRESETREQ
JTAG/Serial wire Debug
debug IOP address
interface
decode
module IOP
Cortex-M0+
peripheral(s)
Interrupts
MTB AHB to
Program Default AHB
AHB APB
ROM slave peripheral(s)
peripheral(s) bridge
RAM
Interrupts
Interrupts
Since the Cortex-M0+ processor supports the separation of privileged and unprivileged execution
levels, you should consider system-level security if this feature is used. To support the separation
of privileged and unprivileged levels, a designer should also consider adding the MPU (Memory
Protection Unit) option, which can prevent unprivileged codes from accessing privileged memories.
Peripheral registers for system control (e.g., clock, power management, flash programming) should
be privileged access only.
If an AHB access is unprivileged and the address targets a privileged only device, then the address
decoder in the system can select the default slave instead of the targeted device to generate a fault
exception (bus error).
Similar to the Cortex-M0 processor, the initial vector table address is fixed at address 0x00000000.
Therefore, the ROM needs to be visible at the beginning of the memory map after reset.
95
System-on-Chip Design with Arm® Cortex®-M processors
In many aspects, the system-level integration for the Cortex-M1 system is similar to the Cortex-M0 system:
The processor does not have separation of privileged and unprivileged operations;
The Cortex-M1 processor supports optional Instruction TCM (Tightly Coupled Memory) and Data TCM;
Use of Tightly Coupled Memory (TCM) is common in processor systems implemented in FPGA. If this
option is implemented, the Cortex-M1 processor provides two TCM interfaces, one for Instruction
memory (I-TCM) and the second one for Data (D-TCM). When TCMs are used, the Cortex-M1
processor can execute a program in its best performance. If executing a program from memory blocks
connected via AHB, the performance/MHz would be lowered because the AHB interface on the
Cortex-M1 has an additional pipeline stage to allow it to reach high clock frequency.
TCM can be implemented with RAM blocks inside the FPGA. The details of implementing RAM
blocks inside the FPGA depend on the FPGA type and the FPGA design tools you use. An example
is shown in Section 2.2, but you might need to refer to the FPGA vendor’s documentation and tools
documentation for the correct implementation of the TCMs.
Since the Program “ROM” (it is actually RAMs in the FPGA) and RAM can be connected via the TCM
interface, the system AHB connections can be simplified.
96
System-on-Chip Design with Arm® Cortex®-M processors
D-TCM Interrupts
RAM
AHB to
Default AHB
AHB APB
slave peripheral(s)
peripheral(s) bridge
Interrupts
Interrupts
The source code of the Cortex-M1 top-level files are available into two versions:
When the debug features are included, the debug interface has a separate set of TCM interfaces
(using the block RAM as dual-port RAM). The reason for having a separated interface for the debugger
to access the TCM at maximum speed. In most modern FPGA architectures, the memory blocks can
be used as dual-port memory. Therefore, each TCM block can be simultaneously connected to the
processor core’s TCM interface and the debug TCM interface:
Cortex-M1
Address,
Address, control, control, write
write data data Processor core
I-TCM
I-TCM interface
Read data
Address,
Address, control, control, write
write data data Debug
D-TCM
D-TCM interface
Read data
Figure 4.4: The Cortex-M1 TCM connections when the debug option is used.
97
System-on-Chip Design with Arm® Cortex®-M processors
In most of the modern FPGA architectures, dual-port RAMs are widely supported.
If the debug option is not used in your FPGA design, there is no debug TCM interfaces. Only one
interface per TCM is needed. In this case, it does not matter whether the FPGA memory supports
dual-port operation or not.
Where the SRAM memory on the FPGA is insufficient for your application, you can add an SRAM
interface on the System bus as an AHB bus slave and add external SRAM to extend the total
memory size of the system. You can also use SRAM connected to the AHB instead of using I-TCM
and D-TCM.
I-CODE Instruction fetches and vector fetches for CODE region (0x00000000 to Read transfers only
0x1FFFFFFF)
D-CODE Data and debug read/write for CODE region (0x00000000 to 0x1FFFFFFF)
System All accesses not targeting at CODE region, PPB or internal components
(SRAM, Peripheral, RAM Devices, and System/Vendor specific address range
excluding PPB)
Private Peripheral All accesses are in external PPB range (0xE0040000 to 0xE00FFFFF) excluding Privileged accesses only
Bus (PPB) internal components (e.g., ETM, TPIU, ROM table)
Table 4.1: Bus master interface on the Cortex-M3 and Cortex-M4 processors.
The multiple bus interface allows instruction fetches and data accesses to take place at the same time
(i.e., Harvard bus architecture) to get better performance. This requires that the program image and
data are on different buses.
The program image is placed in the CODE region. Similar to the Cortex-M0 processor, the initial
vector table address is fixed at address 0x00000000. Therefore, the ROM (which contains the
vector table) needs to be visible in this address after a reset.
SRAM and peripherals are connected via the system bus. Normally in address 0x20000000 (for
SRAM) and address 0x40000000 (for peripherals). This arrangement allows software developers
to utilize the bit-band feature on SRAM and peripherals.
98
System-on-Chip Design with Arm® Cortex®-M processors
PPB
Debug interface
SYSRESETREQ
JTAG/Serial wire Debug
debug Cortex-M3 / Cortex-M4 ROM
interface
table
module
Interrupts
AHB to
Default AHB
System bus (AHB Lite) RAM AHB APB
slave peripheral(s)
peripheral(s) bridge
Interrupts
Program Default
Peripheral bus (APB)
ROM slave
Interrupts
Since there are two main AHB bus segments, both having some invalid address ranges, we will need
default slaves on each of these buses.
The reason for separating I-CODE and D-CODE in the system is to add a literal data cache (on the
D-CODE bus) so that literal data can be read even if the instruction fetch is stalled due to a wait
state on flash memory. Typically, flash memories are quite slow (in the range of 30MHz to 50MHz)
in comparison to modern microcontrollers, which can run at over 100MHz. When the Cortex-M3
processor was designed, a common approach to overcoming flash performance issues was to use flash
memories with a wider bus (e.g., 128-bit) with a prefetch buffer so that sequential instructions could
be prefetched while the processor consumed the remaining instructions in the prefetch buffer.
Figure 4.6: Flash prefetching can help eliminate wait states in sequential flash accesses.
However, program operations contain many constant data reads, and these read operations would
result in non-sequential accesses, which would be stalled as the data are not available in the buffer.
To make matters worse, if the literal access occurred just after the prefetcher started a prefetch, the
flash interface needed to wait until the flash read is completed before reading the literal from flash
memory. For example, in Figure 4.7, the processor pipeline needs to stall after the literal data read
(address 0x1048) until the flash returns the data (end of read operation to 0x1040-0x104F).
99
System-on-Chip Design with Arm® Cortex®-M processors
Figure 4.7: Literal data access can reduce the performance of a system with prefetcher.
To help reduce the performance penalty, one solution is to separate the data accesses on a different
AHB and put a small literal data cache on it so that literal data used in small loops will not cause
latency after the first loop.
Cortex-M3 / Cortex-M4
I-CODE D-CODE
Bus arbiter
Embedded flash
Figure 4.8: Separating I-CODE and D-CODE allows literal data cache to operate in parallel with the prefetcher buffer.
For many simple designs or systems that use system-level caches, there is no need to have such flash
access acceleration arrangements. System designers can simply merge the I-CODE and D-CODE
buses. The Cortex-M3 and Cortex-M4 product bundles provide two components for this purpose:
This is a simple bus multiplexer with minimal gate count. To use this component, the DNOTITRANS
input of the Cortex-M3/Cortex-M4 must be set to 1. This prevents the processor I-CODE interface
from generating bus transfers at the same time when D-CODE is used.
100
System-on-Chip Design with Arm® Cortex®-M processors
Cortex-M3 / Cortex-M4
1 DNOTITRAN
cm3_code_mux /
cm4_code_mux
Program Default
ROM slave
Figure 4.9: Using code mux component to merge I-CODE and D-CODE.
This component has internal bus arbitration and a register slice to hold I-CODE transfers in a buffer
if both I-CODE and D-CODE are active. This can be useful if there are other bus slaves in the CODE
region that could be accessed at the same time as instruction fetches.
Cortex-M3 / Cortex-M4
0 DNOTITRAN
cm3_flash_mux /
cm4_flash_mux Other slave
Program Default
ROM slave
Figure 4.10: Using flash mux component to merge I-CODE and D-CODE.
101
System-on-Chip Design with Arm® Cortex®-M processors
In newer microcontroller designs, the use of system-level cache for embedded flash is increasingly
common. Arm provides the AHB flash cache which can be used with various Cortex-M processors
with an AHB interface.
Cortex-M3 / Cortex-M4
1 DNOTITRAN
cm3_code_mux /
cm4_code_mux
AHB flash
Default
cache
slave
Flash
interface
Embedded
flash
Figure 4.11: Using AHB flash cache in the Cortex-M3/M4 system design.
For such systems, the chance for both instruction accesses and literal data having a cache miss is
relatively low (except the first time the code sequence is executed of course), so separating the
CODE bus into I-CODE and D-CODE does not bring a lot of benefits. In newer Cortex-M processors
like Cortex-M33 and Cortex-M35P, the I-CODE and D-CODE have been merged to reduce system
integration complexity and to enable lower power.
Instead of having the cache module closely coupling to the processor, the AHB flash cache
arrangement has the following advantages:
The interface between the flash cache and flash interface can be designed as wide bus such as
a 128-bit AHB. This enables faster data transfers from flash to the cache, and next flash access
(e.g., if there is a cache miss) can start earlier.
If the code bus has another bus master, the flash cache can provide caching to the other bus master.
Please note:
For Cortex-M3 and Cortex-M4 processors, the internal bus interconnect has a registering stage
between the instruction fetch interface and the system bus. Therefore, the performance of the
system is reduced if the software image is executed from the system bus.
102
System-on-Chip Design with Arm® Cortex®-M processors
Peripherals are expected to be connected via the System AHB (or on a peripheral APB via a bus
bridge) instead of the Private Peripheral Bus (PPB). The PPB intended for debug components has
some limitations; namely:
Accesses behave as Strongly Ordered (no other data memory access can start until the current
data access finished);
It is accessible from the Debug Port and the local processor, but not from any other agent
(processor) in the system.
(Source: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka14334.html)
If needed, it is possible to have an SRAM shared between code and SRAM regions by having bus
accesses from both code and system buses (i.e., memory address aliasing). This allows the software
to use a single SRAM block and execute code from SRAM without performance loss:
Cortex-M3 / Cortex-M4
1 DNOTITRAN
Flash_mux
Default
ROM slave
SRAM
Figure 4.12: Placing a single SRAM into both CODE and SRAM region.
103
System-on-Chip Design with Arm® Cortex®-M processors
However, from a security point of view, this needs to be handled carefully to prevent vulnerabilities. For
example, if a memory region is privileged access only, then the access permission needs to be privileged
for both address locations (Alternatively, you can make the RAM visible on only one bus at a time).
Peripherals that need high data bandwidth; for example, USB controllers, ethernet interfaces.
In both cases, these units have bus master interfaces to initiate transfers, as well as bus slave interfaces
for configuration. To enable multiple bus masters to access the AHB bus system, Arm provides:
Simple AHB master multiplexers to support two or three bus masters accessing a single AHB bus
segment (shared bandwidth);
The concepts of these components were covered in Chapter 3. For a simple Cortex-M0 based system
with a DMA controller, the system design can look like this (Figure 4.13):
AHB master
multiplexer
AHB to
Program Default AHB
RAM AHB APB
ROM slave peripheral(s)
peripheral(s) bridge
Interrupts
DMA
interrupt(s)
APB APB APB APB
peripheral(s) peripheral(s) peripheral(s) peripheral(s)
Interrupts
Figure 4.13: Simple Cortex-M0 system design with DMA controller and AHB master multiplexer.
104
System-on-Chip Design with Arm® Cortex®-M processors
Both the processor and the DMA controller can have a full view of the memory system. The design is
simple to create, but the bus bandwidth is shared. As a result, the number of bus masters supported
by the AHB master multiplexer component is often limited to 2 or 3. In such systems, it is common to
give the DMA controller higher priority as the processor can access the bus very frequently (due to
both instruction-fetches and data accesses).
For systems with higher performance needs, the AHB bus matrix component is generally used. In
addition, you would also often need multiple banks of SRAM to enable processor and other bus
masters to access different banks of SRAM at the same time. Otherwise, the bandwidth of SRAM
accesses could become the bottleneck.
Cortex-M3 /
DMA controller USB controller
Cortex-M4
I-CODE
D-CODE
System
DMA
USB
Figure 4.14: A simplified Cortex-M3/Cortex-M4 processor system design with an AHB bus matrix.
In addition, to provide higher total data bandwidth, having multiple banks of SRAM can also allow some
banks of SRAM to be powered down when not in use, resulting in lower power consumption in some
situations. However, when all banks of SRAM are used, the maximum system power is higher than a single
SRAM bank. Of course, the use of multiple SRAM banks has the advantage of higher data bandwidth,
which might mean the overall system-level energy efficiency is still better than one SRAM bank.
AHB bus matrix designs have a concept called sparse connectivity, which means some of the AHB bus
masters connected to the bus matrix do not need to have access to all of the downstream AHB bus
segments. For example:
A USB controller does not need to access to flash program area and peripherals;
The I-CODE and D-CODE bus of the Cortex-M3/Cortex-M4 do not need to access SRAM and
peripherals because transfers on these buses are limited to the CODE region (unless the SRAM
is mapped into CODE region).
The configurable AHB bus matrix from Arm supports sparse connectivity, which reduces the bus
matrix area and potentially helps to improve timing and speed.
105
System-on-Chip Design with Arm® Cortex®-M processors
Another supported feature in Arm’s AHB bus matrix is the internal default slave. Since the AHB bus
matrix has an internal address decoder to select which downstream AHB bus segment should be used,
it can also detect accesses to invalid address ranges and route them to internal bus matrix, which
means that there is no need to add another system-level default slave.
The AHB bus matrix is highly flexible and can bring many advantages to system designs. However,
please note that it can also introduce latency cycles when switching a bus segment from one master
to another. It is possible to optimize a bus matrix to reduce the chance of unnecessary bus arbiter
switching by customizing the logic that defines the default selected bus or forcing the address of the
bus to a specific value when the bus is idle.
When designing systems with multiple bus masters, from a security point of view, it is common to
make the configuration interface of the bus masters (e.g., DMA controllers) privileged access only.
Otherwise, if an unprivileged software component can program a DMA controller, it can use the DMA
controller to access privileged-only memories, which means bypassing the memory protection.
Other bus
Processor Processor
master(s)
SRAM SRAM
Figure 4.15: Correct location for placing global exclusive access monitors.
In Figure 4.15, there are two banks of SRAM, and each of them needs a global exclusive access
monitor because they are on separate AHB bus segments. In system-level designs, you need a global
exclusive access monitor for each AHB segment after bus arbitration if:
106
System-on-Chip Design with Arm® Cortex®-M processors
1. The bus slaves connected to the bus segment can contain semaphore data, and
2. The bus slaves connected to the bus segment can be accessed by more than one bus master that
generates access to semaphore data.
Bus segments that only contain general peripherals or flash memories (or other types of NVM) do not
require an exclusive access monitor as there is no semaphore data in these buses.
In single-processor systems, it is possible to omit the global exclusive access monitor because even
with other bus masters present (e.g., DMA controller, USB controller), the software can ensure that
these other bus masters do not access the semaphore data. Therefore, normally, global exclusive
access monitors are present only on multi-processor systems.
In single-core systems with Cortex-M3, Cortex-M4 and Cortex-M7 processors, which use proprietary
exclusive access handshaking signals (EXREQ and EXRESP), if an AHB bus segment does not have an
exclusive access monitor:
Where the bus contains SRAM, you can tie EXRESP low (do not tie EXRESP high as OS semaphore
functions using exclusive accesses will always fail).
Where the bus segment only contains NVM or peripherals that do not contain semaphore data, it is
valid to tie EXRESP high to indicate exclusive access to such address range is not supported.
In single-core systems with Cortex-M23, Cortex-M33 and Cortex-M35P processors, which use AHB5
bus protocol with exclusive access support (HEXCL and HEXOKAY):
If the bus contains SRAM, you can use a simple glue logic to assert HEXOKAY in data phases of
exclusive accesses (do not tie HEXOKAY high as AHB5 protocol requires that HEXOKAY is asserted
only when HREADY is asserted and must not be high when HRESP is high).
If the bus segment only contains NVM or peripherals which do not contain semaphore data, it is
valid to tie HEXOKAY low to indicate exclusive access to such address ranges is not supported.
Additional information related to exclusive access support on Cortex-M3 and Cortex-M4 processor
can be found in this knowledge base article:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka16180.html
107
System-on-Chip Design with Arm® Cortex®-M processors
To use the address remap function, the system design needs to include a program register to control
the behavior of the address decoder. For the use case, we mentioned, this control register only needs
1-bit to switch between two memory maps. However, some other devices support multiple boot
arrangements, and this register might have multiple bits.
An example of an address map design with remap is shown in Figure 4.16 below:
Address Address
After boot loader
execution, just before
starting program in
Boot loader Reset handler in embedded flash, Boot loader
boot loader REMAP is disabled
0x00100000 @0x00100100 0x00100000
0x00100101 Vector
0x00080000 table
0x00080000
Embedded flash 0x20001000 Embedded flash
Reset vector points Vector table in
Boot loader to boot loader’s embedded flash
alias reset handler is in effect
0x00000000 0x00000000
Vector table in Start of application
Boot at startup,
boot loader used in embedded flash,
REMAP active
for startup REMAP disabled
Figure 4.16: An example address map of address remap to support boot loader.
With the arrangement shown in Figure 4.16, the vector table in the boot loader is used for booting
up the system. The execution of the boot loader is based on its real address 0x001000000. However,
during this period, the vector table in the boot loader alias is still being used. After the boot loader
has finished its work, it switches the REMAP off so that the vector table of the program image in
embedded flash is used. It can then read the vector table of the application, set up the MSP value,
and branch into the reset handler.
Please note the embedded flash might also have an alias address range to allow the boot loader to
handle flash programming. Otherwise, the beginning of the embedded flash address range will not be
visible as the bootloader alias is placed there during start-up.
108
System-on-Chip Design with Arm® Cortex®-M processors
In the design of the remap control register, there are several considerations:
In many system designs, the remap control register needs to be privileged access only for security
reasons.
In some systems, it is desirable to make the remap control register reset by power-on reset so
that the bootloader only executes once, and does not get executed again during debugging (the
debugger normally resets the target using a system reset, with SYSRESETREQ field in AIRCR).
In some systems, the remap control register could be designed to only be switched off but cannot
be switched back on by software. This arrangement is used by some secure boot systems where the
information associated with the security checks are hidden inside the boot loader and are masked
out (non-accessible) after REMAP is switched off.
In addition to bootloader use cases, a remap arrangement is also used to allow part of the SRAM to
be used as a vector table in systems with Cortex-M0 processors because Cortex-M0 does not have a
programmable Vector Table Offset Register (VTOR). In such usage scenarios, a REMAP control register
bit is needed and defaults to off (no REMAP). When set to 1, a portion of system SRAM is aliased to
the first 192 bytes (maximum vector table size in Cortex-M0) of system memory. Before setting the
REMAP control register, software should copy the original vector table to the SRAM that will then be
remapped so that exceptions can still work afterward.
The remap feature is supported by the AHB bus matrix designed by Arm. However, for processors
with VTOR, there is no need to use REMAP to allow runtime updates of vector tables because
you can program VTOR to point to the SRAM area. In newer Cortex-M processors like Cortex-M7,
Cortex-M23, Cortex-M33, and Cortex-M35P, the initial address of the vector table for boot-up is
configurable and, therefore, there is no need to use REMAP to enable multiple boot options.
In terms of performance, at the interface level, TCM and AHB provide the same read access latency.
Write access timings are different, but at the processor pipeline level, the write could still be a single-
cycle, even when using an AHB interface (e.g., when the processor has a write buffer, or when the
AHB pipeline is mapped into two stages of the processor’s pipeline).
109
System-on-Chip Design with Arm® Cortex®-M processors
Clock
Address
Read address Write address
TCM
Read data
Read data
Write data
Write data
Figure 4.17: Timing characteristics comparison between TCM interface and AHB interface.
Some designers suggest that a dedicated TCM interface could be beneficial if the bus is often
occupied by other transactions from other bus masters. In that situation, access to TCM will not
be delayed by other bus traffic. However, if using a multi-layer AHB approach, processor access to
memories can still be carried out immediately providing that the bus slave segment accessed is not
being used by another bus master. Even if a processor supports TCM unless its bus interface supports
multiple outstanding transfers, it is impossible to start a new data access while the current memory
read/write is on-going.
While having TCM reduces the complexity at the system-level interconnect, the merging of read data
from the system bus and TCMs is placed inside the processor, so there is no area saving. Potentially,
the TCM design might restrict the address range and size of the memories while connecting memories
on the AHB instead could be more flexible, as designers can customize address ranges and memory
sizes based on application needs. It is also possible to optimize the AHB bus structure to minimize
timing delays between the processor and the memory blocks.
Cortex-M
Other AHB to
Default
AHB APB
slave
slaves bridge
Figure 4.18: AHB path optimized to minimize timing delay between processor and memories.
110
System-on-Chip Design with Arm® Cortex®-M processors
In some processor designs, the use of TCM is required to allow deterministic interrupt responses.
For example, in the Cortex-M7 processor, access to memories on the AXI bus system can have non-
deterministic timing due to cache hit/miss scenarios. Having TCM enables interrupt services to be
carried out quickly in deterministic manners. But in small processors like Cortex-M0 to Cortex-M33,
the omission of a TCM feature is not a real issue.
In addition to the embedded flash memories, you need an embedded flash memory controller IP
that links the embedded flash to AMBA buses, and potentially system cache IP. The flash memory
controller IP can be flash technology-specific; however, in 2018, Arm announced the Generic Flash
Bus (GFB) standard, making it possible to create generic embedded flash controllers and allow
embedded flash macros to be connected to those controllers via simple glue logic, which is process
technology-specific. Arm also offers embedded flash controllers based on GFB interface. Information
about AMBA GFB can be found here: https://developer.arm.com/products/architecture/system-
architectures/amba/amba-gfb
Embedded flash memories are usually quite slow (e.g., 30MHz to 50MHz access speed for most of
the low-power embedded flash macros). Typically, caches in some form are needed to enable the
processor system to run at higher clock speeds. Having caches also enables better energy efficiency
by reducing the memory accesses on the embedded flash (which could be relatively power-hungry).
Such cache components are available from Arm and other IP suppliers. For example, the AHB flash
cache is part of the Arm Corstone-100 foundation IP (https://developer.arm.com/products/system-
design/corstone-foundation-ip/corstone-100).
When doing flash programming, instead of using the debug connection to access the flash controller
directly, the common approach is to:
Download a small piece of code to SRAM called a flash programming algorithm, and
111
System-on-Chip Design with Arm® Cortex®-M processors
Download additional configuration information and set the PC (program counter) to the flash
programming algorithm, before executing the code to program the flash page.
Each time a flash page is programmed, optionally, the flash programming algorithm can verify the
contents of the page. The debug host can then download another page of data and repeat the process
until all the pages are programmed.
If a device contains TrustZone security extensions and the on-chip secure firmware is already loaded
to the device, the flash programming algorithm might already be present within the on-chip firmware.
In such cases, the flash programming sequence only needs to load the new flash contents and
configurations before triggering the flash programming steps.
When the device starts up for the first time, since the flash does not contain a valid program image,
it will quickly enter fault exception and eventually go into LOCKUP state.
Even if the device is in LOCKUP state, the debugger can still establish a debug connection via
JTAG/Serial Wire.
The debugger can then enable a reset vector catch (a debug feature in the Cortex-M processors),
and use System Reset Request (by programming Application Interrupt and Reset Control Register,
AIRCR) to reset the system. When the processor comes out from system reset, it enters halt state
immediately because the reset vector catch is enabled.
The debugger can download the flash programming algorithm and pages of program image into
SRAM and set the PC (program counter) to launch the flash programming algorithm.
When all the required flash pages are programmed, it can reset the system again to start the
application or to debug it.
The same concept can also be applied to devices that run code from external flash (e.g., QSPI flash).
112
System-on-Chip Design with Arm® Cortex®-M processors
113
System-on-Chip Design with Arm® Cortex®-M processors
114
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Debug integration 5
with Cortex-M
processor systems
115
System-on-Chip Design with Arm® Cortex®-M processors
Debug and trace features are configurable in most of the Cortex-M processors (except DesignStart
Eval versions, which provide the processor designs with fixed configurations). In general, the debug
features include the following:
Access to the memory space (including download of program code into SRAM for the FPGA
platform. Flash programming is a bit more complex and will be explained later). Memory access can
be carried out while the processor is running.
Breakpoint events can be used to halt the processor, or if using a software debug agent (i.e., using
the debug monitor in Armv7-M or Armv8-M Mainline), then the debug monitor exception
is triggered. There are two types of breakpoint mechanisms:
Hardware breakpoints that use hardware comparators to compare the program counter with
breakpoint addresses. The number of breakpoint comparators is limited. For example, the
Cortex-M0 processor has only up to 4 breakpoint comparators.
Software breakpoints that use the BKPT (breakpoint) instruction to trigger breakpoint event.
There is no limit on the number of breakpoints. It is suitable for creating breakpoints during
software development on devices with reprogrammable memory.
Watchpoints are debug events that are triggered when a specific data address is accessed. They
use hardware comparators to compare data read/write addresses with specific values and emit
watchpoint events when a match is detected. When a watchpoint event is triggered, the processor
can be halted, or a debug monitor exception can be triggered (Armv7-M or Armv8-M Mainline).
Halt / Resume – software developers can send a command to halt the processor when it is running,
and when debug operations are done, they can also send a command to resume the operations.
Access to processor’s registers – registers in the processor’s register banks and special registers can
be accessed and modified when the processor is halted. Memory-mapped registers can be accessed
at any time.
Reset – debugger can request a reset of the target board, typically a system reset through the
SYSRESETREQ feature. It is also possible to have a debugger to trigger the whole device reset
if a separated reset connection is provided and the debug probe can support this.
116
System-on-Chip Design with Arm® Cortex®-M processors
Debug authentication – from an IP protection and system security point of view, there is a need
to disable debug and trace in some parts of the product life cycle. Cortex-M processors provide
interface signals for debug authentication hardware so that debug and trace features can be
disabled. For Armv8-M processors, the interface can also restrict debug and trace visibility to Non-
secure side only.
Multi-code debug and trace – the debug and trace systems in Arm Cortex processors are based on
an architecture developed by Arm called CoreSight Debug Architecture. This debug architecture
supports multi-core debug and allows multiple processors to be connected to the debugger with
a single debug connection. In addition, debug events (e.g., breakpoint, watchpoint) can propagate
between various cores to allow the whole system to halt or resume at the same time. The trace
interface also allows trace information from multiple processors to be merged and collected by a
single debug probe with a single connection, and then be decoded and separated back into multiple
trace streams on the debug host.
Micro Trace Buffer (MTB) – a low-cost instruction trace solution that enables the system to allocate
a small part of system SRAM for instruction trace. The trace result can be collected through the
debug connection.
Embedded Trace Macrocell (ETM) – a real-time instruction trace solution that streams instruction
execution information to debug host via a trace connection.
Event trace – Real-time trace of exception events generated by the DWT (Data Watchpoint and
Trace unit).
Profiling trace – trace generation for system performance analysis generated by the DWT.
Selective data trace – a real-time trace of a small selection of data generated by the DWT. The
comparators for data watchpoints are used to detect accesses to the specific address location,
and if a transfer to such a monitored location is made, information about it can be exported on the
trace interface.
Full data trace – the ETM on Cortex-M7 processors has an implementation option to support full
data trace to provide maximum visibility of the program’s operations. However, this requires a much
higher trace bandwidth and hardware cost, and is therefore only used in specialized SoC designs.
Instrumentation trace – a real-time trace that is generated by the software to provide debug messages
and OS awareness support. For example, it is possible to direct printf(“Hello World”) messages to the
Instrumentation Trace Macrocell (ITM) to display the message on a debug console on the debug host.
Timestamp support – most trace sources (except MTB) support timestamp and that allows the
debug host to reconstruct the timing of various events.
117
System-on-Chip Design with Arm® Cortex®-M processors
Due to area and power constraints, not all of the Cortex-M processors support all these features. Table
5.1 lists the debug and trace features supported by current Cortex-M processors.
Breakpoint Watchpoint MTB ETM DWT’s ITM
comparators comparators selective
data trace
Cortex-M0 Up to 4 Up to 2
Cortex-M1 Up to 4 Up to 2
Cortex-M0+ Up to 4 Up to 2 Y
Cortex-M3 Up to 6 Up to 4 Y Y Y
(instruction) + 2(literal)
Cortex-M4 Up to 6 Up to 4 Y Y Y
(instruction) + 2(literal)
Cortex-M7 Up to 8 Up to 4 Y Y Y
Cortex-M23 Up to 4 Up to 4 Y Y
Cortex-M33 Up to 8 Up to 4 Y Y Y Y
Cortex-M35P Up to 8 Up to 4 Y Y Y Y
Table 5.1: Debug and trace features support in the Cortex-M processors.
Since most of the debug features are optional, system designers might need to define the options
optimized for their requirements (except when using Cortex-M0/Cortex-M3 DesignStart Eval with
fixed configurations).
Based on this architecture, the debug system is scalable and is consistent with debug systems on
other ARM Cortex processors, making it easy for tools developers to adapt their tools to the various
Cortex-M products.
118
System-on-Chip Design with Arm® Cortex®-M processors
The full CoreSight debug architecture specification document is available on the Arm website. The
Cortex-M processors listed in Table 5.1 were developed with CoreSight Architecture version 2.0.
Please note: The Cortex-M processors utilize a subset of the features in CoreSight.
1. JTAG protocol, created by Joint Test Action Group and originally used for various chip level and PCB
level testing. This protocol uses 4 or 5 pins for the debug connection: TDI (test data in), TDO (test
data out), TCK (test clock), TMS (test mode select), and optional TRST (test reset).
2. Serial Wire Debug (SWD) protocol, created by Arm, uses only two pins: SWDIO (Serial Wire Data
I/O, bidirectional), and SWCLK (Serial Wire Clock).
Since the SWD protocol only needs two pins, it is very popular in microcontrollers. JTAG and SWD can
co-exist in a microcontroller device and share the same pins: TMS shares a pin with SWDIO, and TCK
and SWCLK share the same pin.
Table 5.2: Pin sharing arrangement between JTAG and Serial Wire Debug.
SWDv2 supports an optional feature called multi-drop serial wire debug. When this feature is enabled,
it allows multiple multi-drop SWD devices to share an SWD connection in parallel. Not all devices
implementing SWDv2 support multi-drop features.
If a debug interface supports both protocols, in most case the device would support dynamic protocol
switching, which uses a special sequence of bit patterns on SWDTMS to switch between JTAG
119
System-on-Chip Design with Arm® Cortex®-M processors
and SWD modes. The details of the sequence are documented in the Arm Debug Interface (ADI)
specification, which is available on the Arm website. Existing Cortex-M designs are based on ADIv5.
Please note that there are standardized connector arrangements. More details can be found here:
http://infocenter.arm.com/help/topic/com.arm.doc.faqs/attached/13634/cortex_debug_connectors.pdf
Optional power
request control
Debug
connection
protocol (e.g.
JTAG / Serial Debug
Debug Port Access Port
Wire Debug) AHB components /
(e.g. JTAG/ (e.g. AHB-AP)
Memory
Serial Wire
Debug) Debug
Access Port
APB components /
(e.g. APB-AP)
Memory
Asynchronous clock
domain boundary
Optional asynchronous
Debug Access Port
clock domain boundary
In a SoC design with a single Cortex-A processor, potentially the DAP can contain two Access Port
modules: one for accessing the debug components and one for accessing the memory space. A bus
multiplexer is needed if some of the debug components require software accesses.
Figure 5.2: Concept of a Debug Access Port (DAP) arrangement for a single-core Cortex-A processor system.
120
System-on-Chip Design with Arm® Cortex®-M processors
With Cortex-M processors, the debug components are part of the memory map. As a result, the debug
connection can be simplified significantly to enable lower power and area. The Cortex-M processor’s
internal bus interconnect routes the debug accesses to the debug components and memory interfaces,
so there is no need to have the bus multiplexer in the DAP.
Optional power
request control Cortex-M
processor
Debug
connection Debug
protocol (e.g. components
JTAG / Serial Access Port AHB Debug AHB
Debug Port
Wire Debug) (e.g. AHB-AP) (bus slave)
(e.g. JTAG/
Serial Wire AHB/AXI (bus
Debug) Optional Asynchronous master)
clock domain boundary
Asynchronous clock
domain boundary Main
Internal debug bus memory
(APB like)
Debug Access Port
Figure 5.3: Concept of a Debug Access Port (DAP) arrangement for a single-core Cortex-M processor system.
To further reduce silicon area and power consumption, the structure of the DAP can be reduced by
removing the optional asynchronous clock domain crossing in the AHB-AP and simplifying the internal
debug bus. The result is a very small area in the optimized debug interface design, but the Cortex-M
processors can still be connected to a full CoreSight debug system if they are used in a complex
SoC design.
Early Cortex-M processors including Cortex-M3, Cortex-M4, and Cortex-M1, provide an APB-like
debug bus interface, and a module called SWJ-DP (Debug Port) that is connected to this bus interface
to provide the JTAG or Serial interface. Inside the processor, there is another hardware module called
AHB-AP (Access Port) to convert the transfer commands into AHB transactions.
Memory system
121
System-on-Chip Design with Arm® Cortex®-M processors
Recent Cortex-M processors deliver a Debug Access Port (DAP) module that is optimized for a smaller
silicon area, which combines the functionalities of the Debug Port and Access Port. The processor
exposes a debug AHB interface that allows AHB transactions to be routed into the processor’s
memory system directly.
Debug
Core logic
blocks
JTAG/Serial Wire
Debug connection Mini- Internal bus
DAP interconnect
Memory system
To simplify integration, in the Cortex-M0 and Cortex-M3 DesignStart Eval versions, the debug
interface module is pre-integrated, exposing only the JTAG/SWD interface but not the internal debug
buses. However, the designs still support the power request control interface to enable low-power
designs.
Trace port
Trace Port interface
ATB Interface
Trace source Unit (TPIU)
122
System-on-Chip Design with Arm® Cortex®-M processors
With various trace bus infrastructure components in CoreSight SoC products, trace sources can be
merged, converted between different bus widths, and transferred across different clock / power
domains. Most of the trace components are configurable and need to be set up by the debug host
through a debug connection (via the DAP that we have already introduced).
Trace data can then be formatted and exported to the trace port interface via a Trace Port Interface
Unit (TPIU). The trace port interface (trace data + trace clock) needs to be available at the top-level of
the chip (it can be multiplexed with function pins and hidden when trace is not used). The software
developer can collect the trace data using a debug probe that supports trace.
In Cortex-A and Cortex-R systems, the trace bandwidth requirements are usually higher (mostly due to
the higher clock speed, but potentially with more trace sources when multiple processors are present)
so the trace port could have up to 16-bit or 32-bit trace data (plus clock). In some cases, for example,
when the trace port does not have sufficient bandwidth, trace data can be stored in a trace buffer
instead of streaming out to the trace port immediately.
In most Cortex-M systems, the TPIU included in the IP bundle can support a parallel trace port of up
to 4 trace data bits (plus clock), or use an asynchronous single pin trace protocol called SWO (Serial
Wire Output) for a low-cost trace arrangement, but with a much lower trace bandwidth. The TPIU
used in Cortex-M systems also further reduces the area by merging ATB trace funnel functionality into
it, so it can accept trace from the processor (trace generated by DWT and ITM)) as well as from the
optional ETM simultaneously.
If needed, the trace buses from a Cortex-M processor can be connected to other CoreSight trace bus
infrastructure components and CoreSight TPIU. This is common for complex SoC designs where
a number of processors are present including Cortex-A and the Cortex-M processors.
Please note that the MTB instruction trace solutions in Cortex-M0+, Cortex-M23, Cortex-M33,
and Cortex-M35P do not use ATB at all. With MTB, trace information is directly stored in SRAM
connected to the MTB unit and therefore trace information cannot be collected in real-time. Instead,
the debugger can use a standard debug connection to extract trace results when the content of the
SRAM is in the memory map. This enables a very low-cost trace solution as software developers do
not need to use an expensive trace probe to collect instruction trace information.
123
System-on-Chip Design with Arm® Cortex®-M processors
5.2.6 Timestamp
In order to allow the debug host to reconstruct event timing from trace data, many trace components
support a global timestamp mechanism. To ensure various trace units have the same timestamp
source, in small processor systems a single timestamp generator is used. This unit can be a simple
binary counter, with the counter value connected to various trace sources. Using this information,
trace components can output timestamp packets periodically to provide timing information.
In Cortex-M3 and Cortex-M4 processors, the timestamp interface is 48-bit (input signal). In
Cortex-M7 and Armv8-M processors, a 64-bit timestamp interface is used. You can use a simple
binary up counter to generate timestamp and enable this counter only when the TRCENA output
signal (Trace Enable) from the Cortex-M processor is set.
The timestamp interface also contains an input signal call TSCLKCHANGE. The intention of this signal
is to help trace reconstruction software to be aware of clock frequency changes. Since processor
systems can switch between different clock sources, and this can affect the timing reconstruction,
TSCLKCHANGE was introduced with the intention that it could be pulsed when clock/frequency
changes were made. This was to enable the trace components to output new timestamp packets
immediately to resynchronize timing information. However, it was found that in some system designs
that it was difficult to implement this feature accurately. TSCLKCHANGE has now been removed from
the ETMv4.0 specification and can be tied low in system designs if needed. More information on this
topic can be found in https://developer.arm.com/docs/300818048/latest/what-is-tsclkchange
To get the component discovery process working, the AHB-AP component contains a register BASE
(address offset 0xF8) that listed the base address – the address of the top-level ROM table component
in the AHB memory map. The ROM tables are 4KB in size, and the end address range has ID values
that indicate that it is a ROM table.
AHB-AP
Entries for debug
components or
secondary ROM table(s)
BASEADDR
Subsystem ROM table (optional)
IDs
124
System-on-Chip Design with Arm® Cortex®-M processors
The ROM table entries contain address offsets of the debug components/additional ROM tables.
There can be multiple ROM tables in a system, which are arranged in a hierarchical way. The addresses
used as table entries are relative - in this way a subsystem can contain its ROM table and does not
need to know the absolute address of the debug component inside.
SoC designers should consider customizing the system-level ROM table. The ID values of the ROM
table contains a JEP106 Identity Code, and this represents the company’s identification. You can
register for a JEP106 from www.jedec.org. For more information on this topic, please see: https://
developer.arm.com/docs/103489663/latest/peripheralid-values-for-the-coresight-rom-table, and
https://www.jedec.org/standards-documents/id-codes-order-form
If additional debug components are added/removed, the ROM table entries need to be updated. Each
entry is 32-bit with the following format:
31 12 8 4 2 1 0
Address offset - -
The power domain ID field is optional. For most single Cortex-M systems with a single power domain,
it is perfectly fine to set power domain ID and power domain ID valid to 0.
The last entry in the ROM table must have the value of 0x00000000, and subsequent locations are
also read as zero.
There is an additional read-only register in the ROM table called MEMTYPE, which is in address
0xFCC of the ROM table. If bit 0 (SYSMEM) of this register is 1, it means the system memory is visible
on the bus that the ROM table is connected. Otherwise, the bus is for debug components only. For
Cortex-M systems, the MEMTYPE of ROM tables is set to 1.
For more information on the ROM table format, please refer to CoreSight Architecture Specification v2.0
section D5: https://static.docs.arm.com/ihi0029/d/IHI0029D_coresight_architecture_spec_v2_0.pdf
Since the current Cortex-M processors listed in Table 5.1 are based on CoreSight architecture
specification v2.0, the designs use class 0x1 ROM tables.
125
System-on-Chip Design with Arm® Cortex®-M processors
In simplified terms, DBGEN and NIDEN control the Non-secure debug and trace permissions, while
SPIDEN and SPNIDEN control the Secure debug and trace permissions. Not all combinations of these
signals are valid – if a debug action is allowed for Secure state, it must also be allowed for Non-secure
state (i.e., you cannot have SPIDEN set to 1 while DBGEN set to 0, or SPNIDEN set to 1 and NIDEN
set to 0).
There is another enable control signal on AHB-AP component which will enable/disable memory
accesses of AHB-AP (when disabled, the debugger can still access the AHB-AP registers, but cannot
initiate an AHB transfer).
In newer Cortex-M DAP, this is named DEVICEEN (not available in Cortex-M0 DesignStart Eval as it
is obfuscated within the module).
In simple systems that do not need debug authentication support, these signals can be tied high. This
enables all the debug functionalities.
In systems that need debug authentication support, the CoreSight debug authentication signals
are connected to a debug authentication control unit (not a part of the Cortex-M processor) that
authenticates debug connections. The authentication process typically based on the product’s life
cycle state and user’s input such as debug certificate or password. Based on guidelines from Platform
Security Architecture (PSA), generally certificated based debug authentication is preferred over
password-based authentication for products that can contain sensitive information.
The behavior of debug access control has changed between different versions of Armv7-M
architectures. In older Arm Cortex-M processors like Cortex-M3 and Cortex-M4, and Armv6-M
processors, the AHB-AP can access to the memory space when DBGEN and NIDEN are 0. Since
Cortex-M7 processor (from version E of Armv7-M architecture), DBGEN and NIDEN signals affect
the debug access permission.
126
System-on-Chip Design with Arm® Cortex®-M processors
Although Cortex-M3 and Cortex-M4 have internal trace sources (ITM, DWT), it does not have
a NIDEN signal. It is still possible to disable trace output by disabling the ATB path from the
processor’s ATB to TPIU, but the masking control has to be a static control signal and cannot be
changed during trace operations.
For all Armv8-M processors, permission for debug to access the memory system depends on the
debug authentication status, as describes in the Armv8-M Architecture Reference Manual.
Some of the DAP modules only have the debug power up handshaking.
The handshaking interface is used at the beginning of a debug connection – the debug host requests
the debug system to be powered up and wait for the acknowledgment before proceeding to access
debug and system components.
In simple FPGA designs, you can handle these signals with a loopback via a double D flip-flop
synchronizer. The synchronized version of the signals could be used for clock gating control.
DFF DFF
DAP / SWJ-DP CDBGPWRUPREQ
CDBGPWRUPACK
JTAG / SWD
DFF DFF
connection
CSYSPWRUPREQ
CSYSPWRUPACK
In ASIC designs where multiple power domains are used, the acknowledge signals need to be handled
by the power management logic to ensure that the acknowledge is sent only after the power domain
is up and running.
In a few cases, such debug reset also resets functional logic if there is no easy way to isolate the reset.
If that is the case, chip designers should document this clearly to avoid confusing tool vendors.
For single-processor systems, CTI option can be removed, and its signals can be tied off (for inputs) or
unused (for outputs).
128
System-on-Chip Design with Arm® Cortex®-M processors
The top debug connections for JTAG / SWD is summarised in Figure 5.11.
nTRST nTRST
TDI TDI
SWCLKTCK SWCLKTCK
SWDIOTMS SWDITMS
tristate
SWDO
buffer
SWDOEN
JTAGNSW
TDO/SWO
tristate
1 TDO
buffer
0 SWO (SWV)
enable (alternate function, optional)
1 nTDOEN
0
Tie to 1 or SWOACTIVE
(alternate function, optional)
Many parts of the logic in Figure 5.11 are optional. For example,
Some of the JTAG signals like nTRST, TDI, and TDO are not needed if the processor system is
configured to support Serial Wire Debug (SWD) only.
Multiplexing of debug pins and other functional pins this optional – be careful when multiplexing
debug pins with functional pins as this could lock out debug connections.
129
System-on-Chip Design with Arm® Cortex®-M processors
The use of nTDOEN is optional. Potentially disabling the TDO tristate buffer when it is not used can
save power.
For Armv7-M or Armv8-M Mainline Cortex-M systems that support trace, one integration task
required is to multiplex TDO (JTAG’s test data out) with SWO (Serial Wire Output). This enables
debug tools to support low bandwidth trace operations. Using this arrangement, the SWO is
available only when using SWD debug protocol.
When using the SWO/SWV feature, instead of always enabling the output tristate buffer, using
SWOACTIVE from the Cortex-M TPIU is also a suitable arrangement. Note: SWOACTIVE signal is
not available in Cortex-M3 DesignStart Eval.
In low-power designs, if we need to allow debug connections to be established when the processor
is sleeping, the debug interface block (either DAP or SWJ-DP), and the associated logic (including I/O
pads and pin multiplexer) should not be powered down.
In addition, the debug power request signals need to be connected as covered in Section 5.2.9 and do not
forget to set up the clock and reset timing constraints for JTAG/SWD’s signals including clock and reset.
SWO (or SWV in Cortex-M3/M4) – already covered in 5.3.1 as it is normally multiplexed with TDO pin;
Please note that MTB instruction trace does not require top-level pins because the trace data is
extracted via debug connection.
In single-core Cortex-M systems that support trace, the Cortex-M TPIU bundled supports up to 4-bit
trace data and a trace clock (all of them are output signals). Optionally, you can multiplex the trace
pins with functional inputs/outputs, and use the TRCENA (trace enable) output from the processor
and optionally ETMPWRUP (ETM power-up) from the ETM to switch the pins to trace function
when trace system is enabled. Note: If using Cortex-M3 DesignStart Eval, SWOACTIVE signal is not
available and therefore SWO cannot be multiplexed with TRACEDATA[0]. However, you can still
multiplex SWO with TDO as explained in section 5.3.1.
130
System-on-Chip Design with Arm® Cortex®-M processors
TRACEDATA[0]
TRACEDATA[0]
SWO (SWV)
1
SWOACTIVE
(from TPIU)
TRACEDATA[1] TRACEDATA[1]
1
TRACEDATA[2] TRACEDATA[2]
1
TRACEDATA[3] TRACEDATA[3]
1
TRACECLK TRACECLK
1
TRCENA and
ETMPWRUP / other
programmable pin mux control
control register
Using TRACENA for pin multiplexing control is easy as it is always set when any of the trace
components are enabled, and ETMPWRUP can be used to indicate TPIU need to operate in trace port
mode in order to provide enough bandwidth to output ETM trace. Alternatively, it is also possible to
use a programmable control register to enable the trace pin functions. This allows software developers
to use just SWO trace without forcing other trace pins to trace functions. If you are doing this, you
will need debug tools to program this register before using ETM. In most tools, you can set up a debug
script to enable this action to take place before the debug session started.
JTAG / SWD clock DAPCLK frequency System clock frequency Trace port clock
CM3TPIU /
CM4TPIU
ATB
SWJ-DP AHB-AP Cortex-M3 / Trace
JTAG/SWD Debug bus Cortex-M4 outputs
ATB
ETM
SWCLKTCK
Figure 5.13: Asynchronous clock boundaries in Cortex-M3 and Cortex-M4 processor systems.
131
System-on-Chip Design with Arm® Cortex®-M processors
Most of the newer Cortex-M processors do not have internal asynchronous clock boundaries apart
from the DAP interface modules:
ETM
SWCLKTCK
DCLK
(synchronous to
system clocks) SCLK, HCLK TRACECLKIN
In the case of single-core Cortex-M3 / Cortex-M4 systems, system designers can use the system clock
to generate the DAPCLK, with clock gating controlled by synchronized CDBGPWRUPREQ. For multi-
core systems, the DAPCLK needs to be the same as the debug APB of the DAP (Debug Access Port).
While technically TRCECLKIN for the TPIU can be completely asynchronous to system clocks, there
are some additional considerations when designing the clock systems for TRACECLKIN. From the
debug tools point of view, when using SWO (Serial Wire Output), the debug probe expects the data
rate to be constant during a debug session.
In an MCU debug use case, potentially you can connect TRACECLKIN to the processor’s clock. You
can also clock gate TRACECLKIN using TRCENA from the processor. (Please note: the trace system
can be enabled using software without a debug connection). For designs using Cortex-M23 and
Cortex-M33 processors, the clock gate control should be TRCENA|TPIU_PSEL (clock enabled when
either TRCENA or TPIU’s PSEL is set).
Considering Serial Wire Output/Serial Wire Viewer user cases, although the processor clock’s
frequency might change at boot time (as it switches from crystal to PLL), the software trace (i.e., printf
via the ITM) normally does not happen until after entering main(), so that will be okay. However, if the
application switches clock frequency speed during operations, then connecting TRACECLKIN to the
processor’s clock would be problematic as the debug probe does not know what the current clock
frequency is and therefore cannot extract the serial data.
When using Trace port mode (4-bit data + clock), even if the clock speed changed, the debug probe
could recover the data as the reference trace clock is available. Therefore, clock frequency change is
not an issue normally. However, for a high-speed trace (if your chip is going to run at >100MHz), some
trace probes use PLL/DLL to handle clock recovery and clock frequency changes can cause the PLL/
DLL in the debug probe to lose synchronization for a short time.
At the same time, if using a constant but low-frequency clock (e.g., direct from a low-frequency
crystal), it might not have sufficient data bandwidth to output an ETM trace when the processor
132
System-on-Chip Design with Arm® Cortex®-M processors
switches to its high-speed clock from PLL. Therefore, in designing of TRACECLKIN source, ideally, use
a constant high-speed clock if available. (Note: This can be completely asynchronous to processor’s
clock). In generic MCU designs, it could be desirable to have multiple options (controlled by a
programmable register and set up by means of debug configuration script in the debugger). The
default setup could use the processor’s clock and be changed to other clock sources if needed.
Modern commercial debug and trace probes for Cortex-M support debug connections of maximum
20MHz to 50MHz, and trace port operations at up to 200 to 300MHz. Obviously, connecting debug
and trace at lower frequencies is allowed. Please note that the TPIU has an internal programmable
pre-scaler to allow the clock speed to be reduced if needed. The maximum speed of debug and TPIU
operations also depends on:
PCB designs;
Cable connection between the board and the debug & trace probes;
Many debug and trace probes might also have their own specific requirements for signal voltage
levels. If the debug and trace connection is unstable, it worth trying to reduce the frequency in the
connection settings to see if this helps.
TARGET INSTANCE
ID ID
SWDIO SWD
Cortex-M processor
DAP
SWCLK
TARGET INSTANCE
ID ID
SWDIO
SWD
Cortex-M processor
DAP
SWCLK
133
System-on-Chip Design with Arm® Cortex®-M processors
If using multi-drop SWD, the system designer must provide a target ID and instance ID to the DAP.
These ID values are device-specific, where Target ID is unique to a device. Instance ID is needed if a
circuit board contains multiple identical devices; in that case, their instance ID needs to be unique.
The use of multi-drop serial wire debug is uncommon in microcontrollers because there is a need to
set up target ID and instance ID to ensure that the ID values are unique, and the debug tools know
which DAP, is which. Multi-drop Serial Wire is more popular for complex SoC designs that are end-
product-specific, as all the target and instance IDs of the end-products can be controlled (e.g., by the
OEM). Please also note that some microcontroller debug tools do not have support for multi-drop
operations.
To be secure, typically a debug authentication system requires Non-Volatile Memory (NVM) for
storage of life cycle state information, secure information for decryption of debug certificates, a
communication interface for the authentication process, and, potentially, a crypto accelerator for
decryption of the certificates. Often, the debug authentication mechanism is software controlled and
integrated into the firmware of the device.
transaction
filter
DAP
Bus
Cortex-M processor
Control
registers
Interconnect
Debug
authentication Crypto
software accelerator
Transfer of debug
authentication
information (e.g. Communication
certificate) interface
NVM
Firmware
134
System-on-Chip Design with Arm® Cortex®-M processors
Note: The bus filtering arrangement in Figure 5.16 cannot be used for Cortex-M3 and Cortex-M4
- in the case of Cortex-M3 and Cortex-M4 processors, the bus connection between AHB-AP and
the processor’s internal interconnect is not exposed. So, the DAPEN of the AHB-AP (input of
Cortex-M3/Cortex-M4 processor) could be used to block all debug accesses until debug connection
is authenticated.
The Arm document TBSA-M (Trusted Base System Architecture for Cortex-M) provides
comprehensive requirements information about debug authentication. However, occasionally system
designers would like to use some simpler designs; for example, by having a passcode mechanism to
protect the system, and to allow a full hardware-based approach (since the firmware is not ready).
While this is possible, a brute force approach could be used to obtain the passcode, or a replay
mechanism employed to reverse engineer the passcode if the attacker has gained access to a working
debug session. Unless a system is developed only as a prototype (not for commercial deployment),
then a simple passcode approach is not recommended.
If a very simple debug authentication approach is needed, a possible arrangement is to use e-fuse
or emulated e-fuse in embedded flash to store debug authentication control information. The
information might contain several pieces of information; namely:
One or more NVM locations to hold passcode(s) – this information must not be readable from
software or debugger. If there is a need for multiple levels of debug authentication controls,
multiple NVM passcodes could be used.
A separate passcode register (which can be implemented as a memory-mapped register) is needed to allow
software development tools to program the passcode, and the value is compared against the passwords
stored in NVM by hardware. If using Cortex-M3 or Cortex-M4, this can be implemented as a mirror of the
AHB-AP and activates the comparison if the debug tool is writing to a specific address (needed to test the
condition of TAR (Target Address Register) and CSW (Control and Status Word) registers):
Cortex-M3/Cortex-M4
AHB-AP
JTAG/SWD Debug APB
CSW
connection
SWJ-DP TAR
DRW
Mirrored AHB-AP
CSW Compare
conditioning
TAR
DRW Comparator latch
Figure 5.17: Simple debug authentication passcode mechanism for Cortex-M3/Cortex-M4 processor systems (not recommended for general product
development).
135
System-on-Chip Design with Arm® Cortex®-M processors
For other Cortex-M processors, the connection between the DAP and the processor exposes the
debug accesses, and therefore, the passcode registers can be memory-mapped registers placed on the
debug AHB there.
Please note that such an example scheme only provides limited protection. For stronger protection,
debug authentication solutions such as CryptoCell/CryptoIsland could be used. These solutions also
offer other security features such as cryptographic accelerators and life cycle state management.
EDBGRQ – external debug request is usually used for systems with multiple processors, to allow the
processor to enter halt if there is a debug event in another processor. In single-processor systems, or if
multiple core debug events are transferred by built-in CTI, this signal can be tied low.
DBGRESTART and DBGRESTARTED – this is used for systems with multiple processors, to allow
multiple processors to be taken out from halted state at the same time. In single-processor systems, or
if multi-core debug events are transferred by built-in CTI, the DBGRESTART signal can be tied low and
DBGRESTARTED can be left unused.
FIXMASTERTYPE – this signal is available in the Cortex-M3 and Cortex-M4 processors. When this
signal is low, it allows the debugger to generate a debug access with the same bus master information
(HMASTER) as transfers generated from software running on the processor (this behavior is
programmable in the AHB-AP). When this pin is high, an HMASTER signal always indicates the true
source of the transfer. This pin should be set to 1 if there are firmware protection mechanisms in the
system that needs to know the true generation source of bus transfers. In other Cortex-M processors,
the DAP interface does not allow the HMASTER value of debug accesses to be controlled by debug
host software and so do not need this pin.
136
System-on-Chip Design with Arm® Cortex®-M processors
137
System-on-Chip Design with Arm® Cortex®-M processors
138
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Low-power support 6
139
System-on-Chip Design with Arm® Cortex®-M processors
To enable low-power capabilities, most of the Cortex-M processors provide various low-power
features, including:
Multiple clock signals for block-level clock gating, and optional support for sub-block level clock
gating (sometimes referred to as architectural clock gating);
Sleep-on-exit – allows interrupt driven applications to enter sleep mode automatically when no
interrupt requests are pending.
Newer Cortex-M processors also support multi-power domains and additional pipeline optimizations
to enable ultra-low-power designs.
In addition to dedicated low-power features, some of the characteristics of the Cortex-M processors
are also helpful for enabling low-power designs:
High code density – allows applications to fit into smaller program memories;
Small area – Cortex-M0, Cortex-M0+, and Cortex-M23 processors are designed to meet the power
constraints of the most demanding low-power applications;
High performance – processing tasks can be completed faster, reducing overall energy
consumption;
Low interrupt latency and interrupt handling optimizations – reduce the overhead of interrupt handling.
Additional low-power optimizations can be done at the system-level. For example, in Chapter 4, Figure
4.14, we mentioned the approach of having multiple banks of SRAM. In this chapter, we will also cover
some other methods to reduce system-level power.
140
System-on-Chip Design with Arm® Cortex®-M processors
D Q D Q
EN
EN
CLK CG
CLK
Clock gating cell
Since all the Cortex-M processors are written in generic Verilog RTL, synthesis tools can handle the
clock gating insertion easily.
Another level of clock gating is done inside the processor design where a number of functional units
contain instantiations of clock gating cell wrappers (which can be modified to process node-specific
clock gating cells). These clock gating wrappers are placed in optimum locations to enable more
aggressive clock gating where synthesis tools cannot determine the clock gating opportunities. This
technique is referred to as architectural clock gating (ACG) and is optional (enable/disable by a Verilog
parameter, usually called ACG).
Sub-block
enable signal
Hardware
sub-block
CLK CG
Many Cortex-M processors also have multiple top-level clock signals that enable system designers
to place additional clock gating at the clock domain level (Figure 6.3). In some newer processors, this
arrangement is handled internally to the processor design.
141
System-on-Chip Design with Arm® Cortex®-M processors
Debug clock
gating control Debug domain
Debug
CG
CLK
Cortex-M processor
System clock
gating control System domain
System
CG
CLK
In Chapter 5, we have already covered the debug power control signals CDBGPWRUPREQ and
CDBGPWRUPACK, which can be used for clock gate control for the debug clock domain. For system
clock domain, some of the Cortex-M processors have GATEHCLK signal which can serve the purpose
of gating the system clock on/off.
In many cases, just clock gating is not enough, and so power gating is needed. Simple power gating
requires power switch transistors and also additional special cells for signal isolation and clamping.
Vdd
FET
Logic
GND
Figure 6.4: Simple power gating.
The most significant disadvantage of simple power gating is that the logic state would be lost and
would need to be reset. A newer form of power gating technology is called State Retention Power
Gating (SRPG), which introduces state retention elements inside registers and uses separate power
rails for state retention. Since the rest of the logic can be powered down, it is still a significant saving
for sleep modes. However, the area of state retention registers and their dynamic power are higher
than for the standard registers, so such a mechanism is not optimized for systems that are active most
of the time.
Vdd Vddret
Vddret
Vdd
FET D Q
State
Logic
GND CLK
GND
Figure 6.5: State retention power gating.
142
System-on-Chip Design with Arm® Cortex®-M processors
In addition to digital logic, there are also low-power features in memories (e.g., many SRAM macros
support a range of low-power states). A range of peripheral components like ADC (Analog to Digital
Converters) can also have their own low-power features.
Architecturally, the processor supports two sleep modes: sleep and deep sleep. The mechanisms for
entering these sleep modes are the same.
The sleep-on-exit feature is enabled, and the processor is returning from an exception handler to
thread level.
The selection of entering sleep and deep sleep is defined by SLEEPDEEP bit in the System Control
Register (SCR). Inside the processor, there is not much difference between the two types of sleep.
System designers can make use of the SLEEPING and SLEEPDEEP signals to define power-saving
measures on the chip level, and optionally extend the sleep modes with device-specific programmable
registers if preferred.
(Please note in Armv7-M and Armv8-M Mainline processors, the WIC is used only in deep sleep).
When the processor is sleeping, it is still possible to have bus transactions on the AHB interface of the
processor due to debug accesses. Therefore, an extra signal GATEHCLK is provided to indicate that
the processor is sleeping, and there are no on-going bus transactions like debug accesses.
How these signals can be used is device-specific. For example, bus clocks can be gated off, SRAM can
enter a lower power state, and some peripherals could be stopped if GATEHCLK is high.
143
System-on-Chip Design with Arm® Cortex®-M processors
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0068c/index.html
The new interface protocol enables system designers to create reusable IP for power management.
This Low-power Interface Specification describes both Q-channel and P-channel protocols. The use of
Q-channels in the Cortex-M processors only started recently with the introduction of Cortex-M23 and
Cortex-M33 processors. P-channels are used on recent Cortex-A processors for most complex power
control scenarios.
The Q-channel connects between a design unit (e.g., a processor) and a power management unit
(device-specific). The interface operates with 4 signals, as shown in Table 6.2. The polarity of these
signals must be chosen so that if the interface signals get clamped to 0 in a low-power state, it will not
affect the signal levels
Signal Description
A processor can have multiple Q-channels for different power domains, and there can be separated
Q-channels for clock gating and power-down operations. In simple designs, the system-level power
management hardware can use a Q-channel to control a power domain of a processor.
QACTIVE QACTIVE
QREQn QREQn
QACCEPTn QACCEPTn
QDENY QDENY
144
System-on-Chip Design with Arm® Cortex®-M processors
The following example shows a power management scenario for a processor’s system power domain.
At the starting stage of the power-up sequence, the power gating control applies power to the
processor. In this scenario, both the QACTIVE and QREQn signals are high before the processor starts
(Figure 6.7).
Processor exit from
reset and start
executing code
Power on
(power gating)
QACTIVE
QREQn
QACCEPTn
QDENY
Reset_n
SLEEPING
(processor status)
Figure 6.7: Example Q-channel activity for system domain when a processor boots up.
When the processor enters sleep mode, the power management unit detects the sleep operation and
then requests to change the processor to low-power state (e.g., SRPG) by asserting QREQn to 0. The
processor can then drive QACCEPTn low to indicate that the low-power state request is accepted.
After the processor accepts the low-power mode request, it then puts the processor in the targeted
low-power state (i.e., SRPG), as shown in Figure 6.8.
Interrupt
event
PMU received a
Processor entered Processor entered state wake-up request
sleep retention power gating from WIC and
restored power to
the processor
Power on
(power gating) QACTIVE de-asserted
to indicate that the
processor is idle
QACTIVE
Low-power state
PMU requested request removed
QREQn low-power state
Processor accepted
low-power state
QACCEPTn
QDENY
Reset_n
Processor exited sleep mode
and started execute ISR
SLEEPING
(processor Processor entered a sleep mode that
status) triggered SRPG operation
Figure 6.8: Example of Q-channel activity for the system domain when a processor enters a sleep mode that uses state retention power gating.
145
System-on-Chip Design with Arm® Cortex®-M processors
Assuming that the processor system has a Wakeup Interrupt Controller (WIC) in an always-on
power domain or similar hardware features, then a peripheral activity triggering an interrupt request
can wake up the system via a separate connection between the WIC and the power management
controller. In this scenario, the power management controller can restore the power and clock signal
activity to the processor system, and then the Q-channel can complete its handshaking sequence
(right-hand side of Figure 6.8).
If an interrupt request arrives just after the processor has entered sleep mode, then it is possible for the
processor to reject the low-power state request using the QDENY signal. In such cases, the PMU must
not power down the processor in order to allow it to continue its operations, as shown in Figure 6.9.
Interrupt
Processor request arrived
entered Processor started ISR
sleep execution
Power on
(power gating) QACTIVE de-asserted
to indicate that the QACTIVE re-asserted to
processor is idle indicate that the
processor is busy
QACTIVE
PMU requested low
QREQn power state
QACCEPTn
Processor rejected
low power state
QDENY
Reset_n
Processor exited sleep
mode due to interrupt
SLEEPING
(processor status)
Processor entered a sleep mode that
triggered SRPG operation
Figure 6.9: Example Q-channel activity for system domain when a processor rejects a low-power state request.
It is possible that some power domains of a processor can start up in a low-power state. For example,
the debug power domain of a processor can be in an OFF state when the system starts and turned on
only when a debugger is connected. In such cases, the QACTIVE and QREQn signals will have a start-
up level of zero instead of one.
In addition to processor systems, the Q-channel can also be used in other system components. Since
the handshake protocol is fairly simple and is very generic, it can be deployed for many components of
a low-power microcontroller or system-on-chip design.
SLEEPHOLDREQn Input of processor. When using this feature, set this signal to 1 after entering sleep mode
SLEEPHOLDACKn Output from processor to indicate sleep hold request is accepted
These two signals are active-low and interface with a power management unit (PMU) or the system
controller developed by silicon vendors.
The operation of the sleep hold interface is very simple: When the PMU or System Controller detects
that the processor core has entered sleep (SLEEPING or SLEEPDEEP signals), it can then assert
the SLEEPHOLDREQn signal to the processor. If the processor core responds with the assertion of
SLEEPHOLDACKn (pulled low), then the PMU or system controller can then reduce the power by
turning off the flash memories, peripherals, PLL, etc. If the processor core does not respond with
SLEEPHOLDACKn, then it means the processor core might have received an interrupt or a debug
request, so it is going to wake up. In this case, the PMU or system controller should not carry out
any further action and de-assert SLEEPHOLDREQn when sleep signals (SLEEPING or SLEEPDEEP)
is de-asserted.
If the sleep hold request has been accepted after the system has entered sleep, and an interrupt arrives,
the processor core will de-assert the sleep signals. However, the processor core will not resume program
execution until SLEEPHOLDREQn from the PMU or system controller is de-asserted (pulled high).
When the flash memory voltage supply is resumed, and all the logic is ready, the SLEEPHOLDREQn can
be de-asserted, and the execution of the interrupt service routine can be started.
1) Processor enters
sleep mode after
5) Processor deasserts
executing WFI/WFE 4) An IRQ SLEEPING, indicating it
arrives wants to wakeup
SLEEPING
SLEEPHOLDACKn
(from Cortex-M)
2) System controller 7)
asserts SLEEPHOLDACKn is
SLEEPHOLDREQ to 3) Cortex-M responds by deasserted,
keep the core asleep asserting SLEEPHOLDACKn indicating processor
to indicate now core is held in starts running
sleep mode
During the extended sleep, it is possible to stop the HCLK if GATEHCLK is high.
If the sleep hold feature is not needed, the SLEEPHOLDREQn input signal can be tied high.
147
System-on-Chip Design with Arm® Cortex®-M processors
The WIC is an optional block that is in a separated always on the power domain that will take the role
of interrupt and wake up event detection when the NVIC is stopped or powered down. The exact
interface and integration details can be processor-specific. In general, the interface between the WIC
and the processor contains:
WICMASKxxx[n:0] Processor to WIC Wakeup event mask. Contain mask status for NMI, RXEV, EDBGRQ (for
Armv7-M, Armv8-M Mainline) and IRQ signals. The signal width is configurable.
WICLOAD Processor to WIC Indicate to WIC that the WICMASK is valid and needs to be captured
WICCLEAR Processor to WIC Clear the wake-up event mask inside WIC
WICINT[n:0] / IRQ + Input Wakeup events: NMI, RXEV, EDBGRQ (for Armv7-M, Armv8-M Mainline) and
other wakeup events IRQ signals. The signal width is configurable.
WAKEUP Output Wakeup request to the system controller to indicate that the processor needs
to be woken up to serve an interrupt request or other event.
WICPENDxxx[n:0] Output Latched version of an interrupt request. Since the incoming interrupt event
could be single-cycle, the WIC holds the request status until the processor is
back operating and WICCLEAR is asserted. This can be fed to NVIC’s interrupt
inputs via an OR logic.
Finally, there can be an additional interface on the Cortex-M processor to enable/disable the WIC
operation, which we will cover later in this section.
The WIC delivered in the Cortex-M product bundle is an example of this small interrupt detection
logic, and it is modifiable. In some cases, designers have modified the WIC to enable a latch-based
operation so that wake-up events can be detected and captured without any active clocks.
An overview of the WIC operations is as follows:
When entering sleep mode, the wakeup event mask is transferred from NVIC to WIC using
a dedicated hardware interface (WICMASK[] and WICLOAD).
When a wake-up event is detected, the WIC sends a wake-up request to the system power
management control.
148
System-on-Chip Design with Arm® Cortex®-M processors
The power management control then restores the power to the processor and resumes clocking.
The processor can pick up the interrupt request (or other wakeup events) and resume operation.
The wake-up masking information and pending wake-up event held inside the WIC is cleared by
hardware automatically when the processor wakes up from sleep mode (WICCLEAR).
Processor
clock
SLEEPING/
(powered down)
SLEEPDEEP
System waking up
WICLOAD (powered down)
power down
WAKEUP
WICINT[ ]
WICPEND[ ]
Wake-up Requesting
event wake-up
IRQ, NMI,
RXEV,
EDBGRQ
Cortex-M3 /
Cortex-M4
WIC
WICINT[] WICPEND[]
Cortex-M0/
IRQ, NMI,
Cortex-M0+/
RXEV,
EDBGRQ Cortex-M7/
WIC Cortex-M23
149
System-on-Chip Design with Arm® Cortex®-M processors
In Armv7-M or Armv8-M Mainline processor systems, the EDBGRQ signal (external debug request)
is included as one of the wake-up events that the WIC monitors. This is because the external debug
request can trigger a Debug Monitor exception if it is enabled. In Armv6-M or Armv8-M baseline
systems (i.e., Cortex-M23 processor system), the EDBGRQ is not considered to be a wake-up event
as the Debug Monitor exception is not available.
The WIC feature can be enabled or disabled with a handshaking signal interface. In the Cortex-M3
and Cortex-M4, this interface involves a pair of handshaking signals between the processor and the
WIC, and another pair of handshaking signals between the WIC and the system-level control registers
(device-specific).
WICENREQ WICDSREQn
Cortex-M3 /
WIC
WICENACK WICDSACKn Cortex-M4
At the start of the application, the software can write to a register in the system power management
unit (outside of the processor, device-specific) to enable the WIC feature. This enables the power
management unit that handles state retention power gating. The WIC is then enabled with the
following handshaking sequence:
WICENACK
WICDSREQn
WICDSACKn
2) WIC request to NVIC
that DEEPSLEEP imply 3) NVIC acknowledge to WIC
WIC mode that DEEPSLEEP will imply
WIC mode
Figure 6.14: The WIC enables handshaking waveform in Cortex-M3 and Cortex-M4.
In later Cortex-M processor designs, the WIC enable/disable interface is simplified so that it only
needs the WICENREQ and WICENACK signals. There is no need for additional handshaking between
the WIC and the processor.
150
System-on-Chip Design with Arm® Cortex®-M processors
When using State Retention Power Gating (SRPG), the system designer will need to handle a number
of control signals. The control sequences of these signals are process node-specific. In a simple
example, you might see the following signals:
Signal Description
System designers need to create a state machine to control the sequence for entering the power-
down state and exiting from the power-down state. To support these operations, the power
management design also needs to include a status signal to indicate if the power-up has been done
(let us assume that this signal is called PWRUPREADY in the following state machine diagrams). When
in the power-down state, the WAKEUP signal from the WIC is used to switch the state machine into
the wake-up sequence. A simple state transition diagram is shown in Figure 6.15.
WIC operation is
Establish sleep Clock Powering
enabled, and Isolate Retain
hold req/ack off down
SLEEPDEEP is high
ISOLATEn=0 RETAINn=0 POWERDOWN=1
SLEEPDEEP = 0
(interrupt taken place)
Powered up Powered down
PWRUPREADY = 0
WAKEUP = 1
Clock on and sleep Remove Powering (from WIC)
Restore
hold req removed Isolation up
ISOLATEn=1 RETAINn=1 POWERDOWN=0
(for controlling
(for controlling (for controlling power gating in
isolation between state retention in state retention
power domains) state retention logic)
flip-flops)
Silicon vendors might choose different approaches to develop their power control FSM. For example,
the FSM can optionally allow the power down sequence to be canceled if the WAKEUP signal is
asserted before being powered down to reduce the interrupt latency (Figure 6.16).
151
System-on-Chip Design with Arm® Cortex®-M processors
PWRUPREADY = 0
WAKEUP = 1
(from WIC)
Clock on and sleep Remove Powering
Restore
hold req removed Isolation up
ISOLATEn=1 RETAINn=1 POWERDOWN=0
Figure 6.16: State machine for SRPG sequence control that allows the power-down sequence to be canceled.
The design of the FSM is also heavily dependent on other system-level factors (e.g., power control for
memories).
1. The SYSTICK timer will be stopped during power down. As the processor is powered down, the
SYSTICK timer inside the processor would be stopped. Embedded applications that use OS will
need to use a timer external to the processor core to wake it up for task and event scheduling.
Sleep modes that still allow free running processor clocks should not be affected – embedded
application programmers should check the sleep mode details from chip manufacturers.
2. The interrupt latency is increased when WIC or SRPG is used. Since it will take a certain duration
to power-up the processor, memories, and get the system ready, the interrupt latency can be
increased substantially. Please check the datasheet from the silicon vendor for details.
3. Power down is normally disabled when a debugger is attached. This is because debuggers require
access to the processor even when the processor core is in sleep modes. In many such cases, the
power down FSM is automatically disabled by the WIC interface inside the processor. As a result,
testing of deep sleep can show a different set of behaviors and interrupt latency when a debugger
is connected.
Unfortunately, there is no golden rule. If the oscillators use large amounts of power, running them
slowly might be a good way to reduce power consumption. However, this also has the effect of
increasing interrupt latency and overall leakage current.
152
System-on-Chip Design with Arm® Cortex®-M processors
On the other hand, if the flash memories have a high leakage current, run fast and sleep (while also
turning off the flash memory), this could be a good way to reduce overall energy consumption.
However, it means that the peak power will be higher.
In addition to that, the peripheral control requirement can also affect the whole picture. Software
developers might need to run a number of trials to determine what is the best approach for them.
Figure 6.17: Thumb instruction set enables high code density for microcontrollers.
High code density can have various advantages. In addition to opportunities to reduce power by using
a smaller program ROM/flash, it can also help to:
Reduce cost;
153
System-on-Chip Design with Arm® Cortex®-M processors
The Cortex-M0+ and Cortex-M23 processors also support halfword instruction fetches if a branch
target is not word-aligned (bit 1 of address is 1). It means half of the byte lanes for that access can be
inactive and could save quite a bit of power in short loops.
Figure 6.20: Half-word instruction-fetch in non-word-aligned branch target accesses (Cortex-M0+ and Cortex-M23 only).
154
System-on-Chip Design with Arm® Cortex®-M processors
Selection of crystal operation range is also important. While a microcontroller product might be
designed to run at 100MHz, having a 100MHz crystal in the design means the product will have
a 100MHz clock running all the time and can burn a lot of power. Therefore, it is common to use
a relatively slow crystal (4 to 12MHz) and use PLL to generate higher clock frequencies only if they
are needed.
For embedded flash, it is also possible to power down the flash completely during sleep as there is
no issue of data loss. However, when doing this, be aware of the in-rush current (current spike) when
the flash macro is turned on, which potentially can cause a voltage drop in power rails and result in
problems affecting other parts of the chip.
Some devices might allow the software to write to flash before brown-out so that crucial data in
SRAM can be restored later. This operation might also be needed when the battery of the product is
being replaced. If a design needs to support such features, the minimum flash programming voltage
and flash power during programming can become a critical issue.
6.5.4 Caches
While adding a cache unit can increase the silicon area and the leakage current, and hence increase
the power requirement of a system, sometimes it can help reduce the overall energy efficiency
because it reduces the access to the main memories, especially embedded flash, which can be power-
hungry. It also has the benefit of enabling higher performance because flash memories are often quite
slow (e.g., 30MHz to 50MHz). The AHB flash cache from Arm (in CoreLink SDK-100/101) is available
in Cortex-M3 DesignStart Pro, and additional system cache designs are available in other system
design kit products (licensable IP).
155
System-on-Chip Design with Arm® Cortex®-M processors
Many I/O pads have configurable power modes that can reduce power by adjusting drive
strengths and skew rates. System designers can make these options programmable by introducing
programmable registers to control these configuration signals.
PCLK
I/O I/O I/O
HCLK, PCLK
(not gated)
Data phase
AHB
Address phase
APBACTIVE
PCLKG (gated)
Time
Figure 6.21: AHB to APB bridge in CMSDK has a clock gating control output.
To take advantage of this feature, a peripheral might need to be modified so that it has separate clock
signals for the bus interface and peripheral operations.
In some cases, the peripheral buses might be clock gated and software introduced to enable the clock
on the peripheral buses before accessing the bus slaves on it.
156
System-on-Chip Design with Arm® Cortex®-M processors
The processor states will be lost. Hence, software using this power down arrangement must save
critical information to state retention SRAM beforehand.
The system design needs to have additional hardware logic to handle the hardware wake-up event
detection.
If using such an approach, system designers need to add a custom-defined wake-up unit to detect
wakeup events, which then:
Resets the processor system (without resetting the state retention memories and registers);
Releases resets and the processor can then boot-up and execute software.
Clamp
Wakeup event Wakeup
detection events
Interrupts cells
Power (programmable)
ctrl
Reset
Cortex-M
Power Info
domain register
isolation
bridge
State retention
ROM RAM Peripherals Always on SRAM
power
domain
Figure 6.22: Additional hardware needed for sleep mode with full power-down.
Power ctrl (control) – allows the software to select which sleep mode is used (e.g., whether to enter
power down when in deep sleep).
Reset info register – allows the software to decide if it is a cold boot or wakeup after a power-down
‘sleep.’
State retention SRAM (optional) for holding various program state information.
157
System-on-Chip Design with Arm® Cortex®-M processors
Wakeup event detection – enables the generation of wake-up events from peripherals or I/O. This
is likely to be programmable to allow the software to decide if this is enabled or not.
Also, since the wake-up process is going to take time, the design must also hold the wakeup event
information so that software can have time to enable NVIC.
Debug access port (DAP) – You might optionally move the SWJ-DP to the always-on power domain to
allow the debugger to wake up the system with a debug connection. An alternative solution is to use
another hardware mechanism to wake up the system so that debugger can connect to the processor
to start the debug sessions.
The downside of this approach is that the processor must boot-up first and will, as a result, take
longer time to service the interrupt. It is possible to reduce the boot time by storing most of the key
processor’s state into retention SRAM before powering down and restoring this after waking up. Using
this approach, the C runtime startup could be skipped, and hence, the time needed to setup NVIC
could be reduced.
If using this method, the sleep procedure should be handled in privileged thread mode, and if
TrustZone is implemented, the sleep procedure should be in Secure privileged thread mode so that all
the register states can be accessed easily.
NVIC settings;
MPU settings;
Banked SP (both MSP and PSP might be needed), and if TrustZone is implemented, all four stack
pointers and corresponding stack limit registers should be stored;
Special registers (PRIMASK, FAULTMASK, BASEPRI, etc.), - and beware that if TrustZone is present,
these registers are banked and both versions will need to be saved;
FPU settings if FPU is present, and optionally FPU registers if the FPU was used and active;
Note: Depending on the handling of resuming execution, some registers in the register banks might
not need to be restored.
Another thing to bear in mind is that if TrustZone is implemented, the security management of
retention SRAM is important as Secure information is stored in it when using this power down
approach.
158
System-on-Chip Design with Arm® Cortex®-M processors
159
System-on-Chip Design with Arm® Cortex®-M processors
160
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Design of 7
bus infrastructure
components
161
System-on-Chip Design with Arm® Cortex®-M processors
7.1 Overview
In this chapter, we will go through the basic steps needed to develop a simple AMBA system with a
Cortex-M3 processor using the AMBA 5 AHB (AHB5) and APB (AMBA 3) architectures. While the
Cortex-M3 processor was designed using the AHB Lite version of AHB protocols in AMBA, here
in the featured examples, AHB5 protocol is used because it is the latest protocol and more future
proof. However, TrustZone security management with AHB5 is not going to be covered here as the
Cortex-M3 processor does not support TrustZone, and it is too complex a topic for beginners. You can
use AHB Lite bus masters with AHB5 interconnect, but additional bus wrappers might be needed, for
example, to convert exclusive access signals in Cortex-M3/Cortex-M4 to AHB5 equivalents.
For peripheral connections, APB is used in the example, and the APB bus segment is connected via an
AHB to APB bridge. As explained in Section 4.5, if you are using Cortex-M3/Cortex-M4/Cortex-M7/
Cortex-M33 processors, the PPB (Private Peripheral Bus) interface is primarily for debug components
and should not be used for general peripherals.
The example system that we are going to build contains a behavioral model of two memory blocks
(Program ROM and RAM) and a number of simple peripherals on the APB including two parallel I/O
interface ports, a UART and two simple timers. In addition, a number of basic AMBA infrastructure
blocks including an AHB bus bridge to APB and bus slave multiplexers will also be created.
Cortex-M3
ETM
TPIU
(optional)
ROM table
I-CODE D-CODE System
cm3_code_mux
HSEL HSEL
Two default slaves are needed as there are two AHB bus segments, each of them containing invalid
address ranges.
162
System-on-Chip Design with Arm® Cortex®-M processors
DNOTITRAN input (applicable for Cortex-M3 and Cortex-M4 processors only) is set to 1 because
we are using a code mux module to merge I-CODE bus and D-CODE.
In this design, the APB slave multiplexer and AHB to APB bridge are combined. It is also absolutely
fine to separate the two functions into two modules.
The PPB bus connections are handled inside the integration layer, and there is no need for them to
be handled at a higher level.
There is no need to do any work on the memory space for NVIC and debug components within the
Cortex-M processors. Transfers accessing these components will be routed internally inside the
processor and will not be visible from the bus system.
One of the important parts of designing an AMBA system is the determination of the required memory
map. In this example, the memory map is based on the one supported in the Cortex-M3 processor, with
64kB for program ROM and data memory, and 64kB of memory space allocated to the APB.
Each peripheral block in this example takes 4kB of memory space. Since the transfer size on the APB is
limited to word size, we can have up to 1024 hardware registers for each peripheral. However, in normal
applications, the required number of registers for each peripheral is likely to be far less than that.
The use of 4kB memory size for peripherals is a common practice, which allows us to create a simple
APB slave multiplexer which multiplexes responses using bit fields of PADDR (e.g., bit[15:12] for 16
APB slaves). However, it is fine to use other memory sizes for APB peripherals, although potentially
these will require a slightly more complex APB slave multiplexer.
For designs using the Cortex-M3 or Cortex-M4 processors, in cases where the bit-band feature is to
be used, then the allocation of an address in the memory map must avoid conflict with the bit-band
alias regions.
163
System-on-Chip Design with Arm® Cortex®-M processors
1. An AHB slave must respond with OKAY without a wait state for IDLE or BUSY transfers: When
HTRANS is IDLE (0x0) or BUSY (0x01), HSEL is 1 and HREADY (or HREADYIN) is 1, in the next
clock cycle HREADYOUT must be 1 and HRESP must be OKAY (0x0).
2. An AHB slave must respond with OKAY without wait state if it is not selected: When HSEL is 0 and
HREADY (HREADYIN) is 1, in the next clock cycle HREADYOUT must be 1 and HRESP must be
OKAY (0x0).
3. At reset, HREADY output from AHB slaves must be 1 (ready), and HRESP must be OKAY (0x0). This
is needed to ensure the AHB system is reset correctly.
4. There should not be any combinatorial path from inputs of the AHB interface to the output of the
AHB interface on an AHB slave. The inputs and outputs must be pipelined (separated by register
stage) to prevent combinatorial loops.
5. Error signals on HRESP must be two cycles, with HREADY output (HREADYOUT) low in the first
cycle and high in the second cycle. An additional wait state(s) before the error response is allowed.
Multiple back-to-back transfers can result in multiple back to back error responses, but each of the
error responses must still contain the two-cycle waveform.
6. Although the HRESP input in Cortex-M3 and Cortex-M4 is 2-bit wide, the Cortex-M processors
and the AHB infrastructure components that we are designing here do not support RETRY and
SPLIT responses; therefore, the AHB slaves must not generate these two responses. SPLIT and
RETRY responses are not supported in AHB Lite and AHB5.
7. Ideally, the AHB slave should only be able to insert a limited number of wait states to ensure that
it will not lock up the whole system. The common recommendation for the maximum number of
wait states for a transfer is 16 cycles - but system designers can increase the limit if necessary. Note
that this is only a recommendation. In some cases, it is unavoidable to have longer wait states in
AHB transfers if the AHB interconnect components or slave has to deal with data transfer across
asynchronous clock domains.
8. The minimum memory size of an AHB slave for an ARM system should be 1k bytes. Even if the
slave does not need this amount of memory, the remaining memory space should not be used
for another AHB slave. Not only does it reduce the complexity of AHB decoder design, but it can
also prevent a burst transfer from going across two AHB slaves, which can cause an AHB protocol
violation. The starting address of the AHB slave should be aligned to its memory size in order to
reduce the complexity of the AHB decoder design.
164
System-on-Chip Design with Arm® Cortex®-M processors
9. HEXOKAY can only be asserted in the data phase of an exclusive transfer if there is no error
response. If a bus slave does not support exclusive transfer, HEXOKAY can be tied low.
With these AHB design rules defined, most AHB slaves can be designed with a simple pipeline logic
block, as shown here:
HSEL
HTRANS[1] Write
HREADY request
(HREADYIN)
HREADYOUT
HWRITE Finite HRESP
Read State
request Machine (FSM)
R/W
controls
enable
Address Data path
HADDR Register slice
Size
HSIZE
Write data Read data
HWDATA HRDATA
In the simplest AHB slaves, the Finite State Machine (FSM) can be implemented as a simple register
stage if no wait state is required. If multiple cycles are required, the FSM can be implemented as shown:
No read request or
write request
Read request or
write request
No new read Idle / reset
request or write
request HREADYOUT = 1
Figure 7.4: Simple Finite State Machine (FSM) for AHB slaves with wait-states.
Additional states will be needed if the device supports error responses on the AHB. The AHB to APB
bridge that we will cover in the later chapter of this book is an example of such a design.
165
System-on-Chip Design with Arm® Cortex®-M processors
HADDR
AHB Decoder HSEL_ROM
(CODE bus)
HSEL_DefSlave
HADDR HSEL_RAM
AHB Decoder HSEL_APB
(System bus)
HSEL_DefSlave
The default slave is an AHB slave that is selected when the address is invalid (i.e., no valid slave is
selected). This subject will be covered in the next section of this chapter. The outputs of the address
decoder are combinatorial outputs generated using the address value.
Based on the memory map that we defined earlier, the AHB decoders can be designed as shown here:
HADDR[31:16] =
HADDR HSEL_ROM
16'h0000
HSEL_DefSlave
HADDR[31:16] =
HADDR HSEL_RAM
16'h2000
(HADDR[31:16] =
16'h4000) & HSEL_APB
~HADDR[15]
HSEL_DefSlave
Figure 7.6: Design of the AHB decoders for the example Cortex-M3 system.
166
System-on-Chip Design with Arm® Cortex®-M processors
In some situations, an AHB decoder might also take an HSEL input to enable the decode operation if
the decoder is only for an AHB subsystem, which is part of a larger AHB system. In the case where an
optional HSEL input is implemented, the HSEL outputs would be AND together with the HSEL input,
so that the output can only be high if the HSEL input is high. This is required in more complex systems
where multiple AHB subsystems are developed, and the address decoder of each AHB subsystem
requires a HSEL input from a global address decoder.
HSEL
HREADYOUT
HTRANS
HCLK
HRESETn
The default slave is selected by the AHB address decoder when an invalid address range is output
from the processor. If the default slave is selected and the processor (or bus master) issues an active
transfer (HTRANS equals NSEQ or SEQ), then the default slave sends out an error response in the
data phase of the transfer.
This behavior is different from most 8-bit or 16-bit microcontrollers. In these products, accesses to
an invalid address will not normally cause any fault exception. The advantage of using a default slave
to respond to invalid accesses is that the processor can remedy this if an error is detected, and thus
increase the robustness of the system.
In some designs, the default slave is combined with the AHB slave multiplexer. In this example, we
will design it as a separate unit. A simple finite state machine is used to generate the 2-cycle error
response as required by the AHB protocol. Since we generate an error response for every transfer with
invalid accesses, we do not need to worry about the HWRITE control, HSIZE signals, and data values.
HSEL
HTRANS[1] Finite HREADYOUT
State
HREADY Machine HRESP
167
System-on-Chip Design with Arm® Cortex®-M processors
module ahb_defslave (
input wire HCLK, // Clock
input wire HRESETn, // Reset
input wire HSEL, // connect to HSEL_DefSlave from AHB decoder
input wire [1:0] HTRANS, // Transfer command
input wire HREADY, // System-wide HREADY
output wire HREADYOUT, // Slave ready output
output wire HRESP // Slave response output
);
// Internal signals
wire TransReq; // Transfer Request
reg [1:0] RespState; // FSM for two cycle error response
wire [1:0] NextState; // next state for RespState
// Connect to output
assign HREADYOUT = RespState[0];
assign HRESP = RespState[1];
endmodule
The default slave does not generate any read data and exclusive responses. When connecting the
default slave to an AHB system, the unused HRDATA[31:0] signal and HEXOKAY signal (available in
AHB5) can be connected to zero.
168
System-on-Chip Design with Arm® Cortex®-M processors
HSEL0
HREADYOUT0
HRESP0
HEXOKAY0
HRDATA0[31:0]
HSEL1
HREADYOUT1
HREADY HRESP1
HRESP HEXOKAY1
HEXOKAY HRDATA1[31:0]
HRDATA[31:0]
AHB5 Slave Multiplexer
HSEL2
HREADYOUT2
HRESP2
HCLK HEXOKAY2
HRESETn HRDATA2[31:0]
HSEL3
HREADYOUT3
HRESP3
HEXOKAY3
HRDATA3[31:0]
Figure 7.9: Simple AHB5 slave multiplexer with up to four AHB slave connection.
The AHB slave multiplexer takes outputs from each of the AHB slaves, as well as the HSEL outputs
from the AHB decoder. Using the multiplexed HREADY signal and the HSEL inputs, the AHB decoder
internally generates the pipelined multiplexer control to select the correct data phase output. The
design for the four-port AHB Slave multiplexer can be implemented as shown here:
Address phase
HSEL signals
HSEL0
Data phase MUX
HSEL1 HSEL to MUX control
control register
HSEL2 mapping
HSEL3 enable
HREADYOUT0
HREADYOUT1
HREADY
HREADYOUT2
HREADYOUT3
HRESP0
HRESP1
HRESP
HRESP2
HRESP3
HEXOKAY0
HEXOKAY1
HEXOKAY
HEXOKAY2
HEXOKAY3
HRDATA0
HRDATA1
HRDATA
HRDATA2
HRDATA3
Figure 7.10: Design of a simple AHB5 slave multiplexer with up to four AHB slave connection.
169
System-on-Chip Design with Arm® Cortex®-M processors
module ahb_slavemux (
input wire HCLK, // Clock
input wire HRESETn, // Reset
input wire HREADY, // Bus system level HREADY
input wire HSEL0, // HSEL for AHB Slave #0
input wire HREADYOUT0, // HREADY for Slave connection #0
input wire HRESP0, // HRESP for slave connection #0
input wire [31:0] HRDATA0, // HRDATA for slave connection #0
input wire HEXOKAY0, // HEXOKAY for slave connection#0
input wire HSEL1, // HSEL for AHB Slave #1
input wire HREADYOUT1, // HREADY for Slave connection #1
input wire HRESP1, // HRESP for slave connection #1
input wire [31:0] HRDATA1, // HRDATA for slave connection #1
input wire HEXOKAY1, // HEXOKAY for slave connection#1
input wire HSEL2, // HSEL for AHB Slave #2
input wire HREADYOUT2, // HREADY for Slave connection #2
input wire HRESP2, // HRESP for slave connection #2
input wire [31:0] HRDATA2, // HRDATA for slave connection #2
input wire HEXOKAY2, // HEXOKAY for slave connection#2
input wire HSEL3, // HSEL for AHB Slave #3
input wire HREADYOUT3, // HREADY for Slave connection #3
input wire HRESP3, // HRESP for slave connection #3
input wire [31:0] HRDATA3, // HRDATA for slave connection #3
input wire HEXOKAY3, // HEXOKAY for slave connection#3
output wire HREADYOUT, // HREADY output to AHB master and AHB slaves
output wire HRESP, // HRESP to AHB master
output wire [31:0] HRDATA, // Read data to AHB master
output wire HEXOKAY // Exclusive okay
);
// Internal signals
reg [3:0] SampledHselReg;
// Registering select
always @(posedge HCLK or negedge HRESETn)
begin
if (~HRESETn)
SampledHselReg <= {4{1’b0}};
else if (HREADY) // advance pipeline if multiplexed HREADY is 1
SampledHselReg <= {HSEL3, HSEL2, HSEL1, HSEL0};
end
assign HREADYOUT =
(SampledHselReg[0] & HREADYOUT0)|
(SampledHselReg[1] & HREADYOUT1)|
(SampledHselReg[2] & HREADYOUT2)|
(SampledHselReg[3] & HREADYOUT3)|
(SampledHselReg ==4’b0000);
assign HRDATA =
({32{SampledHselReg[0]}} & HRDATA0)|
({32{SampledHselReg[1]}} & HRDATA1)|
({32{SampledHselReg[2]}} & HRDATA2)|
({32{SampledHselReg[3]}} & HRDATA3);
assign HRESP =
(SampledHselReg[0] & HRESP0)|
(SampledHselReg[1] & HRESP1)|
(SampledHselReg[2] & HRESP2)|
(SampledHselReg[3] & HRESP3);
170
System-on-Chip Design with Arm® Cortex®-M processors
assign HEXOKAY =
(SampledHselReg[0] & HEXOKAY0)|
(SampledHselReg[1] & HEXOKAY1)|
(SampledHselReg[2] & HEXOKAY2)|
(SampledHselReg[3] & HEXOKAY3);
endmodule
Two memory models are being developed: a ROM model for program memory and a RAM model for
the data memory. Normally, the ROM model is used for program memory, and RAM is used for data
memory; however, the RAM model can also be used for program memory if memory initialization is
carried out. In this way, we can allow the program to be self-modified or allow an external debugger
to change the program code during debugging.
The ROM model that we have illustrated here is a read-only simulation memory model. One of the
requirements for this program memory simulation model is that it must allow us to define the program
data inside when the simulation starts. Inside the Verilog code of the ROM model, we use the Verilog
system function “$readmemh” to initialize the program data array with data from a file called “image.
dat”. This file contains the hexadecimal values of a compiled binary image for the Cortex-M processor.
The RAM model, however, will only initialize the data array to zero values. It is possible to add the
“$readmemh” function to initialize the RAM content to other values if it is to be used as program memory.
HSEL HSEL
HADDR HADDR
HTRANS HRDATA HTRANS HRDATA
HSIZE HREADYOUT HSIZE HREADYOUT
HREADY HRESP HREADY HRESP
ahb_rom HWRITE ahb_ram HEXOKAY
HEXCL
image.dat HWDATA
HCLK HCLK
HRESETn HRESETn
Figure 7.11: Simple memory simulation models with the AHB interface.
In order to simplify the design, the memory models do not insert a wait state and treat burst transfers
just like single transfers. Also, these example models only support little-endian.
171
System-on-Chip Design with Arm® Cortex®-M processors
There are different ways to develop the required AHB memory simulation models. Since the AHB
protocol is pipelined, a registering stage is needed inside the memory. For illustration purposes, we will
design the ROM model with a registering stage for the data output, while for the RAM, we will use the
registering stage for control signal processing.
Byte strobe
generation
HADDR[1:0]
HSIZE
HSEL ReadEnable
ReadValid
HTRANS[1]
HREADY 4
Registered
Masking output
Byte 3 HRDATA
Memory
Byte 2 (Read Data)
HADDR Byte 1
image.dat
Byte 0
HCLK
For the ROM design, we assumed that all accesses to the ROM are always read transfers. Therefore
the HWRITE signal was not used. In the ROM design, we masked the unused read data output for
half-word and byte transfers. Note: This is not required in real systems. ARM processors like the
Cortex-M3 just ignore the unused data. However, masking the unused data bytes can make the bus
activities easier to see during debugging.
// Internal signals
reg [7:0] RomData[0:65535]; // 64k byte of ROM data
integer i; // Loop counter for ROM initialization
wire ReadValid; // Address phase read valid
wire [15:0] WordAddr; // Word aligned address(addr phase)
reg [3:0] ReadEnable; // Read enable for each byte(addr phase)
reg [7:0] RDataOut0; // Read Data Output byte#0(data phase)
172
System-on-Chip Design with Arm® Cortex®-M processors
// Read operation
assign WordAddr = {HADDR[15:2], 2’b00}; // Get word aligned address
// Registered read
always @(posedge HCLK or negedge HRESETn)
begin
if (~HRESETn)
begin
RDataOut0 <= 8’h00;
RDataOut1 <= 8’h00;
RDataOut2 <= 8’h00;
RDataOut3 <= 8’h00;
end
else
begin // Read when read enable is high
RDataOut0 <= (ReadEnable[0]) ? RomData[WordAddr ] : 8’h00;
RDataOut1 <= (ReadEnable[1]) ? RomData[WordAddr+1] : 8’h00;
173
System-on-Chip Design with Arm® Cortex®-M processors
endmodule
Unlike the AHB ROM, the pipeline stage of the AHB RAM design takes place at the time of control signal
generation. All the actual read and write operations take place during the data phase of the AHB transfer.
This ensures that if the data written are read out in the next clock cycle, the updated value will be used
for output.
To make it even more interesting, the AHB SRAM example design also adds support for exclusive
accesses by having exclusive response generation logic and tag registers (for address and bus master
ID) for the exclusive access sequence. Semaphore data are normally placed in SRAM, and if another
bus master writes to the same address location during semaphore read-modify-write operations, the
access conflict can be detected by the logic added in this model.
HSEL
Registering
HTRANS[1]
stage
HREADY
ReadEnable
WriteEnable
WriteValid NextByteLane
ByteLane
4 4
ReadValid
HWRITE
Byte strobe
generation
HADDR[1:0]
HSIZE
HADDR
HCLK
Address
Byte 3
RAM
Byte 2
HWDATA Byte 1 HRDATA
Masking
Byte 0
Address
Tag
Exclusive
response
generation
Master ID logic
HMASTER tag HEXOKAY
HCLK HCLK
174
System-on-Chip Design with Arm® Cortex®-M processors
The Verilog RTL code of the AHB SRAM (for simulation) is as follows:
// Internal signals
reg [7:0] RamData[0:65535]; // 64k byte of RAM data
integer i; // Loop counter for zero initialization
wire ReadValid; // Address phase read valid
wire WriteValid; // Address phase write valid
reg ReadEnable; // Data phase read enable
reg WriteEnable; // Data phase write enable
reg [3:0] RegByteLane; // Data phase byte lane
reg [3:0] NextByteLane; // Next state of RegByteLane
175
System-on-Chip Design with Arm® Cortex®-M processors
if (ReadValid | WriteValid)
begin
case (HSIZE)
0 : // Byte
begin
case (HADDR[1:0])
0: NextByteLane = 4’b0001; // Byte 0
1: NextByteLane = 4’b0010; // Byte 1
2: NextByteLane = 4’b0100; // Byte 2
3: NextByteLane = 4’b1000; // Byte 3
default:NextByteLane = 4’b0000; // Address not valid
endcase
end
1 : // Halfword
begin
if (HADDR[1])
NextByteLane = 4’b1100; // Upper halfword
else
NextByteLane = 4’b0011; // Lower halfword
end
default : // Word
NextByteLane = 4’b1111; // Whole word
endcase
end
else
NextByteLane = 4’b0000; // Not reading
end
// Read operation
assign RDataOut0 = (ReadEnable & RegByteLane[0]) ? RamData[WordAddr ] : 8’h00;
assign RDataOut1 = (ReadEnable & RegByteLane[1]) ? RamData[WordAddr+1] : 8’h00;
assign RDataOut2 = (ReadEnable & RegByteLane[2]) ? RamData[WordAddr+2] : 8’h00;
assign RDataOut3 = (ReadEnable & RegByteLane[3]) ? RamData[WordAddr+3] : 8’h00;
// Registered write
always @(posedge HCLK)
begin
if (WriteEnable & RegByteLane[0] & ~ExclStoreFail)
begin
RamData[WordAddr ] = HWDATA[ 7: 0];
end
if (WriteEnable & RegByteLane[1] & ~ExclStoreFail)
begin
RamData[WordAddr+1] = HWDATA[15: 8];
end
176
System-on-Chip Design with Arm® Cortex®-M processors
// Exclusive state
always @(posedge HCLK or negedge HRESETn)
begin
if (~HRESETn)
Excl_State <= 1’b0;
else
if (ReadValid & HEXCL) // Exclusive read
Excl_State <= 1’b1;
else if (WriteValid & (HMASTER!=Excl_Tag_MID[3:0]) & (HADDR[15:4]==Excl_Tag_
Addr[15:4]))
Excl_State <= 1’b0; // Another bus master write to same location
else if (WriteValid & HEXCL) // Another bus master performed an exclusive write
Excl_State <= 1’b0;
end
177
System-on-Chip Design with Arm® Cortex®-M processors
endmodule
For FPGA designs, instead of using the behavioral model for ROM and RAM, the memory will be
replaced by either:
In cases where the memories are to be implemented inside the FPGA, the design flow can involve:
Using the memory generator in your FPGA development tool to generate the required memory
block design file, or
Instantiating the required memory component directly in the FPGA design library.
If the synthesis tool supports the synthesis of memory models, we can create a synthesizable memory
model and synthesize the design together with other Verilog files.
In all cases, the implementation details depend on the FPGA product that you are using as well as the
development tools. You will need to refer to the documentation or application notes of your FPGA
development tool to determine which is the best arrangement for your project.
For example, for users of Synplify, they can choose to synthesize a behavior model of memory, with
the possibility of using a standard Verilog function like “$readmemh” to specify the initial content of
the memory. The tools will then create the memory from memory blocks in the FPGA. A similar feature
is also available from various FPGA tools from different vendors.
For ASIC/SoC designs, the ROM and RAM are normally created using vendor-specific memory
compilers and with AHB bus wrappers. An example of an SRAM bus wrapper for AHB Lite can be
found in the Cortex-M0 and Cortex-M3 DesignStart bundle called cmsdk_ahb_to_sram.v. This block
provides essential write buffering to enable typical SRAM blocks generated from memory compilers
178
System-on-Chip Design with Arm® Cortex®-M processors
to be connected to AHB Lite bus without any wait states. While there is no exclusive monitor
functionality in this block, bus level exclusive access monitoring is not normally required in single-
processor systems, as described in Section 4.7.
In AHB, the address value and control information are output from the bus master during the address
phase. Since the duration of the address phase is not fixed and the write data is not available until data
phase, the bus bridge registers the address and read/write control signal at the end of the address
phase and outputs them to the APB during the data phase. In order to generate the required PSEL
and PENABLE signals, the bus bridge contains a simple Finite State Machine (FSM) to handle the
APB control signals, as well as generating an error response on the AHB when an APB slave error
(PSLVERR) is asserted.
The example design also includes an APB slave multiplexer and eight interface ports to APB slaves.
The selection of which slave to access is determined by bit-14 to bit-12 of the address value
(HADDR[14:12]). It is possible to design the APB Bridge and the APB slave multiplexer as two
separated units. However, in this example, they have been designed as a single unit to simplify
integration of the AMBA system.
HSEL PADDR[11:0]
HADDR[14:0] PENABLE
HTRANS[1:0] PWRITE
HWRITE PWDATA[31:0]
HWDATA[31:0] PPROT[2:0]
AHB to APB
HREADY PSTRB[3:0]
Bridge
HNONSEC PSEL[7:0]
HPROT[6:0] PREADY0
PRDATA0[31:0]
HREADYOUT PSLVERR0
HRDATA[31:0]
HRESP
PREADY7
HRESETn PRDATA7[31:0]
HCLK PSLVERR7
Figure 7.14: AHB5 to APBv2 bus bridge with 8 APB slave interface ports.
The example bus bridge design assumes that HCLK is the same as PCLK. If the AHB system and APB
system have different clock frequencies, or if the clock signals are asynchronous, the bus bridge will
have to include an extra handshaking mechanism to support the data transfers across different clock
domains. This is not supported on the example bridge discussed here.
179
System-on-Chip Design with Arm® Cortex®-M processors
To keep the design simple, we have also omitted exclusive access support, as semaphore data are
normally placed in SRAM rather than in peripherals.
For a simple read operation, the bridge outputs the address and APB control signal at the beginning of
the data phase. When the read data is obtained, the read value is sampled in a register and output to
the AHB system in the next clock cycle. Since the bridge supports multiple APB slave interfaces, there
are multiple PSEL, PRDATA, PSLVERR, and PREADY signals, which have a number as their suffix.
HCLK
HADDR Address
HSEL
HWRITE
Other address
Read IDLE IDLE
phase controls
HREADY
PADDR Valid
PSEL(n)
PENABLE
PWRITE
PREADY(n)
PSLVERR(n)
It is possible to design the bridge so that as soon as the read data is ready, it is passed on to the AHB
system without the sampling cycle. However, this could result in poor timing performance if the
output delay of the peripheral system is high, or if the processor core requires a longer setup time
for the read data. Registering the read data signal and read response provides better synthesis timing
performance for the ASIC/FPGA design. The disadvantage is that it slightly increases the area of the
design and adds an extra clock cycle for APB operations.
The write transfer bridging is similar to read transfers. For write data, in most cases, it is not a problem
to connect from HWDATA directly to PWDATA because there is no need to add a multiplexer in the
write data path (only buffers are needed as the HWDATA signals are connected to a large number of bus
slaves). But the PREADY and PSLVERR signals are registered before feeding back to the AHB system.
180
System-on-Chip Design with Arm® Cortex®-M processors
HCLK
HADDR Address
HSEL
HWRITE
Other address
Write IDLE IDLE
phase controls
HREADY
PADDR Valid
PSEL(n)
PENABLE
PWRITE
PREADY(n)
PSLVERR(n)
In order to generate the read and write control signals, a simple finite state machine (FSM) is used.
The FSM is also used to generate the two-cycle error response on the AHB if an error response on the
APB is detected.
APB selected
APB not
APB not selected selected
PREADY = 1,
1st cycle of error Other cycles of the APB
PSLVERR = 0
response on AHB transfer
State unchange
PREADY = 1, if PREADY = 0
PSLVERR = 1
181
System-on-Chip Design with Arm® Cortex®-M processors
If a slave error is detected on the APB bus, the APB bridge needs to generate an error response to
the AHB master. The error response on the AHB must be two cycles in length. Therefore, two states
are assigned in the FSM for this purpose. For example, if an APB receives an error from the peripheral
(PSLVERR = 1), the waveform would look like:
HCLK
HADDR Address
HSEL
HWRITE
Other address
Read IDLE IDLE
phase controls
HREADY
PADDR Valid
PSEL(n)
PENABLE
PWRITE
PREADY(n)
PSLVERR(n)
Figure 7.18: AHB5 to APBv2 bus bridge read with error response.
The same applies to the bridging write transfers. If an APB write transfer receives an error response,
the bus bridge generates the error response using the same mechanism.
182
System-on-Chip Design with Arm® Cortex®-M processors
HCLK
HADDR Address
HSEL
HWRITE
Other address
Write IDLE IDLE
phase controls
HREADY
PADDR Valid
PSEL(n)
PENABLE
PWRITE
PREADY(n)
PSLVERR(n)
Figure 7.19: AHB5 to APBv2 bus bridge write with error response.
Based on these waveforms, we could design the AHB to APB Bridge, as shown in the following
block diagram:
HSEL APB selected
HTRANS[1]
APB
HREADY
Interface
Finite
HREADYOUT State for up to 8
HRESP
Machine (FSM)
slaves
PENABLE
HWDATA PWDATA
HWRITE PWRITE
HADDR PADDR
[18:0] [15:2]
HADDR
register
[15:2] Data phase MUX control
PSEL[7:0]
HADDR enable
PREADY0
[18:16]
PREADY1
Binary to PREADY7
one hot
register PSLVERR0
PSLVERR1
AHB enable
Interface PSLVERR7
HRDATA PRDATA0
register PRDATA1
enable
PRDATA7
183
System-on-Chip Design with Arm® Cortex®-M processors
184
System-on-Chip Design with Arm® Cortex®-M processors
// Internal signals
reg [15:2] AddrReg; // Address sample register
reg [7:0] SelReg; // One-hot PSEL output register
reg WrReg; // Write control sample register
reg [2:0] StateReg; // State for finite state machine
185
System-on-Chip Design with Arm® Cortex®-M processors
if (~HRESETn)
begin
AddrReg <= {10{1’b0}};
WrReg <= 1’b0;
PProtReg<= {3{1’b0}};
end
else if (ApbSelect) // Only change at beginning of each APB transfer
begin
AddrReg <= HADDR[11:2]; // Note that lowest two bits are not used
WrReg <= HWRITE;
PProtReg<= {(~HPROT[0]),HNONSEC,(HPROT[1])};
end
end
186
System-on-Chip Design with Arm® Cortex®-M processors
// Slave Multiplexer
assign muxPRDATA = ({32{SelReg[0]}} & PRDATA0) |
({32{SelReg[1]}} & PRDATA1) |
({32{SelReg[2]}} & PRDATA2) |
({32{SelReg[3]}} & PRDATA3) |
({32{SelReg[4]}} & PRDATA4) |
({32{SelReg[5]}} & PRDATA5) |
({32{SelReg[6]}} & PRDATA6) |
({32{SelReg[7]}} & PRDATA7) ;
assign muxPREADY = (SelReg[0] & PREADY0) |
(SelReg[1] & PREADY1) |
(SelReg[2] & PREADY2) |
(SelReg[3] & PREADY3) |
(SelReg[4] & PREADY4) |
(SelReg[5] & PREADY5) |
(SelReg[6] & PREADY6) |
(SelReg[7] & PREADY7) ;
assign muxPSLVERR = (SelReg[0] & PSLVERR0) |
(SelReg[1] & PSLVERR1) |
(SelReg[2] & PSLVERR2) |
(SelReg[3] & PSLVERR3) |
(SelReg[4] & PSLVERR4) |
(SelReg[5] & PSLVERR5) |
(SelReg[6] & PSLVERR6) |
(SelReg[7] & PSLVERR7) ;
// Sample PRDATA
always @(posedge HCLK or negedge HRESETn)
begin
if (~HRESETn)
RDataReg <= {32{1’b0}};
else if (ApbTranEnd|AhbTranEnd)
RDataReg <= muxPRDATA;
end
187
System-on-Chip Design with Arm® Cortex®-M processors
endmodule
In this design, up to eight APB slaves can be connected to the bridge. The design can be modified
easily to support more or fewer APB slaves. This can be done by changing the binary to one-hot logic,
the PSEL register, and the multiplexers.
The current design allows each APB to take up 4kB of memory. If the memory range for each APB
slave needs to be increased or decreased, the address signal lines connected to the slave multiplexer
need to be changed. In most cases, 4kB should be enough for most simple APB slaves.
Please note that there is also a mismatch between the AHB Lite specification and the Cortex-M3
and Cortex-M4’s bus interface design. Since the Cortex-M3 was designed before the AHB Lite
specification was finalized, the HRESP input signals on the Cortex-M3 processor are 2-bit wide as they
are in AHB (AMBA 2), although the RETRY and SPILT responses are not supported. When connecting
a Cortex-M3 processor to AHB Lite infrastructure, the bit 1 of the HRESP input in the processor can
be tied to 0. Since Cortex-M4 was designed to enable easy migration from Cortex-M3, it kept the
same arrangement, and therefore its HRESP inputs are also 2-bit wide.
module cm3ahb_to_ahb5 (
input wire HCLK, // Clock
input wire HRESETn, // Reset
188
System-on-Chip Design with Arm® Cortex®-M processors
endmodule
In this bus wrapper, we omitted the HNONSEC (Security Attribute for TrustZone support). If a
Cortex-M processor without TrustZone support is used in a TrustZone enabled system, you can use
one of two arrangements:
1. The Cortex-M processor is treated as always Non-secure. In this case, the HNONSEC signal(s)
of the Cortex-M processor is tied high. The bus system needs to handle permission checking to
prevent the processor from accessing Secure memories.
2. The Cortex-M processor is treated as always Secure. In this case, the HNONSEC signal(s) must be
generated based on the memory address partitioning (1 for Non-secure addresses and 0 for Secure
addresses).
The Corstone system design kits from Arm provide a component called Master Security Controller to
handle security management of legacy bus masters in a TrustZone based system.
189
System-on-Chip Design with Arm® Cortex®-M processors
190
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Design of 8
simple peripherals
191
System-on-Chip Design with Arm® Cortex®-M processors
Make sure that the peripheral’s registers are word-aligned unless you are creating an AHB peripheral
with byte-addressable registers. In most cases, peripherals will be connected to the APB bus
system. Since there is no transfer size information on this bus and the bus size is always 32-bit wide,
it is best to make every register word size and align to word addresses. At the peripheral interface,
the address bit [1:0] is unused and can be ignored.
When creating peripheral registers, it is important to avoid a status bit that can be cleared by
writing a zero to it. For example, if the application needs to perform a read-modify-write operation
to change a one-bit bitfield value in a peripheral register, and if another status bit of the same
register changed state between the read and the write access, the information of the status bit
change would be lost when the write-back takes place. Normally, status bits that indicate events
can be implemented as write 1 to clear. In this way, the status bit will not be cleared accidentally.
When creating status registers in peripherals, it is best to avoid a status register that changes
its value upon read accesses (e.g., clear on read). This is because you might want to read the
peripheral memory map through a debugger at the same time as the program is running. Of course,
sometimes, this cannot be avoided. In this case, you might want to create a separate address for
the debugger accesses to the peripheral status register so that reading of status information by the
debugger will not change the behavior of the device.
Be aware that the status of the peripheral interface might be in an undefined state during the
starting up of an FPGA. Internally to the FPGA, the peripheral can be reset after the FPGA is ready.
But the external circuit interface to the peripheral will need to be aware that the FPGA needs a
certain period of time to get ready. Most FPGA products have status output to indicate that the
starting sequence is completed. This can be used to externally disable the circuit from activation.
In most cases, interrupt signals from peripherals are designed as level trigger interrupts. Compared
to pulse triggered interrupts, level trigger interrupts can propagate through clock domains with
simple synchronizers. Whereas synchronization for pulse trigger interrupt signals is more complex.
When developing a complex design, it is often impractical to design all of the required peripherals
by yourself. However, there are various offerings available from Intellectual Property providers that
can be used to create peripheral solutions that work with Arm processors. Arm also has a range of
peripheral products and subsystem products for accelerating time to market.
In many cases, a peripheral could operate at a much slower clock speed than the processor. Instead
of using bus bridges to convert the bus clock speed to match the peripheral speed, it is often
best to have a separated clock for the bus interface of the peripheral and a clock for peripheral
functions. The complexity of the peripheral would increase as a result, as there is a need for
192
System-on-Chip Design with Arm® Cortex®-M processors
additional synchronization logic between the clock domains. However, this arrangement avoids
high access latency when reading/writing to the peripheral registers, which reduces the energy
efficiency of the system and has an impact on interrupt latency.
Exclusive accesses, or
Instances in which the peripheral needs to behave differently when other bus-masters have access
to it (Note: This requires the HMASTER signal).
In this section, we will cover the design of simple APB peripherals, including a simple parallel I/O
interface and a simple timer. First of all, though, we will take a look at APB interface design in general.
APB transfers take two cycles in AMBA 2, or a minimum of two cycles in AMBA 3 and later versions.
For read operations, an APB slave must provide valid read data at the last cycle of the transfer, except
when it is responding with an error signal (applicable only in AMBA 3 or after). For slower frequency
systems, we can generate the read data as soon as possible without any pipeline stage.
PSEL ReadEnable
PWRITE
PADDR
PRDATA
MUX
Hardware
registers
With the arrangement shown in Figure 8.1, the PRDATA will be valid for all the cycles during the APB
transfer. This solution is good enough for some APB systems. However, in systems that require higher
operating frequency, this design might not be suitable because of the propagation delay in the read
193
System-on-Chip Design with Arm® Cortex®-M processors
data generation. For example, if an APB slave contains a large number of hardware registers, the read
data multiplexer will have a long delay. In addition to that, the APB slave multiplexer at the system-
level (in our example the APB slave multiplexer is built-in to the AHB to APB Bridge) also requires
some time to multiplex the read data from different APB slaves. As a result, the total propagation
delay can be quite significant and can be worse when we include wire connection delays in the
calculation. This can reduce the maximum clock frequency of the system, or cause signal routing
problems within the FPGA/ASIC designs.
In order to solve this problem, we can insert a register stage in the APB slave read data interface
to improve the synthesis timing of the data path. The pipeline stage could be inserted in different
positions, depending on your design. One of the possible arrangements is to register the read data
multiplexer output.
PSEL ReadEnable
PWRITE
PADDR
MUX D Q PRDATA
En
Hardware
registers
By using this pipeline stage, we can prevent the signal path from hardware registers within the APB
slaves to the AHB to APB bridge from getting too long. In addition, an extra registering stage can be
placed within the AHB to APB bridge, so in total, a data read takes two clock cycles to get the data
value from the peripheral register to the processor. This seems to have increased the read operation
latency, but the first registering cycle within the APB slave overlapped with the APB operations (APB
transfers take a minimum of two cycles), so the only extra latency cycle is at the registering stage of
the AHB to APB Bridge.
Processor internal
APB slave AHB to APB bridge AHB connection
delay
Optional
Registering Register bank
registering
stage (R0 - R15)
stage
Hardware
registers
APB AHB Processor data
MUX
infrastructure infrastructure path
Figure 8.3: The signal path of pipelined APB read data generation.
194
System-on-Chip Design with Arm® Cortex®-M processors
In the example designs, the register slices for read data only activate if there is a read to the slave.
Otherwise, the register is held unchanged to reduce power consumption. The read data output
(PRDATA) is masked by a ReadEnable signal to block data output when the APB slave is not being
read. Such a blocking mechanism could potentially simplify merging of PRDATA from multiple slaves
by using just OR logic if all bus slaves in the APB segment has the same behavior.
The write operation is easier to implement. For example, if each writable hardware register has
a corresponding write enable signals, the write enable signal can be generated as:
PSEL
Enable
WriteEnable0 Hardware Register
PWRITE
#0
Hardware Register
WriteEnable(N) #N
PWDATA
When designing APB peripherals, we should assign each hardware register to word-aligned addresses
and reserve the whole word for the register even if only part of the space is used. This is because
the APB interface does not provide transfer size information, so each access is assumed to be the
maximum transfer size of the bus (i.e., word size). The common practice is that bit 1 and bit 0 of the
PADDR bus are not used, and registers are assigned with word aligned addresses like (0xXXXX0000,
0xXXXX0004, 0xXXXX0008, etc.). Even if the register does not require the whole word, it occupies
the whole word address.
Unused
Registers
Address Offset space
0x0010 Register #4
0x000C Register #3
0x0008 Register #2
0x0004 Register #1
Peripheral base 0x0000 Register #0
address
Figure 8.5: Keep peripheral registers word-aligned so that the lowest two bits of an address can be tied off.
For many designers, they might have legacy peripherals that were designed for simple 8-bit or 16-bit
microcontrollers and want to reuse them on their Cortex-M processor-based design. In most cases,
these peripherals have three read and write control signals: Chip-Select, Read-Enable, and Write-
195
System-on-Chip Design with Arm® Cortex®-M processors
Enable. If the peripheral has a synchronous (clocked) interface and does not require a tristate data bus,
it is normally an easy task to connect this type of peripheral to APB.
PADDR[n-1:2] Address
PSEL ChipEnable
PWRITE
ReadEnable
WriteEnable
PENABLE
PWDATA WriteData
PREADY 1
PRDATA ReadData
PSLVERR 0
Figure 8.6: Simple APB wrapper for legacy peripherals with a simple synchronous interface.
However, if the peripheral requires a tristate bus interface, or uses an asynchronous interface, the
wrapper will have to handle the transfers in multiple clock cycles and create a turn-around cycle if
a tristate bus is used (a turn-around cycle is used to prevent current spikes on the data bus when the
direction of the data changes, caused by the bus master and bus slave tristate buffers being turned
on simultaneously for a very short period of time during the transition). In order to allow multiple
cycle operations, the APB system must support the PREADY signal, so the APB for AMBA 3 or later is
needed in this situation.
The simplest way to develop a wrapper for such a peripheral is to create a finite state machine and
generate the read enable, write enable, and tristate buffer output enable, using the state value. (Note:
We assume that the OutputEnable signal is used to enable the tristate buffer for write data, and
the ReadEnable is used to enable the tristate buffer for data output on the slaves). If the peripheral
accesses take longer, the finite state machine design can be extended easily by adding extra states.
ReadEnable = 0,
WriteEnable = 0, PSEL = 1, PENABLE = 0
OutputEnable = 0 and PWRITE= 1
Idle
PSEL = 1, PENABLE = 0
and PWRITE= 0
WriteEnable = 1,
Write cycle
OutputEnable = 1
Read and sampling ReadEnable = 1,
Cycle OutputEnable = 0
196
System-on-Chip Design with Arm® Cortex®-M processors
In the example shown in Figure 8.8, a sampling register is used to hold the read data during the
turn-around cycle for read. This allows for better synthesis timing performance. Usually, a tristate
bus operates slower than a unidirectional bus. Without the sampling register, the delay of the read
operation, together with the delay caused by the APB slave multiplexer can become a critical path of
the design and limits the maximum clock frequency.
Using the state machine diagram, the wrapper logic to interface an asynchronous peripheral can be
developed as follows:
PADDR[n-1:2] Address
PSEL ChipEnable
PENABLE ReadEnable
PREADY OutputEnable
PWDATA WriteData
Sampling
PRDATA ReadData
register
PSLVERR 0
Figure 8.8: APB to simple asynchronous interface wrapper with tristate bus support.
module apb_to_async_wrapper (
input wire PCLK, // Clock
input wire PRESETn, // Reset
// APB interface inputs
input wire PSEL, // Device select
input wire [7:2] PADDR, // Address
input wire PENABLE, // Transfer control
input wire PWRITE, // Write control
input wire [31:0] PWDATA, // Write data
// APB interface outputs
output wire [31:0] PRDATA, // Read data
output wire PREADY, // Device ready
output wire PSLVERR, // Device error response
// Cycles for read & write, values are example only. Modify if needed.
localparam RD_CYCLE=4’h3; // 3 cycles for read
localparam WR_CYCLE=4’h2; // 2 cycles for write
// Encoding of FSM states
197
System-on-Chip Design with Arm® Cortex®-M processors
// FSM
always @(*)
begin
case (reg_fsm_state[2:0])
FSM_IDLE,FSM_READ_2,FSM_WRITE_2:
begin
if (RdStart)
nxt_fsm_state = FSM_READ_1;
else if (WrStart)
nxt_fsm_state = FSM_WRITE_1;
else
nxt_fsm_state = FSM_IDLE;
end
FSM_READ_1:
nxt_fsm_state = OpDone? FSM_READ_2: FSM_READ_1;
FSM_WRITE_1:
nxt_fsm_state = OpDone? FSM_WRITE_2: FSM_WRITE_1;
default: // should not be here
nxt_fsm_state = FSM_IDLE;
endcase
198
System-on-Chip Design with Arm® Cortex®-M processors
end
endmodule
199
System-on-Chip Design with Arm® Cortex®-M processors
The first stage of the design process is to determine the interface signals required. For basic
functionality, we need to have input signals, output signals, and enable control signals for output
tristate buffers. We can make the design more flexible by setting the width of the I/O as a Verilog
parameter, with a default width of 8-bit. In this way, the block can be reused easily on other designs
that need a different I/O width.
The GPIO block will also allow the generation of interrupts to a processor core. In order to make
the design more flexible, for each I/O pin, we will provide an interrupt output, as well as a combined
interrupt output.
The APB interface will include APB signals for AMBA 3. However, as the design does not require wait
state, the PREADY output will be tied to high and PSLVERR signal will be tied to low. This design can be
used on APB systems for AMBA 2. In this case, the PREADY and PSLVERR outputs can be ignored.
The next step of the design process is to determine the programmer’s model for the GPIO block. The
programmer’s model is fairly simple, containing only six registers.
200
System-on-Chip Design with Arm® Cortex®-M processors
With these details in place, we can begin to develop the design for the example GPIO block.
OutEn
PORTEN
8
I/O Pad
DataOut
PORTOUT
8 8
Synchroniser
DataIn Q D Q D
PORTIN
8
APB
Interface
IntPolarity
8
Polarity
adjustment D Q
Edge detection
IntType 1 0
Type
8 multiplexer
IntEnable
8
IntState
COMBINT
GPIOINT
8
GPIO
The GPIO design itself does not contain the tristate buffers. These have to be added externally
because tristate buffers can be technology-specific, and designers might want to define them
manually to match the electrical characteristics needed by the applications. Additionally, in some
designs, the I/O pins might need to be shared with other peripherals. In such cases, the tristate buffers
would have to be added after the pin multiplexor stage.
The design also contains a dual flip-flop synchronizer. This is used to prevent metastability issues
caused by toggling of asynchronous external inputs. The interrupt generation circuit is connected to
the synchronizer output. Please note that, with this arrangement, the interrupt generation will not
work if the clock is stopped.
For interrupt generation on the Cortex-M processors with this GPIO unit, please note that, aside from
enabling the interrupt enable at the GPIO block, it is also necessary to program the interrupt enable
register in the NVIC in the Cortex-M processor.
The APB interface design for the example GPIO block is quite simple. First, we use PSEL, PWRITE, and
PENABLE to create the enable signals for read and write operations. Then, we combine these signals
201
System-on-Chip Design with Arm® Cortex®-M processors
with the output from address decoding logic to enable write control for each register and to control
the read multiplexer for the generation of read data.
PSEL
ReadEnable
PWRITE
WriteEnable
PENABLE
Synchronized
DATAIN
0
DATAOUT
WrEn
=0x4 Q 1
D
OUTENABLE
WrEn
=0x8 Q 2
D
INTEN
WrEn D Q
=0xC Q 3 PRDATA
D MUX En
INTYPE
1 PREADY
WrEn
=0x10 Q 4
D
0 PSLVERR
INTPOLARITY
WrEn
=0x14 Q 5
D
INTSTATE
Clear
=0x18 Q 6
Set
Hardware registers
PADDR[4:2]
PWDATA
Interrupts
// Simple GPIO
//
//-------------------------------------
// Programmer’s model
// 0x00 RO DataIn
// 0x04 RW Data Output
// 0x08 RW Output Enable
// 0x0C RW Interrupt Enable
// 0x10 RW Interrupt Type
// 0x14 RW Interrupt Polarity
// 0x18 RWc Interrupt State
//-------------------------------------
module apb_gpio #(
parameter PortWidth = 8
) (
input wire PCLK, // Clock
input wire PRESETn, // Reset
// APB interface inputs
input wire PSEL, // Device select
202
System-on-Chip Design with Arm® Cortex®-M processors
// Write operations
// Data Output register
always @(posedge PCLK or negedge PRESETn)
begin
203
System-on-Chip Design with Arm® Cortex®-M processors
if (~PRESETn)
RegDOUT <= {PortWidth{1’b0}};
else if (WriteEnable04)
RegDOUT <= PWDATA[(PortWidth-1):0];
end
// Read operation
always @(PADDR or DataInSync2 or RegDOUT or RegDOUTEN or
RegINTEN or RegINTTYPE or RegINTPOL or RegINTState)
begin
case (PADDR[7:2])
0: ReadMux = DataInSync2;
1: ReadMux = RegDOUT;
2: ReadMux = RegDOUTEN;
3: ReadMux = RegINTEN;
4: ReadMux = RegINTTYPE;
5: ReadMux = RegINTPOL;
6: ReadMux = RegINTState;
default : ReadMux = {PortWidth{1’b0}}; // Read as 0 if address is out of range
endcase
end
204
System-on-Chip Design with Arm® Cortex®-M processors
// Output to external
assign PORTEN = RegDOUTEN;
assign PORTOUT = RegDOUT;
// Synchronize input
always @(posedge PCLK or negedge PRESETn)
begin
if (~PRESETn)
begin
DataInSync1 <= {PortWidth{1’b0}};
DataInSync2 <= {PortWidth{1’b0}};
end
else
begin
DataInSync1 <= PORTIN;
DataInSync2 <= DataInSync1;
end
end
// Interrupt state
always @(posedge PCLK or negedge PRESETn)
begin
if (~PRESETn)
RegINTState <= {PortWidth{1’b0}};
else
RegINTState <= MaskedInt|(RegINTState & ~(PWDATA[PortWidth-1:0] &
{PortWidth{WriteEnable18}}));
end
endmodule
205
System-on-Chip Design with Arm® Cortex®-M processors
The next step of the design process is to determine the programmer’s model for the timer block,
which is fairly simple as it contains only four registers.
Write 1 to
INTSTATE[0]
Synchroniser Edge
EXTIN detect Reload
(external D Q D Q D Q
Value
input)
CTRL[2]
1 Down Counter =1
TIMEINT
D Q
(interrupt)
1 1 0
Decrement
1 0
The APB interface design for the timer is almost the same as the GPIO, hence, it is not covered in
detail again here. The Verilog RTL code of the design is as follows:
206
System-on-Chip Design with Arm® Cortex®-M processors
// Simple Timer
//
//-------------------------------------
// Programmer’s model
// 0x00 RW CTRL[3:0]
// [3] Timer Interrupt Enable
// [2] Select External input as Clock
// [1] Select External input as Enable
// [0] Enable
// 0x04 RW Current Value[31:0]
// 0x08 RW Reload Value[31:0]
// 0x0C RWc Interrupt state
// [0] IntState, write 1 to clear
//-------------------------------------
module apb_timer (
input wire PCLK, // Clock
input wire PRESETn, // Reset
// APB interface inputs
input wire PSEL, // Device select
input wire [7:2] PADDR, // Address
input wire PENABLE, // Transfer control
input wire PWRITE, // Write control
input wire [31:0] PWDATA, // Write data
// APB interface outputs
output wire [31:0] PRDATA, // Read data
output wire PREADY, // Device ready
output wire PSLVERR, // Device error response
// Internal signals
reg ExtInSync1; // Synchronization registers for external input
reg ExtInSync2;
reg ExtInDelay; // Delay register for edge detection
wire DecCtrl; // Decrement control
wire ClkCtrl; // Clk select result
wire EnCtrl; // Enable select result
wire EdgeDetect; // Edge detection
reg RegTimerInt; // Timer interrupt output register
wire NxtTimerInt;
207
System-on-Chip Design with Arm® Cortex®-M processors
// Write operations
// Control register
always @(posedge PCLK or negedge PRESETn)
begin
if (~PRESETn)
RegCTRL <= {4{1’b0}};
else if (WriteEnable00)
RegCTRL <= PWDATA[3:0];
end
// Read operation
always @(PADDR or RegCTRL or RegCurrVal or RegReloadVal or RegTimerInt)
begin
case (PADDR[7:2])
0: ReadMux = {{28{1’b0}}, RegCTRL};
1: ReadMux = RegCurrVal;
2: ReadMux = RegReloadVal;
3: ReadMux = {{31{1’b0}}, RegTimerInt};
default : ReadMux = {32{1’b0}}; // Read as 0 if address is out of range
endcase
end
208
System-on-Chip Design with Arm® Cortex®-M processors
// Edge detection
assign EdgeDetect = ExtInSync2 & ~ExtInDelay;
// Clock selection
assign ClkCtrl = RegCTRL[2] ? EdgeDetect : 1’b1;
// Enable selection
assign EnCtrl = RegCTRL[1] ? ExtInSync2 : 1’b1;
// Decrement counter
always @(WriteEnable04 or PWDATA or DecCtrl or RegCurrVal or
RegReloadVal)
begin
if (WriteEnable04)
NxtCurrVal = PWDATA[31:0];
else if (DecCtrl)
begin
if (RegCurrVal == 32’h0)
NxtCurrVal = RegReloadVal;
else
NxtCurrVal = RegCurrVal - 1;
end
else
NxtCurrVal = RegCurrVal;
end
// Interrupt generation
// Trigger an interrupt when decrement to 0 and interrupt enabled
assign NxtTimerInt = (DecCtrl & RegCTRL[3] &
(RegCurrVal==32’h00000001));
// Connect to external
assign TIMERINT = RegTimerInt;
endmodule
209
System-on-Chip Design with Arm® Cortex®-M processors
This simple UART design supports 8-bit data transfers with 1 start bit, 1 stop bit, and without either
hardware flow control or parity support. Despite being less than 500 lines of Verilog code, it contains
a built-in baud rate generator, and the design supports 16 times oversampling on the serial input for
better receive reliability. It also supports interrupt for transmit (when write buffer is emptied) and
interrupt for receive (when data is received). For simulation purposes, a UART monitor module is also
included in the example testbench to capture the transmitted serial data.
APB interface
TXD (transmit data)
TXINT
(transmit UART TXEN (transmit enable)
interrupt)
RXD (receive data)
RXINT
BAUDTICK (baud rate x 16)
(receive interrupt)
210
System-on-Chip Design with Arm® Cortex®-M processors
// Simple UART
//
//-------------------------------------
// Programmer’s model
// 0x00 RW CTRL[3:0] TxIntEn, RxIntEn, TxEn, RxEn
// [3] RX Interrupt Enable
// [2] TX Interrupt Enable
// [1] RX Enable
// [0] TX Enable
// 0x04 RW STAT[3:0]
// [3] RX buffer overrun (write 1 to clear)
// [2] TX buffer overrun (write 1 to clear)
// [1] RX buffer full (Read only)
// [0] TX buffer full (Read only)
// 0x08 W TXD[7:0] Output Buffer Data
// R TX buffer full - bit[0]
// 0x0C RO RXD[7:0] Received Data
// 0x10 RW BAUDDIV[19:0] Baud divider
// (minimum value is 32)
//-------------------------------------
module apb_uart (
input wire PCLK, // Clock
input wire PRESETn, // Reset
// APB interface inputs
input wire PSEL, // Device select
input wire [7:2] PADDR, // Address
input wire PENABLE, // Transfer control
input wire PWRITE, // Write control
input wire [31:0] PWDATA, // Write data
// APB interface outputs
output wire [31:0] PRDATA, // Read data
output wire PREADY, // Device ready
output wire PSLVERR, // Device error response
211
System-on-Chip Design with Arm® Cortex®-M processors
// Internal signals
// Baud rate divider
reg [15:0] RegBaudCntrI;
wire [15:0] NxtBaudCntrI;
reg [3:0] RegBaudCntrF;
wire [3:0] NxtBaudCntrF;
wire [3:0] MappedCntrF;
reg RegBaudTick;
reg BaudUpdated;
wire ReloadI;
wire ReloadF;
wire BaudDivEn;
// Status
wire [3:0] UartStatus;
reg RegRxOverrun;
wire RxOverrun;
reg RegTxOverrun;
wire TxOverrun;
wire NxtRxOverrun;
wire NxtTxOverrun;
// Interrupts
reg RegTXINT;
wire NxtTXINT;
reg RegRXINT;
wire NxtRXINT;
// transmit
reg [3:0] TxState; // Transmit FSM state
reg [3:0] NxtTxState;
wire TxStateInc; // Bit pulse
reg [3:0] TxTickCnt; // Transmit Tick counter
wire [3:0] NxtTxTickCnt;
reg [7:0] TxShiftBuf; // Transmit shift register
wire [7:0] NxtTxShiftBuf;
reg TxBufFull; // TX Buffer full
wire NxtTxBufFull;
reg RegTxD; // Tx Data
wire NxtTxD;
wire TxBufClear;
// Receiver
reg [3:0] RxState; // Receiver FSM state
reg [3:0] NxtRxState;
reg [3:0] RxTickCnt; // Receiver Tick counter
wire [3:0] NxtRxTickCnt;
wire RxStateInc;// Bit pulse
reg [6:0] RxShiftBuf;// Receiver shift data register
wire [6:0] NxtRxShiftBuf;
reg RxBufFull;
wire NxtRxBufFull;
wire RxBufSample;
wire RxDataRead;
wire [7:0] NxtRxBuf;
212
System-on-Chip Design with Arm® Cortex®-M processors
assign WriteEnable = PSEL & (~PENABLE) & PWRITE; // assert for 1st cycle of write
transfer
assign WriteEnable00 = WriteEnable & (PADDR[7:2] == 6’b000000);
assign WriteEnable04 = WriteEnable & (PADDR[7:2] == 6’b000001);
assign WriteEnable08 = WriteEnable & (PADDR[7:2] == 6’b000010);
assign WriteEnable10 = WriteEnable & (PADDR[7:2] == 6’b000100);
assign WriteEnable14 = WriteEnable & (PADDR[7:2] == 6’b000101);
assign ReadEnable10 = PSEL & (~PWRITE) & (~PENABLE) & (PADDR[7:2] == 6’b000100);
// Write operations
// Control register
always @(posedge PCLK or negedge PRESETn)
begin
if (~PRESETn)
RegCTRL <= {4{1’b0}};
else if (WriteEnable00)
RegCTRL <= PWDATA[3:0];
end
// Status register
assign NxtRxOverrun = (RegRxOverrun & ~(WriteEnable04 & PWDATA[3])) | RxOverrun;
assign NxtTxOverrun = (RegTxOverrun & ~(WriteEnable04 & PWDATA[2])) | TxOverrun;
// Read operation
assign UartStatus = {RegRxOverrun, RegTxOverrun, RxBufFull, TxBufFull};
213
System-on-Chip Design with Arm® Cortex®-M processors
// --------------------------------------------
// Baud rate generator
// Baud rate generator enable
assign BaudDivEn = (RegCTRL[1:0] != 2’b00);
assign MappedCntrF = {RegBaudCntrF[0],RegBaudCntrF[1],
RegBaudCntrF[2],RegBaudCntrF[3]};
// Reload Integer divider
// when UART enabled and (RegBaudCntrF < RegBaudDiv[3:0])
// then count to 1, or
// when UART enabled then count to 0
assign ReloadI = (BaudDivEn &
(((MappedCntrF >= RegBaudDiv[3:0]) &
(RegBaudCntrI[15:1] == {15{1’b0}})) |
(RegBaudCntrI[15:0] == {16{1’b0}})));
// Next state for Baud rate divider
assign NxtBaudCntrI = (BaudUpdated | ReloadI) ? RegBaudDiv[19:4] :
(RegBaudCntrI - {{15{1’b0}},BaudDivEn});
assign ReloadF = BaudDivEn & (RegBaudCntrF==4’h0) &
ReloadI;
assign NxtBaudCntrF = (BaudUpdated) ? RegBaudDiv[3:0] :
(ReloadF) ? 4’b1111 :
(RegBaudCntrF - {{3{1’b0}},ReloadI});
always @(posedge PCLK or negedge PRESETn)
begin
if (~PRESETn)
begin
RegBaudCntrI <= {16{1’b0}};
RegBaudCntrF <= {4{1’b0}};
BaudUpdated <= 1’b0;
RegBaudTick <= 1’b0;
end
else
begin
RegBaudCntrI <= NxtBaudCntrI;
214
System-on-Chip Design with Arm® Cortex®-M processors
// Connect to external
assign BAUDTICK = RegBaudTick;
// --------------------------------------------
// Transmit
// Increment TickCounter
assign NxtTxTickCnt = ((TxState==4’h1) & RegBaudTick) ? 4’h0 :
TxTickCnt + {{3{1’b0}},RegBaudTick};
// TxState machine
// 0 = Idle, 1 = Wait for Tick,
// 2 = Start bit, 3 = D0 .... 10 = D7
// 11 = Stop bit
always @(TxState or TxBufFull or TxTickCnt or TxStateInc or RegCTRL)
begin
case (TxState)
0: begin
NxtTxState = (TxBufFull & RegCTRL[0]) ? 1 : 0; // New data is written to buffer
end
1: begin // Wait for next Tick
NxtTxState = TxState + {3’b0,TxStateInc};
end
2,3,4,5,6,7,8,9,10: begin // Start bit, D0 - D7
NxtTxState = TxState + {3’b0,TxStateInc};
end
11: begin // Stop bit , goto next start bit or Idle
NxtTxState = (TxStateInc) ? ( TxBufFull ? 4’h2:4’h0) : TxState;
end
default: // Illegal state
NxtTxState = 4’h0;
endcase
end
// Load/shift TX register
assign NxtTxShiftBuf = (((TxState==4’h0) & TxBufFull) |
((TxState==4’hB) & TxBufFull &
TxStateInc)) ? RegTxBuf :
(((TxState>4’h2) & TxStateInc) ?
{1’b1,TxShiftBuf[7:1]} : TxShiftBuf[7:0]);
// Data output
assign NxtTxD = (TxState==2) ? 1’b0 :
(TxState>4’h2) ? TxShiftBuf[0] : 1’b1;
// Registering outputs
always @(posedge PCLK or negedge PRESETn)
begin
if (~PRESETn)
215
System-on-Chip Design with Arm® Cortex®-M processors
begin
TxBufFull <= 1’b0;
TxShiftBuf <= {8{1’b0}};
TxState <= {4{1’b0}};
TxTickCnt <= {4{1’b0}};
RegTxD <= 1’b1;
end
else
begin
TxBufFull <= NxtTxBufFull;
TxShiftBuf <= NxtTxShiftBuf;
TxState <= NxtTxState;
TxTickCnt <= NxtTxTickCnt;
RegTxD <= NxtTxD;
end
end
// Connect to external
assign TXD = RegTxD;
assign TXEN = RegCTRL[0];
// --------------------------------------------
// Receive synchronizer and low pass filter
// Averaging values
assign RxShiftIn = (RxDLPF[1] & RxDLPF[0]) |
(RxDLPF[1] & RxDLPF[2]) |
(RxDLPF[0] & RxDLPF[2]);
// --------------------------------------------
// Receive
// Increment TickCounter
assign NxtRxTickCnt = ((RxState==4’h0) & ~RxShiftIn) ? 4’h8 :
RxTickCnt + {{3{1’b0}},RegBaudTick};
// Increment state
assign RxStateInc = ((&RxTickCnt) & RegBaudTick);
216
System-on-Chip Design with Arm® Cortex®-M processors
// Shift register
assign NxtRxShiftBuf= (RxStateInc) ? {RxShiftIn, RxShiftBuf[6:1]} : RxShiftBuf;
// Buffer full status
assign NxtRxBufFull = RxBufSample | (RxBufFull & ~RxDataRead);
// RxState machine
// 0 = Idle, 1 = Start of Start bit detected
// 2 = Sample Start bit, 3 = D0 .... 10 = D7
// 11 = Stop bit
always @(RxState or RxShiftIn or RxTickCnt or RxStateInc or RegCTRL)
begin
case (RxState)
0: begin
NxtRxState = ((~RxShiftIn) & RegCTRL[1]) ? 1 : 0; // Wait for Start bit
end
1: begin // Wait for middle of start bit
NxtRxState = RxState + {3’b0,RxStateInc};
end
2,3,4,5,6,7,8,9: begin // D0 - D7
NxtRxState = RxState + {3’b0,RxStateInc};
end
10: begin // Stop bit , goto back to Idle
NxtRxState = (RxStateInc) ? 0 : 10;
end
default: // Illegal state
NxtRxState = 4’h0;
endcase
end
// Registering
always @(posedge PCLK or negedge PRESETn)
begin
if (~PRESETn)
begin
RxBufFull <= 1’b0;
RxShiftBuf <= {7{1’b0}};
RxState <= {4{1’b0}};
RxTickCnt <= {4{1’b0}};
RegRxBuf <= {8{1’b0}};
end
else
begin
RxBufFull <= NxtRxBufFull;
RxShiftBuf <= NxtRxShiftBuf;
RxState <= NxtRxState;
RxTickCnt <= NxtRxTickCnt;
RegRxBuf <= NxtRxBuf;
end
end
// --------------------------------------------
// Interrupts
assign NxtTXINT = RegCTRL[2] & TxBufFull & TxBufClear; // Falling edge of buffer full
217
System-on-Chip Design with Arm® Cortex®-M processors
// Registering outputs
always @(posedge PCLK or negedge PRESETn)
begin
if (~PRESETn)
begin
RegTXINT <= 1’b0;
RegRXINT <= 1’b0;
end
else
begin
RegTXINT <= NxtTXINT|(RegTXINT & ~(WriteEnable14 & PWDATA[1]));
RegRXINT <= NxtRXINT|(RegRXINT & ~(WriteEnable14 & PWDATA[0]));
end
end
// Connect to external
assign TXINT = RegTXINT;
assign RXINT = RegRXINT;
endmodule
218
System-on-Chip Design with Arm® Cortex®-M processors
8.3 ID registers
Many Arm peripherals and CoreSight debug components contain a range of read-only ID registers at
the end of the 4KB memory spaces. These ID values enable debug tools to identify debug components
automatically (as described in Section 5.2.7 Debug components discovery), and they allow the
software to determine the revision of the design so that it knows what features are available. In some
cases, such information is also useful for software to implement a workaround if certain defects in the
peripheral design can be overcome by using software measures.
One of the ID registers, PID3, contains an Engineering Change Order (ECO) bit field which can be
generated from input signals. The ECO operation enables you to carry out minor design changes in
the late stage of a chip design process; for example, at the silicon mask level. By providing an external
input to provide ECO input field, it is possible to use tie-off cells on those peripheral inputs to reflect
ECO maintenance revisions.
The ID registers are not strictly required for peripheral operation. In ultra-low-power designs, you can
remove these ID registers to reduce gate count and power consumption.
219
System-on-Chip Design with Arm® Cortex®-M processors
When you modify a peripheral from Arm’s product range, it is recommended that you alter the JEDEC
ID value and the part number in the ID registers to indicate that the peripheral is no longer identical to
the original version from Arm. Alternatively, you can remove these ID registers.
Include a 64-bit sampling register to allow a 64-bit counter value to be sampled in one go, and then
read out using two accesses;
Include a 64-bit transfer register to allow new 64-bit values to be set up using multiple accesses,
and then be transferred into the counter using a separated control register.
220
System-on-Chip Design with Arm® Cortex®-M processors
221
System-on-Chip Design with Arm® Cortex®-M processors
222
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Putting the 9
system together
223
System-on-Chip Design with Arm® Cortex®-M processors
cm3_mcu
cm3_system_top
cm3_processor_subsystem Debug
Debug interface interface
CORTEXM3INTEGRATIONDS
Cortex-M3 DesignStart apb_subsystem
(cortexm3ds_logic, GPIO #0
Pin multiplexing
(apb_gpio)
cm3_code_mux Timer #0
(apb_timer) PORT1
cm3ahb_to_ahb5 cm3ahb_to_ahb5 [7:0]
Timer #1
(apb_timer)
ahb_slave_mux ahb_slave_mux
UART #0
(apb_uart)
Default slave Default slave
(ahb_defslave) (ahb_defslave) Sys ctrl reg CLKIN
clk_reset_ctrl
Behavioural Behavioural nSRSTIN
Sys ctrl
ROM model RAM model
(ahb_rom) (ahb_ram)
The processor subsystem contains the processor and the bus infrastructure components, as well as
the APB subsystem where digital peripherals are located.
The APB subsystem contains the AHB to APB bridge, as well as peripherals (digital parts only).
In this example, we have:
Two timers;
One UART;
224
System-on-Chip Design with Arm® Cortex®-M processors
The behavior memory models are one level up in the design hierarchy (cm3_system_top). This level
also contains the pin multiplexing.
The top-level of the microcontroller (cm3_mcu) contains the top-level of the Cortex-M3 system the
clock reset control (e.g., clock gating, reset synchronizers) and I/O pads.
This is a simple design just for illustration and educational purposes. In the real world, microcontroller
designs are likely to be much more complex; for example:
Most commercial microcontrollers have a lot more peripherals, including analog peripherals.
In real microcontrollers, there might be additional bus masters such as DMA controllers.
In SoC design, RAM, ROM (or embedded flash macros) would be likely to have power management/
control features.
If embedded flash is used, flash programming support requires additional control registers and a
voltage booster (e.g., DC-DC converter).
Many microcontrollers also have on-chip DC-DC converters to provide lower voltage (~ 1 to 1.2
volts) for digital circuitries. The supply voltage for the chip normally ranges from 1.8 volts to 3.6 volts.
Additional circuits are needed for chip manufacturing testing. This topic is commonly known as DFT
(Design for Testing) and will be covered later.
Power management in modern microcontrollers can be quite complex. For example, there can be
multiple power domains and many clock domains and runtime power mode options. In addition, some
of them have separate retention SRAM for holding crucial data while in very low-power sleep modes.
A peripheral/APB subsystem is designed as one unit – this allows it to be reused in multiple
designs. The system control function is split into two halves: one part of it is in the programmable
registers inside the APB subsystem, while the other half is located at a higher level. The reason is
that in many SoC designs, the system control function might involve non-synthesizable IP such as
225
System-on-Chip Design with Arm® Cortex®-M processors
voltage control or clock control logic. If a system contains analog peripherals, then it is also possible
to use the same arrangement to separate synthesizable and non-synthesizable parts of any analog
peripherals in the design hierarchy.
Pin multiplexing is also placed outside of the processor subsystem – this enables the same
peripheral subsystem to be used with different projects, which can have different chip packages and
hence different pin multiplexing arrangements. Another reason for this is that the pin multiplexing
might also need to handle pins from analog components at this level.
I/O pads are instantiated in the design (in this example, behavioral models of I/O pads are used
for simulations). In SoC design, it is often essential to instantiate I/O pads based on electrical
characteristics of the pin functions. Usually, various types of input, output, and tristate pads are
available for each semiconductor process node with different drive strengths and different speeds.
The behavioral models for I/O pads can be used directly in FPGA synthesis, as FPGA development
tools can help you to define I/O characteristics using a project’s configuration files.
Clock gating is handled at a high level in the design hierarchy. This can help simplify the setup of clock
tree synthesis. In this case, the clock gating is done in a design unit called clk_reset_ctrl at the top-level.
If your SoC design needs to support multiple power domains, it is also important to partition
designs based on the power domains. In design units where multiple blocks of different power
domains are present; it is best not to have any logic functions inside those hierarchical files to
simplify power domain handling in implementation flow.
To simulate the simple MCU design, we also need a testbench. A testbench is a simulated environment
in which the microcontroller system will be working. In addition to the processor system, which we
will call DUT (Device-Under-Test), the simulation environment typically contains a number of other
components:
Trickbox(es) that provide input stimulus to the DUT and might also interface with outputs from the
DUT. This is optional. In the case of testing a microcontroller-like system, it is possible to use some
form of loopback signal connections as a trickbox to test peripheral interfaces.
In some cases, simulation models of external memories or external peripherals might also be
needed to test some of the interfaces on the DUT. For example, if your system design supports
external memories, then you will need to add a model of the external memory in the testbench for
testing external memory accesses.
226
System-on-Chip Design with Arm® Cortex®-M processors
For a processor testbench, it is also common to add some mechanism to allow message display
under the control of software in the processor. For example, when the processor executes a
“printf(“Hello world\n”)” statement, the “Hello world” message can be displayed in the simulator’s
console.
Other verification components – one of the techniques for verification is to add a range of
verification components like bus protocol checkers in the simulation (some of these components
can be inside the DUT). If something goes wrong, for example, illegal bus behavior is observed, the
verification component can stop the simulation and report the errors.
In our example testbench, a UART monitor (uart_mon.v) is used to display text messages generated by
software (e.g., using printf function). This unit is also used to end the simulation when it receives
a special character.
Testbench
(tb_cm3_mcu)
To run a system-level simulation, the program memory of the processor system also needs to contain
a valid program. Therefore, we need to prepare some minimal software code and a compilation setup
to enable us to do a basic simulation.
File Descriptions
cm3_mcu.h Device-specific header based on CMSIS-CORE. This contains the peripheral register definitions and
interrupt assignments.
startup_cm3_mcu.s Assembly startup file – this contains the reset handler, default handlers, and the vector table
uart_util.c Simple UART functions to configure the UART and basic UART transmit and receive function. This is used
for supporting message display during simulation.
system_cm3_mcu.c This provides SystemInit(void) usually used for system clock initialization
system_cm3_mcu.h Header file that declares functions available in system_cm3_mcu.c
hello.c Simple hello world message display and demo
Table 9.1: Source file created for minimal software support based on CMSIS-CORE.
227
System-on-Chip Design with Arm® Cortex®-M processors
The assembly startup code (startup_cm3_mcu.s) is toolchain-specific. For this demo, we use Arm
assembler from Arm Compiler 5 (this is available as a part of Arm Development Studio as well Keil
Microcontroller Development Kit, MDK-ARM).
To support the printf function with redirection to the UART, the Arm software development toolchain
Keil MDK-ARM provides the following files:
File Descriptions
Table 9.2: Source file created for minimal software support based on CMSIS-CORE.
For applications that do not use I/O functions like printf, puts, gets, etc., these two files are not
required.
One part of the header is Interrupt Number Definition. In our simple MCU example, it has only 6
interrupts, the numbers of which are defined as follows:
/*
* ==========================================================================
* ---------- Interrupt Number Definition -----------------------------------
* ==========================================================================
*/
228
System-on-Chip Design with Arm® Cortex®-M processors
Another part of this header file that needs some effort to create is the register definitions. In CMSIS-
CORE, peripheral registers are defined as C struct and use a pointer declaration to create the
peripheral definitions. An example of the C struct for the simple APB timer is as follows:
You can also add C macros to declare the bit fields in the registers.
229
System-on-Chip Design with Arm® Cortex®-M processors
The final part of this file that you need to add is the memory map and peripheral definitions:
/******************************************************************************/
/* Peripheral memory map */
/******************************************************************************/
/** @addtogroup CM3MCU_MemoryMap CM3MCU Memory Mapping
@{
*/
/* APB peripherals */
#define CM3MCU_GPIO0_BASE (CM3MCU_PERIPH_BASE + 0x0000UL)
#define CM3MCU_GPIO1_BASE (CM3MCU_PERIPH_BASE + 0x1000UL)
#define CM3MCU_TIMER0_BASE (CM3MCU_PERIPH_BASE + 0x2000UL)
#define CM3MCU_TIMER1_BASE (CM3MCU_PERIPH_BASE + 0x3000UL)
#define CM3MCU_UART0_BASE (CM3MCU_PERIPH_BASE + 0x4000UL)
/******************************************************************************/
/* Peripheral declaration */
/******************************************************************************/
/** @addtogroup CM3MCU_PeripheralDecl CM3MCU_CM3 Peripheral Declaration
@{
*/
230
System-on-Chip Design with Arm® Cortex®-M processors
Just like the interrupt number assignment in the device-specific header file (cm3_mcu.h), the vector
assignment needs to match the IRQ assignments in the Verilog RTL file. The names of the interrupt
handlers must also match the name definitions of the default handlers in the startup code, which is
the second part of modification as shown below:
231
System-on-Chip Design with Arm® Cortex®-M processors
Default_Handler PROC
EXPORT GPIO0_Handler [WEAK]
EXPORT GPIO1_Handler [WEAK]
EXPORT TIMER0_Handler [WEAK]
EXPORT TIMER1_Handler [WEAK]
EXPORT UARTTX0_Handler [WEAK]
EXPORT UARTRX0_Handler [WEAK]
UARTRX0_Handler
UARTTX0_Handler
GPIO0_Handler
GPIO1_Handler
TIMER0_Handler
TIMER1_Handler
B . ; dead loop
ENDP
#include “cm3_mcu.h”
void uart_config(void);
void uart_putc(char c);
char uart_getc(void);
int stdout_putchar (int ch);
void uart_config(void)
{
CM3MCU_UART0->BAUDDIV = 32;
CM3MCU_UART0->CTRL = 1; // Enable TX
return;
}
void uart_putc(char c)
{
while (CM3MCU_UART0->STATE & 1); // wait if TX FIFO full
CM3MCU_UART0->TXD = (uint32_t) c;
return;
}
char uart_getc(void)
{
while ((CM3MCU_UART0->STATE & 2)==0); // wait if RX FIFO empty
return ((char) CM3MCU_UART0->RXD);
}
Combining this utility file and the UART monitor in the testbench, we can output text messages and
display them in the simulation console.
232
System-on-Chip Design with Arm® Cortex®-M processors
In our implementation of system_cm3_mcu.c, the only step carried out by SystemInit is setting up
variables which define the system clock speed. These variables can then be used by the application
codes - see below:
#include “cm3_mcu.h”
void uart_config(void);
void uart_putc(char c);
char uart_getc(void);
int stdout_putchar (int ch);
void uart_config(void)
{
CM3MCU_UART0->BAUDDIV = 32;
CM3MCU_UART0->CTRL = 1; // Enable TX
return;
}
void uart_putc(char c)
{
while (CM3MCU_UART0->STATE & 1); // wait if TX FIFO full
CM3MCU_UART0->TXD = (uint32_t) c;
return;
}
char uart_getc(void)
{
while ((CM3MCU_UART0->STATE & 2)==0); // wait if RX FIFO empty
return ((char) CM3MCU_UART0->RXD);
}
The file system_cm3_mcu.h is only used to provide function prototypes of functions available in
system_cm3_mcu.c. Application code might call those functions if the application needs to update
clock configurations during runtime.
233
System-on-Chip Design with Arm® Cortex®-M processors
9.4.6 Retargeting
With all these files prepared, you can now create an application that boots up and utilizes the peripherals
to do some basic control operations. You can also create a hello world application as follows:
#include “cm3_mcu.h”
#include <stdio.h>
int main(void)
{
uart_config();
printf (“Hello world\n”);
uart_putc(4);// end simulation
while(1);
}
When using the printf function, the compilation needs some way of knowing where the message
needs to be directed to. The redirection of these messages could be handled by:
Semi-hosting support in the debugger (note: this is not supported in Keil MDK-ARM);
Instrumentation Trace Macrocell (ITM) in Armv7-M or Armv8-M Mainline processors via a trace
connection (e.g., Serial Wire Output).
The file retarget_io.c supporting multiple redirection methods are listed above. To select UART as standard
output (stdout), the file RTE_Components.h defines the following C macros used by retarget_io.h:
When these C macros are defined, the retarget_io.c selects a user-defined stdout_puchar(int ch)
function for standard outputs.
/**
Put a character to the stdout
234
System-on-Chip Design with Arm® Cortex®-M processors
By defining this stdout_puchar(int ch) function in uart_util.c to output the character to UART, we will
be able to collect and display printf messages in simulations.
Device drivers – Typically, chip vendors provide a range of device drivers to access peripheral
functions and hence help software developers to create applications quickly. One of the products in
the CMSIS is CMSIS-Diver, which provides device driver API definitions for a range of communication
peripherals. This can help software developers to port their applications to your device.
Device and Board support packages – Typically, chip vendors need to bundle together the relevant
software packages to enable users to download all the essential files quickly and easily. To make
it even better, the CMSIS team has created a CMSIS-PACK framework that is integrated with
development toolchains (an Eclipse plug-in for Eclipse-based IDE is also available) to enable software
developers to download software packages and dependent packages (if needed). The CMSIS-PACK
standard also specifies XML description files that provide essential information about the packages
such as the devices supported. Utilities for creating CMSIS-PACK are available from Arm at: https://
arm-software.github.io/CMSIS_5/Pack/html/createPackUtil.html. For more information, please visit
https://arm-software.github.io/CMSIS_5/Pack/html/index.html
CMSIS-SVD – Another useful part of the CMSIS is the System View Description (SVD), which is an
XML-based file to describe a programmer’s model of peripherals in the chip. Using this file, debug
tools can visualize peripheral register states to allow software developers to analyze the status of the
system more easily.
Flash programming – If your chip contains embedded flash memories, you need to prepare flash
programming support for your customers. This is provided as part of the CMSIS-PACK, and you can
find more information about creating CMSIS-PACK compatible flash programming algorithms at
https://arm-software.github.io/CMSIS_5/Pack/html/flashAlgorithm.html
235
System-on-Chip Design with Arm® Cortex®-M processors
system_cm3_mcu.o: system_cm3_mcu.c
armcc $(ARM_CC_OPTS) $< -o $@
uart_util.o: uart_util.c
armcc $(ARM_CC_OPTS) $< -o $@
startup_cm3_mcu.o: startup_cm3_mcu.s
armasm $(ARM_ASM_OPTS) $< -o $@
hello.hex : hello.elf
fromelf --vhx --8x1 $< --output $@
hello.lst : hello.elf
fromelf -c -d -e -s $< --output $@
clean:
rm *.o
rm *.elf
rm *.lst
rm *.hex
To allow the behavior ROM model to load the program image into it, we need to generate a hex file
which contains a list of 8-bit hexadecimal values. This is generated with the fromelf utility. If you are
using Keil MDK, you can add this in your project options to execute fromelf after compilation is done.
236
System-on-Chip Design with Arm® Cortex®-M processors
Figure 9.3: Specify additional step to be carried out after compilation in Keil MDK-ARM.
../cortex_m3/cortexm3integration_ds/verilog/cm3_code_mux.v
tbench_cm3.v
237
System-on-Chip Design with Arm® Cortex®-M processors
This Verilog command file contains the search path, and any additional design files need to be included
in the compilation. With this file, we can launch the compilation with the following commands:
The option “-incr” means incremental compilation, so if a file hasn’t been changed since the last
compilation, it can be skipped.
“-lint” is to enable lint checking, which flags up various possible design errors.
“-novopt” is used to help maintain all the internal information of the design which can help analysis
of its behavior in waveform windows. Without this option, the optimization will remove a lot of
internal signal details to speed up the simulation but can be very hard to debug. This option should
not be used for regression testing.
After the compilation is carried out successfully, and assuming that the hex file of the software image
is prepared and is in the current directory as image.dat, we can start the simulation with:
The use of “-gui” launches the GUI to allow us to add a waveform window. This is optional. If
everything works, you can start the simulation with “run -all.” The “Hello world” message should be
displayed, and the simulation stop automatically when the program sends a special character (0x4) to
the UART monitor.
# Reading pref.tcl
# // Questa Sim
# // Version 10.6e linux Jun 22, 2018
# //
# // Copyright 1991-2018 Mentor Graphics Corporation
# // All Rights Reserved.
# //
# // QuestaSim and its associated documentation contain trade
# // secrets and commercial or financial information that are the property of
# // Mentor Graphics Corporation and are privileged, confidential,
# // and exempt from disclosure under the Freedom of Information Act,
# // 5 U.S.C. Section 552. Furthermore, this information
# // is prohibited from disclosure under the Trade Secrets Act,
# // 18 U.S.C. Section 1905.
# //
# vsim -novopt -gui tbench_cm3
# Start time: 22:31:20 on Apr 12,2019
238
System-on-Chip Design with Arm® Cortex®-M processors
# ** Warning: (vsim-8891) All optimizations are turned off because the -novopt switch is in
effect. This will cause your simulation to run very slowly. If you are using this switch
to preserve visibility for Debug or PLI features please see the User’s Manual section on
Preserving Object Visibility with vopt.
# Loading work.tbench_cm3
# Loading work.tb_clk_reset_gen
# Loading work.cm3_mcu
# Loading work.clk_reset_ctrl
# Loading work.behavioral_clk_gate
# Loading work.cm3_system_top
# Loading work.cm3_processor_subsystem
# Loading work.CORTEXM3INTEGRATIONDS
# Loading work.cortexm3ds_logic
# Loading work.cm3_code_mux
# Loading work.cm3ahb_to_ahb5
# Loading work.ahb_decoder_code
# Loading work.ahb_slavemux
# Loading work.ahb_defslave
# Loading work.ahb_decoder_system
# Loading work.apb_subsystem
# Loading work.ahb_to_apb
# Loading work.apb_gpio
# Loading work.apb_timer
# Loading work.apb_uart
# Loading work.ahb_rom
# Loading work.ahb_ram
# Loading work.sys_ctrl
# Loading work.behavioral_input_pad
# Loading work.behavioral_input_pullup_pad
# Loading work.behavioral_tristate_pullup_pad
# Loading work.behavioral_output_pad
# Loading work.behavioral_tristate_pad
# Loading work.uart_mon
run -all
# ** Warning: (vsim-8233) ../mcu_system/ahb_ram.v(111): Index 1zzzzzzzzzzzzzzzz into array
dimension [0:65535] is out of bounds.
# Time: 0 ns Iteration: 0 Instance: /tbench_cm3/u_cm3_mcu/cm3_system_top/u_ahb_ram
# ** Warning: (vsim-8233) ../mcu_system/ahb_ram.v(112): Index 1zzzzzzzzzzzzzzzx into array
dimension [0:65535] is out of bounds.
# Time: 0 ns Iteration: 0 Instance: /tbench_cm3/u_cm3_mcu/cm3_system_top/u_ahb_ram
# ** Warning: (vsim-8233) ../mcu_system/ahb_ram.v(113): Index 1zzzzzzzzzzzzzzxz into array
dimension [0:65535] is out of bounds.
# Time: 0 ns Iteration: 0 Instance: /tbench_cm3/u_cm3_mcu/cm3_system_top/u_ahb_ram
# 563700 UART: Hello world
# 579700 UART:
# 579700 UART: Simulation End
# ** Note: $finish : ./uart_mon.v(119)
# Time: 579700 ns Iteration: 1 Instance: /tbench_cm3/uart_mon
# 1
# Break in Module uart_mon at ./uart_mon.v line 119
# End time: 22:32:37 on Apr 12,2019, Elapsed time: 0:01:17
# Errors: 0, Warnings: 4
239
System-on-Chip Design with Arm® Cortex®-M processors
The bus system might need to support multiple bus masters including a DMA controller, high-speed
communication interfaces (e.g., peripherals like USB controller and Ethernet controller can have a
bus master interface).
Additional power management schemes might be needed to enable longer battery life. For example,
multiple bus systems can be used, and each of them can be running at different clock speeds to
reduce active power.
Additional security features are often needed, especially when dealing in IoT applications. For
example, security control features might be added in peripheral bus bridges to allow peripherals
to be allocated to specific tasks, and to make some of the peripherals privileged access only. If a
system design is based on Armv8-M processors, the use of the TrustZone security extensions also
requires additional system IP to partition memories into Secure and Non-secure spaces.
To help reduce time-to-market, Arm has produced and delivered a range of system components and
system design products. The Corstone Foundation IP are system design packages containing:
Subsystems based on Arm Cortex processor, which can be used as a standalone system or as
a subsystem in complex SoC designs.
A wide collection of bus infrastructure components including bus bridges, TrustZone security
management components (for Corstone-20x and Corstone-7xx).
Software support.
These IPs are verified, and system designers can integrate these subsystems into their design to
accelerate their projects. The Corstone Foundation IP series contains multiple products:
Corstone-050, 100, 102 – IoT subsystem for Cortex-M3 processor and including deliverables in the
previous Cortex-M System Design Kit (CMSDK).
Corstone-200, 201 – IoT subsystem for Cortex-M33 with TrustZone security, also including CMSDK.
Corstone-700 – scalable IoT subsystem for small Cortex-A processors and optional Cortex-M
subsystems.
240
System-on-Chip Design with Arm® Cortex®-M processors
For designers that prefer a simpler system design, the previous generation of CMSDK is included
in full versions of Corstone Foundation IP, which provides example system designs for Cortex-M0,
Cortex-M0+, Cortex-M3, and Cortex-M4 processors.
9.7 Verification
System-level simulation is great for testing the integration of the system (connections between units)
and testing of the basic functionality of the design units. In commercial projects, getting a system-level
simulation running is only one part of the verification task. There are other verification methodologies
needed for SoC designs; notably:
LINT checking
A lint checking tool can be used to analyze the design source code (Verilog / VHDL) to flag up coding
errors (e.g., a missing signal in the sensitivity list). The term “lint” originates from the name of a tool for
C code checking (https://en.wikipedia.org/wiki/Lint_(software)).
Some Verilog/VHDL simulation tools include LINT checking capability. For example, in the simulation
example Section 9.5, we added the “-lint” option in the vlog command to enable lint checking in
Modelsim/Questasim.
Formal verification
Since system-level simulation cannot reach corner cases, formal verification is usually needed for
component level (smaller design units) verifications. With formal verification, you need to define input
constraints for the DUT (Device Under Test) and rules for expected outputs. The formal verification
tool then works out all the possible inputs and scenarios to see if the DUT truly follows the rules
for the expected output. Take an example in which the DUT is a bus bridge component; a formal
verification environment might look like the one shown in Figure 9.4:
Due to the long duration for the formal verification runs, usually formal verification is not used at
system-level.
241
System-on-Chip Design with Arm® Cortex®-M processors
Netlist simulations
After the design is synthesized and has potentially gone through placement and routing flow, it is
common to back annotate the netlist and timing to double-check the design using netlist simulations
with a subset of tests used in RTL level simulation. Due to additional timing details, typically in the
form of SDF (standard delay format) files, netlist simulations are much slower than RTL simulations
and therefore re-running all verifications on netlist is typically unfeasible.
While static timing analysis can detect timing violations, netlist simulation is still useful to detect
missing or errors in timing constraints. Netlist simulation is also needed to verify scan patterns
generated by ATPG (Automatic Test Pattern Generation, see Section 9.9) tools, as scan patterns often
contain user-defined setup patterns at the start of the scan test that need to be verified at netlist level.
FPGA prototype
In addition to demonstration purpose, FPGA prototyping is very useful for validating debug
connections and related aspects such as pin multiplexing. Since software developers can create
applications and execute them on FPGA platforms as in real applications (potentially at a reduced
speed), application developers can use FPGA prototypes to develop application-level testing and run
the test much quicker than in RTL simulations.
Power-aware simulations – entering of sleep modes and waking up can be simulated, with some of
the power domains powered down during sleep.
Low-power formal verification – if a logic operation has moved from one power domain to another
due to synthesis optimizations, potentially this can cause incorrect behavior during power down.
Low-power equivalent checking can identify such mismatch in behavior.
242
System-on-Chip Design with Arm® Cortex®-M processors
Placement and
Verilog / Synthesis, routing, clock tree
Netlist Layout Signoff
VHDL scan insertion synthesis (CTS), post
route optimization
Power Power
analysis analysis
Brief descriptions of some of the key steps in the implementation flow are explained below.
Some of the checking (e.g., static timing analysis) might be carried out multiple times during the
implementation flow.
Synthesis – Conversion of RTL to netlist. In addition to the RTL, ASIC synthesis tools also need the cell
library and various timing constraints for the clock, reset, and interface signals. Synthesis processes
might include automatic clock gate generation to help low-power optimization.
Scan insertion – adding of scan chains to the netlist for chip manufacturing testing. See Section 9.9
for more information.
Static timing analysis (STA) – STA tools calculate the timing of the netlist based on timing models
of the cell library and check if the design can meet the timing constraint requirements. The analysis
involves multiple “corners” to detect potential failures like setup timing violation (circuit running too
slow) and hold time violation (a signal changes so fast that a register capturing it can end up with an
incorrect value).
Placement and routing – The tool places the logic gates in the chip layout and connects the signals
between logic gates. Potentially, the placement can be divided into two stages: initial placement
243
System-on-Chip Design with Arm® Cortex®-M processors
provides a rough location of logic gates to enable better timing optimizations during synthesis, and
then a second stage finalizes the locations of the gates.
Clock tree synthesis (CTS) – CTS inserts clock tree buffers to ensure that clock signals reach the
different registers at the right time. In a SoC design, two registers receiving the same clock signal
might see the clock edge at different times if they are not in the same area on the chip (due to the
propagation delay of the clock signal). CTS can balance the clock tree accordingly so that signal paths
between the registers still work after placement.
Logic equivalence checking (LEC) – this ensures that the functionality of the design units matches the
RTL code. If the LEC check fails, the reason could be an error in the design, missing constraints in the
synthesis or LEC setup, or potentially something has gone wrong in the synthesis.
Automatic Test Pattern Generation (ATPG) – ATPG analyzes the netlist and generates scan patterns for
chip manufacturing testing. (There can be multiple types of scan patterns for different test purposes).
On-chip signal integrity (SI) analysis – This can help prevent chip failures caused by cross-talk between
wires, and static and dynamic voltage drops (known as IR drop) in power rails during circuit activities.
Power analysis – this enables designers to confirm that the chip can operate within the power budget.
There can be additional steps required, for example, when dealing with embedded memory macros.
Designers need to use memory compilers to generate the memory macros.
Earlier, we mentioned about scan insertion and ATPG in the implementation flow. This is one of the
most common approaches used today for testing of digital circuitries. To use scan tests, the flip-flops
in the circuit design include additional ports for scan test operations:
Reset / nReset
Reset / nReset
D Q D
SI SO D Q Q
SE SI
CLK SO
SE CLK
A scan flip-flop
CLK
Figure 9.6: Scan D flip-flop.
244
System-on-Chip Design with Arm® Cortex®-M processors
After synthesis with scan enable (i.e., scan registers are used), we can then create scan chains using
scan insertion in synthesis tools. There can be multiple scan chains in a design (but not too many, as
there are restrictions in the chip testers), and the more scan chains you have, the shorter each chain
would be and can shorten the time required for running scan tests.
Reset /
nReset
D Q D Q D Q
SCANIN[0] SI SO SI SO SI SO SCANOUT[0]
SE SE SE
CLK CLK CLK
D Q D Q D Q
SCANIN[1] SI SO SI SO SI SO SCANOUT[1]
SE SE SE
CLK CLK CLK
SCANEN
CLK
Figure 9.7: Two scan chains inserted (functional logic not shown).
With the scan chain in place, we can then use a tester hardware call Automatic Test Equipment (ATE)
to shift in any test patterns to the logic by applying a series of clock pulses with scan enable asserted,
and clocked the design with scan enable de-asserted to exercise the functional logic (capture cycle).
Capture Capture
cycle cycle
nReset
SCANEN
CLK
SCANOUT[n:0] Shift out reset states Shift out result state #1 Shift out result state #2
245
System-on-Chip Design with Arm® Cortex®-M processors
During a scan test, it is often essential to bypass internal clock gating and internal reset generation
circuits to allow ATE to have direct control of the clock and reset. Therefore, in Arm IP designs, you
might see signals like this:
SCANMODE – Scan mode indication/control to force components to work in certain ways to help
test coverage. For example, wrappers of memory macros can route write data to read data so that
data paths can be tested easily.
These signals should be high during the whole duration of the scan test, including capture cycles.
In some cases, a setup pattern needs to be added to the beginning of test patterns to enable scan
tests. For example, the scan test pins might be multiplexed with other function pins and need a special
signal sequence to enable the scan pin accesses. Such a setup pattern is defined by the chip designers.
The ATPG tool can generate different types of test patterns. The typical scan test is targeted at
detecting stuck-at faults, which means that it checks the inputs and outputs of logic gates are not
stuck with a value of 0 or 1. It is also possible to generate scan test patterns for IDDQ (Idd quiescent)
testing, which detects unexpected supply current when a certain logic state is reached, which can be
an indication of manufacturing faults.
Scan tests can also be used for at-speed testing of some degree. However, in modern ASICs that run
at over 100MHz, many ATE (chip testers) might not be able to support at-speed testing at such a high
clock speed and in those cases, traditional functional tests might be more suitable.
Another type of manufacturing test is the memory built-in self-test (BIST). Memory BIST controllers
can be inserted by EDA tools and controlled via JTAG or other test interfaces. When memory BIST is
enabled, a memory BIST controller automatically creates test patterns to access the memory macros
to verify their functions. There can be more than one memory BIST controller in a chip when there is
more than one memory block. To help the integration of memory BIST, Arm processors with internal
SRAM (e.g., caches) provide memory BIST support. The exact details are processor-specific, so please
refer to processor integration manuals for more information.
There are also manufacturing tests that focus on electrical characteristics of input and output pins.
Common examples of these tests include:
Output driving voltage (VOL, VOH) (can also cover output drive current test).
246
System-on-Chip Design with Arm® Cortex®-M processors
While output pins can be accessed, and their electrical characteristic measured easily, the output
signals of input pads are inside the chip and creating a test for checking each of the signals can be
tricky. To make it easier, we can add a simple logic to link the input pads’ outputs together and test
them at the same time.
Functional signal
Functional signal
Adjustable
voltage source
Functional signal
Functional signal
Assuming that all the input pins have the same electrical characteristics, they should also have the
same outputs when the input voltage is in a valid range. Since we know how many input pins are in
the XOR tree, we can determine the expected test result easily. By adjusting the input voltage closer
to the input threshold voltage level slowly, and if any of the input pins fail to deliver the correct signal,
the VIL/VIH test result pin will toggle. Of course, if there are two input pads that fail at the same time
at a certain threshold voltage (the chance is low - but it is possible), then the VIL/VIH test result signal
will remain unchanged and will not be able to detect the issue.
Depending on other components in the chip design, there can be additional DFT integration
requirements. For example, an on-chip Phase-Locked Loop (PLL) module usually has some test pins
to enable the PLL to be tested.
247
System-on-Chip Design with Arm® Cortex®-M processors
248
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Beyond the 10
processor system
249
System-on-Chip Design with Arm® Cortex®-M processors
Accuracy – Many peripherals and external communication interfaces require fairly high accuracy for
the clock frequency. For external crystals, the accuracy is often expressed as PPM (Part Per Million),
and commercial products usually require 40ppm or better accuracy for the crystal oscillators.
Internal R-C oscillators typically are not accurate (some can have up to 20% error).
Duty cycle – Typically, the clock source should provide a square wave with a 50% duty cycle. For
Cortex-M based systems, since all registers in the processor and the bus system trigger at the
system clock’s rising edge, a small inaccuracy in the duty cycle is unlikely to cause any problems.
However, when dealing with an interface with DDR (double data rate) operations, accuracy in clock
duty cycle can be very important.
Low-power – A crystal oscillator running at a high clock speed can consume a lot of power. So if high
clock speeds are required, a slower crystal oscillator is often used to generate a reference clock and a
PLL to generate a high-frequency clock out of it. When the system does not need the high clock speed,
the PLL can be turned off to save power. Another common requirement for low-power systems is to
provide an ultra-low-power clock source for Real-Time-Clock (RTC) and a periodic interrupt source for
RTOS (Real-Time Operating System). It is therefore relatively common for microcontrollers to have
a crystal oscillator for the system clock and a 32kHz crystal oscillator for the RTC clock.
Clock distribution inside the chip – In Chapter 9, we mentioned the topic of clock tree synthesis
during clock design flow. In many system designs, we need to make a clock edge that arrives at
different parts of the processor system at the same time. Clock tree balancing during clock tree
synthesis is required to achieve this goal. In addition to the clock signal propagation delay, clock
tree balancing must also take account of additional delays caused by clock gating cells, and,
potentially, additional clock skews and clock jitters (uncertainties). What is more, if all this wasn’t
enough to make your design flow complicated, the clock tree can use up a considerable amount
of power consumption!
In some cases, application-specific IC designs can have additional clock source requirements. For
example, chip designs with USB interfaces might require accurate 12MHz or 48MHz clock sources.
1
Note: There are processors designed with asynchronous logic, but their bus systems still need clock signals to allow them to interface with the external world and peripherals.
250
System-on-Chip Design with Arm® Cortex®-M processors
When the processor system is running background code, the processor’s clock may be driven from
the crystal oscillator.
When the processor system receives a certain processing workload, it needs higher performance
and therefore switches from the crystal oscillator’s clock to a PLL-generated high-frequency clock.
When there is nothing to process, the processor enters sleep mode during which only the RTC is
running (PLL and other oscillator turned off to minimize power consumption). The processor system
might in this state need to use the RTC clock to wake up the memory controllers after an interrupt
request is received.
As a result, the clock system design needs to support switching over from one clock source to another,
- and this process needs to be glitch-less. If a clock glitch enters the processor system, some of the
registers can be affected by metastability issues and enter an undefined state, and the system will not
be able to resume normal operations.
One way to provide glitch-free clock switching is to implement a clock switching FSM in each of
the clock domains and to enable their clock output only if none of the other clock output FSMs are
outputting. Figure 10.1 shows a clock switcher that supports three asynchronous clock sources:
Sync DFF
CLKSEL[0]
Reg
CLK_b_ON
CLK_a_ON
Clock
CLK_c_ON CLK_A gate
CLK_A
Sync DFF
CLKSEL[1]
Reg
CLK_a_ON
CLK_b_ON
Clock Clock
CLK_c_ON CLK_B gate output
CLK_B
Sync DFF
CLKSEL[2]
Reg
CLK_a_ON
CLK_c_ON
Clock
CLK_b_ON CLK_C gate
CLK_C
251
System-on-Chip Design with Arm® Cortex®-M processors
The clock gating cells and the OR gate that merge clock sources need to be instantiated and cannot
be generated by synthesis. This prevents the synthesis process from reordering logic gates that can
result in glitches. The reset values of the flip-flop might need to be customized if you want to allow the
system to be started up with the clock running.
A problem with powering down oscillators and clock sources during sleep is that, in many cases, we
still need to have a clock source for various power management hardware, which requires clocks (they
might have internal state machines to power up memories). Therefore, chip designers must ensure
that either one of the active clock sources is kept active, or that the oscillator’s control is designed so
that after a wake-up event is triggered, an oscillator could start automatically to support the power-up
sequence.
Be careful about the wake-up time required for the oscillator and PLL, including the effect of various
operating temperatures and voltages. Typically, a finite state machine (FSM) is needed to ensure that
the clock signal to the processor system is gated off until it is fully stabilized. This is why the 32kHz
clock is usually left running during sleep mode so that the FSM can rely on this clock to operate.
In addition, some PLL designs provide test modes that can help chip designers to analyze the
performance of the PLL in the implemented chips. To use such features, you might need to implement
test modes that allow certain PLL signals to be observable at the top-level of the chip.
Unlike clock gating (which simply reduces dynamic leakage), power gating also removes static leakage
currents. However, as system states may be lost by removing power, there are some trade-offs to be
made – it may take time to save and recover the necessary state information and remove and restore
power. A special form of power gating called State Retention Power Gating (SRPG) is available, but
SRPG flip-flops are larger than standard flip-flops, have a slightly higher dynamic current, and need an
additional power rail to support retention.
252
System-on-Chip Design with Arm® Cortex®-M processors
To handle power gating and state retention power gating, we need special physical IP for power gating:
Header cells;
Footer cells;
Isolation cells;
Power on domain
Power on
Iso
domain Power gated domain
Isolation
Clamp
cell
GND
Footer cells (N-MOS)
Footer cells and footer cells - On-chip power gating typically involves inserting large, high- Vt
(threshold voltage), low-leakage PMOS (on supply connection side) and/or NMOS (on ground
connection) sleep transistors into the chip’s power network. This means that the device’s power
network consists of:
A number of power islands, to which the power supply may be turned off.
Isolation cells - Isolation cells are special cells that isolate a power-gated block from other parts of
the design which remain powered up. These are required to prevent (1) short-circuit currents and (2)
the inputs of other blocks from floating (a “clamp” will force such inputs to a logical 0 or 1 when the
output of the powered down block has been isolated.)
In some cases, when multiple power domains are used, it is possible to have different voltage levels,
and in these situations, level shifters are also needed.
253
System-on-Chip Design with Arm® Cortex®-M processors
To support power gating design, Arm physical IP products have Power Management Kits (PMK) that
contain a range of these special components for multiple power domain designs. Different process
nodes need different PMKs libraries. For more information about PMKs, please visit Arm’s website:
https://www.arm.com/products/silicon-ip-physical/logic-ip
The inclusion of such power gates brings a range of complexity to the physical design:
IR-drop - The voltage supplied to individual functional blocks will be lower than that provided to the
chip, due to losses within the on-chip power distribution network. Power gate transistors need to
be large in order to reduce the IR drop. In many cases, a number of header cells and footer cells are
required to avoid this.
In-rush current – If a voltage island is large and all the header and footer cells are turning on
simultaneously, it causes a large ‘in-rush current’ that can result in problems to the integrity of the
device’s power grid. In order to prevent this problem, the header and footer cells often have buffered
control output to allow-power switching sequences to be chained. This makes the time needed for
switching on a power domain longer but can prevent large ‘in-rush current’ that can cause physical
damage to the chip.
Power gating control skew rate - Another potential issue is that power gates themselves can have
leakage current, especially if the gating control signal is not well buffered and not reaching optimal
voltage. Therefore, the timing of the power gate operation needs to be carefully considered to ensure
that the slew rate of the power gate control signal is not too large.
Power-gating also has cost implications in terms of die area and silicon routing. It requires specialist
EDA tools and may make timing closure more difficult.
As explained in Chapter 6, there are a number of schemes to preserve the state of a block before it is
shut down. While state retention power gating (SRPG) enables the state in the processor system to
be retained during sleep, it won’t be useful if the states in peripherals and other system components
are lost during sleep. So if state retention power gating is used, the components in the system that
need to be powered down during sleep mode should also have state-retention power gating (SRPG)
implemented to ensure that software can make the most out of SRPG capability.
An alternative method is to store the status of the parts of the applications that need to be preserved
to a state retention RAM, before removing power. When power is restored, the software needs to
restore the application’s state. The state that needs saving might include the processor’s registers and
peripheral configurations.
Of course, it is also possible to power down and simply restart the system when it is needed.
254
System-on-Chip Design with Arm® Cortex®-M processors
ADCs;
DACs;
Voltage references;
Brown-out detector;
Analog comparators;
LCD driver;
At the same time, analog components are getting more intelligent. For example, traditional sensor ICs
integrate more and more digital logic including, processors and are becoming “smart” sensors. This can
bring many advantages by enabling:
Better calibrations;
Since the Cortex-M processors are small, energy-efficient, and easy to use, they are widely used in
microcontrollers and smart analog IC designs. With various sleep mode features and intelligence in
software control, a range of mixed-signal designs can reach better levels of energy efficiency and, at
the same time, provide more features than before.
When dealing with mixed-signal designs, additional complexities are added to the projects; for example:
Design flow of analog components can be very different from digital. In some cases, Verilog-AMS
can be used, and in others, some analog components are designed with manual chip layout. There is
often a need for system-level mixed-signal simulation and verification.
255
System-on-Chip Design with Arm® Cortex®-M processors
In some cases, CMOS manufacturing technology is not suitable for certain analog applications
(e.g., power electronics, radio frequency circuits). In many cases, BiCMOS is used for mixed-signal
designs that combine CMOS, and bipolar transistor technologies on the same silicon die.
Many analog components need separated power domains (separated from the digital logic).
Additional considerations in a chip’s power rail design and floor-planning are needed to reduce
noise to sensitive analog components. For example, layout implementation techniques such as
guard rings and well isolation are common techniques that can be used to prevent switching noise
from a digital circuit reaching an analog circuit through silicon substrate coupling.
Many EDA tool vendors have specific offerings to help when dealing with mixed-signal design flows.
Conversion speed and sampling rate – When dealing with an input signal that has a frequency
range of X Hz, the minimum sampling rate needed is 2*X Hz. And in many cases, even 2*X Hz is not
enough for quality and reliability reasons. For example, imagine that a 4 kHz sine wave is sampled
at 8 kHz: there is a chance that we will sample the input signal at 0 level all the time. Therefore, it
is often necessary to have the sample rate more than 4 times that of the input signals. Please note
that the input bandwidth of an ADC can be much lower than half of the sampling rate.
Resolution – This is expressed as the number of bits in the conversion results. For an on-chip ADC,
typical resolutions range from 8-bit to 14-bits. The difference between the real value and the
measured value is often referred to as the quantization error and is ½ of the LSB of the ADC in ideal
cases, as shown in the diagram below:
Result
11111111
11111110
11111101
11111100
Quantization
error
00000011
00000010
00000001
00000000
0v Vcc
Input
Valid voltage range for measurement
256
System-on-Chip Design with Arm® Cortex®-M processors
(Please note that many on-chip ADCs have a measurement range less than the supply voltage range.)
Signal-to-noise (SNR) ratio – this is often calculated using the number of bits in the results, by
assuming the noise level is +/- LSB. SNR is often expressed in decibels, i.e.
Given that the ADC results are in the form of voltage values, we need to convert the formula by
squaring the input values (or 2x after log10), thus:
Assuming that the ADC is 8-bit (256 levels), the SNR calculation can be formulated as shown:
In some cases, you can use oversampling and filtering techniques to reduce noise. But even if the ADC
noise level is low, in many cases, noises from other parts of the integrated circuit and on the circuit
board, could reduce the signal-to-noise ratio.
Suitability for the target process node – One of the challenges for mixed-signal design is that analog
circuit design does not scale well to small transistor geometries.
Area and power – Silicon area and the type of ADC directly contribute to cost and power. Based on
the project requirements, some of the ADC types might not be suitable for these reasons.
Operating conditions – If you are designing a chip for industrial (or even automotive) applications,
you will need to pay attention to the operating temperature range of ALL the components that you
use for the design (not just ADC, DAC, but also oscillators, memories, etc.). Some of the ADCs and
DACs might not work at high temperature.
Amongst various types of ADC, successive approximation ADCs are very popular in microcontrollers.
Successive approximation contains several parts, as shown in the diagram below:
capacitor
Voltage
reference Digital to
analog
conversion
Start request (result reg) Result
Clock Conversion done
Finite State Machine (FSM)
257
System-on-Chip Design with Arm® Cortex®-M processors
Using bisection, a successive approximation ADC determines input voltage bit by bit, as illustrated below:
Start
Result reg = 0.
Current bit = MSB
Conversion
done
With a 14-bit ADC, the finite state machine iterates the operation loop 14 times. While it is not the
fastest type of ADC, the conversion speed is acceptable for most applications and delivers relatively
high accuracy.
For designs that require very fast conversion speed, a flash ADC can be used. A flash ADC uses an
array of voltage comparators to detect the voltage and then converts the results to binary values.
Buffer
Input voltage Sampling
Binary value
Reference voltage generation
R Comparators
+
-
R
+
-
R Result
+
-
R
+
-
R
+
-
R
258
System-on-Chip Design with Arm® Cortex®-M processors
The illustration in Figure 10.6 is a conceptual one. In ASIC design, the resistor network is likely to be
replaced by switching capacitor networks as the implementation of resistors in ICs can be challenging
in terms of accuracy.
Due to their nature, flash ADCs are usually larger in silicon area, power-hungry, and can offer only
limited resolution (e.g., 8-bit). They are commonly used for video signal processing because other ADC
methods cannot reach the required speed, and 8-bit is sufficient for video processing needs.
For audio processing, a delta-sigma ADC could be used. Delta-sigma ADCs contains several stages;
namely:
Decimation filter.
Input
Delta sigma Digital low pass Result
signal Decimation filter
modulator filter
The delta-sigma modulator runs at several MHz and can generate a bitstream with a feedback loop.
The 1-bit DAC in the feedback is simply a switch that switches between +Vref and -Vref based on the
result of the 1-bit ADC input. The differential amplifier compares the input signal with the 1-bit ADC,
and the integrator works as a lowpass filter of the result.
Integrator
Input signal Comparator
+ 1-bit ADC
+
- output
-
Differential
Amp
+
-
1-bit DAC
With a delta-sigma ADC, the output is based on the density of ones in the output stream. Due to its
nature that high-frequency input will suffer higher quantization errors. For audio processing, this is
less of an issue as the human ear is less sensitive to higher frequency sounds, so we can apply a low
pass filter to suppress quantization noises.
259
System-on-Chip Design with Arm® Cortex®-M processors
Since the output is in the form of a bitstream, the application code cannot use this result directly. With
a decimation filter, we can convert the bitstream information into multi-bit binary values at a lower
sampling rate.
The usable bandwidth of a delta-sigma ADC is a bit lower than the other ADCs mentioned previously.
In addition, delta-sigma ADCs are designed to be used with periodic sampling, whereas successive
approximation and flash ADC can be turned on and off at any time to skip samples when it is not
needed or perform conversions on an ad-hoc basis.
For applications where the sampling rate is very low, for example, in smart sensors when measurement
might be needed only for a few times a second, or even a sample every hour, we can use a slower ADC
like a dual-slope ADC.
Conceptually, dual slope ADCs can be implemented using an op-amp, a voltage comparator, a binary
counter, and a state machine, as illustrated in the diagram below:
Va
Input
Vin (Input signal)
buffer
Zero comparator
-
- Enable Result
Reference + Binary
voltage + counter
Integrator
Clock Clear
When in operation, a dual-slope ADC applies the input voltage to an integrator circuit and integrates
the voltage for a fixed duration. A reference voltage of opposite polarity is then applied to the
integrator and allowed to ramp until the integrator output returns to zero. The input voltage can
then be calculated as a function of the reference voltage, the fixed-length period, and the measured
discharge period; which is expressed as follows:
Longer integration times permit higher resolution measurements, and this kind of ADC is particularly
suited for very accurate measurement of slowly varying signals.
260
System-on-Chip Design with Arm® Cortex®-M processors
Va
Constant discharge rate
based on voltage
reference
Discharge
Clear
Counter’s
Enable Counter stop
here if Vin=0
Start of Counter stop Counter stop here
Counter stop here
measurement here if Vin=Vref if Vin=2*Vref
if Vin=Vref/2
MSB
R*2
R*4
R
R*8
R*16
- Output
Input +
R*32
Op-amp
R*64
R*128
LSB
R*256
-Vref
In some cases, when the quality of the output is not critical, we could consider simpler mechanisms for
analog output. For example, a simple PWM (Pulse Width Modulation) output could be used to drive
a small speaker for audio tone generation.
261
System-on-Chip Design with Arm® Cortex®-M processors
With a simple analog integrator (either on-chip or off-chip), we can also deploy the delta-sigma
technique to produce audio output, as shown here:
Digital comparator
Input 1-bit DAC Analog Integrator
A
value signal
A>B
C
B
Vref
- Output
Digital integrator +
+1
Op amp
-1
I2S interface - For microcontroller-based products, using an external I2S audio codec IC is a common
choice. I2S are digital serial interfaces and therefore can be handled as digital peripherals. They can
also be implemented on FPGA development boards.
I2C/I3C/SPI interface – External ADC and DAC chips often use serial communication protocols such
as I2C/I3C (Inter-IC bus) or SPI (Serial Peripheral Interface). Such an arrangement is suitable for sensor
and control applications. I2C and SPI interface peripherals are widely available from various peripheral
IP providers. APB interface support is commonly available for these peripheral solutions.
PDM (Pulse Density Modulation) – In recent years, digital microphones with PDM interfaces are
becoming more common due to their low cost and simple interfaces. Similar to delta-sigma, a
microphone’s output is in serial bitstream where the density of ‘1’ indicates a higher analog value.
The PDM signals can be converted to analog values using digital filters and be implemented as digital
peripheral interfaces.
Adding an APB bus wrapper for various data and control registers (please note that, potentially,
additional level shifters might be needed in the bus wrapper design).
Adding interrupt handling logic (and registers) to handle completion of conversion or error cases.
Creation of software driver code and create test codes for system-level simulations.
262
System-on-Chip Design with Arm® Cortex®-M processors
To help system-level verification, a number of EDA tool vendors offer solutions for mixed-signal
verification, including simulators that can handle Verilog and Verilog-AMS co-simulation. For example,
Cadence Virtuoso can handle co-simulation of RTL, Verilog-AMS, and wreal:
Figure 10.13: A Cadence Virtuoso demo containing a Cortex-M0 based system and mixed-signal IP (diagram in courtesy of Cadence Design Systems).
https://community.cadence.com/cadence_blogs_8/b/ms/posts/arm-based-micro-controllers-using-
cadence-s-mixed-signal-solution
https://community.cadence.com/cadence_blogs_8/b/ii/posts/easing-mixed-signal-design-with-the-
arm-cortex-m0
Test chip projects can be both exciting and challenging at the same time. One of the test chip projects
that I have been involved in is called ‘Beetle.’ It uses a Cortex-M3 processor and the CoreLink SDK-
100 IoT subsystem (rebranded Corstone-100). In addition, it includes a range of peripherals, a true
random number generator (TRNG) needed for IoT security, embedded flash, an on-chip PLL, and an
integrated Bluetooth interface (Arm Cordio) - see the block diagram below:
263
System-on-Chip Design with Arm® Cortex®-M processors
The chip package that was chosen for the project is a QFN 80 pin package (7mm x 7mm). Several
factors were considered in the selection; notably, it needs to be:
While ‘80 pins’ sounds like quite a lot of pins are available in the Beetle chip design, in practice, we are
quite tight on pin usage due to the following factors:
Additional pins for multiple voltage supply – since we did not integrate an on-chip DC-DC voltage
converter, we need to allocate multiple voltage supply pins for the different voltage levels used by
different parts of the chips.
The Cordio IP requires a number of dedicated pins for power, oscillators, and of course antenna and
RF interface.
Since we would like to enable all debug features to be used at any time, we did not multiplex debug
pins with functional pins.
Several pins are required to control test modes that expose internal signals for testing.
264
System-on-Chip Design with Arm® Cortex®-M processors
On the upside, the limited pin situation is not such a problem as it may first seem because:
The bottom side of the chip provides a large ground connection, so we do not need to have many
ground pins; and,
Oscillators for the Cordio macros are also used by the processor system (shared).
Embedded flash;
The success of the chip was a really rewarding experience for the project team and everyone involved
in the Beetle.
The team was quite small, with only a few designers working on each step of the project. The planning
started Q1-2015, and the project officially kicked off in March-2015 with the system design phase.
Physical design started in the middle of May-2015, and the Beetle taped out at the beginning of
August. We had the chip working and demonstrated it at Arm TechCon 2015. (Here we would like to
thank the TSMC team for their help and a fast turnaround achieved for this project).
At the design level, one of the biggest challenges was the issue of handling supply voltage. In this
design, various supply voltages were needed for different parts of the Beetle chip; for example:
Making life a little more difficult for us, perhaps, the embedded flash macro had a strict requirement
on power-down procedures. Otherwise, it could cause permanent damage to the chip. This led to
additional design requirements for the PCB as it had to provide an input signal that would indicate
that the supply voltage was dropping. This signal is connected to the on-chip power management
control unit, which triggers an internal power control FSM to start the shutdown process. Since the
32kHz oscillator was designed to be always on, the FSM can rely on this to safely shut down the chip
before the supply voltage cuts out.
265
System-on-Chip Design with Arm® Cortex®-M processors
An AHB flash cache for optimizing system performance when running code from flash.
The system design was also created with IoT security requirement in mind. With the TRNG IP, a secure
communication protocol stack like mbedTLS can utilize it for entropy generation, which enables better
security in session keys generation for encrypted IoT traffic. Another important aspect designed to
address IoT security requirements is the Secure firmware update and two banks of embedded flash
(128KB each) are included in the package to help support that.
To enable our software support development to start as early as possible, most parts of the beetle
test chip were also ported to FPGA to enable our software team to create software support, including
software packages and mbedOS support.
While the availability of CoreLink SDK-100 did a great deal to help the design process, there were still
several system design tasks to be accomplished at this stage, including the following:
Cordio IP integration;
DFT.
Thanks to great support and teamwork from engineering groups within Arm across a number of
sites (Cambridge and Sheffield in the UK, Budapest in Hungary, Hsinchu in Taiwan, Austin in the US),
system design tasks progressed quickly and were actioned as we had planned from the start.
While we were waiting for the test chip to be delivered, another project team was already busy on
the PCB design, and a third team had also started working on the software support. So by the time
the chip was back from manufacturing, it could be soldered on the PCB so that software testing could
begin. At the same time, a few of our engineers took the test chip samples to our ATE (Automatic
Test Equipment) laboratory to test the chip with the DFT supports (e.g., scan testing with ATE)
implemented in the chip to see if the design was working properly, while another small engineering
team was busy testing the Cordio wireless interface and Bluetooth software support.
267
System-on-Chip Design with Arm® Cortex®-M processors
When the Beetle test chip was back from the manufacturer, we couldn’t get it to work at first,
despite the fact that the test chips had all worked during the ATE scan testing. After a few days of
head-scratching, we found that the floating pin for scan enable (for scan testing) was the cause of
the problem. While the test mode pins were tied to certain logic levels to suppress the scan testing
logic, the scan enable pin, - which was shared with a GPIO pin - , was floating and therefore causing
a metastability issue. By adding a pull-low resistor on the PCB, the chip booted up! (That was a very
nerve-wracking week for all of us in the different teams!)
The Beetle test chip is not the only Cortex-M test chip that Arm has created. Since that time, we have
designed and fabricated a series of Cortex-M test chips called Musca, using the Cortex-M33 processor
and a more recent CoreLink subsystem. These test chips and development boards have been proving
very useful for Trusted Firmware-M development projects. They are also an important way for Arm
engineers to learn about the design challenges that other chip designers face so that Arm products
can be further enhanced in the process.
268
System-on-Chip Design with Arm® Cortex®-M processors
269
System-on-Chip Design with Arm® Cortex®-M processors
270
System-on-Chip Design with Arm® Cortex®-M processors
CHAPTER
Software development 11
With contribution from Christopher
Seidl, senior marketing manager Keil MDK
and CMSIS, Arm
271
System-on-Chip Design with Arm® Cortex®-M processors
A range of access functions for the processor’s hardware blocks including NVIC, SysTick timer,
Instrumentation Trace Macrocell (ITM).
A range of functions for accessing special registers inside the processor.
The CMSIS-CORE project also provides a template for a microcontroller’s device driver design,
including a standardized way to define interrupt assignments, names for system exceptions and
a device’s system initialization code (e.g., SystemInit() in system_<device>.c).
Like other projects in CMSIS (Cortex Microcontroller Software Interface Standard), CMSIS-CORE is an
open-source project driven by Arm. CMSIS is a collection of multiple projects and is widely adopted by
the embedded software industry.
The CMSIS project started shortly after the arrival of Cortex-M3 based microcontrollers (Note:
Cortex-M3 was the first the Cortex-M processor). The success of Arm Cortex-M based devices
created a demand for standardization in the software development area. Engineers don’t want to cope
with different software development guidelines every time they change from one silicon vendor to
the other.
To answer this industry demand, Arm created the Cortex Microcontroller Software Interface Standard
(CMSIS). It enables consistent software layers and device support across a wide range of development
tools and microcontrollers. CMSIS is a lean software layer with little overhead and provides flexibility
to the device manufacturer in defining standard peripherals. The SoC designer can, therefore, support
the wide variations of the Cortex-M processor-based devices with this common standard.
Reduces the learning curve, development costs, and time-to-market for developers. They can write
software faster through a variety of easy-to-use, standardized software interfaces.
Improves the software portability and re-usability by defining consistent software interfaces. With
its generic software libraries and interfaces, CMSIS provides a consistent software framework.
Provides interfaces for debug connectivity, debug peripheral views, software delivery, and device support.
272
System-on-Chip Design with Arm® Cortex®-M processors
Allows the usage of different compilers (such as Arm, GCC, and IAR) via its compiler independent
software layer.
CMSIS is defined in close cooperation with silicon and software vendors and provides a common
approach to interfacing to peripherals, real-time operating systems, and middleware components.
Over the years, the CMSIS project has expanded into multiple areas:
SoC designers must be aware of the following CMSIS components, to create basic support for their
device in various toolchains:
CMSIS-CORE is a standardized API for the Cortex-M processor core and peripherals.
CMSIS-SVD (System View Description) describes how to display peripherals in IDEs. Create this
XML-based file to let tools display the peripherals in debuggers and to create header files with
peripheral register and interrupt definitions.
CMSIS-Pack is a delivery mechanism for device support and software components. Development
tools and web infrastructures use the information included in packs to extract device parameters,
software components, and evaluation board configurations.
CMSIS-DAP is standardized firmware for a debug probe unit (usually on a separated chip) that
connects to the Arm CoreSight Debug Access Port of your SoC design. CMSIS-DAP is well suited
for integration on low-cost evaluation boards.
273
System-on-Chip Design with Arm® Cortex®-M processors
CMSIS-Driver, a definition of peripheral driver interfaces for common peripherals such as USB,
Ethernet, SPI, and other types. Using this driver interface, middleware becomes reusable across
supported devices.
CMSIS-NN, a collection of efficient neural network kernels developed to maximize the performance
and minimize the memory footprint of neural networks on the Cortex-M processor cores.
CMSIS-Zone, a tool that helps to describe system resources and to partition these resources into
multiple projects and execution areas. This is required for complex microcontrollers that contain
multiple cores, memory protection units, or TrustZone for Armv8-M.
An in-depth tutorial is available online that explains how to create device support for a custom device
using CMSIS and its pack delivery mechanism. You can find them here: https://arm-software.github.io/
CMSIS_5/Pack/html/createPack_DFP.html
Creating startup code for various toolchains – many toolchains use assembly startup code, and the
assembly language syntax is toolchain specific.
Creating compilation setup for various toolchains – this could be in the form of project files
for Integrated Development Environments (IDEs) or makefile for Arm Compiler / gcc in Linux
environment.
In creating portable source codes, avoid using compiler/toolchain specific features such as compiler
specific intrinsic/attributes. CMSIS-CORE already includes a range of intrinsic functions that is
supported across multiple toolchains, and that can be used instead. If a compiler-specific feature is
needed, you can add pre-processing macros to make the inclusion of such a feature optional so that
the source code can still be compiled by other toolchains.
For chip design project environments, Linux would often be used, and the use of a toolchain in the
command-line interface or with makefile is very common. We will look into examples of a simple
makefile for Arm Compiler 6 and gcc next.
274
System-on-Chip Design with Arm® Cortex®-M processors
Cortex-M23 and Cortex-M33, Arm Compiler 6 is needed (Note: Arm Compiler 5 does not support
Armv8-M processors).
Optimization options might need to be updated (e.g., Otime in Arm Compiler 5 is not valid in Arm
Compiler 6).
system_cm3_mcu.o: system_cm3_mcu.c
armclang $(ARM_CC_OPTS) $< -o $@
uart_util.o: uart_util.c
armclang $(ARM_CC_OPTS) $< -o $@
startup_cm3_mcu.o: startup_cm3_mcu.s
armasm $(ARM_ASM_OPTS) $< -o $@
hello.hex : hello.elf
fromelf --vhx --8x1 $< --output $@
hello.lst : hello.elf
fromelf -c -d -e -s $< --output $@
clean:
rm *.o
rm *.elf
rm *.lst
rm *.hex
275
System-on-Chip Design with Arm® Cortex®-M processors
The assembly startup code, the command-line for assembling and the linker command-lines can
remain unchanged when migrating from Arm Compiler 5 to Arm Compiler 6. However, when using
new optimization features in Arm Compiler 6, such as LTO (Link Time Optimization), both compilation
and linking options need to be updated.
https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/
gnu-rm
This toolchain only provides command-line tools but is sufficient for chip designers when compiling
projects for simulations. Chip vendors can also create a toolchain package for their customers by
providing an IDE built around gcc toolchain.
Unlike when using Arm Compiler 5/6, it is common to merge the compilation and link stages
when using gcc or use gcc for both compilation and linking because gcc can invoke the linker (ld)
automatically with the correct link options. Linking compiled objects using GNU linker (ld) directly
is less common as it is error-prone.
Instead of using the command-line to control memory layout as in previous Arm Compiler 5/6
examples, with gcc, we need to use a linker script to specify memory layout for the linking stage.
An example linker script can be written as follows:
MEMORY
{
FLASH (rx) : ORIGIN = 0x0, LENGTH = 0x10000 /* 64KB */
RAM (rwx) : ORIGIN = 0x20000000, LENGTH = 0x8000 /* 32KB */
}
INCLUDE “sections.ld”
This linker script also pulls in a default linker script sections.ld, which defines the memory layout inside
the program image.
276
System-on-Chip Design with Arm® Cortex®-M processors
Since the assembly language syntax is different between Arm toolchain and GNU assembler, we also
need to create a startup code for gcc:
startup_cm3_mcu.S
.syntax unified
.arch armv7-m
.section .stack
.align 3
/*
// <h> Stack Configuration
// <o> Stack Size (in Bytes) <0x0-0xFFFFFFFF:8>
// </h>
*/
.section .stack
.align 3
#ifdef __STACK_SIZE
.equ Stack_Size, __STACK_SIZE
#else
.equ Stack_Size, 0x200
#endif
.globl __StackTop
.globl __StackLimit
__StackLimit:
.space Stack_Size
.size __StackLimit, . - __StackLimit
__StackTop:
.size __StackTop, . - __StackTop
/*
// <h> Heap Configuration
// <o> Heap Size (in Bytes) <0x0-0xFFFFFFFF:8>
// </h>
*/
.section .heap
.align 3
#ifdef __HEAP_SIZE
.equ Heap_Size, __HEAP_SIZE
#else
.equ Heap_Size, 0
#endif
.globl __HeapBase
.globl __HeapLimit
__HeapBase:
.if Heap_Size
.space Heap_Size
.endif
.size __HeapBase, . - __HeapBase
__HeapLimit:
.size __HeapLimit, . - __HeapLimit
/* Vector Table */
.section .isr_vector
.align 2
.globl __isr_vector
277
System-on-Chip Design with Arm® Cortex®-M processors
__isr_vector:
.long __StackTop /* Top of Stack */
.long Reset_Handler /* Reset Handler */
.long NMI_Handler /* NMI Handler */
.long HardFault_Handler /* Hard Fault Handler */
.long MemManage_Handler /* MPU Fault Handler */
.long BusFault_Handler /* Bus Fault Handler */
.long UsageFault_Handler /* Usage Fault Handler */
.long 0 /* Reserved */
.long 0 /* Reserved */
.long 0 /* Reserved */
.long 0 /* Reserved */
.long SVC_Handler /* SVCall Handler */
.long DebugMon_Handler /* Debug Monitor Handler */
.long 0 /* Reserved */
.long PendSV_Handler /* PendSV Handler */
.long SysTick_Handler /* SysTick Handler */
/* External Interrupts */
.long GPIO0_Handler /* 16+ 0: GPIO 0 Handler */
.long GPIO1_Handler /* 16+ 1: GPIO 1 Handler */
.long TIMER0_Handler /* 16+ 2: Timer 0 Handler */
.long TIMER1_Handler /* 16+ 3: Timer 1 Handler */
.long UARTTX0_Handler /* 16+ 4: UART 0 TX Handler */
.long UARTRX0_Handler /* 16+ 5: UART 0 RX Handler */
/* Reset Handler */
.text
.thumb
.thumb_func
.align 2
.globl Reset_Handler
.type Reset_Handler, %function
Reset_Handler:
/* Loop to copy data from read only memory to RAM. The ranges
* of copy from/to are specified by following symbols evaluated in
* linker script.
* __etext: End of code section, i.e., begin of data sections to copy from.
* __data_start__/__data_end__: RAM address range that data should be
* copied to. Both must be aligned to 4 bytes boundary. */
subs r3, r2
ble .LC1
.LC0:
subs r3, #4
ldr r0, [r1, r3]
str r0, [r2, r3]
bgt .LC0
.LC1:
#ifdef __STARTUP_CLEAR_BSS
/* This part of work usually is done in C library startup code. Otherwise,
* define this macro to enable it in this startup.
*
* Loop to zero out BSS section, which uses following symbols
* in linker script:
* __bss_start__: start of BSS section. Must align to 4
* __bss_end__: end of BSS section. Must align to 4
278
System-on-Chip Design with Arm® Cortex®-M processors
*/
ldr r1, =__bss_start__
ldr r2, =__bss_end__
movs r0, 0
.LC2:
cmp r1, r2
itt lt
strlt r0, [r1], #4
blt .LC2
#endif /* __STARTUP_CLEAR_BSS */
#ifndef __NO_SYSTEM_INIT
/* bl SystemInit */
ldr r0,=SystemInit
blx r0
#endif
bl _start
.pool
.size Reset_Handler, . - Reset_Handler
def_default_handler NMI_Handler
def_default_handler HardFault_Handler
def_default_handler MemManage_Handler
def_default_handler BusFault_Handler
def_default_handler UsageFault_Handler
def_default_handler SVC_Handler
def_default_handler DebugMon_Handler
def_default_handler PendSV_Handler
def_default_handler SysTick_Handler
/* IRQ Handlers */
def_default_handler GPIO0_Handler
def_default_handler GPIO1_Handler
def_default_handler TIMER0_Handler
def_default_handler TIMER1_Handler
def_default_handler UARTRX0_Handler
def_default_handler UARTTX0_Handler
/*
def_default_handler Default_Handler
.weak DEF_IRQHandler
.set DEF_IRQHandler, Default_Handler
*/
.end
279
System-on-Chip Design with Arm® Cortex®-M processors
Please note that with GNU toolchain, there is a different between filename extension “.S” (upper case)
and “.s” (lower case). To enable preprocessing, the filename extension needs to be “.S” (upper case).
The retarget support code is different between gcc and Arm toolchain as well:
retarget.c
#include <stdio.h>
#include <sys/stat.h>
__attribute__ ((used)) int _write (int fd, char *ptr, int len)
{
size_t i;
for (i=0; i<len;i++)
{
stdout_putchar((int) ptr[i]); // call character output function
}
return len;
}
To compile the simple hello world project with gcc, the makefile can be made quite simple by merging
compilation and linking into one step:
hello.hex : hello.elf
arm-none-eabi-objcopy -S hello.elf -O verilog $@
hello.lst : hello.elf
arm-none-eabi-objdump -S hello.elf > $@
clean:
rm *.o
rm *.elf
rm *.lst
rm *.hex
280
System-on-Chip Design with Arm® Cortex®-M processors
Keil MDK is an integrated development environment (IDE) that contains compiler, editor, debugger, as
well as various utilities such as flash programming tools. Using MDK, software developers can create
embedded applications for Arm Cortex-M processor-based devices, program the application images
into the flash memories and verify their correct operations using the integrated debugger. The full
version of the Keil MDK also includes a set of middleware and two choices of IDEs.
For microcontroller software developers, the user interface they are most familiar with is the µVision
IDE. From there, you can create/modify projects, edit source codes, compile codes, program the
devices, and debug the applications. The rest of this section is focused on µVision IDE, which most
microcontroller software developers use.
CMSIS support is closely integrated within the µVision IDE. For example, when creating a project,
the project wizard can utilize CMSIS-PACK to download the appropriate software packages required.
Software packs contain device support, CMSIS libraries, middleware, board support, code templates,
and example projects. These can be added any time to the toolchain. The IDE manages the provided
software components that are available for the application as building blocks.
Note: While you need to start with the IDE to set up a project, you can use the command-line
afterward to automatically build, flash, and debug your application.
281
System-on-Chip Design with Arm® Cortex®-M processors
After installing the Keil MDK, the CMSIS-PACK installer will start, and you need to download software
packs for the microcontroller devices you want to use. If you are a chip designer and creating your own
Cortex-M based device, then you only need to install the base CMSIS-CORE packages.
Debug probe driver - Depending on the debug probe that you are going to use, you might need to
install the correct device driver for it. The Keil MDK installation by default includes a number of
these drivers in <installation_dir>\ARM\<Segger/SiLabs/STLink/TI_XDS/ULINK>. Please double-
check the documentation that comes with your debug probe hardware to see the driver installation
requirements.
UART Terminal (or virtual COM port terminal) – If you want to redirect printf text message into a
UART connection and display that on your PC, you need a UART terminal program such as TeraTerm
or Putty. Since most modern PCs do not have RS232 ports anymore, you will need to get a USB-
UART adaptor, and that also has its hardware device driver to install. (Note: printf message display
using Cortex-M’s instrumentation Trace Macrocell (ITM) does not require such drivers as Keil MDK
can display the prinrf message in the debug IDE. But remember: the ITM feature is not available in
Armv6-M and Armv8-M Baseline processors).
282
System-on-Chip Design with Arm® Cortex®-M processors
Adaptors that provide DB9 connectors and use RS232(C) signal protocol
Adaptors that provide jumper wire connections and use logic level signaling (usually 3.3v but can
also be TTL compatible).
Silicon designs normally use 3.3v signaling for top-level I/O, and if RS232 signaling is needed, a
separate signal converter chip is required. Make sure that you have the right kind of USB-UART
adaptor when connecting your boards, as a direct connection between digital logic and RS232 can
result in permanent damage to the circuits.
1. Create a project and select the device along with the relevant CMSIS components.
4. Compile and link the application for downloading it to the on-chip Flash memory.
1. The main.c file contains the main() function that initializes essential hardware, the peripherals, and
starts the LED blinky execution.
2. The LED.c file contains functions to initialize and control the GPIO port. The LED_Initialize()
function initializes the GPIO port pin. The functions LED_On() and LED_Off() control the port pin
that interfaces to the LED.
3. The LED.h header file contains the function prototypes for the functions in LED.c and is included in
the file main.c.
283
System-on-Chip Design with Arm® Cortex®-M processors
It will then ask you where the project should be placed. For this example, we selected an empty folder
called “C:\work\CM3_Blinky_1”.
The project wizard then asks which device this project will be based on. For this project, which is
starting from scratch, the generic Cortex-M3 processor (ARMCM3) is selected.
284
System-on-Chip Design with Arm® Cortex®-M processors
The project wizard then opens the Run-Time Environment window, which allows us to include a range
of software components in the project. For the basic project that we are creating, we need the CMSIS-
CORE support and a device startup file. However, since we are creating our own startup code with
our specific vector table, only the CMSIS-CORE software component is selected, and we will add the
startup code to the project manually.
If we want to include printf support in our example code, we should also include Compiler → I/O →
STDOUT [e.g., ITM/User]. The first project we demonstrate here does not require printf, so this is
not selected.
285
System-on-Chip Design with Arm® Cortex®-M processors
You can add/modify the project’s source groups and add the source file to the project:
Rename a source group – single click on the name of the source group.
Add a new source group – right-click on the Target name (Target 1) and select “Add Group …”.
For this example, I have modified the project to have two source groups, as shown in Figure 11.8, and
added some files to it:
Create the source file in µVision IDE using File→New, write the code and “Save as” the file type
you need, and then add to the project.
Right click on a source group, select “Add New item to Group ‘<group name>’”
286
System-on-Chip Design with Arm® Cortex®-M processors
When using the second method, the following window will appear and allow you to define the file
type and the filename.
Figure 11.10: Define file type and filename when adding a new item to a source group.
We can reuse some of the previously created projects such as startup code and system initialization
code, create the new source files (main.c, LED.c, and LED.h), and add these to the project.
287
System-on-Chip Design with Arm® Cortex®-M processors
We start with the last file LED.c and its associated header file LED.h. Based on the system-level header
file (cm3_mcu.h) created in previous chapters, the source code for LED utilities can be written using the
peripheral definitions defined in the cm3_mcu.h. The example code for LED.h is written as follows:
LED.h
Here, we define the three functions that will be available to users. The actual content of the functions
is available in LED.c. Open this file and add the following code (Assuming the LED pin is GPIO 0 - pin 0):
LED.c
#include “LED.h”
#include “cm3_mcu.h”
The file main.c contains an endless loop that use the LED functions to toggle the LED. The toggling
delay is implemented using SysTick, which increments an integer variable SysTickCntr at 1KHz.
288
System-on-Chip Design with Arm® Cortex®-M processors
main.c
#include “cm3_mcu.h”
#include “LED.h”
int main(void)
{
LED_Initialize();
SysTick_Config((SystemCoreClock/1000)-1); // 1KHz Ticks
while(1){
LED_On();
TickDelay(500);
LED_Off();
TickDelay(500);
}; // end while
}
void SysTick_Handler(void)
{ // Trigger at 1KHz
SysTickCntr++;
return;
}
289
System-on-Chip Design with Arm® Cortex®-M processors
Since this project is for our own system design, we need to tell the tool what the memory map looks
like. Therefore, in the target tab of the project option, we define the memory size for our system, as
shown in the example value in Figure 11.13:
We also need to tell the linker to use this memory layout for linking operations. This option (“Use Memory
Layout from Target Dialog”) is available in the linker option tap. (This option is not set by default).
Figure 11.14: Linker options, including the option to force the linker to use memory layout in the target dialog.
290
System-on-Chip Design with Arm® Cortex®-M processors
We can optionally define the compiler options in the C/C++ compiler options, as shown in Figure 11.15:
Figure 11.15: Compiler options, including optimization options, preprocessing flags, include paths and C/C++ coding rules.
If you are using an FPGA prototype, then there is no flash programming algorithm, and the download
option needs to be set up accordingly, as shown in Figure 11.16. To gain access to this dialog, either:
Click on debug option tab, then click on the “Settings” button on the right of the selected debug
probe choice, or
291
System-on-Chip Design with Arm® Cortex®-M processors
292
System-on-Chip Design with Arm® Cortex®-M processors
Please note that in the debug session, the buttons in the toolbar are different from the coding screen.
You can see a description of the button by moving the mouse cursor on top of it.
Show/hide windows
Find
Incremental find
Start/Stop debug sessions
Insert/remove breakpoint (F9)
Show next statement
Enable/disable breakpoint (Ctrl+F9)
Run to cursor line (Ctrl+F10)
Step Out (Ctrl+F11) Disable all breakpoints
Step over (F10) in current target
Step 1 line (F11) Kill all breakpoints in
Stop current target
Run (F5) Hide/show project window
Reset the CPU
Configure µVision
The µVision debugger connects to various debug/trace adapters and supports traditional features
like simple and complex breakpoints, watch windows, and execution control. Using trace, additional
features like event/exception viewers, system analyzer, execution profiler, and code coverage are
supported. Component viewer and event recorder help you to gain insight into the operation of third-
party software, such as Keil RTX.
In the Registers window, you see the content of the registers of the Cortex-M processor. The
Disassembly window shows the program execution in assembly code intermixed with the source
code (when available). When this is the active window, all debug stepping commands will then work
at the assembly level. The Call Stack + Locals window shows the function nesting and variables of the
current program location. You can use the Command window to enter debug commands.
The application has run to main and is ready to be run. If we do that now, we will see the LED toggles
at 2Hz (i.e., blinks at 1Hz). If you want to look at the toggling operations in detail, you can add a
breakpoint to the loop, such as line 12 of main.c (LED_on()). This can be done by simply left-clicking
the grey area next to this line, and you will see a red dot that shows the breakpoint set. Go to Debug
– Run or press F5 to run to this breakpoint. Use the Step function (F11) to step into the code. The
next line will be highlighted by two arrows. Step again, into the LED_On function. The scope changes,
and the LED.c file comes into the foreground. Step twice to see how the LED is lit and to go back to
main.c.
As you don’t want to step many times in the TickDelay function, use Step Over (F10) to step over this
function and stop at LED_Off. If we step into it and step out again, we will see that the LED is off.
293
System-on-Chip Design with Arm® Cortex®-M processors
Instead of using a UART (since your chip might have only one usable UART, which you could, need
that for your application), you can use ITM (Instrumentation Trace Macrocell) to handle the printf text
message. The trace message output can pass through a trace connection (e.g., either SWO pin or trace
data pins) and be collected by the debug host, and then be displayed in real-time.
A trace connection needs to be available (e.g., when using Serial-Wire debug, the TDO pin can be
switched into SWO for low-cost trace connection).
The debug probe and the debug environment must support ITM trace (e.g., Using Keil MDK and
ULINK2 debug probe allows us to collect trace message via the TDO pin).
To do that, several small changes are needed in the project we demonstrated before. The first
step is to include the STDOUT support in the Run-Time Environment. You can open the Run-Time
environment dialog by clicking on the button on the toolbar, as shown in Figure 11.20:
294
System-on-Chip Design with Arm® Cortex®-M processors
Inside the manage Run-time environment dialog, enable the STDOUT redirection to ITM, as shown in
Figure 11.21:
If using a SWO trace connection, you will need to make sure the clock frequency setup is correct.
295
System-on-Chip Design with Arm® Cortex®-M processors
If using the SWO that is multiplexed with the TDO pin, we must select Serial-Wire Debug mode in the
debug connection setting:
Figure 11.23: If using SWO signal for trace, select Serial-Wire debug mode so that TDO pin can be used for SWO.
Now click on the ‘Trace’ tab of the debug probe settings and enable trace. Please also double-check
that the clock frequency setting is correct.
296
System-on-Chip Design with Arm® Cortex®-M processors
Notes:
Starting from MDK version 5.28, there are separated clock settings for Core and Trace clocks.
If you find you are losing some of the trace messages, you can disable timestamp package generation
to reduce the bandwidth of the trace output. That might help avoid some of the trace data being lost.
#include “cm3_mcu.h”
#include “LED.h”
#include <stdio.h>
int main(void)
{
uint32_t counter=0;
LED_Initialize();
printf (“Hello world\n”);
SysTick_Config((SystemCoreClock/1000)-1); // 1KHz Ticks
while(1){
LED_On();
TickDelay(500);
LED_Off();
TickDelay(500);
counter++;
printf(“%d\n”, counter);
}; // end while
}
void SysTick_Handler(void)
{ // Trigger at 1KHz
SysTickCntr++;
return;
}
297
System-on-Chip Design with Arm® Cortex®-M processors
We can now compile the project and start the debug session as before. In the debug session, before
we start the program execution, we need to open the printf display console: this is accessible from the
pull-down menu View→Serial Windows→Debug (printf) viewer:
You should then be able to see the debug (printf) viewer in the bottom right-hand corner of the IDE
screen. When the program starts, you can then see the printf message being displayed in this window:
Figure 11.26: Debug (printf) viewer showing the printf message received via a trace connection.
298
System-on-Chip Design with Arm® Cortex®-M processors
Development tools, like Keil MDK, have interfaces to Git so that developers can efficiently submit
their code to the repository and update the code base with contributions from other team members.
Git is so popular today that there are a lot of excellent tutorials out there to help you get started, and
since an in-depth introduction to Git is beyond the scope of this introductory chapter, we recommend
that you select one suitable for your needs and study the subject. It is worth bearing in mind though
before you are tempted to find other ways to collaborate that using Git is a good idea even for very
small development teams as each member has the complete repository available, in case the original
one fails for any reason and has to be recreated
Simple embedded applications can safely be run in an endless loop. Time-critical functions, typically
triggered by hardware interrupts, are executed in an ISR that also performs the required data processing.
The main loop contains only basic operations that are not time-critical and run in the background.
In an RTOS-based design, multiple threads run in the multitasking environment provided by the real-
time operating system (RTOS). The RTOS provides inter-thread communication and time management
functions. A pre-emptive RTOS reduces the complexity of interrupt functions because high-priority
threads can perform time-critical data processing.
RTOS kernels are based on the idea of parallel execution threads (tasks). Just like in the real world,
an application usually must fulfill multiple different tasks. An RTOS-based application recreates this
model in software and makes sure that:
Thread priority and run-time scheduling are handled by the RTOS kernel, using a proven code base.
Larger teams can safely work on various aspects of the software. A pre-emptive multi-tasking
concept simplifies the progressive enhancement of an application as new functionality can be
added without risking the response time of more critical threads.
299
System-on-Chip Design with Arm® Cortex®-M processors
Polling for interrupts is not required. Infinite loop software concepts often poll for occurred
interrupts. In contrast, RTOS kernels themselves are interrupt driven and can largely eliminate
polling. This allows the CPU to sleep or process threads more often.
Hard real-time requirements can be met because the RTOS kernel is usually transparent to the
interrupt system. Communication facilities can be used for IRQ-to-task communication and allow
top-half/bottom-half handling of your interrupts.
While RTX RTOS is not part of CMSIS, it is an open-source project of its own, and it builds into Keil
MDK library support. Note: Software developers can include RTX in their software project for free!
Once the execution reaches main(), we use the recommended order to initialize the hardware and
start the kernel. We should implement in main() at least the following in the given order:
1. Initialization and configuration of hardware, including peripheral, memory, pin, clock, and interrupt
system.
2. Update of the system clock frequency using the CMSIS-CORE function SystemCoreClock.
4. Optionally, we can create a new thread using osThreadNew, for example, app_main. In the following
example, this is used as the main thread. Alternatively, threads can also be directly created in main().
We will use this approach later in our little example project.
300
System-on-Chip Design with Arm® Cortex®-M processors
To add RTX into the project, we can enable the RTX library in the Manage Run-Time environment dialog:
When using Keil RTX5, the project wizard specified that default device startup code and system
initialization must be used. Therefore, we enabled the option, and the project wizard added the default
Cortex-M3 startup code and system initialization code into the project. Since we cannot have two
versions of startup code in the project, the original startup code is removed, and the custom-defined
vector table is then transferred into the new default startup code. We also do the same for the system
initialization code.
Next, we modify the main.c to utilize the RTX kernel. In this example, we have only one thread for
toggling the LED:
main.c
#include “cm3_mcu.h”
#include “LED.h”
#include “cmsis_os2.h”
int main(void)
{
LED_Initialize();
osKernelInitialize(); // Initialize CMSIS-RTOS
osThreadNew(thread_led, NULL, NULL); // create thead for thread_led
osKernelStart(); // Start thread execution
for (;;) {}
301
System-on-Chip Design with Arm® Cortex®-M processors
Now we can compile the code and test it, and it should toggle the LED in the same way as the first example.
For more information on using RTX and CMSIS-RTOSv2 API, please visit:
https://www.keil.com/pack/doc/CMSIS/RTOS2/html/index.html
If the code calls a library function and the library does not provide stack usage, C usage reports will
not be able to analysis the stack usage needed for the library calls.
If the code contains function calls to function pointers which are dynamically assigned, the
toolchain cannot determine a static call tree for stack usage analysis.
Even if the toolchain can report the RAM usage, it might not be clear how the RAM is used per thread.
The last build output should have shown a RW memory usage of roughly 3 kB. This is already quite a
lot for some devices that have a small amount of RAM. How can we reduce this? Let’s use some more
enhanced debug features to check our code.
The file RTX_Config.h contains a number of markups in its code comments to enable easy
configuration via a Configuration Wizard. Underneath the code window, you can see a Configuration
Wizard tab. Click on it, and you can then browse and edit each option in the file easily. In Figure 11.28,
we enable the Stack Usage Watermark feature – make sure you save this file before you recompile
the project!
302
System-on-Chip Design with Arm® Cortex®-M processors
Figure 11.28: Enabling stack usage watermark for stack size measurement in RTX.
After the project is compiled, start the debug session as usual, and the program should stop at the
beginning of main().
The RTX RTOS viewer can be enabled in a debug session via pull-down menu View→Watch
windows→RTX RTOS. This shows some of the configuration information about the RTX kernel,
including memory configuration parameters (e.g., default stack size for threads) and states of stack
overflow detection features.
When the RTX RTOS viewer is open at the start of the application (before the OS start), this window
does not show any thread information. But once the program has run for a bit and then halted, the
RTX RTOS viewer will show the information about the threads as well as the OS kernel. Let the
program run for a while and then stop. In the RTX RTOS window, expand Threads, and then the
thread that is marked as thread_led. Observe the Stack usage:
303
System-on-Chip Design with Arm® Cortex®-M processors
Figure 11.29: Stack usage of each thread is shown in the RTX RTOS viewer.
Both stack values, the (currently) used one, and the maximum used one are shown in bytes. Observe
that the actual stack usage is very low. Also, the timer and the idle thread require very little stack. The
Global Dynamic Memory shows that only 360 bytes are consumed.
Previously, we have specified in the RTX_Config.h file that all objects can consume up to 4096 bytes.
304
System-on-Chip Design with Arm® Cortex®-M processors
With the new information, we can now reduce the memory sizes of the RTX by editing various options
in RTX_Config.h. For example, we can:
Reduce the stack size for Timer thread and Idle thread.
Save the file and recompile. You should see a reduction of RAM usage in your project. Do not forget to
disable the stack usage watermark feature afterward – as this can increase OS operation overhead.
305
System-on-Chip Design with Arm® Cortex®-M processors
A commercial alternative to MDK is the IAR Embedded Workbench for Arm (EWARM), available at:
www.iar.com/arm
If you are looking for an open-source implementation of an Eclipse-based IDE, you might want to take
a look at GNU MCU Eclipse (https://gnu-mcu-eclipse.github.io/). This enables you to develop your
software using Linux or even Mac OS machines.
306
System-on-Chip Design with Arm® Cortex®-M processors
307
System-on-Chip Design with Arm® Cortex®-M processors
Glossary of Terms
AHB-AP (AHB Access Port) - An optional component of the DAP that provides
an AHB interface to a SoC. CoreSight supports access to a system
bus infrastructure using the AHB-AP in the Debug Access Port
(DAP). The AHB-AP provides an AHB master port for direct access
to system memory. Other bus protocols can use AHB bridges to map
transactions. For example, you can use AHB to AXI bridges to provide
AHB access to an AXI bus matrix. See also Debug Access Port (DAP).
308
System-on-Chip Design with Arm® Cortex®-M processors
Arm Compiler for DS-5 Arm Compiler for DS-5 is a suite of tools, together with supporting
documentation and examples, that you can use to write and build
applications for the Arm family of processors. Arm Compiler for
DS-5 supersedes RealView Compilation Tools. DS-5 has now been
superseded. See Arm Development Studio entry for information; See
also armasm, armcc, fromelf.
armasm The Arm assembler. This converts Arm assembly language into
machine code.
armcc The Arm compiler for C and C++ code in Arm Compiler 5. See
Development Studio 5 (DS-5) and Keil MDK.
armclang The Arm compiler for C and C++ code in Arm Compiler 6. See
Development Studio 5 (DS-5) and Keil MDK.
ATB (Advanced Trace Bus) - An AMBA bus protocol for trace data. A trace
device can use an ATB to share CoreSight capture resources. Use
AMBA ATB on first use, and ATB thereafter.
309
System-on-Chip Design with Arm® Cortex®-M processors
Bus master multiplexer A bus interconnect component that allows several bus masters to
(master MUX) connect to a single bus slave. Typically, it has its own arbitration
logic to decide which bus master can drive the downstream bus. The
output signals are then multiplexed into a single bus by forwarding
the address and control signals from the highest priority master to
downstream AHB slaves.
Bus matrix An on-chip bus interconnect component that allows multiple bus
masters to communicate with multiple bus slaves simultaneously.
Bus slave multiplexer A bus interconnect component that allows a bus master to connect
(slave MUX) to multiple bus slaves by multiplexing their return read data and
response signal into a single bus for feedback to the bus master.
310
System-on-Chip Design with Arm® Cortex®-M processors
CoreSight ECT A modular system that supports the interaction and synchronization
of multiple triggering events with an SoC. It comprises Cross Trigger
Interface (CTI), and Cross Trigger Matrix (CTM).
CoreSight ETB A Logic block that extends the information capture functionality of
a trace macrocell.
CPSR (Current Program Status Register) – A register that holds the APSR
(Application Program Status Register) flags, the current processor
mode, interrupt disable flags, current processor state, endianness
state (on ARMv4T and later), execution state bits for the IT block (on
ARMv6T2 and later).
311
System-on-Chip Design with Arm® Cortex®-M processors
CTM (Cross Trigger Matrix) - A block that controls the distribution of trigger
requests.
Default Slave A bus slave in an AHB system that is used to return a bus error
response to the bus master when a transfer to an illegal address is
detected. If the bus master is a processor, a fault exception is then
triggered, and it can deal with the error accordingly.
Arm Development Studio A one suite tool that provides a comprehensive embedded C/C++
dedicated software development solution. Arm Development Studio
includes (1) Arm debugger and Keil µVision debugger; (2) Embedded
C/C++ Arm Compiler 6; (3) Streamline performance analyzer for
system-wide optimization on Linux, Android or bare-metal; (4)
Royalty-free CMSIS-compliant middleware blocks for MCUs; (5)
Armv7 and Armv8 Fixed Virtual Platforms for software development
without a hardware target; and (6) Graphics debugger compatible
with OpenGL ES, Vulkan and OpenCL.
DS-5 Debugger An Arm software development tool that enables you to make use
of a debug agent to examine and control the execution of software
running on a debug target. Note: DS-5 has been superseded by the
Arm Development Studio.
312
System-on-Chip Design with Arm® Cortex®-M processors
Eclipse for DS-5 Eclipse for DS-5 is based around the Eclipse IDE and provides
additional features to support the Arm development tools provided
in DS-5. See Development Studio 5 (DS-5). Note: DS-5 has been
superseded by Arm Development Studio (see entry).
endianness The scheme that determines the order of the successive bytes of data
in a larger data structure when that structure is stored in memory.
ETB (Embedded Trace Buffer) - A logic block that extends the information
capture functionality of a trace macrocell.
exception vector When an exception occurs, the processor must execute the handler
code that corresponds to the exception. The location in memory
where the handler is stored is called the exception vector. In Arm
architectures, exception vectors are stored in a table, called the
exception vector table.
FPB (Flash Patch and Breakpoint) - A hardware unit in the Cortex-M processor
that provides hardware comparators for breakpoint functionality (see
breakpoint). In Cortex-M3 and Cortex-M4 processors, these comparators
can also be used for code patching. See also BPU.
313
System-on-Chip Design with Arm® Cortex®-M processors
FPU (Floating Point Unit) - A hardware unit inside a processor for handling
floating-point data.
fromelf The Arm image conversion utility. This accepts ELF format input files and
converts them to a variety of output formats. fromelf can also generate
text information about the input image, such as code and data size.
314
System-on-Chip Design with Arm® Cortex®-M processors
JTAG (Joint Test Action Group) - An IEEE group focused on silicon chip
testing methods. Many debug and programming tools use a Joint Test
Action Group (JTAG) interface port to communicate with processors.
See IEEE Std 1149.1-1990 IEEE Standard Test Access Port and
Boundary-Scan Architecture specification (available from the IEEE
Standards Association).
315
System-on-Chip Design with Arm® Cortex®-M processors
MTB (Micro Trace Buffer) - The MTB provides a simple execution trace
capability to M-series processors. It has a low-cost option for
instruction trace requirements for software development. Unlike the
Embedded Trace Macrocell (ETM) or the Program Trace Macrocell
(PTM) trace solutions, the MTB does not require a dedicated trace
connection and trace data can be collected using a JTAG or Serial Wire
Debug connection. However, the amount of trace history provided by
the MTB is limited by the size of SRAM allocated for trace operations.
nTRST Abbreviation of TAP Reset. nTRST is the electronic signal that causes
the target system TAP controller to be reset. This signal is known
as nICERST in some documentation. See also nSRST and Joint Test
Action Group (JTAG).
316
System-on-Chip Design with Arm® Cortex®-M processors
SDF back-annotation Using the timing delay values extracted from post-layout data (stored
in a SDF file) to update the netlist (‘back-annotate’ it) during a netlist
simulation.
317
System-on-Chip Design with Arm® Cortex®-M processors
STA (Static Timing Analysis) – A type of analysis used to verify that the
output design from synthesis or placement & routing can meet the
timing requirements.
SP (Stack Pointer) On Arm cores, SP refers to the stack pointer for the
hardware-managed stack.: In AArch32 state, the SP is register R13
in the general-purpose register file. In AArch64 state, there is a
dedicated SP for each implemented Exception level.
SW-DP (Serial Wire Debug Port) - The interface for Serial Wire Debug
318
System-on-Chip Design with Arm® Cortex®-M processors
SysTick timer (System Tick timer) - A hardware unit in Cortex-M processor that
provides periodic interrupts for OS operations. The SysTick timer
is controlled by software, and CMSIS-Core support for Cortex-M
processor-based devices provide APIs that generate interrupt requests
on a regular basis. Example use of the SysTick timer and its interrupt
include allowing an OS to carry out context switching to support
multiple tasking. For applications where OS is not required, SysTick can
be used for timekeeping, time measurement, or as an interrupt source
for tasks that need to be executed regularly.
TAP Controller Logic on a device that enables access to some or all of that device
for debug or test purposes. The circuit functionality is defined in
IEEE1149.1. See also Joint Test Action Group (JTAG).
TCK (Test Clock) - The electronic clock signal that times data on the TAP
data lines TMS, TDI, and TDO. See also Test Data Input (TDI) and Test
Data Output (TDO).
TPIU (Trace Port Interface Unit) - A hardware block to convert trace data
from an ETM or other trace sources into parallel trace protocol for
trace probe to collect the data via top-level pins.
319
System-on-Chip Design with Arm® Cortex®-M processors
TrustZone technology The hardware and software that enable the integration of enhanced
security features throughout a SoC. It is widely used in Cortex-A
processors and has been introduced to the latest Cortex-M processors
such as Cortex-M23, Cortex-M33, and Cortex-M35P.
320
System-on-Chip Design with Arm® Cortex®-M processors
References
The designs in this book are based on the following AMBA specifications:
Specifications Url
Other AMBA specifications (including older versions of AHB and APB specifications) were also mentioned:
Specifications Url
Cortex-M Access
The Keil Microcontroller Development Kit (MDK-ARM) introduction materials in Chapter 11 are based
on version MDK-ARM 5.27. For evaluation and education, you can use Keil MDK Lite for free (code
size limited to 32KB). You can find the latest version here: http://www2.keil.com/mdk5
321
System-on-Chip Design with Arm® Cortex®-M processors
Index
Acknowledgments 19
APB 55-89
Arm Ecosystem 5
Arm processors 24
Breakpoints 116
322
System-on-Chip Design with Arm® Cortex®-M processors
Bus design 92
infrastructure components 161-189
masters 104
Cache integration 41
CDC 238
Configuration options 42
Cortex-M0 93-94
Cortex-M0+ 94-95
Cortex-M1 96-98
Cortex-M3 98-104
Cortex-M4 98-104
DAP 120
323
System-on-Chip Design with Arm® Cortex®-M processors
DesignStart 27
DFT 244-247
Disclaimer 17
Foreword 4
FPGA 23
GATECLK 143
Halting 220
ID registers 215-216
324
System-on-Chip Design with Arm® Cortex®-M processors
Modelsim 237-239
Power management 51
Preface 16
QACTIVE 144-146
QREQn 144-146
QuestaSim 237-238
References 321
325
System-on-Chip Design with Arm® Cortex®-M processors
Security 220
STA 243
SWO 130
SWV 130
SysTick 49-50
TBSA-M 135
TCM 109-111
TRACECLK 130
TRACEDATA 130
326
System-on-Chip Design with Arm® Cortex®-M processors
UART 210-218
Watchpoints 116
327
System-on-Chip Design with Arm® Cortex®-M processors
Available now:
Internet of Things
System-on-Chip Design
Embedded Linux
Contact: edumedia@arm.com
328
System-on-Chip Design with Arm® Cortex®-M processors
329
System-on-Chip Design with Arm® Cortex®-M processors
Available now:
Coming soon:
Contact: edumedia@arm.com
330
System-on-Chip Design with Arm® Cortex®-M processors
331
System-on-Chip Design with Arm® Cortex®-M processors
332
System-on-Chip Design with Arm® Cortex®-M processors
333
System-on-Chip Design with Arm® Cortex®-M processors
Design
with Arm® Cortex®-M Processors
Reference Book
The Arm® Cortex®-M processors are already one of the most popular choices for loT
and embedded applications. With Arm Flexible Access and DesignStart™, accessing Arm
Cortex-M processor IP is fast, affordable, and easy. This book introduces all the key topics
that system-on-chip (SoC) and FPGA designers need to know when integrating a Cortex-M
processor into their design, including bus protocols, bus interconnect, and peripheral designs.
Arm Education Media is a publishing operation with Arm Ltd, providing a range of educational materials for aspiring and practicing engineers.
For more information, visit: armedumedia.com
334