ARM has introduced a new processor of cortex M series, and this time it is Cortex M7. The ARM M7 processor is the most recent and highest performance member of the energy-efficient Cortex-M processor family. ARM quotes “The versatility and new memory features of the Cortex-M7 enable more powerful, smarter and reliable microcontrollers that can be used across a multitude of embedded applications”
The primary focus of the Cortex-M7 is improved performance. ARM’s goal was to elevate the M series performance to a level previously unseen, while maintaining the M series’ signature such as small die size and tiny power consumption as well as the excellent responsiveness and ease-of-use of the ARMv7-M architecture. There are at least two reasons ARM focused on performance for the M7 processor. First, they want to further drive a wedge between traditional 8- and 16-bit microcontrollers and provide ARM a further differentiated market position; second, the M7 will help support the Internet of Things and wearable device markets. Focusing on enhanced DSP capabilities, the Cortex M7 is more suited to audio and visual sensor hub processing than any previous M series design.
The Cortex M7 has twice the DSP power of the M4 by executing twice as many instructions simultaneously, and it also helps that the M7 can operate at a higher clock frequency than the M4. It’s backed by the Keil CMSIS DSP library, and includes a single and double precision FPU.
It was developed to provide a low-cost platform that meets the needs of MCU implementation, with a reduced pin count and low-power consumption, while delivering outstanding computational performance and low interrupt latency. You can also use two M7 cores in lock step running the same code – one following two cycles behind the other – so that glitches can be detected by external electronics if the two CPUs sudden behave slightly differently.
The optional Floating Point Unit (FPU) provides:
Automated stacking of floating-point state is delayed until the ISR attempts to execute a floating-point instruction. This reduces the latency to enter the ISR and removes floating-point context save for ISRs that do not use floating-point.
It provides Instructions for single-precision data-processing operations. And optional instructions for double-precision data-processing operations.
FPU also provides Combined multiply and Accumulate instructions for increased precision. And easy hardware support for conversion, addition, subtraction, multiplication with optional accumulate, division, and square-root.
The NVIC is closely integrated with the core to achieve low-latency interrupt processing.
NVIC have 1 to 240 configurable external interrupts. This is configured at implementation.
It also has Configurable levels of interrupt priority from 8 to 256. Configured at implementation. You can also do dynamic reprioritization of interrupts.
NVIC features have support for tail-chaining and late arrival of interrupts. This enables back-to-back interrupt processing without the overhead of state saving and restoration between interrupts.
The memory protection unit (MPU) is used to manage the CPU accesses to memory to prevent one task to accidentally corrupt the memory or resources used by any other active task. This memory area is organized into up to 8 protected areas that can in turn be divided up into 8 subareas. The protection area sizes are between 32 bytes and the whole 4 gigabytes of addressable memory.
Tightly coupled memory (TCM) is a technology which ARM’s partners can use to extend the effective caching of a single M7 processor and has only been seen in previous A and R series designs. In use, it can have the performance of a cache but, unlike cache, its contents are directly controlled by the developer. Developers can place critical code and data inside TCM that can be deterministically accessed with high performance in routines such as interrupt service requests. The M7 supports up to 16 MB of tightly coupled memory.
The AHB-Lite peripheral (AHBP) interface provides access suitable for low latency system peripherals. It provides support for unaligned memory accesses, write buffer for buffering of write data, and exclusive access transfers for multiprocessor systems.
The ATB interfaces output trace information used for debugging. The ATB interface is compatible with the CoreSight architecture.
The ARM Cortex-M7 features a six-stage, dual-issue superscalar pipeline with single- and double-precision floating point units which can execute two instruction at a time. Whereas the Cortex-M4 can execute just one instruction at one time. This is where most of the speed-up comes from. The M7 can run at a higher clock frequency than M4 and together these give on average two-times uplift in DSP performance for M7 over M4.
By doubling the performance, ARM calculates appliances and gadgets using the M7 can more quickly perform the complex mathematics which required to finely control motor movement in robots; analyse microphone, touchscreen, and other sensors data.
The Data Processing Unit (DPU) provides: Parallelized integer register file with six read ports and four write ports for large-scale dual-issue.
It also uses extensive forwarding logic to minimise interlocks.
It has two ALU, with one ALU capable of executing SIMD operations.
Single MAC pipeline capable of 32×32-bit + 64-bit → 64-bit with two cycle result latency and one MAC per cycle throughput.
The Prefetch Unit (PFU) provides: 64-bit instruction fetch bandwidth.
4×64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
It also provides a Branch Target Address Cache (BTAC) for single-cycle turn-around of branch predictor state and target address. And a static branch predictor when no BTAC is specified.
It also provides a powerful features to forwarding of flags for early resolution of direct branches in the decoder and first execution stages of the processor pipeline.
The Load Store Unit (LSU) provides: Dual 32-bit load channels to TCM, data cache, and AXI master interface for 64-bit load bandwidth and dual 32-bit load capability.
And single 32-bit load channel to the AHB interface as well as single 64-bit store channel.
The Load Store Unit (LSU) also provide features to Store buffering to increase store throughput and minimize RAM contention with data and instruction reads. And also it provides a separate store buffering for TCM, AHBP and AXIM for Quality of Service and interface-specific optimizations.
The presence of instruction and data caches, branch prediction, as well as tightly coupled memory are differentiating features of the M7 versus previous M series processors. By providing high performance instruction and data caches, the M7 approaches more typical high performance processor design.
Adding branch prediction allows arm to target dedicated DSP devices with its Cortex-M7 microcontroller. DSP code is often filters data stream for applications such as audio input keyword detection, audio output equalization, and frequency domain amplitude peak searching. When running on an always-on microcontroller these tasks are almost always looped. Without a branch predictor, the code must continually evaluate a loop condition that 99.9% of the time results in the same outcome. Branch predictors cost extra die space but when DSP is your target, they are an obvious design benefit.
According to ARM’s benchmarking, the M7 achieves five CoreMark per MHz, or a 2,000 CoreMark score at 400MHz in a 40nm process at low power, if you run the code in tightly coupled memory. The M4 can hit 3.4 CoreMark per MHz, according to previous ARM figures, and runs at a lower clock speed. The M7 can scale up to 800MHz at 28nm.
Atmel, Freescale and ST Microelectronics have already snapped up licenses to pump out chips with M7 cores in the 90nm to 40nm process range; each core taking up a 0.1mm square of silicon. So let’s hope new ARM cortex M7 based development board will come very soon in the market. Thanks for watching and if you like it please thumbs up and stay tuned for next video.