Example: dental hygienist

A Survey on ARM Cortex A Processors - cs.virginia.edu

1A Survey on ARM Cortex A ProcessorsWei WangTanima Dey2 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no Processors but only IP cores For SoC integration Targeting markets: Netbooks, tablets, smart phones, game console Digital Home Entertainment Home and Web Servers Wireless Infrastructure Design Goals Performance, Power, Easy Synthesis3 ARM Cortex A9/A15 1-4 Cores Out-of-Order Superscalar Branch predicator 32KB L1 I/D caches ~4MB L2 caches with Coherency NEON(SIMD) & FPU 32/28nm (A15)45nm (A9)4 Texas Instrument OMAP5 5 comparison of ARM, Atom, i7 Cortex A15 (no L2, 32nm) Cortex A9(no L2, 40nm )Atom N270 (45nm)I7 960(45nm)Number of Cores2 (4 maximum)2 (4 maximum)1 Core, 2 HT threads4 Cores,8 HT threadsFrequency1 Ghz Ghz800 Mhz (Po)2 Ghz (Per) GhzOut-of-Order?

6 Comparison of ARM SoC, Atom, i7 TI OMAP5 (28nm) Nvidia Tegra 2 (40nm) Atom N450 (45nm) I7 2600S (32nm) CPU Cores 2 x A15 2 x M4 2 x A9 1 Core, 2 HT threads

Tags:

  Processor, Comparison, Cortex, On arm cortex a processors

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of A Survey on ARM Cortex A Processors - cs.virginia.edu

1 1A Survey on ARM Cortex A ProcessorsWei WangTanima Dey2 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no Processors but only IP cores For SoC integration Targeting markets: Netbooks, tablets, smart phones, game console Digital Home Entertainment Home and Web Servers Wireless Infrastructure Design Goals Performance, Power, Easy Synthesis3 ARM Cortex A9/A15 1-4 Cores Out-of-Order Superscalar Branch predicator 32KB L1 I/D caches ~4MB L2 caches with Coherency NEON(SIMD) & FPU 32/28nm (A15)45nm (A9)4 Texas Instrument OMAP5 5 comparison of ARM, Atom, i7 Cortex A15 (no L2, 32nm) Cortex A9(no L2, 40nm )Atom N270 (45nm)I7 960(45nm)Number of Cores2 (4 maximum)2 (4 maximum)1 Core, 2 HT threads4 Cores,8 HT threadsFrequency1 Ghz Ghz800 Mhz (Po)2 Ghz (Per) GhzOut-of-Order?

2 YesYesNoYesL1 cache size32KB I/D32KB I/D32KB I/D32KB I/DL2 cache sizeN/AN/A512KB1MB + 8MB L3 Issue Width4424?Pipeline Stages?81614 ~ 24 (?)Supply Voltage? (Per) VTransistor Count?26,00,000?47,000,000731,000,000 Die size? mm2 (Po) mm2 (Per)26 mm2263 mm2 Power Consumption? W (Po) W (Per) (TDP)130W (TDP)6 comparison of ARM SoC, Atom, i7TI OMAP5(28nm)Nvidia Tegra 2(40nm)Atom N450(45nm)I7 2600S (32nm)CPU Cores2 x A152 x M42 x A91 Core, 2 HT threads4 Cores,8 HT threadsCPU (A15) ASICsVideo, Audio, Encryption, Display, 2D/3D8x GPUs,Audio, Video,ISP1 GPU1 GPUL2?1MB512KB1MB+8 MBDie Size?49mm266mm2?Transistors?260,000,0001 23,000,000?Package Size17 x 17 mm223 x 23 mm222 x 22 x mm2 Power Consumption?150~500mW ? (TDP)65W (TDP)7 Power/Performance Optimizationas a SoC Application-specific SoC design Integrate different ASICs Customize Cortex Processors Reduced memory bandwidth & frequency Mixing High Vt / Low Vt transistors Twisting floorplan, routing, clock tree design Power gating/Clock gating/DVFS Four modes: Run, Standby, Dormant, Shutdown Fine-grained pipeline shutdown Faster register save and restore (state save/restore) Power domains & voltage domains8 Power Saving as SoC.

3 Power Gating Different power domains Cores NEON/VFP Debug Interface L2 cache tags (per bank) L2 cache control Interrupt Controllers Impact of power gating 3% reduction in performance 2% increase in area 4% increase in dynamic power 95% decrease in power when turned off9 Power/Performance as a CPU Performance Enhancement (power hungry techniques) Dynamic issue design 4-way superscalar Complex Branch predictor Large L1/L2 caches Power savings Accurate branch prediction Micro TLB RISC SIMD, Jazzelle RCT Instruction Set Architecture ARM processor architecture supports 32-bit ARM and 16-bit Thumb ISAs ARM architecture -- RISC architecture Large uniform register file Load/store architecture Simple addressing modes Auto-increment and auto-decrement addressing modes Load and Store multiple instructions Instructions can also be "conditionalised" based on condition code in Application Program Status Register11 ARM Instruction Set Architecture Thumb Extension to the 32-bit ARM architecture Features a subset of the most commonly used 32-bit ARM instructions compressed into 16-bit opcodes Excellent code-density for minimal system memory size.

4 Reduced cost and power efficiency Designers have the flexibility to emphasize performance or code size "Thumb-aware" core is a standard ARM processor fitted with a Thumb decompressor in the instruction pipeline ARM uses the Universal Assembly Language 12 DSP ISA extension Features: new instructions to load and store pairs of registers, 2-3 x DSP performance improvement over ARM7 Eliminates the need for additional hardware accelerators Provides high performance solution with low power consumption Reuses existing OS and application code Supports including servo motor control, Voice over IP (VOIP) and video & audio codecs13 SIMD 75% higher performance for multimedia processing in embedded devices Near zero" increase in power consumption Simultaneous computation of 2x16-bit or 4x8-bit operands Offers single tool-chain and processing device, transparent of OS14 NEON Cleanly architected and works seamlessly with its own independent pipeline and register file Large NEON register file with its dual 128-bit/64-bit views enables efficient handling of data Minimizes access to memory.

5 Enhancing data throughput Designed for autovectorizing compilers and hand coding Provides flexible and powerful acceleration for consumer multimedia applications Supports the widest range of multimedia codecs used for internet applications15 NEON16 Vector Floating Point Architecture Coprocessor extension to the ARM architecture Supports floating point operations in half-, single- and double-precision floating point arithmetic Fully IEEE 754 compliant with full software library support Supports execution of short vector instructions but these operate on each vector element sequentially Three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications17 Jazzelle Combined hardware and software solution for accelerating execution Software -- fully featured multi-tasking JVM Hardware -- coprocessor CP14 provides support for the hardware acceleration Jazelle DBX technology for direct bytecode execution Direct interpretation bytecode to machine code Jazelle RCT technology supports efficient AOT and JIT compilation with and beyond Java18 Jazelle DBX and RCT are cache and memory efficient, maintaining low power Jazelle DBX is a robust and proven solution and easy to integrate Jazelle RCT provides an excellent target for any run-time compilation technology Developers Flexibility Resource constraint device.

6 Jazelle DBX only On high-end platforms, Jazelle RCT alone with JIT and AOTJ azzelle19 Conclusion Aggressive power hungry design targeting at high single thread performance Out-of-Order Execution Wide superscalar Large caches with coherency protocols Power saving techniques for ARM CPUs RISC ISA Optimization: Thumb, Thumb2, ThumbEE Application-Specific Components: SIMD, DSP, VFPUs, Jazzelle Power saving techniques for SoC chips Fine-grained power gating & clock gating & DVFS Fine-grained pipeline shutdown fast registers saving/restoring Customizable CPU components Mixing high Vt and low Vt transistors20 Reading materials ARM Cortex -A9 Technical Reference Manual ARM Cortex -A9 MPCore Technical Reference Manual Keys to Silicon Realization of Gigahertz Performance and Low Power ARM Cortex -A15, Lamber A.

7 Et. al., ARM Technology Conference 2010 2 GHz Capable Cortex -A9 Dual Core processor Implementation, Circuit Design: High performance AND low power, the ARM way, ARM MPCore Architecture Performance Enhancement, Cortex -A9 processor Microarchitecture, Details of a New Cortex processor , Revealed, ARM Cortex -A9 Performance.


Related search queries