Lecture 9: Digital Signal Processors: Applications and ...

1 Kurt KeutzerLecture 9: Digital Signal Processors: Applications and ArchitecturesPrepared by: Professor Kurt KeutzerComputer Science 252, Spring 2000 With contributions from:Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI; Prof. Bob Brodersen, Prof. David Patterson 2 Kurt KeutzerProcessor ApplicationsGeneral Purpose - high performance Pentiums, Alpha s, SPARC Used for general purpose software Heavy weight OS - UNIX, NT Workstations, PC sEmbedded processors and processor cores ARM, 486SX, hitachi SH7000, NEC V800 Single program Lightweight, often realtime OS DSP support Cellular phones, consumer electronics ( CD players) Microcontrollers Extremely cost sensitive Small word size - 8 bit common Highest volume processors by far Automobiles, toasters, thermostats, .. IncreasingCostIncreasingvolume3 Kurt KeutzerProcessor Markets$30B$ $ $10B/33%8-bitmicro16-bitmicroDSP32-bitmi cro$ $ bit DSP4 Kurt KeutzerThe Processor Design SpaceCostPerformanceMicroprocessorsPerfo rmance iseverything& Software rulesEmbeddedprocessorsMicrocontrollersC ost is everythingApplication specific architecturesfor performance5 Kurt KeutzerMarket for DSP ProductsMixed/SignalAnalogDSPDSP is the fastest growing segment of the semiconductor market6 Kurt KeutzerDSP ApplicationsAudio Applications MPEG Audio Portable audioDigital camerasWireless Cellular telephones Base stationNetworking Cable modems ADSL VDSL7 Kurt KeutzerAnother Look at DSP ApplicationsHigh-end Wireless Base Station - TMS320C6000 Cable modem gatewaysMid-end Cellular phone - TMS320C540 Fax/ voice serverLow end Storage products - TMS320C27 Digital camera - TMS320C5000 Portable phones Wireless headsets Consumer audio Automobiles, toasters, thermostats.

IncreasingCostIncreasingvolume8 Kurt KeutzerServing a range of applications9 Kurt KeutzerWorld s Cellular Subscribers01002003004005006007001993 1994 1995 1996 1997 1998 1999 2000 2001 MillionsYearDigitalAnalogSource: Ericsson Radio Systems, providea ubiquitousinfrastructurefor wirelessdata as wellas voice10 Kurt KeutzerCELLULAR TELEPHONE SYSTEMPHYSICALLAYERPROCESSINGRF MODEMCONTROLLER1 2 3 4 5 67 8 90415-555-1212 SPEECHDECODESPEECHENCODEA/DBASEBANDCONVE RTERDAC11 Kurt KeutzerHW/SW/IC PARTITIONINGPHYSICALLAYERPROCESSINGRF MODEMCONTROLLER1 2 3 4 5 67 8 90415-555-1212 SPEECHDECODESPEECHENCODEA/DBASEBANDCONVE RTERDACANALOG ICDSPASICMICROCONTROLLER12 Kurt KeutzerMapping onto a system on a chip RAM CRAMDSPCOREASICLOGICS/PDMA phonebookprotocolkeypadintfccontrolS/PDM A speechqualityenhancmentde-intl &decodervoicerecognitionRPE-LTPspeech decoderdemodulatorandsynchronizerViterbi equalizer13 Kurt KeutzerExample Wireless Phone OrganizationC540 ARM714 Kurt KeutzerMultimedia I/O Architecture Low Power BusRadioModemEmbedded ProcessorFifoVideoDecompVideoAudioFBFifo GraphicsPenSched ECC PactInterfaceDataFlowSRAM15 Kurt KeutzerMultimedia System on a ChipFuture chips will be a mix of processors.

Memory and dedicated hardware for specific algorithms and I/O PDSPComsVideo UnitcustomMemoryUplink RadioDownlink RadioGraphics OutVideo I/OVoice I/OPen Multimedia terminal electronics16 Kurt KeutzerRequirements of the Embedded ProcessorsOptimized for a single program - code often in on-chip ROM or off chip EPROMM inimum code size (one of the motivations initially for Java)Performance obtained by optimizing datapathLow cost Lowest possible area Technology behind the leading edge High level of integration of peripherals (reduces system cost)Fast time to market Compatible architectures ( ARM) allows reuseable code Customizable coreLow power if application requires portability17 Kurt KeutzerArea of processor cores = CostNintendo processorCellular phones18 Kurt KeutzerAnother figure of meritComputation per unit areaNintendo processorCellular phones???19 Kurt KeutzerCode sizeIf a majority of the chip is the program stored in ROM, then code size is a critical issueThe Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate20 Kurt KeutzerBENCHMARKS - DSPstoneZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHENAPPLICATION BENCHMARKS ADPCM TRANSCODER - CCITT REAL_UPDATE COMPLEX_UPDATES DOT_PRODUCT MATRIX_1X3 CONVOLUTION FIR FIR2 DIM HR_ONE_BIQUAD LMS FFT_INPUT_SCALED 21 Kurt KeutzerEvolution of GP and DSPG eneral Purpose Microprocessor traces roots back to Eckert,Mauchly, Von Neumann (ENIAC)DSP evolved from Analog Signal Processors, using analog hardwareto transform phyical signals (classical electrical engineering)ASP to DSP because DSP insensitive to environment ( , same response in snow or desert if it works at all) DSP performance identical even with variations in components.

2 analog systems behavior varies even if built with same components with 1% variationDifferent history and different Applications led to different terms, different metrics, some new inventionsConvergence of markets will lead to architectural showdown22 Kurt KeutzerEmbedded Systems vs. General Purpose Computing - 1 Embedded SystemRuns a few Applications often known at design timeNot end-user programmableOperates in fixed run-time constraints, additional performance may not be useful/valuableGeneral purpose computingIntended to run a fully general set of applicationsEnd-user programmableFaster is always better23 Kurt KeutzerEmbedded Systems vs. General Purpose Computing - 2 Embedded SystemDifferentiating features: power cost speed (must be predictable)General purpose computingDifferentiating features speed (need not be fully predictable) speed did we mention speed? cost (largest component power)24 Kurt KeutzerDSP vs. General Purpose MPUDSPs tend to be written for 1 program, not many programs.

Hence OSes are much simpler, there is no virtual memory or protection, ..DSPs sometimes run hard real-time apps You must account for anything that could happen in a time slot All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD!DSPs have an infinite continuous data stream25 Kurt KeutzerDSP vs. General Purpose MPUThe MIPS/MFLOPS of DSPs is speed of Multiply-Accumulate (MAC). DSP are judged by whether they can keep the multipliers busy 100% of the "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolversIn DSPs, algorithms are king! Binary compatability not an issueSoftware is not (yet) king in DSPs. People still write in assembly language for a product to minimize the die area for ROM in the DSP KeutzerTYPES OF DSP PROCESSORSDSP Multiprocessors on a die TMS320C80 TMS320C600032-BIT FLOATING POINT TI TMS320C4X MOTOROLA 96000 AT&T DSP32C ANALOG DEVICES ADSP2100016-BIT FIXED POINT TI TMS320C2X MOTOROLA 56000 AT&T DSP16 ANALOG DEVICES ADSP210027 Kurt KeutzerNote of Caution on DSP ArchitecturesSuccessful DSP architectures have two aspects: Key architectural and micro-architectural features that enabled product success in key parameters Speed Code density Low power Architectural and micro-architectural features that are artifacts of the era in which they were designed We will focus on the former!

28 Kurt KeutzerArchitectural Features of DSPsData path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulateMultiple memory banks and buses - Harvard Architecture Multiple data memoriesSpecialized addressing modes Bit-reversed addressing Circular buffersSpecialized instruction set and execution control Zero-overhead loops Support for MACS pecialized peripherals for DSPTHE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!!29 Kurt KeutzerDSP Data Path: ArithmeticDSPs dealing with numbers representing real world=> Want reals / fractionsDSPs dealing with numbers for addresses=> Want integersSupport fixed point as well as point-1 x < point 2N 1 x < 2N 130 Kurt KeutzerDSP Data Path: PrecisionWord size affects precision of fixed point numbersDSPs have 16-bit, 20-bit, or 24-bit data wordsFloating Point DSPs cost 2X - 4X vs. fixed point, slower than fixed pointDSP programmers will scale values inside code SW Libraries Separate explicit exponent Blocked Floating Point single exponent for a group of fractionsFloating point support simplify development31 Kurt KeutzerDSP Data Path: Overflow?

DSP are descended from analog : what should happen to output when peg an input? ( , turn up volume control knob on stereo) Modulo Arithmetic???Set to most positive (2N 1 1) ormost negative value( 2N 1) : saturation Many algorithms were developed in this model32 Kurt KeutzerDSP Data Path: MultiplierSpecialized hardware performs all key arithmetic operations in 1 cycle 50% of instructions can involve multiplier=> single cycle latency multiplierNeed to perform multiply-accumulate (MAC)n-bit multiplier => 2n-bit product33 Kurt KeutzerDSP Data Path: AccumulatorDon t want overflow or have to scale accumulatorOption 1: accumalator wider than product: guard bits Motorola DSP: 24b x 24b => 48b product, 56b AccumulatorOption 2: shift right and round product before adderAccumulatorALUM ultiplierAccumulatorALUM ultiplierShiftG34 Kurt KeutzerDSP Data Path: RoundingEven with guard bits, will need to round when store accumulator into memory3 DSP standard optionsTruncation: chop results=> biases results upRound to nearest: < 1/2 round down, 1/2 round up (more positive)=> smaller biasConvergent.

< 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0)=> no biasIEEE 754 calls this round to nearest even35 Kurt KeutzerData PathDSP ProcessorSpecialized hardware performs all key arithmetic operations in 1 support for managing numeric fidelity: Shifters Guard bits SaturationGeneral-Purpose ProcessorMultiplies often take>1 cycleShifts often take >1 cycleOther operations ( , saturation, rounding) typically take multiple Keutzer320C54x DSP Functional Block Diagram37 Kurt KeutzerFIR Filtering: A Motivating ProblemM most recent samples in the delay line (Xi)New sample moves data down delay line Tap is a multiply-addEach tap (M+1 taps total) nominally requires: Two data fetches Multiply Accumulate Memory write-back to update delay lineGoal: 1 FIR Tap / DSP instruction cycle38 Kurt KeutzerBENCHMARKS - FIR FILTERFINITE-IMPULSE RESPONSE FILTER 1Z 1Z 1ZN 1C2 CNC1C..39 Kurt KeutzerMicro-architectural impact - MACy(n)=h(m)x(n m)0N 1 element of finite-impulse response filter computationMPYXYACC REGADD/SUB40 Kurt KeutzerThe critical hardware unit in a DSP is the multiplier - much of the architecture is organized around allowing use of the multiplier on every cycleThis means providing two operands on every cycle, through multiple data and address busses, multiple address units and local accumulator feedback 123D54 DXXnX Yn Yn-11324566 Mapping of the filter onto a DSP execution unit41 Kurt KeutzerMAC Eg.

- 320C54x DSP Functional Block Diagram42 Kurt KeutzerDSP MemoryFIR Tap implies multiple memory accessesDSPs want multiple data portsSome DSPs have ad hoc techniques to reduce memorybandwdith demand Instruction repeat buffer: do 1 instruction 256 times Often disables interrupts, thereby increasing interrupt response timeSome recent DSPs have instruction caches Even then may allow programmer to lock in instructions into cache Option to turn cache into fast program memoryNo DSPs have data cachesMay have multiple data memories43 Kurt KeutzerConventional ``Von Neumann memory44 Kurt KeutzerHARVARD ARCHITECTURE in DSPPROGRAMMEMORYX MEMORYY MEMORYGLOBALP DATAX DATAY DATA45 Kurt KeutzerMemory ArchitectureDSP ProcessorHarvard architecture2-4 memory accesses/cycleNo caches-on-chip SRAMG eneral-Purpose ProcessorVon Neumann architectureTypically 1 access/cycleMay use cachesProcessorProgramMemoryDataMemoryPr ocessorMemory46 Kurt KeutzerEg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture47 Kurt KeutzerEg.

320C62x/67x DSP 48 Kurt KeutzerDSP AddressingHave standard addressing modes: immediate, displacement, register indirectWant to keep MAC datapth busyAssumption: any extra instructions imply clock cycles of overhead in inner loop=> complex addressing is good=> don t use datapath to calculate fancy addressAutoincrement/Autodecrement register indirect lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1 Option to do it before addressing, positive or negative49 Kurt KeutzerDSP Addressing: FFTFFTs start or end with data in weird bufferfly order0 (000)=>0 (000)1 (001)=>4 (100)2 (010)=>2 (010)3 (011)=>6 (110)4 (100)=>1 (001)5 (101)=>5 (101)6 (110)=>3 (011)7 (111)=>7 (111)What can do to avoid overhead of address checking instructions for FFT?Have an optional bit reverse address addressing mode for use withautoincrement addressingMany DSPs have bit reverse addressing for radix-2 FFT50 Kurt KeutzerBIT REVERSED ADDRESSINGx(0)x(4)x(2)x(6)x(1)x(5)x(3)x( 7)F(0)F(1)F(2)F(3)F(4)F(5)F(6)F(7)Four 2-point DFTsTwo 4-point DFTsOne 8-point DFT000100010110001101011111 Data flow in the radix-2 decimation-in-time FFT algorithm51 Kurt KeutzerDSP Addressing: BuffersDSPs dealing with continuous I/OOften interact with an I/O buffer (delay lines)To save memory, buffer often organized as circular bufferWhat can do to avoid overhead of address checking instructions for circular buffer?

Lecture 9: Digital Signal Processors: Applications and ...

Tags:

Information

Transcription of Lecture 9: Digital Signal Processors: Applications and ...

Related search queries

Lecture 9: Digital Signal Processors: Applications and ...

Tags:

Information

Documents from same domain

Related documents

Related search queries