## C12. VLIW DSPs

#### **Outline:**

- DSP architecture roadmap
- TMS 320C6000 OVERVIEW
- C6X Architecture
- C6X Instructions
- C6x Programing
- -Trends C66X, C67X
- –Reference: SPRU731.pdf

<u>https://slideplayer.com/slide/7344585/!!!!</u>

-ee213a\_lec18\_VLIW\_V2.pdf

### **DSP EVOLUTION**





#### Performance

- Low : ~ 25 to 50 MHz clock, low cost / power consumption.
- Mid : ~ 150 MHz clock, multiprocessing support.
- High : Enhanced architectures VLIW (Very Long Instruction Word) or SIMD (Single Input Multiple Data).



## **Texas Instruments' TMS320 family**

Different families and sub-families exist to support different markets

## **C2000**





#### **Lowest Cost**

#### **Control Systems**

- Motor Control
- Storage
- Digital Ctrl Systems

#### **Efficiency**

#### Best MIPS per Watt / Dollar / Size

- Wireless phones
- Internet audio players
- Digital still cameras
- Modems
- Telephony
- VoIP
- TI TMS320 C62x, C64x.....C66x multicore dsp
- ADI TigerSHARC ADS-TS20x
- Freescale (Motorola) MSC71xx and MSC81xx
- StarCore SC1400 Agere/Motorola (DSP core)

# Performance & Best Ease-of-Use

- Multi Channel and Multi Function App's
- Comm Infrastructure
- Wireless Base-stations
- DSL
- Imaging
- Multi-media Servers
- Video etc



## C6000 Roadmap\*

#### **Object Code Software Compatibility**



Time

Accumulator architecture



Memory-register architecture



Load-store architecture



## TMS 320C6000 VLIW DSP OVERVIEW

- Different from the conventional DSP architecture
- MIMD type architecture
- HLL programming
- Code optimization made in the compiling phase

#### **REMEMBER!**

### **Conventional DSP Architecture**

- Multiply-accumulate (MAC) in 1 instruction cycle
- Harvard architecture for fast on-chip I/O
  - Data memory/bus separate from program memory/bus
  - One read from program memory per instruction cycle
  - Two reads/writes from/to data memory per inst. cycle
- Instructions to keep pipeline (3-6 stages) full
  - Zero-overhead looping (one pipeline flush to set up)
  - Delayed branches
  - Special addressing modes supported in hardware
    - Bit-reversed addressing (e.g. fast Fourier transforms)
    - Modulo addressing for circular buffers (e.g. filters)

### **'C6x DSP Block Diagram**



## **C6x Internal Buses**



. . . . . . . . .



TMS320C62x/C64x/C67x Block Diagram



Features • More instructions/ cycle, packed in a "super-long instruction"

- Regular Architecture, more orthogonal, RISC like
- Uniform Instruction set, more instructions.

#### Very long instruction word (VLIW) size of 256 bits

- Eight 32-bit functional units with single cycle throughput
- One instruction cycle per clock cycle
- Data word size is 32 bits
  - 16 (32 on C6400) 32-bit registers in each of 2 data paths
  - 40 bits can be stored in adjacent even/odd registers

#### Two parallel data paths

- Data unit 32-bit address calculations (modulo, linear)
- Multiplier unit 16 bit × 16 bit with 32-bit result
- Logical unit 40-bit (saturation) arithmetic & compares
- Shifter unit 32-bit integer ALU and 40-bit shifter



#### Functional Units and Operations Performed

| Functional Unit    | Fixed-Point Operations                                        |
|--------------------|---------------------------------------------------------------|
| .L unit (.L1, .L2) | 32/40-bit arithmetic and compare operations                   |
|                    | 32-bit logical operations                                     |
|                    | Leftmost 1 or 0 counting for 32 bits                          |
|                    | Normalization count for 32 and 40 bits                        |
| .S unit (.S1, .S2) | 32-bit arithmetic operations                                  |
|                    | 32/40-bit shifts and 32-bit bit-field operations              |
|                    | 32-bit logical operations                                     |
|                    | Branches                                                      |
|                    | Constant generation                                           |
|                    | Register transfers to/from control register file (.S2 only)   |
| .M unit (.M1, .M2) | $16 \times 16$ -bit multiply operations                       |
| .D unit (.D1, .D2) | 32-bit add, subtract, linear and circular address calculation |
|                    | Loads and stores with 5-bit constant offset                   |
|                    | Loads and stores with 15-bit constant offset (.D2 only)       |

#### In the best case, all units operate in parallel, and the processor performs :

- four arithmetic operations,
- two multiplications,
- two address calculations in one instruction cycle.



## **Pipelining**



### Pipelining

- •Process instruction stream in stages (as stages of assembly on a manufacturing line)
- Increase throughput
- Managing Pipelines
- •Compiler or programmer
- •Pipeline interlocking

- 1 instruction / every machine cycle
- Pipeline depth
  - 7-11 stages C62x : fetch 4; decode 2; execute 1-5
  - -7-16 stages to C67x: fetch4;decode2;execute 1-10
  - a loop in pipeline will disable interrupts
  - avoid loop usage by employing conditional execution!
- no Hardware protection against pipeline incidents!
  - compiler/assembler must to warn the pipeline incidents
- Instruction dispatching



23

#### Fetch

The fetch phases of the pipeline are:

- PG: Program address generate
- PS: Program address send
- PW: Program access ready wait
- PR: Program fetch packet receive







<sup>†</sup>NOP is not dispatched to a functional unit.

Execute Phases of the Pipeline



(b)



26

#### FETCH PACKET

| F           | DP | DC | <b>E</b> 1 | <b>E2</b> | <b>E3</b> | <b>E4</b> | E5 | <b>E6</b> |
|-------------|----|----|------------|-----------|-----------|-----------|----|-----------|
| MVK         |    |    |            |           |           |           |    |           |
| LDH         |    |    |            |           |           |           |    |           |
| LDH         |    |    |            |           |           |           |    |           |
| MPY         |    |    |            |           |           |           |    |           |
| ADD         |    |    |            |           |           |           |    |           |
| SUB         |    |    |            |           |           |           |    |           |
| В           |    |    |            |           |           |           |    |           |
| STH         |    |    |            |           |           |           |    |           |
|             |    |    |            |           |           |           |    |           |
| $(F_{1-4})$ |    |    |            |           |           |           |    |           |
| /           |    |    |            |           |           |           |    |           |
|             |    |    |            |           |           |           |    |           |

Time (t) = 4 clock cycles

#### DISPATCHING

| F                   | DP                                                 | DC | <b>E</b> 1 | E2 | <b>E3</b> | E4 | E5 | E6 |
|---------------------|----------------------------------------------------|----|------------|----|-----------|----|----|----|
| F( <sub>2-5</sub> ) | MVK<br>LDH<br>LDH<br>MPY<br>ADD<br>SUB<br>B<br>STH |    |            |    |           |    |    |    |

Time (t) = 5 clock cycles

#### DECODING

| F                   | DP                                          | DC  | <b>E</b> 1 | E2 | <b>E3</b> | <b>E4</b> | E5 | <b>E6</b> |
|---------------------|---------------------------------------------|-----|------------|----|-----------|-----------|----|-----------|
| F( <sub>2-5</sub> ) | LDH<br>LDH<br>MPY<br>ADD<br>SUB<br>B<br>STH | MVK |            |    |           |           |    |           |

Time (t) = 6 clock cycles

**EXECUTE -1** 

| F                   | DP                                   | DC  | <b>E1</b> | E2 | <b>E3</b> | <b>E4</b> | E5 | <b>E6</b> |
|---------------------|--------------------------------------|-----|-----------|----|-----------|-----------|----|-----------|
| F( <sub>2-5</sub> ) | LDH<br>MPY<br>ADD<br>SUB<br>B<br>STH | LDH | MVK       |    |           |           |    |           |

Time (t) = 7 clock cycles

#### **Execute (MVK done LDH in E1)**

| F                   | DP                            | DC  | <b>E</b> 1 | E2 | <b>E3</b> | <b>E4</b> | E5 | <b>E6</b> |
|---------------------|-------------------------------|-----|------------|----|-----------|-----------|----|-----------|
| F( <sub>2-5</sub> ) | MPY<br>ADD<br>SUB<br>B<br>STH | LDH | LDH        |    | MVK D     | one       |    |           |

Time (t) = 8 clock cycles

#### Vector Dot Product with pipeline effects

| ; clear A4 and initialize pointers A5, A6, and A7 |
|---------------------------------------------------|
| MVK $.S1 40,A2$ ; $A2 = 40$ (loop counter)        |
| loop LDH .D1 $*A5++, A0$ ; $A0 = a(n)$            |
| LDH .D1 $*A6++, A1$ ; $A1 = x(n)$                 |
| NOP 4                                             |
| MPY .M1 A0,A1,A3 ; $A3 = a(n) * x(n)$             |
| NOP                                               |
| ADD .L1 A3,A4,A4 ; $Y = Y + A3$                   |
| SUB .L1 A2,1,A2; decrement loop counter           |
| [A2] B .S1 loop ; if A2 != 0, then branch         |
| NOP 5                                             |
| STH .D1 A4, *A7 ; *A7 = Y                         |

Assembler will introduce automatic NOP

Assembler may transform secvential code to parallel code

## **'C62x Instruction Set (by category)**



Note: Refer to the 'C6000 CPU Reference Guide for more details

## **'C62x Instruction Set (by unit)**

| .S Unit |       |  | .L U  | nit  |
|---------|-------|--|-------|------|
| ADD     | MVKLH |  | ABS   | NOT  |
| ADDK    | NEG   |  | ADD   | OR   |
| ADD2    | NOT   |  | AND   | SADD |
| AND     | OR    |  | CMPEQ | SAT  |
| В       | SET   |  | CMPGT | SSUB |
| CLR     | SHL   |  | CMPLT | SUB  |
| EXT     | SHR   |  | LMBD  | SUBC |
| MV      | SSHL  |  | MV    | XOR  |
| MVC     | SUB   |  | NEG   | ZERO |
| MVK     | SUB2  |  | NORM  |      |
| MVKL    | XOR   |  |       |      |
| MVKH    | ZERO  |  |       | nit  |

| .M Unit |       |  |  |  |
|---------|-------|--|--|--|
| MPY     | SMPY  |  |  |  |
| MPYH    | SMPYH |  |  |  |

| Other |      |  |  |  |
|-------|------|--|--|--|
| NOP   | IDLE |  |  |  |

| .D Unit |         |  |  |  |  |  |
|---------|---------|--|--|--|--|--|
| ADD     | STB/H/W |  |  |  |  |  |
| ADDA    | SUB     |  |  |  |  |  |
| LDB/H/W | SUBA    |  |  |  |  |  |
| MV      | ZERO    |  |  |  |  |  |
| NEG     |         |  |  |  |  |  |

Note: Refer to the 'C6000 CPU Reference Guide for more details.

## C6700: Superset of Fixed-Point (by unit)



Note: Refer to the 'C6000 CPU 38 Reference Guide for more details.

## 'C64x: Superset of 'C62x

| .S | Dual/Quad Arith<br>SADD2<br>SADDUS2<br>SADD4                        | <u>Data Pack/Un</u><br>PACK2<br>PACKH2<br>PACKLH2           | Compares<br>CMPEQ2<br>CMPEQ4<br>CMPGT2                         | .L                                               | Dual/Quad Arith<br>ABS2<br>ADD2<br>ADD4                          | Data Pack/U<br>PACK2<br>PACKH2<br>PACKLH2                                  | <u>n</u>         |
|----|---------------------------------------------------------------------|-------------------------------------------------------------|----------------------------------------------------------------|--------------------------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------------|------------------|
|    | Bitwise Logical<br>ANDN<br>Shifts & Merge<br>SHR2<br>SHRU2<br>SHLMB | PACKHL2<br>UNPKHU4<br>UNPKLU4<br>SWAP2<br>SPACK2<br>SPACKU4 | CMPGT4<br><u>Branches/PC</u><br>BDEC<br>BPOS<br>BNOP<br>ADDKPC |                                                  | MAX<br>MIN<br>SUB2<br>SUB4<br>SUBABS4<br>Bitwise Logical<br>ANDN | PACKHL2<br>PACKH4<br>PACKL4<br>UNPKHU4<br>UNPKLU4<br>SWAP2/4<br>Multiplie  | 26               |
|    | SHRMB                                                               |                                                             |                                                                |                                                  | <u>Shift &amp; Merge</u><br>SHLMB                                | MPYHI<br>MPYLI                                                             | .9               |
| .D | Dual Arithmetic<br>ADD2<br>SUB2                                     | Mem Access<br>LDDW<br>LDNW<br>LDNDW                         |                                                                | .M                                               | SHRMB<br>Load Constant                                           | MPYHIR<br>MPYLIR<br>MPY2                                                   |                  |
|    | Bitwise Logical<br>AND<br>ANDN<br>OR<br>XOR<br>Address Calc.        | STDW<br>STNW<br>STNDW<br>Load Constant<br>MVK (5-bit)       |                                                                | Average<br>AVG2<br>AVG4<br><u>Shifts</u><br>ROTL | MVK (5-bit)<br>Bit Operation<br>BITC4<br>BITR<br>DEAL<br>SHFL    | SMPY2<br>DOTP2<br>DOTPN2<br>DOTPN2<br>DOTPN5<br>DOTPN4<br>DOTPU4<br>DOTPS1 | SU2<br>RSU2<br>4 |
|    | ADDAD                                                               |                                                             |                                                                | SSHVL<br>SSHVR                                   | <u>Move</u><br>MVD                                               | GMPY4<br>XPND2/4                                                           |                  |



\* Typical efficiency vs. hand optimized assembly

### **Software Tool Flow**



Compl. 6x runs all the code generation tools

## **Debug Tools Flow**





### **C66x Multicore DSP**

C66x – world's fastest floating-point DSP core with devices ranging from single core C6654 to octal core C6678 and supporting core speeds up to 1.4GHz

#### **Main Features**

- Up to 1.4GHz of fixed and floating-point performance per DSP core
- Single core to eight core scalability
- KeyStone<sup>™</sup> architecture for enhanced multicore performance
- Large embedded memory and high bandwidth DDR3/DDR3L interface
- Network Coprocessor (NetCP) option including security and packet acceleration
- High Speed I/O including PCIe, Serial RapidI/O, Gigabit Ethernet, Hyperlink

http://www.ti.com/lsds/ti/processors/dsp/c6000\_dsp/c66 x/products.page



Packaging: 21mm x 21mm

TMS320C665x

#### **APPLICATIONS**

- Avionics and defense
- Communications systems
- Machine vision
- Embedded and cloud analytics
- High performance computing
- Multimedia infrastructure
- Medical imaging
- Test and measurement
- Surveillance and security 46
- Software defined radio (SDR)

- TI's new TMS320C66x (aka C66x) series, a multicore chip they *designed for 4G cellular base stations and radio network controllers*. The C66x is a 40nm chip that comes in single-core, dual-core, quad-core and eight-core variations. Its most distinguishing feature is *the addition of floating-point instructions*, which were incorporated to support the more complex processing required for 4G wireless communications. The previous generation C64x series DSPs supported only fixed-point math.
- The C66x is implemented with TI's new *KeyStone architecture*, which incorporates an *eightway VLIW architecture*, a high-speed switch fabric called TeraNet, and a multicore navigator and DMA system that manages packet sending to other cores and peripherals.
  All the C66x products come with 512 KB L2 cache/core, along with 32 KB L1 cache for both instructions and data.
- In its eight-core 1.25 GHz implementation, the C66x delivers 160 single precision (SP) Gflops, while sucking up just 10W of power. That works out to an impressive 16 SP Gflops/watt. Energy efficiency is a hallmark of DSPs, in general, since they typically populate systems (like the aforementioned cellular base station towers and radio network controllers), where power and cooling is in short supply.
- The first HPC (High Performance Computing)-friendly C66x-based device is a PCIe card, which sports four of the eight-core DSPs running at 1.0 GHz. Built by Advantech, a TI parter, the half-length PCIe card delivers 512 SP Gflops at a modest 50W.
- On-board memory consists of 4 GB DDR3 RAM (1333 MHz), with full ECC support. They're also working on a full-length card, with eight DSPs, twice as much memory, and twice the performance.

http://processors.wiki.ti.com/index.php/Keystone\_Device\_Architecture http://www.hpcwire.com/2011/10/27/texas\_instruments\_makes\_hpc\_play\_with\_new\_multicore\_dsp\_chips/ TMS320C66x KeyStone™ Multicore DSP

> ÎEXAS ÎNSTRUMENTS

#### **TI embedded processors**



https://www.yumpu.com/en/document/view/18654009/digital-media-overview-texas-instruments

#### Examen:

# Grila 25-30 intrebari cu raspuns multiplu 3-4 probleme dintre care un tabel C2x