## ELE 375 / COS 471 Final Exam Fall, 2001 Prof. Martonosi

| Question | Score |
|----------|-------|
| 1        | /10   |
| 2        | /20   |
| 3        | /15   |
| 4        | /15   |
| 5        | /10   |
| 6        | /20   |
| 7        | /20   |
| 8        | /25   |
| 9        | /30   |
| 10       | /30   |
| 11       | /30   |
| 12       | /15   |
| 13       | /10   |
| Total    | / 250 |

Please write your answers clearly in the space provided. For full credit and/or to get partial credit, show your work.

Name: \_\_\_\_\_

Honor code:

1. (10 points) TLB's are typically built to be fully-associative or highly set-associative. In contrast, firstlevel data caches are more likely to be direct-mapped or 2 or 4-way set associative. Give two good reasons why this is so.

2. (20 points) The design team for a simple, single-issue processor is choosing between a pipelined or non-pipelined implementation. Here are some design parameters for the two possibilities:

| Parameter                    | Pipelined Version | Non-Pipelined Version |
|------------------------------|-------------------|-----------------------|
| Clock Rate                   | 500MHz            | 350 MHz               |
| CPI for ALU instructions     | 1                 | 1                     |
| CPI for Control instructions | 2                 | 1                     |
| CPI for Memory instructions  | 2.7               | 1                     |

a. (10 points) For a program with 20% ALU instructions, 10% control instructions and 75% memory instructions, which design will be faster? Give a quantitative CPI average for each case.

b. (10 points) For a program with 80% ALU instructions, 10% control instructions and 10% memory instructions, which design will be faster? Give a quantitative CPI average for each case.

3. (15 points) For each of the characteristics below, complete the table by checking the appropriate entry (architecture or implementation) to indicate whether each feature is most often a characteristic of a computer architecture or of a particular CPU implementation.

| Characteristic                     | Architecture? | Implementation? |
|------------------------------------|---------------|-----------------|
| # of instructions issued per cycle |               |                 |
| size of data cache                 |               |                 |
| # of registers visible to          |               |                 |
| programmers                        |               |                 |
| miss penalty for L1 data cache     |               |                 |
| Presence/absence of a left shift   |               |                 |
| instruction                        |               |                 |
| Branch delay slots                 |               |                 |

4. (10 points) Imagine an instruction whose function is to read four adjacent 32-bit words from memory and places them into four specified 32-bit architectural registers. Assuming the 5-stage pipeline is filled with these instructions and these instructions ONLY, what is the minimum number of register file read and write ports that would be required?

5. (25 points) How many total SRAM bits will be required to implement a 256KB four-way setassociative cache. The cache is physically-indexed cache, and has 64-byte blocks. Assume that there are 4 extra bits per entry: 1 valid bit, 1 dirty bit, and 2 LRU bits for the replacement policy. Assume that the physical address is 50 bits wide. 6. (30 points) Consider a 64 byte, direct-mapped cache with 32-byte lines. A data reference stream is presented to the cache in the order shown below. (Each reference is a read of a single 32-bit (aka 4-byte) word starting at the given address.) For each reference, indicate whether it is a hit or a miss. Compute the miss rate and write it below.

| Address<br>Referenced | Hit or Miss? |
|-----------------------|--------------|
| 0                     |              |
| 4                     |              |
| 72                    |              |
| 136                   |              |
| 68                    |              |
| 140                   |              |
| 76                    |              |
| 80                    |              |
| 88                    |              |
| 92                    |              |
| 96                    |              |
| 204                   |              |
| 0                     |              |
| 36                    |              |
| 40                    |              |

- 7. (25 points) Consider a memory system with the following parameters:
- Translation Lookaside Buffer has 256 total entries and is 2-way set associative
- 64Kbyte L1 Data Cache has 64 byte lines and is also 2-way set associative
- <u>Virtual addresses</u> are 32-bits and <u>physical addresses</u> are 24 bits
- 8KB page size

The figures below are labeled diagrams of the cache and TLB. Please fill in the appropriate information in the boxes below:

|     | L1 Cache |     | TLB  |
|-----|----------|-----|------|
| A = | bits     | F = | bits |
| B = | bits     | G = | bits |
| C = | bits     | H = | bits |
| D = | bits     | I = | bits |
| E = | bits     |     |      |



## 8. (30 points) Pipelining

a. (15 points) First consider a standard 5-stage pipeline of the type discussed in class: IF-ID-EX-M-WB. The only difference is that the pipeline here implements NO BYPASSING. All data dependences are handled by having the pipeline stall until the register fetch will result in the correct data being fetched. For the following pairs of instructions, indicate the number of stall cycles required between instruction i and instruction i+1:

| Instruction i  | Instruction I+1 | # Stall cycles |
|----------------|-----------------|----------------|
| add r1, r2, r3 | add r4, r1, r5  |                |
| add r1, r2, r3 | sw r1, 42(r8)   |                |
| lw r1, 10(r7   | add r11, r1, r9 |                |

- b. (15 points) Next consider a non-standard 6-stage pipeline as described below. Once again, the pipeline implements NO BYPASSING as in part (a). Please again complete the stall cycles for each of the cases given.
- **IF** Instruction Fetch
- **ID** Instruction Decode
- **RF** Register fetch during second half of cycle
- EX ALU execution, memory address calculation
- M Memory operation
- **WB** Writeback to register file during first half of cycle

| Instruction i  | Instruction I+1 | # Stall cycles |
|----------------|-----------------|----------------|
| add r1, r2, r3 | add r4, r1, r5  |                |
| add r1, r2, r3 | sw r1, 42(r8)   |                |
| lw r1, 10(r7   | add r11, r1, r9 |                |

- 9. (15 points) Bandwidth: Dr. Andrew Tanenbaum is often attributed for the quote "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."
  - a. The specs for my Subaru station wagon say it has an interior volume of 95.9 cubic feet. Current magnetic tapes hold about 500 Gbits per cubic inch. So if someone "hurtles" at 65 mph down the NJ Turnpike with my car full of these tapes, what bandwidth (in bits per second) are they achieving? You can assume, for the purposes of this problem, that the driver takes up 10 cubic feet.

b. Strands of optical fiber used in long-haul data networks have bandwidths of about 2 terabits per second. (That's 2\*10<sup>12</sup> bits per second.) How does this bandwidth compare to the station wagon? Give 5 good reasons why optical fiber used more often than a station wagon for delivering email.

10. (10 points) Memory

a. (5 points) Draw the rough timing diagram for an SRAM write operation. Include the data, address, CE, and WE signals.

b. (5 points) How many address pins are needed for a 1Mbit x 1 DRAM?