## DESIGN OF POWER AND PERFORMANCE OPTIMAL 3D-NoC ARCHITECTURES

Thesis

Submitted in partial fulfillment of the requirements for the degree of  $\mathbf{DOCTOR}\ \mathbf{OF}\ \mathbf{PHILOSOPHY}$ 

by Bheemappa Halavar (148004 CS14F06)



DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL, MANGALORE - 575025

June 2020

DECLARATION

by the Ph.D. Research Scholar

I hereby declare that the Research Thesis entitled **Design of Power and Performance Optimal 3D-NoC Architectures** which is being submitted to the **National Institute of Technology Karnataka**, **Surathkal** in partial fulfilment of the requirements for the award of the Degree of **Doctor of Philosophy** in **Computer Science and Engineering** is a **bonafide report of the research work carried out by me**. The material contained in this Research Thesis has not been submitted to any University or Institution for the award of any degree.

(**148004 CS14F06**, **Bheemappa Halavar**) Department of Computer Science and Engineering

Place: NITK, Surathkal. Date: June 02, 2020

#### CERTIFICATE

This is to *certify* that the Research Thesis entitled **Design of Power and Performance Optimal 3D-NoC Architectures** submitted by **BHEEMAPPA HALAVAR**, (Register Number: 148004 CS14F06) as the record of the research work carried out by him, is *accepted as the Research Thesis submission* in partial fulfilment of the requirements for the award of degree of **Doctor of Philosophy**.

Dr. Basavaraj Talawar Research Guide

Dr. Alwyn Roshan Pais

Chairman - DRPC

## ACKNOWLEDGMENT

I would like to express my sincere gratitude to my research guide Dr. Basavaraj Talawar, for his guidance, support and encouragement throughout my research work at the Department of Computer Science and Engineering NITK, Surathkal.

I express heartfelt thanks to my Research Progress Assessment Committee (RPAC) members Dr. Ramesh Kini M, Associate Professor, E and C department and Dr. Mohit P. Tahiliani, Asst. Professor, Dept. of CSE, for their valuable suggestions and constant encouragement to improve my research work.

I sincerely thank all teaching, technical and administrative staff of the Computer Science and Engineering Department for their help during my research work. I also want to take the opportunity to thank all my teachers throughout the years. There are many whom I owe thanks.

I wish to express my love and gratitude to all my family members, especially my father and my uncle Ganapati Halawar, EE HESCOM for their encouragements and supports throughout all my studies from primary school to the current level.

Finally, I would like to express my gratitude to my friends for proofreading the papers submitted to conferences, journals and also the thesis.

Place: Surathkal Date: June 02, 2020 Bheemappa Halavar

### ABSTRACT

A highly structured and efficient on-chip communication network is required to achieve high-performance and scalability in current Chip Multiprocessors (CMPs) and Systemon-Chips (SoCs). Network-on-Chip (NoC) has emerged as a reliable communication framework in CMPs and SoCs. Many 2D NoC architectures have been proposed for the efficient design of on-chip communication. 2D NoC architectures suffer from high latency and high energy in read/write buffers, Virtual Channels, switch traversal, links (wires) as the number of cores in SoC ad CMPs increase. 3-Dimensional Integrated Chips (3D-ICs) serve emerging applications that demand tailored accelerators for high performance and improved energy efficiency. The component redistribution in 3D ICs enables higher performance at competitive energy budgets by allowing greater integration capabilities, while lowering the overall wire area, providing greater communication bandwidth, high flexibility, throughput and lower overall communication latencies.

Cycle accurate simulators model the functionality and behaviour of NoCs by considering microarchitectural parameters of the underlying components to estimate performance, power and energy characteristics. Employing NoCs in 3D-ICs can further improve performance, energy efficiency, and scalability characteristics of 3D SoCs and CMPs. Minimal error in the estimation of energy and performance of NoC components is crucial in architectural trade-off studies. Exploring design space in 3D NoC can lead to highly energy efficient and reduced area interconnect architecture for modern SoC. Accurate modeling of horizontal and vertical links by considering microarchitectural and physical characteristics reduces the error in power and performance estimation of 3D NoCs. Additionally, mapping the temperature distribution in a 3D NoC reduces estimation error. Effective extraction of the heat between layers is a significant challenge in 3D NoCs.

In this thesis, power and performance trade-off in two, 2-layer 3D Butterfly Fat Tree (BFT) variants are explored using a floorplan driven approach. The first 3D BFT variant analyzed is a standard stacked BFT (3DBFT) derived from a 2D BFT topology. A power-performance optimal 3D BFT (OP3DBFT) is evolved from the standard 3DBFT using overall performance, link and TSV minimization, and power-The 3D NoC modeling capabilities are extended in two performance trade-offs. existing state-of-the-art simulators, viz., the 2D NoC Simulator - BookSim2.0 and the thermal behaviour simulator - HotSpot6.0. The major extensions incorporated in BookSim2.0 are: Through Silicon Via power and performance models, 3D topology construction modules, 3D Mesh topology construction using variable X, Y, Z radix, tailored routing modules for 3D NoCs. The major extensions incorporated in HotSpot6.0 are: parameterized 2D router floorplan, 3D router floorplan including Through Silicon Vias (TSVs), power and thermal distribution models of 2D and 3D routers. Using the extended 3D modules, performance (average network latency), and energy efficiency metrics (Joules per Flit, Energy-Delay Product) of variants of 2D and 3D Mesh, and Butterfly Fat Tree (BFT) topologies have been evaluated under synthetic traffic patterns. The thermal behaviour of 3D NoC architectures has been analyzed for the ideal arrangement, as well as a proposed thermally aware design of the router-TSV element. Accurate power estimation models of routers and TSVs were used for the thermal evaluation of 3D NoCs.

The OP3DBFT with round-robin deflection routing delivers up to 44% higher performance and consumes up to 23% lesser power compared to the 3DBFT. From the energy perspective, OP3DBFT has an average 23% decrease in Flits-per-Joule, and up to 46% improvement in Energy-Delay-Product when compared to the 3DBFT. The 3DBFT and OP3DBFT have been synthesized on Xilinx Artix-7 FPGAs for resource comparison. OP3DBFT consumes 12% lesser area compared to 3DBFT. Using extended models in a 4x4x4 3D NoC Mesh topology, it has been observed that the total average link power consumed is lower than a 2D mesh by 13%. Additionally, the average network latency in the 3D mesh topology is roughly 60%-82% lower than the 2D Mesh. 4-layer 3D Mesh with uniform traffic exhibits a performance improvement of up to  $2.3 \times$  compared to other Mesh variants. 4-layer 3D BFT with transpose traffic shows an improvement in performance up to  $1.3 \times$  over all other BFT variants. BFT with transpose traffic pattern has a  $1.5 \times$  improvement in performance compared to the uniform traffic pattern. 4-layer 3D Mesh has on-chip communication performance up to  $4.5 \times$  than 4-layer 3D BFT. The on-chip communication performance improved up to  $2.2 \times$  and  $3.1 \times$  in 4-layer 3D Mesh in comparison to 2D Mesh with uniform and transpose traffic patterns respectively. 3D Mesh variants have the lowest Energy Delay Product (EDP) compared to 3D BFT variants as there is an 80% reduction in link lengths and up to  $3 \times$  more TSVs.

**Keyword:** 3D Network-on-chip (NoC), BFT topology, Mesh topology, Throughsilicon via (TSV), Design space exploration, performance analysis, Energy Delay Product.

iv

## Contents

| Ab                   | ostract                                                     | i    |
|----------------------|-------------------------------------------------------------|------|
| Ta                   | ble of Contents                                             | iv   |
| $\operatorname{Lis}$ | st of Figures                                               | viii |
| $\operatorname{Lis}$ | st of Abbrevations                                          | xiii |
| 1                    | Introduction                                                | 1    |
|                      | 1.1 Introduction                                            | 1    |
|                      | 1.2 Problem Statement and Objectives                        | 6    |
|                      | 1.3 Objectives                                              | 6    |
|                      | 1.4 Contributions                                           | 7    |
|                      | 1.5 Organization of the Thesis                              | 8    |
|                      | 1.6 Thesis Outline                                          | 8    |
| <b>2</b>             | Literature Review                                           | 9    |
|                      | 2.1 Network-on-chips                                        | 9    |
|                      | 2.2 3D Network on chips                                     | 11   |
|                      | 2.3 NoC Simulators                                          | 12   |
|                      | 2.3.1 BookSim2.0 NoC Simulator                              | 13   |
|                      | 2.3.2 Thermal Simulation of 3D ICs                          | 17   |
|                      | 2.3.3 HotSpot6.0 Temperature Modelling Tool                 | 17   |
|                      | 2.4 Summary                                                 | 18   |
| 3                    | Floorplan based 2D and 3D NoC Architectural Design Space Ex | [-   |
|                      | ploration of Mesh and BFT Topologies                        | 19   |

| 3.1               | Floorplan and Delay Estimation                                                                                                     | 19                               |
|-------------------|------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
|                   | 3.1.1 2D and 3D Mesh Topology                                                                                                      | 19                               |
|                   | 3.1.2 BFT topology                                                                                                                 | 22                               |
| 3.2               | Horizontal and Vertical Link Delay Estimation                                                                                      | 25                               |
| 3.3               | Buffer Space Analysis                                                                                                              | 26                               |
|                   | 3.3.1 Buffer Space Equalisation (BSE)                                                                                              | 28                               |
| 3.4               | Experimental Setup                                                                                                                 | 31                               |
| 3.5               | Results and Discussion                                                                                                             | 32                               |
|                   | 3.5.1 Average Network Latency                                                                                                      | 32                               |
|                   | 3.5.2 BSE based Mesh and BFT Topology                                                                                              | 32                               |
|                   | 3.5.3 BFT vs Mesh Topology                                                                                                         | 36                               |
|                   | 3.5.4 Flit Energy Analysis                                                                                                         | 37                               |
|                   | 3.5.5 Energy Delay Product (EDP)                                                                                                   | 39                               |
| 3.6               | Summary                                                                                                                            | 39                               |
| 4 3D              | NoC Modelling in BookSim and Hotspot for Power, Performance                                                                        | <u> </u>                         |
| and               | 1 Thermal Evaluation                                                                                                               | 41                               |
| 4.1               | TSV Delay and Power Models                                                                                                         | 41                               |
| 111               | 4.1.1 TSV Delay Models                                                                                                             | 41                               |
|                   | 4.1.2 TSVs Power Model                                                                                                             | 42                               |
| 4.2               | 3D NoC Modelling in BookSim2.0 and HotSpot                                                                                         | 43                               |
|                   | 4.2.1 Variable radix at X, Y, Z in Mesh topology                                                                                   | 43                               |
|                   | 4.2.2 TSV based Delay Model and Power Module in BookSim2.0.                                                                        | 44                               |
|                   | 4.2.3 TSV-Router in HotSpot for 3D NoC Architecture                                                                                | 48                               |
| 4.3               |                                                                                                                                    | 50                               |
|                   | Analysis of 3D NoC Topology Variants                                                                                               |                                  |
|                   | Analysis of 3D NoC Topology Variants                                                                                               | 51                               |
|                   | Analysis of 3D NoC Topology Variants         4.3.1       General Procedure to Add New Topologies         4.3.2       Mesh topology | 51 $52$                          |
|                   | Analysis of 3D NoC Topology Variants                                                                                               | 51<br>52<br>54                   |
| 4.4               | Analysis of 3D NoC Topology Variants                                                                                               | 51<br>52<br>54<br>57             |
| $\frac{4.4}{4.5}$ | Analysis of 3D NoC Topology Variants                                                                                               | 51<br>52<br>54<br>57<br>60       |
| 4.4<br>4.5        | Analysis of 3D NoC Topology Variants                                                                                               | 51<br>52<br>54<br>57<br>60<br>60 |

| 4.5.3 Average Energy per Flit (EPF)                                                                | 62 |
|----------------------------------------------------------------------------------------------------|----|
| 4.5.4 Energy Delay Product (EDP)                                                                   | 63 |
| $4.5.5  \text{Thermal behaviour}  \dots  \dots  \dots  \dots  \dots  \dots  \dots  \dots  \dots  $ | 64 |
| 4.5.6 Thermal behaviour of a 3D Mesh and 3D BFT topology $\ldots$                                  | 66 |
| 4.6 Summary                                                                                        | 69 |
|                                                                                                    | 3  |
| 5 Area, Power and Performance analysis of Optimal 3D BFT Not                                       | ;  |
|                                                                                                    | 71 |
| 5.1 Analysis of 2D and 3DBF1 Topology                                                              | (1 |
| $5.1.1  \text{Floorplanning}  \dots  \dots  \dots  \dots  \dots  \dots  \dots  \dots  \dots  $     | (1 |
| 5.1.2 Through Silicon Via Link Delay Model                                                         | 72 |
| 5.1.3 Data Serialization over 15VS                                                                 | 74 |
| 5.1.4 Nearest Common Ancestor(NCA) Routing in BF1 Topology                                         | 74 |
|                                                                                                    | 70 |
| 5.2 Power and Performance Optimal OP3DBFT                                                          | 79 |
| 5.2.1 TSV Count Minimisation                                                                       | 79 |
| 5.2.2 OP3DBFT - Topology and Floorplan                                                             | 81 |
| 5.3 Experimental Setup                                                                             | 82 |
| b.4 Results and Discussion                                                                         | 83 |
| 5.4.1 Performance Analysis                                                                         | 83 |
| 5.4.2 Energy Analysis                                                                              | 84 |
| 5.4.3 Energy Delay Product (EDP)                                                                   | 84 |
| 5.4.4 Area Utilization                                                                             | 85 |
| b.b Summary                                                                                        | 86 |
| 6 Conclusion and Future Scopes                                                                     | 87 |
|                                                                                                    |    |
| Appendices                                                                                         | 89 |
| Α                                                                                                  | 89 |
| A.1 Routers Thermal Behaviour for 2D BFT and 2D Mesh Topologies .                                  | 89 |
| A.2 Thermal Behaviour Analysis of 2 Layer 3D CMesh Network-on-Chip                                 |    |
| Architecture                                                                                       | 92 |

| A.3 The Effect of Varying TSV Parameters(Length, Diameter, TSV Pitch      |     |
|---------------------------------------------------------------------------|-----|
| and Bump Height) on Latency and Power                                     | 94  |
| A.4 Through-Silicon Via Electrical Model Parameters Details and Essential |     |
| Equations for Calculating Power Consumption.                              | 97  |
| Bibliography                                                              | 99  |
| List of Publications                                                      | 107 |

## List of Figures

| 1.1        | (a) Mesh Topology. (b) BFT topology. (c) Floorplan of 2D Mesh                                  |          |
|------------|------------------------------------------------------------------------------------------------|----------|
|            | Topology. (d) Floorplan of 2D BFT topology.                                                    | 3        |
|            |                                                                                                |          |
| 2.1        | 4 x4 NoC Mesh topology. Each PE connects to a router. One router                               |          |
|            | connects to North, East, South and West neighbours using links                                 | 9        |
| 2.2        | Generic K <sub>input</sub> , K <sub>output</sub> router microarchitecture. Each input port has |          |
|            | n Virtual Channels. Output port for data is chosen by the Router                               |          |
|            | Logic. Switching mechanism is implemented by SA and VC- Allocator                              |          |
|            | block(Pande et al., 2005).                                                                     | 10       |
| 2.3        | The overall simulation flow between Modules during the simulation in                           |          |
|            | BookSim2.0.                                                                                    | 15       |
| 2.4        | HotSpot6.0 thermal simulation flow chart for thermal analysis                                  | 18       |
|            |                                                                                                |          |
| 3.1        | $8 \times 8$ 2D mesh with 64 PEs.                                                              | 20       |
| 3.2        | $8\times4\times2$ 3D Mesh with four stacked layers connected using TSVs. $$ .                  | 21       |
| 3.3        | $4\times4\times4$ 3D Mesh with four stacked layers connected using TSVs                        | 22       |
| 3.4        | 64 node BFT topology with three levels. Level 1 is of 4 router, level                          |          |
|            | 2 of 8 routers, level 3 of 16 routers. The leaves are the PEs which are                        |          |
|            | connected to level-3 routers and 4 PE's per router.                                            | 23       |
| 3.5        | Floorplan of 2D BFT with 64 PEs and each PEs are connected routers                             |          |
|            | for inter PEs communication.                                                                   | 23       |
| 3.6        | (b1) $8 \times 4 \times 2$ 3D BFT with four stacked layers connected using TSVs.               |          |
|            | (b2) Inter-layer connections.                                                                  | 24       |
| 3.7        | (c1) $4 \times 4 \times 4$ 3D BFT with four stacked layers connected using TSVs.               |          |
| ( <u> </u> | (c2) Inter-layer connections                                                                   | 24       |
|            |                                                                                                | <u> </u> |

| 3.8  | Average network latency comparison with accurate link delay modelling    |    |
|------|--------------------------------------------------------------------------|----|
|      | of 2D Mesh(default link delay) and 2D Mesh with accurate link delay.     | 32 |
| 3.9  | Average network latency comparison with accurate link delay modelling    |    |
|      | of 2D BFT(default link delay) and 2D BFT with accurate link delay.       | 33 |
| 3.10 | Average network latency comparison after BSE (varying VC and D ) (a)     |    |
|      | 3D 2-layer Mesh uniform traffic (b) 3D 4-layer Mesh uniform traffic (c)  |    |
|      | 2D Mesh transpose (d) 3D 2-layer Mesh transpose traffic(f) 3D 4-layer    |    |
|      | Mesh transpose traffic.                                                  | 34 |
| 3.11 | Average network latency comparison after BSE (varying VC and D )         |    |
|      | (a) 2D BFT uniform traffic (b) 3D 2-layer BFT uniform traffic (c) 3D     |    |
|      | 4-layer BFT uniform traffic (d) 2D BFT transpose (e) 3D 2-layer BFT      |    |
|      | transpose traffic<br>(f) 3D 4-layer BFT transpose traffic                | 35 |
| 3.12 | Normalized performance between 2D Mesh and 3D Mesh and BFT               |    |
|      | variants for (a) Uniform traffic (b) Transpose traffic.                  | 36 |
| 3.13 | Mesh and BFT (2D, 3D variants) topologies normalized Flits per Joules    |    |
|      | for (a) Uniform traffic (b) Transpose traffic.                           | 38 |
| 3.14 | Normalized EDP of Mesh and BFT (2D, 3D variants) for (a) Uniform         |    |
|      | traffic (b) Transpose traffic.                                           | 39 |
| 4.1  | Structure of a signal TSV and a ground TSV with bumps with the           |    |
|      | via-last process and their structural parameters (Kim et al., 2011).     | 43 |
| 4.2  | TSVs electrical model with labelled components (Kim et al., 2011).       | 44 |
| 4.3  | Logical Layout of the TSV electrical model considered in the dynamic     |    |
|      | power model (Kim et al., 2010).                                          | 46 |
| 4.4  | Simulation framework for evaluating power and performance. BookSim       |    |
|      | was extended with 3D TSV delay, power and link delay modules.            | 47 |
| 4.5  | Logical representations of (a) Default Alpha Ev6 processor layout in     |    |
|      | HotSpot6.0. (b) Modified layout with router next to the Data cache (Mesh |    |
|      | topology). (c) Modified layout with router shifted away from the Data    |    |
|      | cache(Thermal Aware Mesh architecture). (d) One router shared be-        |    |
|      | tween 4 cores (not to scale) for BFT architecture.                       | 49 |
| 4.6  | The automated floorplans generation in HotSpot for 3D NoC architecture   | 50 |

| 4.7 The over all modified simulation framework for power, performance,                   |    |
|------------------------------------------------------------------------------------------|----|
| thermal behaviour of 3D NoC architecture.                                                | 51 |
| 4.8 $8 \times 8$ 2D Mesh with 4 PEs $\ldots$                                             | 53 |
| 4.9 Floorplan of $8 \times 4 \times 2$ 3D Mesh with two stacked layers connected using   |    |
| TSVs.                                                                                    | 54 |
| 4.10 Floorplan of $4 \times 4 \times 4$ 3D Mesh with four stacked layers connected using |    |
| TSVs                                                                                     | 55 |
| 4.11 64 node BFT topology with three levels. Level 1 is of 4 router, level               |    |
| 2 of 8 routers, level 3 of 16 routers. The leaves are the PEs which are                  |    |
| connected to level-3 routers and 4 PE's per router.                                      | 55 |
| 4.12 Floorplan of 2D Mesh with 64 PEs and each PEs are connected routers                 |    |
| for inter PEs communication.                                                             | 56 |
| 4.13 (b1) $8 \times 4 \times 2$ 3D Mesh with two stacked layers connected using TSVs.    |    |
| (b2) Inter-layer connections.                                                            | 57 |
| 4.14 (c1) $4 \times 4 \times 4$ 3D Mesh with four stacked layers connected using TSVs.   |    |
| (c2) Inter-layer connections. $\ldots$                                                   | 57 |
| 4.15 Average network latency comparison with accurate link delay mod-                    |    |
| elling. (a) 2D Mesh(default link delay) and 2D Mesh with accurate                        |    |
| link delay and (b) 2D BFT(default link delay) and 2D BFT with ac-                        |    |
| curate link delay.                                                                       | 60 |
| 4.16 Average network latency comparison for 2-layer 3D BFT for RROD                      |    |
| and ROD routing                                                                          | 61 |
| 4.17 Average network latency comparison between uniform and transpose                    |    |
| traffic pattern for 2D and 3D variants of (a) Mesh topology and (b)                      |    |
| BFT topology                                                                             | 62 |
| 4.18 (a) Average EPF for 2D and 3D Mesh topology variants (b) Average                    |    |
| EPF for 2D and 3D BFT topology variants.                                                 | 63 |
| 4.19 Normalized EDP of 2D and 3D Mesh and BFT topology for (a) Uniform                   |    |
| traffic pattern (b) Transpose traffic pattern.                                           | 64 |

| 4.20 Heatmaps in a 64-core (a) 3D Mesh architecture with the router next                     |    |
|----------------------------------------------------------------------------------------------|----|
| to the data cache, (b) 3D Mesh architecture with the router shifted                          |    |
| away from the data cache and Temperature distribution across routers                         |    |
| in (c) 3D Mesh architecture with the router next to the data cache, (d)                      |    |
| 3D Mesh architecture with the router shifted away from the data cache.                       | 65 |
| 4.21 Thermal behaviour of (a) 2D mesh Topology (b) 2-layer 3D Mesh                           |    |
| Topology (c) 4-layer Mesh topology (d) Average thermal behaviour                             |    |
| comparison of 2D and 3D Mesh variants.                                                       | 66 |
| 4.22 Thermal behaviour of (a) 2D BFT topology (b) 2-layer 3D BFT topol-                      |    |
| ogy (c) 4-layer 3D BFT topology (d) Average thermal behaviour com-                           |    |
| parison of 2D and 3D Mesh variants.                                                          | 67 |
|                                                                                              |    |
| 5.1 64 node BFT topology with three levels. Level 1 is of 4 router, level                    |    |
| 2 of 8 routers, level 3 of 16 routers. The leaves are the PEs which are                      |    |
| connected to level-3 routers and 4 PE's per router.                                          | 72 |
| 5.2 Floorpan of 2D BFT topology.                                                             | 73 |
| 5.3 (a) Floorplan of 3DBFT (two-stacked layer) BFT connected using $TSVs(8)$                 | X  |
| $4 \times 2$ ). (b) Inter-layer connections.                                                 | 73 |
| 5.4 2D BFT topology with two path flows from <b>node 0</b> to <b>node 32</b> .               | 75 |
| 5.5 ROD routing for from <b>node 0</b> to <b>node 32</b> and <b>node 3</b> to <b>node 35</b> |    |
| for 2D BFT topology                                                                          | 76 |
| 5.6 RROD - from <b>node 0</b> to <b>node 32</b> and <b>node 3</b> to <b>node 35</b> for 2D   |    |
| BFT topology.                                                                                | 76 |
| 5.7 NCA Routing flowchart of BFT topology with ROD and RROD                                  | 78 |
| 5.8 2D BFT links with red and blue (L1 to L8) colors are vertical links for                  |    |
| 3DBFT topology.                                                                              | 80 |
| 5.9 Modified BFT topology (OP3DBFT)                                                          | 81 |
| 5.10 (i) 8 x 4 x 2 2-layer OP3DBFT with two stacked layers. (ii) Inter-                      |    |
| layer(TSVs) connections.                                                                     | 82 |
| 5.11 Latency comparison of 2-layer and OP3DBFT topology for uniform,                         |    |
| transpose and bit-reversal traffic.                                                          | 83 |

| 5.12 En        | nergy per nit comparison of 2-layer and OP3DBF1 topology for uni-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |          |
|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|
| for            | rm. transpose and bit-reversal traffic.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 84       |
|                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | <u> </u> |
| 5.13 No        | ormalised EDP of regular 3DBFT and OP3DBFT for uniform, trans-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |          |
| no             | se and hit reversal traffic                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 85       |
| po             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 00       |
|                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |          |
| A.1 3L         | <u>D CMesh NoC architecture</u>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 92       |
| $A_2$ (a)      | ) Temperature distribution across routers (b) Heatmaps of 3D Con-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |          |
| 11.2 (a)       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |          |
| cei            | ntrated Mesh Architecture.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 93       |
|                | fact of remains TCV (a)length (b) dispector and (a) sitch on latence                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |          |
| A.5 EI         | lect of varying 15v (a)length, (b)diameter and (c)pitch on latency                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |          |
| for            | r a single via, at operating frequency= $2.5 \text{ GHz}$ and voltage= $1.1 \text{ V}$ .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 94       |
|                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |          |
| A.4 Ef         | fect of varying TSV (a)length, and (b)diameter on Power Consump-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |          |
| tio            | on for a single TSV, at operating frequency= $2.5 \text{ GHz}$ and voltage= $1.1$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |          |
|                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |          |
| V              | with activity $factor(AF) = 0.15$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 95       |
| $\Delta 5$ Ef  | fect of varying (a) TSV pitch (b) Rump Height on Power Consumption                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |          |
| <b>11.0</b> L1 | neet of varying (a) 15 v pitch (b) Dunip neight on 1 ower consumption                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |          |
| for            | r a single TSV, at operating frequency=2.5 GHz and voltage=1.1 V                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |          |
|                | $(\Lambda E) = 0.15$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | OF       |
| W1             | $\operatorname{till} \operatorname{activity} \operatorname{iactor}(\operatorname{AF}) = 0.13 \dots \dots$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 90       |
| A.6 Ef         | fect of varying (a)IMD Layer Height and (b)Oxide Layer Thickness                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |          |
|                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |          |
| on             | Power Consumption for a single TSV, at operating frequency= $2.5$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |          |
| GI             | Hz and voltage=1.1 V with activity factor(AF) = 0.15                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 96       |
| UI UI          | 112  and  1010080  1.1  10101  activity 100001 (111) = 0.10  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1  1.1 | 50       |

 $\operatorname{xiv}$ 

## List of Tables

| 4 4 |                                                                                                                        | _  |
|-----|------------------------------------------------------------------------------------------------------------------------|----|
| 1.1 | Comparison of State-of-the-art simulators and the modified 3D NoC                                                      |    |
|     | simulators this work.                                                                                                  | 5  |
|     |                                                                                                                        |    |
| 2.1 | Comparison of State-of-the-art simulators based on the design space                                                    |    |
|     | exploration in both 2D and 3D NoC architecture.                                                                        | 14 |
| 2.2 | Configuration parameters in BookSim2.0.                                                                                | 16 |
|     |                                                                                                                        |    |
| 3.1 | Floorplan parameter details.                                                                                           | 20 |
| 3.2 | Horizontal link(HL) length and delay(cc) details of 2D and 3D variants                                                 |    |
|     | of Mesh, BFT. These delays are considered for the simulation $% \mathcal{A} = \mathcal{A} = \mathcal{A} + \mathcal{A}$ | 27 |
| 3.3 | Vertical Link details of 3D variants of Mesh, BFT each link has 64 TSVs.                                               | 27 |
| 3.4 | Buffer Space utilization of 2D and 3D Mesh and BFT variants.                                                           | 29 |
| 3.5 | Mesh and BFT (2D, 3D variants) buffer Space Equalisation for varying                                                   |    |
|     | virtual channel and buffer depth.                                                                                      | 30 |
| 3.6 | Simulated Network Configuration.                                                                                       | 31 |
| 4.1 | Variable radix Mesh topology details.                                                                                  | 44 |
| 4.2 | Important Physical Parameters for TSVs (Khalil et al., 2008) and Safe                                                  |    |
|     | limits values for each parameter. All the parameters are from the Elec-                                                |    |
|     | trical TSV model shown in $4.1$ (a).                                                                                   | 45 |
| 4.3 | Reduced Model Parameters(Figure 4.3 (b))                                                                               | 46 |
| 4.4 | The detail of Modification to BookSim2.0 for 3D NoC topology                                                           | 52 |
| 4.5 | Parameters used in designing the floorplan                                                                             | 53 |
| 4.6 | Parameters used in the design of the floorplan                                                                         | 54 |
| 4.7 | 2-layer and 4-layer 3D Mesh topology details.                                                                          | 55 |
| 4.8 | Floorplan based 2D BFT and 3D BFT variants topology details                                                            | 58 |

| 4.9 Total resources in the 2D and 3D variants of Mesh and BFT considered         | 1      |
|----------------------------------------------------------------------------------|--------|
| in this work. Network size is 64 PEs. Links are horizontal (HL) and              | 1      |
| vertical (VL). VC is the number of virtual channels and D is buffer              | r      |
| depth per V.                                                                     | . 58   |
| 4.10 NoC BookSim2.0 parameter for 2D and 3D Mesh and BFT variants.               | . 59   |
| 4.11 Thermal evaluation Simulation Environment.                                  | . 59   |
| 4.12 Average of router's power of each level of 2D BFT topology with interva     | ]      |
| of 2000cc.                                                                       | . 69   |
| 4.13 Average of router's power utilisation of peripheral routers (each side)     | )      |
| and middle routers of Mesh topology.                                             | . 69   |
| 5.1 Parallel and serial case with the TSVs design parameters and TSV co          | unt 74 |
| 5.2 Link length and delay details of BFT topologies variants. $\ldots$           | . 79   |
| 5.3 Links utilisation (Injection rate= $0.018$ ) of 3DBFT (8-vertical links (TS) | Vs))   |
| for uniform, transpose and bit-reversal. Average utilisation of 3DBFTis          | S      |
| 40% and OP3DBFT is 80%.                                                          | . 79   |
| 5.4 Links utilisation (Injection rate= $0.018$ ) of OP3DBFT (2-vertical links    | S      |
| (TSVs)) for uniform, transpose and bit-reversal.                                 | . 80   |
| 5.5 Synthesis results of 3DBFT Topology variants                                 | . 86   |
| A.1 Each Router Power consumption and average router power of each leve          | 1      |
| of 2D BFT topology with interval of 500CC                                        | . 90   |
| A.2 Router Power consumption of 2D Mesh topology with interval of 2000           | CC 91  |
| A.3 Average of router's power utilisation of peripheral routers (each side)      | )      |
| and middle routers of Mesh topology                                              | . 92   |
| A.4 Electrical Model Parameters(Figure 4.1(b))                                   | . 97   |
| A.5 Reduced Model Parameters                                                     | . 98   |

## List of Abbrevations

| 3D-HiCIT | Hierarchical Crossbar-based Interconnection Topology |
|----------|------------------------------------------------------|
| 3D-ICs   | 3-Dimensional Integrated Chips                       |
| BFT      | Butterfly Fat Tree                                   |
| BSE      | Buffer Space Equalisation                            |
| CMesh    | Concentrated Mesh                                    |
| CMPs     | Chip Multiprocessors                                 |
| EDP      | Energy-Delay Product                                 |
| JpF      | Joules per Flit                                      |
| LT       | Link Traversal                                       |
| NoC      | Network-on-Chip                                      |
| PE       | Processing Element                                   |
| RC       | Route Computation                                    |
| SA       | Switch Allocation                                    |
| SoC      | System on a Chip                                     |
| ST       | Switch Traversal                                     |
| TSVs     | Through Silicon Vias                                 |
| VA       | Virtual-Channel Allocation                           |

# Chapter 1 Introduction

The scope and direction of this thesis are indicated in this chapter. This chapter begins with introducing the background and the classification of NoCs. A discussion on the state-of-art simulators and compared with respect to the support of 3D NoC architectures. Then, it gives an overview about state of our works on 3D NoC architectures and outlines the author's contributions in the thesis. Finally the organization of the rest of the thesis is given

#### 1.1 Introduction

The interconnection network affects the performance of SoCs and Chip Multiprocessors significantly. High performance SoCs cannot rely on the traditional bus based communication infrastructure due to high power consumption and performance bottlenecks introduced by buses. NoCs have emerged as the reliable, high-performance, energy efficient communication framework in SoCs and CMPs. Basic building blocks of NoC are network topology, routers micro-architecture (switches), routing, flow control and links architecture. Network topology in NoCs determines physical layout and how switches and nodes are connected. The functionalities of router are divided into number of stages (Pande et al., 2005).

The time spent during NoC design cycle is reduced by using cycle accurate simulators as early stage estimation. Using cycle accurate simulator power and performance metrics like latency, power and energy can be estimated. Existing simulators have explored various architectural and micro-architectural design parameters at routers level such as router pipeline depth, arbitration techniques, number and size of virtual channels, number and size input/output buffers, bufferless implementations, and switching techniques. At the network level, simulators allow restructuring of the NoC topology, network partitions, node concentrations, redesigning and evaluation of routing algorithms, flow control mechanisms, deadlock detection and avoidance strategies, adaptive and fault tolerance mechanisms. At the link level, the simulators enable designers to evaluate wire width, wire delays, interfaces, deployment of express physical links and serialization strategies.(Psathakis et al., 2015).

Fast and accurate approaches for analyzing critical metrics such as performance, power consumption or system fault-tolerance are important to guide the design process. The time spent during NoC design cycle is reduced by using cycle accurate simulators as an early stage estimation. The state-of-the-art NoC simulators(Agarwal et al., 2009; Jiang et al., 2013a; Tran and Baas, 2012), while having complete support for 2D NoC architectures, do not support configurable, parameterized design and implementation of 3D NoCs. Further, floorplan based designs of NoCs show links of multiple lengths deployed in different levels of irregular, tree-based NoC topologies. Table 2.1 compares the 2D and 3D NoC design space approaches of the state-of-art simulators with this work. The NoC simulators can be compared based on their coverage, configuration parameters and metrics measured. Most on-chip simulators consider a static wire length or constant delay during communications. However, the length and the delay of NoCs link vary according to the floorplan of the NoC.

The topology selection process is a significant NoC design decision as it influences key NoC attributes. The decision on the topology is relies on the application bandwidth requirements, chip power and performance constraints, area budget and resources available. Mesh topology is symmetric and has short wire lengths between routers. Mesh topology symmetric nature and has short wire lengths between routers hence its most widely used topology (Kumar et al., 2002). Butterfly Fat Tree (BFT) is another topology which has less routers and links compare to Mesh topology.

Link latency plays a significant role in overall performance of NoC. Our experiments on a 64 node 2D Mesh with uniform random traffic at the injection rate of 0.1 with default constant link latency show that, 15-20% of flit traversal time is spent in the links of the NoC. However, the wire lengths estimated using floorplans of the topology have non-identical lengths. The wire delays depend on the R,C values of the wire, which are in turn dependent on the length of the links (Section 4). Accurate modeling of link delay is necessary in early stage design trade-off studies. The link lengths on the chip depend on the area of the component. In order to estimate the exact length of the link, it is essential to consider the physical dimension of the components. Variable link lengths which have variable delays are a function of the PE area and the topology of the NoC. The interconnect length, operating frequency, and voltage must be considered to estimate the exact link delay.



Figure 1.1: (a) Mesh Topology. (b) BFT topology. (c) Floorplan of 2D Mesh Topology. (d) Floorplan of 2D BFT topology.

Figure 1.1 (a) and (b) shows 64 node Mesh and BFT Topology NoC architectures. In Mesh topology, each Processing Element(PE) is attached to local router which intern connects the core of neighbouring nodes via interconnects. The BFT topology connects 64 nodes using 28 routers. Link lengths are driven by the physical dimensions of the components on the chip. The floorplan determines the physical placement of PEs and routers physical dimensions. Floorplan influences the overall area and the length of physical links. Figure 1.1 (c) show BFT Topology floorplan. Non-uniform links are present in BFT topology floorplan. Mesh has shortest and uniform wire lengths between routersFigure 1.1 (a) compared to BFT Topology.

Three-dimensional integrated circuits(3D ICs) are an attractive solution for scalable CMPs and SoCs with the potential to achieve high performance and low power usage. 3D ICs distribute logic and memory in stacked layers and use Through Silicon Vias (TSVs) as vertical interconnects(Kim et al., 2009). Through Silicon Vias (TSVs) provide a communication link for dies in the vertical direction to achieve 3D integration TSV are made up of copper or tungsten and TSV are used for signal communication and power delivery. The propagation delay depends on the dimensions(length, radius and pitch) of TSVs. (Khalil et al., 2008) use an analytical model of TSVs to get the TSVs delay, power and valid TSV configuration.

3D NoCs have lesser aggregate wire length resulting in improved communication latency and power compared to their 2D counterparts (Pavlidis and Friedman, 2007). It has been estimated that 3D architectures reduce wiring length by a factor of the square root of the number of layers used (Joyner et al., 2001).

In 3D, links are divided into horizontal and vertical links again the delay of vertical links depends on the type of vertical connection. Most of the simulators consist of fixed delay component and supports only 2D NoC topology. There is a need of earlystage accurate performance evaluation 3D NoC simulators considering accurate delay and accurate physical dimensions. BFT topology has lower resource compared to Mesh topology hence BFT topology is considered for the experiments.

The thermal effect in 3D NoCs has become a major concern because it increases total power generated per area, thermal gradient across layers, if cooling is not adequately provided (Swarup et al.) 2012). State of the art 3D IC simulators (Sridhar et al., 2014), (Huang et al., 2009) incorporate better thermal models for analysing the thermal effect, but support for thermal analysis for 3D NoC architectures is lacking. Table 2.1 compares the 2D and 3D NoC design space approaches of the state-of-art simulators with this work. The NoC simulators can be compared based on their coverage, configuration parameters and metrics measured. Most on-chip simulators consider a static wire length or constant delay during communications. However, the length and the delay of NoCs link vary according to the floorplan of the NoC.

Accurate modeling of link delay is necessary in early stage design trade-off studies. The configuration parameters of current 2D NoC simulators do not allow such link lengths with varying delays to be included in delay calculations. State of the-art NoC simulator (Jiang et al., 2013a) model the link as a fixed delay component. There is a need of early-stage accurate performance evaluation 3D NoC simulators considering accurate delay and accurate physical dimensions. Support for detailed microarchitectural parameters of TSVs for 3D NoCs is yet to be incorporated into cycle accurate

| Tools                                         |                |             |            |          | Desig            | yn Space E     | xploration              |                             |               |                             | R       | tesults |                 |
|-----------------------------------------------|----------------|-------------|------------|----------|------------------|----------------|-------------------------|-----------------------------|---------------|-----------------------------|---------|---------|-----------------|
|                                               |                | General No( | C design I | Features |                  |                | 3D No(                  | C Design Space              | e Feature     |                             | Latency | Power   | $\mathbf{Area}$ |
|                                               | 2D<br>Topology | Network     | Router     | Buffer   | Link<br>analvsis | 3D<br>Tonoloev | Vertical<br>link Models | Data(TSVs)<br>serialization | TSVs<br>model | Thermal<br>Floornlan Design |         |         |                 |
| Acces<br>Noxim (Cata-<br>nia et al.)<br>2016) | Yes            | Yes         | Yes        |          | Yes              | Yes            | Yes                     | No                          | No            | No                          | Yes     | Yes     | No              |
| Orion (Kahng<br>et al., 2009)                 | No             | No          | Yes        | Yes      | Yes              | No             | No                      | No                          | No            | No                          | No      | Yes     | Yes             |
| Nirgam (Jain<br>et al., 2007)                 | Yes            | Yes         | Yes        | Yes      | No               | Yes            | No                      | No                          | No            | No                          | Yes     | Yes     | No              |
| Dsent (Sun<br>et al. 2012)                    | No             | No          | Yes        | Yes      | Yes              | No             | No                      | No                          | No            | No                          | No      | Yes     | Yes             |
| Garnet (Agar-<br>wal et al.<br>2009)          | Yes            | Yes         | Yes        | Yes      | No               | No             | No                      | No                          | No            | No                          | Yes     | Yes     | No              |
| WormSim<br>(Ogras and<br>Marculescu<br>2006)  | Yes            | Yes         | Yes        | Yes      | No               | No             | No                      | No                          | No            | No                          | No      | Yes     | No              |
| NoCTweak<br>(Tran and<br>Baas, 2012)          | Yes            | Yes         | Yes        | Yes      | No               | No             | No                      | No                          | No            | No                          | Yes     | Yes     | No              |
| Booksim<br>(Jiang et al.)<br>2013b)           | Yes            | Yes         | Yes        | Yes      | No               | Yes            | No                      | No                          | No            | No                          | Yes     | Yes     | No              |
| Proposed<br>(Modified<br>booksim)             | Yes            | Yes         | Yes        | Yes      | Yes              | Yes            | Yes                     | Yes                         | Yes           | Yes                         | Yes     | Yes     | Yes             |

+h:0 -• Dun an Noc 4+ h 0+0l . -++ +. t Ctr • Table 1 1. Co simulators such as BookSim.

Incorporating delay and power models of TSVs and inclusion of varying delays in horizontal links improve the accuracy of power and performance measurements of 3D NoC simulators. Thermal effect in 3D-NoCs has become a major concern as it increases total power generated per area, thermal gradient across layers, if cooling is not adequately provided (Swarup et al., 2012). State of the art 3D IC simulators (Huang et al., 2009; Sridhar et al., 2014) incorporate better thermal models for analysing the thermal effect but the support for thermal analysis for 3D NoC architectures is lacking.

Accurate simulation of 3D NoCs require incorporating power and performance models of TSVs in existing 2D NoC simulators. Incorporating microarchitecture link delay and TSV delay models will enable accurate performance evaluation 3D NoC Topologies. The inter-layer communication is achieved using TSVs. Number of TSVs impact on overall communication energy and performance. The challenge is to minimize the number of TSVs used in the network while maintaining a low average network latency with better overall thermal distribution. The temperature variations on the physical layers can significantly impact the performance and reliability of the overall NoC. Thus, it is necessary to conduct thermal impacts on 3D-NoC interconnect in order to meet design needs.

#### **1.2** Problem Statement and Objectives

There is a need of design space exploration of 3D-NoC architecture with considering architectural parameters to reduce power and improve performance. To evaluate the performance and thermal evaluation of 3D NoC architectures, a framework is needed. Architectural and router microarchitectural design parameter exploration can lead to optimized power performance, thermal and cost of 3D NoCs architecture. This problem is further elaborated into the following objectives.

#### 1.3 Objectives

1. Floorplan based 2D and 3D NoC architectural design space exploration of Mesh topology and BFT topology.

- 2. 3D NoC Modelling in BookSim and Hotspot for power performance and thermal evaluation of 3D Mesh and BFT NoC architectures.
- 3. Design of a power and performance optimal 3D-NoC architecture with router micro-architecture, routing and flow control mechanism, TSV placement.

### 1.4 Contributions

This thesis makes four contributions for developing framework for power performance and thermal evaluation of 3D NoC architectures and design of power and performance optimal 3D NoC architectures.

- Design space exploration of 3D NoCs using floorplan driven link lengths and link delay estimation for accurate evaluation. Implemented a Variable radix Mesh NoC topology generation mechanism in BookSim(uses unequal values for X, Y and Z) has been implemented.
- Accurate latency and power estimation of TSVs in 3D NoC architecture by incorporating the TSV models into 2D BookSim to support 3D NoC topology. Analysed the power and performance of 2D and 3D Mesh topology. Thermal behaviour evaluated in 3D NoC architectures by (a) adding configurable router-TSVs elements into the core layer floorplan and (b) augmenting router and TSV power and thermal model in HotSpot6.0.
- Designed the framework for power and performance and thermal evaluation of 3D NoC architectures using state-of-the-art TSV power and delay models by incorporating into BookSim2.0 and HotSpot6.0 simulators.
- A low cost, performance optimal 3D BFT(OP3DBFT) is proposed as powerperformance optimal 3D NoC architecture. It is evolved from the standard 3DBFT using overall performance, link and TSV minimization, and powerperformance trade-offs. We propose a new OP3DBFT(Optimal Power and Performance 3DBFT) architecture with round-robin deflection routing(RROD) as power and performance optimal 2-layer 3D NoC architecture.

#### 1.5 Organization of the Thesis

#### 1.6 Thesis Outline

The remaining part of the thesis are organized as follows:

- Chapter 2: Literature review : This chapter is structured into three sections. The first section provides a background in 2D NoC architecture details and the second section highlights several works from current literature related to the concepts 3D NoC power, performance evaluation. Finally, the third section surveys related list simulators and functionalities for power, performance and thermal simulation.
- Chapter 3: Floorplan based 2D and 3D NoC architectural design space exploration of Mesh topology and BFT topology discusses the design space exploration of floorplan based power performance evaluation of 3D variants of Mesh and BFT Networks-on-Chip architectures. The results are presented with appropriate conclusions.
- Chapter 4: 3D NoC Modelling in BookSim and Hotspot for 3D NoC power, performance and thermal evaluation: presents the 3D NoC Modelling in BookSim and Hotspot for 3D NoC power, performance and thermal evaluation.
- Chapter 5: Area, Power and Performance analysis of Optimal 3D BFT NoC Architectur Presents the data serialisation and TSV minimization technique in 3D NoC architecture. A novel, low cost, power-performance optimal 3D BFT topology is proposed. The area, power and performance of proposed 3D NoC results are presented at the end of the chapter.
- Chapter 6: Summary and conclusions: The contributions of this thesis, along with some important conclusions, outlines of future research directions have been summarized.

# Chapter 2 Literature Review

The context of this chapter includes three sections. The discussion in the first is on introduction to the network on chip and performance evaluation metrics for 2D NoCs. The second section introduces 3D NoC architecture details. The final section introduces the NoC simulator basics and discusses the list of simulators available for power and performance evaluation.

#### 2.1 Network-on-chips

Network-on-Chips(NoC) have emerged as a highly structured and efficient on-chip reliable communication framework in CMPs and SoCs to achieve high-performance and scalability.



Figure 2.1: 4 x4 NoC Mesh topology. Each PE connects to a router. One router connects to North, East, South and West neighbours using links.

NoC interconnects the Processing elements (PEs) with the routers and links in

a scheme called topology. The basic building blocks of interconnection networks are topology, routing, flow control, and router. Routing defines how packets have to route from source to destination without congestion. Flow control helps to control resources such as buffers and bandwidth during traffic flow. Figure 2.1 depicts 4 x 4 2D Mesh NoC topology, where the PEs generate and consume data and routers are responsible for forwarding data between the PEs. A router is composed of the set of input and output ports, buffers to store the incoming flits, switching matrix connecting the input port to output port and a designated local port to connect to its local PE (Dally and Towles, 2001). The generic microarchitecture of the router is shown in Figure 2.2



Figure 2.2: Generic K<sub>input</sub>, K<sub>output</sub> router microarchitecture. Each input port has n Virtual Channels. Output port for data is chosen by the Router Logic. Switching mechanism is implemented by SA and VC- Allocator block (Pande et al., 2005).

The data from PEs are divided into packets. The packets are again subdivided into flow control units called flits. Flits are the units of packets on which flow control policies are applied in an NoC. Flits transfer through the Route Computation (RC), Virtual-Channel(VA), Switch Allocation(SA), Switch Traversal(ST) and Link Traversal (LT) stages in the router pipeline. The output channel is identified for the head flit in the RC stage. The VA allocates the VC for the head flit. SA finds the output physical channel and transfers data is on output physical channel by ST, followed by LT. The router pipeline has been extended using techniques, such as bypassing (Psarras et al., 2016) to improve NoC performance(Dimitrakopoulos et al., 2015).

The evaluation methodology to compare the performance and characteristics of 2D NoC architectures, such as SPIN, Torus, Folded torus, Octagon, BFT has been discussed in Pande et al. (2005). Balfour and Dally (2006) and Psathakis et al. (2015) have conducted comprehensive studies on the internal elements of a router in order to optimize the Energy-Throughput Ratio of NoC architectures. Most of the state-of-the-art studies discuss the performance and area optimisation by considering details of the micro architectural elements. As there is an increase in the cores, the 2D plane has communication overhead due to interconnect length and the area overhead (Pavlidis and Friedman, 2007). 3D IC technologies can be adopted to NoC to avoid long interconnect, as it stacks multiple dies on a single chip. 3D NoC is the result of NoC and 3D IC technologies to support scaling of cores (Qian et al., 2009).

#### 2.2 3D Network on chips

Feero and Pande (2009) have extensively evaluated and analyzed the throughput, latency, and energy dissipation performance in a variety of 3D NoC architectures. Grot et al. (2009) have also made efforts in understanding how network topologies(such as CMesh) scale with regard to cost, performance, and energy considering the advantages and limitations afforded on a die. The thermal effect of TSVs on these architectures is yet to be investigated. The work by Kumar et al. (2009) has explored the Concentrated mesh(CMesh) 3D NoC as a low-cost alternative to the naive 3D Mesh. It efficiently shares on-chip network resources such as buffers and wire bandwidth. This architecture has been considered as a part of the analysis to verify whether there is an improvement in power consumption and thermal behaviour.

(Debora et al., 2015) have proposed 3D-HiCIT (Hierarchical Crossbar-based Interconnection Topology) NoC whose scalability and performance is compared with other hierarchical topologies. 3D-HiCIT reduces the average latency up to 50% and 45% compared to the 3D-BFT and 3D-SPIN respectively.

In (Rahmani et al., 2011), to address the power and performance issue in 3D NoC Bus Hybrid architecture, the authors have proposed an ultra-optimized hybridization scheme called LastZ, allowing optimized inter-layer communication. Area, power, and performance improvements of 10% compared to 3D NoC-Bus Hybrid Mesh architecture have been observed.

Based on 3D NoC partitioning, two different variants are generated, namely 3D Stacked Mesh NoC and 3D Stacked Hexagonal NoC. The performance of these two NoC topologies is analyzed by comparing them with the Stacked 2D Mesh and 3D Mesh NoC. Due to the significance of the wire delay effect of 3D NoC architectures, better performance is observed using partitioning than regular 3D Mesh NoC(Jabbar et al., 2013). Most of the state of art 3D NoC topologies are evaluated with in house simulators. None of the work address considers the TSV models as a base to the evaluation of 3D NoC topologies, to get accurate performance. The next section discusses about the different NoC simulators state-of-the-art.

### 2.3 NoC Simulators

Jiang et al. (2013a) presented BookSim2.0, a cycle-accurate simulator for NoCs. Book-Sim2.0 offers a large set of configurable NoC parameters such as topology, routing algorithm, flow control, traffic and injection rate. BookSim2.0 results are validated against the RTL implementation of the NoC router for accuracy. NIRGAM (Jain et al., 2007) is a modular SystemC based simulator supporting 2D mesh and torus NoC architectures. Access Noxim (Catania et al., 2016) is another open source SystemC based, configurable, cycle-accurate NoC simulator which allows analysing the performance and power of conventional wires. It also simulates the power performance and thermal behaviour of 3D NoC Mesh topology. GARNET (Agarwal et al.) 2009) is an NoC simulator incorporated in the GEM5 full system simulator. GAR-NET models micro architectural details of the router and buffers. Worm sim (Ogras and Marculescu, 2006) is cycle accurate simulator for evaluating performance and communication energy of 2D NoC architecture. NoCTweak (Tran and Baas, 2012) is a highly parametrizable NoC simulator for early exploration of performance and energy efficient NoCs. The simulator has been developed in SystemC, which allows a wide range of configurations to be applied on the NoC platform under simulation. ORION2.0 (Kahng et al., 2009) includes power and area model to estimate the accu-
rate power and area of interconnection network routers accurately. These results can be used to get effective NoC design space exploration in the early phases. DSENT (Sun et al., 2012) is an area and power model tool which also considers same microarchitectural models from ORION3.0 and it is used for rapid design space exploration of the electrical and opto-electrical network.

The NoC simulators can be compared based on their coverage, configuration parameters and metrics measured. Most on-chip simulators consider a static wire length or constant delay during communications. However, the length and the delay of NoCs, link vary according to the floorplan of the NoC. Table 2.1 compares the 2D and 3D NoC design space approaches of the state-of-art simulators with this work.

BookSim2.0 contains detailed models of all network elements with router microarchitecture. BookSim2.0 has widely used NoC simulators for its flexibility and accuracy. However, BookSim2.0 does not supports variable configuration of the link delays for irregular topologies, BookSim2.0 also lacks support for TSV delays based 3D NoC simulation. Hence we consider the BookSim2.0 for the extension for 3D NoC architecture.

#### 2.3.1 BookSim2.0 NoC Simulator

BookSim2.0 is a flexible and a detailed cycle accurate simulator designed for NoCs. BookSim2.0 is a parametrized simulator where the internal organization of routers, including the buffers, crossbar, allocators, are inputs that to observe the behaviour of desired NoC topology. The simulator offers a large degree of network customization and numerous network component designs. BookSim2.0 provides detailed modelling of networks and routers and has been a standard for NoC simulation since its release. It provides a large degree of flexibility and can be used for evaluation of novel network designs(Jiang et al., 2013a).

Figure 2.3 depicts the simulation flow and the models involved in each stage. There are three stages in the simulation of a network topology in BookSim2.0, (1). initialization, (2).building network(*network*), and (3). setup traffic and simulation(*trafficmanager*). In the initialization stage, the user configuration is read and assigned to individual simulator parameters, clearing all the previous statistics.

The network is built by instantiating and interconnecting routers and channels in a

| Tools                                        |          |            |            |          | Desi     | gn Space E | cploration  |                                |                             |                  | Н          | lesults |      |
|----------------------------------------------|----------|------------|------------|----------|----------|------------|-------------|--------------------------------|-----------------------------|------------------|------------|---------|------|
|                                              |          | General No | ⊖ design ] | features |          |            | 3D No       | C Design Space                 | e Feature                   | ŝ                | Latency    | Power   | Area |
|                                              | 2D       | Network    | Router     | Buffer   | Link     | 3D         | Vertical    | $\mathrm{Data}(\mathrm{TSVs})$ | $\mathbf{TSV}_{\mathbf{S}}$ | Thermal          |            |         |      |
|                                              | Topology |            |            | space    | analysis | Topology   | link Models | serialization                  | model                       | Floorplan Design |            |         |      |
| Acces                                        | Yes      | Yes        | Yes        |          | Yes      | Yes        | Yes         | No                             | No                          | No               | Yes        | Yes     | No   |
| noxim Cata-<br>nia et al.<br>2016            |          |            |            |          |          |            |             |                                |                             |                  |            |         |      |
| Orion (Kahng<br>et al. 2009)                 | No       | No         | Yes        | Yes      | Yes      | No         | No          | No                             | No                          | No               | No         | Yes     | Yes  |
| Nirgam (Jain<br>et al.) 2007)                | Yes      | Yes        | Yes        | Yes      | No       | ${ m Yes}$ | No          | No                             | No                          | No               | ${ m Yes}$ | Yes     | No   |
| Dsent (Sun<br>et al.) 2012)                  | No       | No         | Yes        | Yes      | Yes      | No         | No          | No                             | No                          | No               | No         | Yes     | Yes  |
| Garnet (Agar-<br>wal et al.<br>2009)         | Yes      | Yes        | Yes        | Yes      | No       | No         | No          | No                             | No                          | No               | Yes        | Yes     | No   |
| WormSim<br>(Ogras and<br>Marculescu<br>2006) | Yes      | Yes        | Yes        | Yes      | No       | No         | No          | No                             | No                          | No               | No         | Yes     | No   |
| NoCTweak<br>(Tran and<br>Baas 2012)          | Yes      | Yes        | Yes        | Yes      | No       | No         | No          | No                             | No                          | No               | Yes        | Yes     | No   |
| Booksim<br>(Jiang et al.)<br>2013b)          | Yes      | Yes        | Yes        | Yes      | No       | Yes        | No          | No                             | No                          | No               | Yes        | Yes     | No   |



Figure 2.3: The overall simulation flow between Modules during the simulation in BookSim2.0.

topology defining how these modules are interconnected. All communication between routers occurs through send and receive functions. The trafficmanager module is responsible for the flow of flits over the network module from source to destination. Based on the supplied configuration parameters such as traffic pattern, packet size, injection rate, etc., the packets are injected into the network and latency measurements are taken. trafficmanager collects appropriate statistics, and terminates the simulation based on the user input simulation time.

The **router** module in BookSim2.0 is an input-queued virtual channel, canonical four-stage router. Pipeline delays are configurable. The NoC traffic can be injected using (a) synthetic traffic modules (b) pipeline for packet header flits. Arbitrary delays can be assigned to each pipeline stage, and the entire router can be configured to mimic the behaviour of a single cycle router. Multiple subnetworks with different traffic classes to be transported on separate physical networks can be simulated. NoC traffic can be fed using synthetic traffic models or by interfacing with a full-system simulator or by replaying traffic traces. Table 2.2 depicts the list of input parameter for the

| Input Parameter       | Description                                         |
|-----------------------|-----------------------------------------------------|
| Network               |                                                     |
| topology              | Name of the topology(Mesh, CMesh)                   |
| k                     | Topology radix                                      |
| n                     | Network dimension                                   |
| С                     | Concentration (No. of PEs per router)               |
| Router options        |                                                     |
| router                | Router type (input queue)                           |
| in_ports              | Number of input ports in router                     |
| out_ports             | Number of output ports in router                    |
| num_vcs               | Total number of Virtual Channel per port            |
| vc_buf_size           | Buffer size per Virtual Channel(VC)                 |
| routing_function      | The name of the routing function (XY,DOR)           |
| Simulation parameters |                                                     |
| traffic patterns      | Type of traffic in network (uniform, transpose,     |
|                       | bit-compliment, tornado)                            |
| packet_size           | Size of packets in Flits                            |
| sim_type              | latency and throughput simulation type              |
| warmup_periods        | Number of samples periods to warm-up the simulation |
| sample_period         | Total Number of measurements cycles                 |
| sim_count             | Number of simulations to perform                    |

#### Table 2.2: Configuration parameters in BookSim2.0.

network, router and simulation with a description of each parameter in BookSim2.0.

The BookSim2.0 simulator supports Mesh, Cmesh, Torus and Tree based 2D NoC topologies. Capabilities of BookSim2.0 are extended by (a) incorporating TSV delay models (b) support of creating custom wire 3D lengths in standard topologies (c) 3D

Mesh topology with variable radix at each dimension. This extension enables designers to simulate a variable radix 3D Mesh NoC in BookSim2.0. The TSV models have been incorporated in BookSim2.0 are presented in the section 4.2.2

#### 2.3.2 Thermal Simulation of 3D ICs

Sridhar et al. (2014) have developed a new simulator for the compact modelling of liquid cooled 3D-ICs. Kinoshita et al. (2015) have used ADVENTURECluster, a large-scale parallel computing simulator based on Finite Element Modelling to study the thermal elevation in stacked 3D chips and TSV structure stress in 3D System in Packages based on CAD models. Tain et al. (2012) have developed a simulation model using Flotherm and equivalent thermal conductivity correlations to measure the performance of doubly stacked 3D IC structures assembled in Quad Flat Packages. Fourmigue et al. (2014) have proposed a novel, Finite Difference method based algorithm for efficiently computing transient temperatures in liquid-cooled 3D ICs with high accuracy. Lu et al. (2014) have arrived at simple empirical formulas to model heat generated in TSVs using classical equations on heat conduction. The most of the state-of-the-art work((Sridhar et al., 2014)) used HotSpot(Skadron et al., 2003) base models. The HotSpot6.0 has been extended for 3D NoC architecture.

#### 2.3.3 HotSpot6.0 Temperature Modelling Tool

HotSpot6.0 is a widely used model for studying thermal behaviour at the architecture level (Skadron et al.) 2003). Within a thermal package, HotSpot6.0 considers microarchitecture block details as an equivalent circuit of thermal resistances and capacitances. The sample chip floorplans that have been provided in HotSpot6.0 are based on the Alpha Ev6 processor architecture. The sample 3D test case that is provided in the unmodified version of HotSpot6.0 mentions the dimensions of TSV used and the number of TSVs that make up one logical TSV unit. Currently, it does not consider the router as a significant element in the floorplan of the processor. Figure 2.4 shows the overall simulation flow of HotSpot tool. This work extends HotSpot6.0 to simulate the 2D and 3D NoC architectures by adding router-TSVs elements based on the floorplan details.



Figure 2.4: HotSpot6.0 thermal simulation flow chart for thermal analysis

## 2.4 Summary

This chapter presents different horizons of research in the field of NoC architectures. The pros and cons of the state-of-art 2D simulators are discussed. However, most of the work is carried out in 2D NoC. Simulators lack in consideration of accurate microarchitectural models for evaluation of TSVs in 3D NoC architectures. As this thesis focuses on accurate power and performance evaluation of 3D NoC architectures, Chapter 3 emphasizes on the accurate evaluation of 3D NoC based microarchitectural details of routers and links.

## Chapter 3

# Floorplan based 2D and 3D NoC Architectural Design Space Exploration of Mesh and BFT Topologies

In this chapter, we explore the design space of 3D Mesh and Butterfly Fat tree(BFT) NoC architectures using floorplan driven wire lengths and link delay estimation. Buffer space of each topology has been analysed considering various buffer space for individual topology. Thus, the buffer space has been equalised for a fair performance comparison between the topologies. Performance, Flits per Joules(FpJ) and Energy Delay Product(EDP) of six 2D and 3D variants (2-layer and 4-layer) of the Mesh and BFT topologies are analysed.

## 3.1 Floorplan and Delay Estimation

#### 3.1.1 2D and 3D Mesh Topology

Mesh is a direct network topology which allows integration of more PEs in a regular shape structure (Pasricha and Dutt, 2008), where every router except the ones at the edges, is connected to all its neighbouring routers.  $8 \times 8$  Mesh topology is 8-ary, 2-cube 2D Mesh topology which has total number of routers  $(k^n)=64$  as shown in Figure 3.1 Where is k is the radix(number of elements in each dimension) and n is the number of dimensions.

The floorplan consists of a system with tiled Chip Multiprocessor with 64 Sun-SPARC cores (Xu et al., 2012). Each PE area is of 3.4mm<sup>2</sup>. The 3D Mesh topol-



Figure 3.1:  $8 \times 8$  2D mesh with 64 PEs.

ogy consists of 5-port and 6-port routers. The router area is estimated from the ORION2.0 (Kahng et al., 2009) area model and the area of each router is shown in Table 3.1. All the micro-architectural parameters used are shown in Table 3.1.

| Clock Frequency |        | 2.5GHz                   |
|-----------------|--------|--------------------------|
| PEs area        |        | $3.4\mathrm{mm}^2$       |
|                 | 4-port | $0.47098 \text{mm}^2$    |
| Boutor area     | 5-port | $0.598509 \text{mm}^2$   |
| nouter area     | 6-port | $0.729954 \text{mm}^2$   |
|                 | 7-port | 0.865314 mm <sup>2</sup> |
| Channel size    |        | 64(bit)                  |

Table 3.1: Floorplan parameter details.

3D Mesh aims to reduce the latency by redistributing nodes vertically. The 64  $8 \times 8$  Mesh is converted into 2-layer and 4-layer 3D Mesh and are  $4 \times 4 \times 4$  and  $8 \times 4 \times 2$ . Two 3D NoC topologies are designed from existing 2D Mesh topology to analyse the performance of going vertical. Figure 3.2 and 3.3 show the floorplan based

architecture of 8x4x2 2-layer 3D Mesh topology and 4x4x4 4-layer 3D Mesh topology. Both the topology have 32 and 16 routers per layer respectively. TSVs are used to connect interlayer routers. The delay of the TSVs is considered as one clock cycle (Yaghini et al., 2016).



Figure 3.2:  $8 \times 4 \times 2$  3D Mesh with four stacked layers connected using TSVs.

| Algorithm 1: Routing algorithm for both 2D and 3D (XY and ZXY).                                         |
|---------------------------------------------------------------------------------------------------------|
| <b>Input:</b> Current node and <i>dest</i> node                                                         |
| <b>Output:</b> Output port from current node to <i>dest</i>                                             |
| 1 if $cur!=dest$ then                                                                                   |
| <b>2 if</b> cur and dest are at different plane (Layer) <b>then</b>                                     |
| $3 \qquad \  \  \begin{bmatrix} Find\_output \text{ port in Z direction (left or right)} \end{bmatrix}$ |
| 4 else if cur and dest are at same offset of X dimension and are at                                     |
| different Y dimension then                                                                              |
| 5 $\ \ Find\_output \text{ port in Y direction (left or right)}$                                        |
| 6 else                                                                                                  |
| $7 \ \ 7 \ \ 7 \ \ 7$ Find output port in X direction (left or right)                                   |

The ZXY routing algorithm (Algorithm I) is used for the 3D Mesh topology. The ZXY algorithm, first routes the packets to layer (Z) of the destination node and then performs the XY routing (Dally and Towles, 2004) destination (Z) layer.



Figure 3.3:  $4 \times 4 \times 4$  3D Mesh with four stacked layers connected using TSVs.

#### 3.1.2 BFT topology

In the Butterfly Fat Tree (BFT) topology, PEs are placed at the leaves and routers are placed at the top and intermediate levels. A pair of coordinates is used to label each node, (L, P) where L denotes a nodes level, and P denotes its position within that level. BFT has different non-uniform links in each level (Grecu et al., 2004).

#### A 2D BFT

2D BFT topology shown in Figure 3.4 consists of 64 PEs and 28 routers. Except for the top level routers(which have four ports), all routers contain six ports. In the 6-port router, one port is connected to each of the four child nodes and the remaining two ports are connected to the parent nodes. Figure 3.5 shows floorplan of 2D BFT topology. Micro-architectural parameters used for our experiments are shown in Table 3.6. From the floorplan, it is observed that there are five different links lengths. Table 3.2 shows the lengths and their respective delay.



Figure 3.4: 64 node BFT topology with three levels. Level 1 is of 4 router, level 2 of 8 routers, level 3 of 16 routers. The leaves are the PEs which are connected to level-3 routers and 4 PE's per router.



Figure 3.5: Floorplan of 2D BFT with 64 PEs and each PEs are connected routers for inter PEs communication.



Figure 3.6: (b1)  $8 \times 4 \times 2$  3D BFT with four stacked layers connected using TSVs. (b2) Inter-layer connections.



Figure 3.7: (c1)  $4 \times 4 \times 4$  3D BFT with four stacked layers connected using TSVs. (c2) Inter-layer connections.

#### B 2-layer 3D BFT

The level 2 router's are moved towards level 1 to reduce the link length and also to connect links from level 1 routers to level 2, between the layers. Figure 3.6 (b) shows the 2-layer 3D BFT, which is extended from 2D BFT. 2-layer 3D BFT has two stacked layers connected through vertical TSVs. In 3D topology length of wire is reduced when compared to compare to 2D BFT because of vertical connection and are shown in Table 3.2

Figure 3.6 (b1) shows the vertical connections of 2-layer 3D BFT; there are eight links which are connected vertically. Level 1 and level 2 router at interlayer are hybrid and consist both TSV as vertical and wired as horizontal connection.

#### C 4-layer 3D BFT

Figure 3.7 shows 4-layer 3D BFT modified to 4-layer 3D BFT topology with each layer consisting of 16 PE's each. Level 1 routers are placed between layers to reduce the length of TSV. Figure 3.7(c1) shows the vertical connection between the routers from level 1 to level 2 (Kim et al., 2009).

Nearest Common Ancestor (NCA) routing algorithm is employed in 2D and 3D BFT variants(Algorithm 2). The algorithm identifies the nearest common ancestor between source and destination. At each router, it finds the minimum and maximum reachable nodes and then based on reachability, packets are routed to appropriate ports.

### 3.2 Horizontal and Vertical Link Delay Estimation

Link lengths are extracted from the floorplan of the topologies and RC delay models from ORION3.0 (Kahng et al., 2009) are used for estimating the horizontal link delay(ns). The number of cycles per link is calculated for the 2.5GHz frequency with a voltage of 1.1V(32nm Technology). Further delays(ns) are converted to clock cycles for the 2.5 GHz operating frequency. Table 3.2 shows the details of link length, delay(Clock cycle) and horizontal link counts of both 2D and 3D variants of Mesh, BFT topologies based on the floorplan. The Delay column in Table 3.2 shows the delays in clock cycles(cc) of respective wire lengths.

| Algorithm 2: Routing algorithm for BFT topology.                              |
|-------------------------------------------------------------------------------|
| Input: cur node and dest node                                                 |
| <b>Output:</b> output port from <i>cur</i> node to <i>dest</i>                |
| 1 if cur!=dest then                                                           |
| <b>2</b> Find the cur node level(nl) and Position of node (rp) in the level ; |
| <b>3</b> if $nl == zero$ then                                                 |
| 4 $\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$                                     |
| 5 else if $nl ==1$ then                                                       |
| <b>6</b> Find Lowest $(min)$ node and maximum $(max)$ node which can reach    |
| from $cur;$                                                                   |
| 7 <b>if</b> dest is in beween max and min <b>then</b>                         |
| $\mathbf{s}  \  \  \  \  \  \  \  \  \  \  \  \  \$                           |
| 9 else                                                                        |
| 10 $\  \  \  \  \  \  \  \  \  \  \  \  \ $                                   |
| 11 else if $nl ==2$ then                                                      |
| <b>12</b> Find Lowest $(min)$ node and maximum $(max)$ node which can reach   |
| from cur;                                                                     |
| 13 if dest is in beween max and min then                                      |
| 14 $\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$                                    |
| 15 else                                                                       |
| 16 $\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$                                    |
|                                                                               |

Mesh topology has uniform horizontal link lengths in all the variants(2D, 3D-2 layer, 3D 4-layer). Floorplans of the BFT topology indicate that the horizontal link lengths depend on the level, router placement, and the PE size. In 3D BFT, links connecting from level 1 to level 2 (Figure 3.4) are reduced up to 40-80% when compared to 2D BFT.

The Vertical link counts of both Mesh and BFT variants are evaluated and Table 3.3 shows the vertical links delay(cc), number of vertical connections and number of TSVs count of both BFT and Mesh variants.

## **3.3 Buffer Space Analysis**

Buffers in the router I/O ports are expensive resources as they consume as much as 30% of the router area. Buffers improve the overall throughput of the NoC. Total buffer space (B) utilisation of topology is presented in Equation 3.1 Total buffer space depends on the number of routers, input/output ports, virtual channels per

| Topology        | Wire(mm)      | Delay (clock cycle) | HL (wire) count |
|-----------------|---------------|---------------------|-----------------|
| 2D Mesh         | 1.844         | 4                   | 112             |
| 2-layer 3D Mesh | 1.844         | 4                   | 108             |
| 4-layer 3D Mesh | 1.844         | 4                   | 96              |
|                 | 8.825         | 73                  | 8               |
|                 | 8.342         | 73                  | 8               |
| 2D BFT          | 7.859         | 64                  | 8               |
|                 | 4.654         | 23                  | 8               |
|                 | 4.171         | 19                  | 16              |
|                 | 8.342         | 73                  | 8               |
| 2 Javor 3D BFT  | 7.859         | 64                  | 8               |
| 2-layer 5D DI 1 | 4.654         | 23                  | 8               |
|                 | 4.171         | 19                  | 8               |
|                 | 4.654         | 23                  | 14              |
| 4-laver 3D BFT  | 4.171         | 19                  | 14              |
| T-layer 5D DF 1 | less than 1mm | 1                   | 12              |

Table 3.2: Horizontal link(HL) length and delay(cc) details of 2D and 3D variants of Mesh, BFT. These delays are considered for the simulation.

Table 3.3: Vertical Link details of 3D variants of Mesh, BFT each link has 64 TSVs.

|                     | 2-layer 3D Mesh | 4-layer 3D Mesh | 2-layer 3D BFT | 4-layer 3D BFT |
|---------------------|-----------------|-----------------|----------------|----------------|
| VL count            | 32              | 48              | 16             | 8              |
| number of TSVs      | 4096            | 6144            | 2048           | 1024           |
| Delay (clock cycle) | 1               | 1               | 1              | 1              |

port, buffer depth per virtual channels.

$$B = \sum_{i=1}^{n} (R_i * P_i) * V * D$$
(3.1)

where,  $R_i$  is the total number of routers in  $i^{th}$  class,  $P_i$  is the total number of ports

in router  $R_i$ . Routers having identical  $P_i$  belong to the same class. The number of class routers is n. V is the number of VC per port, Buffer depth per virtual channel is D. B is the overall topology buffer space.

The total buffer spaces used in Mesh and BFT (2D, 3D variants) topologies for the various virtual channel and buffer depth are presented in table 3.4. The first two columns are VC parameters, while the other columns show the amount of buffer space used in the Mesh and BFT (2D, 3D variants). Table 3.4's last two rows represent the number of routers and the number of ports (Rn, Pn) per router. There are two types router classes (6 and 7 port) in 3D 4-layer Mesh; one class has 32 routers with six ports, and the other has 32 routers with seven ports. For a router with 8 VCs per port and 12 buffers per VC, the total buffer space is 39936 flits. The total buffer space depends on  $R_i$ ,  $P_i$ , V and D. Therefore, each topology's buffer space is different. In 3D 4-layer Mesh, the buffer space has increased by 9% and 30% for 6-port and 7-port routers (additional TSV ports) respectively. 2D and 3D BFT variants have the same buffer space as no additional ports are required.

#### 3.3.1 Buffer Space Equalisation (BSE)

From Table 3.4, it can be seen that the buffer space varies between topologies. For a fair performance comparison between the topologies, and the buffer space has been equalised. The number of VCs and buffer depth are modified to equalise the buffer space. Equation 3.2 represents the BSE and  $C_1$ ,  $C_2$  are the equalisation factors (positive or negative).

$$B_E = \sum_{i=1}^{n} (R_i * P_i) * (V + C_1) * (D + C_2)$$
(3.2)

2D Mesh NoC is considered as the baseline buffer space. Table 3.5 shows an approximately ( $\pm 10\%$ ) equalised buffer space for all the topologies to the 2D Mesh buffer space. For 3D 2-layer Mesh, the buffer space has been changed from 36864 to 32256 after BSE because of the change in V from 8 to 7. The error in equalisation has varied by 5% in case of D=4 and 12% in case of D=12 and 8% decrease in buffer space. For 3D 4-layer Mesh, the buffer has space changed from 39936 to 32032 after BSE because of the change in V from 8 to 7 and the buffer space has decreased by 20% after BSE. In BFT topology, the buffer space changed from 14208 to 30720 after

|        |               |               |            | Total Buffe | er Space (fli | its)        |            |
|--------|---------------|---------------|------------|-------------|---------------|-------------|------------|
| . para | umeters       |               | Mesh Topol | ogy         |               | BFT Topolog | gy         |
| ial [  | No. of Buffer | $2\mathrm{D}$ | 3D 2-layer | 3D 4-layer  | 2D BFT        | 3D 2-layer  | 3D 4-layer |
|        | (a)mdon       | қ190          | 6177       | GERG        | 9368          | 9368        | 8950       |
|        | ۲ x           | 10240         | 12288      | 13312       | 4736          | 4736        | 4736       |
|        | 12            | 15360         | 18432      | 19968       | 7104          | 7104        | 7104       |
|        | 4             | 7680          | 9216       | 9984        | 3552          | 3552        | 3552       |
|        | $\infty$      | 15360         | 18432      | 19968       | 7104          | 7104        | 7104       |
|        | 12            | 23040         | 27648      | 29952       | 10656         | 10656       | 10656      |
|        | 4             | 10240         | 12288      | 13312       | 4736          | 4736        | 4736       |
|        | $\infty$      | 20480         | 24576      | 26624       | 9472          | 9472        | 9472       |
|        | 12            | 30720         | 36864      | 39936       | 14208         | 14208       | 14208      |
|        |               | (64,5)        | (64,6)     | (32,6)      | (4, 4)        | (4, 4)      | (4, 4)     |
|        |               |               |            | (32,7)      | (22,6)        | (22,6)      | (22,6)     |

|       | $\infty$ |       |       | 6        |      |       | 4        |      | V            |         |                       |  |
|-------|----------|-------|-------|----------|------|-------|----------|------|--------------|---------|-----------------------|--|
| 12    | $\infty$ | 4     | 12    | $\infty$ | 4    | 12    | $\infty$ | 4    | D            |         |                       |  |
| 30720 | 20480    | 10240 | 23040 | 15360    | 7680 | 15360 | 10240    | 5120 | Buffer space | 2D      |                       |  |
|       | 7        |       |       | τυ       |      |       | 4        |      | V            | 2 L     |                       |  |
| 12    | $\infty$ | 4     | 12    | $\infty$ | 4    | 10    | 7        | ಲು   | D            | ayer    | Mesł                  |  |
| 32256 | 21504    | 10752 | 23040 | 15360    | 7680 | 15360 | 10752    | 4608 | Buffer space |         | ı Topology            |  |
|       | -7       |       |       | τυ       |      |       | 4        |      | V            | 4 I     |                       |  |
| 11    | $\infty$ | 4     | 12    | $\infty$ | 4    | 10    | 7        | ಲು   | D            | rayer   |                       |  |
| 32032 | 23296    | 11648 | 24960 | 16640    | 8320 | 16640 | 11648    | 4992 | Buffer space |         |                       |  |
|       | 12       |       |       | $\infty$ |      |       | 6        |      | V            |         |                       |  |
| 16    | 11       | ы     | 18    | 12       | 6    | 16    | 11       | 6    | D            | 2D      |                       |  |
| 30720 | 21120    | 9600  | 23040 | 15360    | 7680 | 15360 | 10560    | 5760 | Buffer space |         |                       |  |
| 16    | 11       | τC    | 18    | 12       | 6    | 16    | 11       | 6    | D            | 3D      | BFT                   |  |
| 30720 | 21120    | 9600  | 23040 | 15360    | 7680 | 15360 | 10560    | 5760 | Buffer space | 2-layer | <sup>•</sup> Topology |  |
| 16    | 11       | පැ    | 18    | 12       | 6    | 16    | 11       | 6    | D            | 3D      |                       |  |
| 30720 | 21120    | 9600  | 23040 | 15360    | 7680 | 15360 | 10560    | 5760 | Buffer space | 4-layer |                       |  |

| Table 3.5: Mesh and B |
|-----------------------|
| FΤ                    |
| (2D,                  |
| 3D                    |
| variants)             |
| buffer                |
| Space                 |
| e Equalisation :      |
| for v                 |
| arying                |
| virtual               |
| channel               |
| and                   |
| buffer .              |
| depth.                |

BSE because of the change in V from 8 to 12 for BFT variants and  $2 \times$  buffer space increase. In the case of BFT topology, V and D are the same in all BFT variants because there is no change in resources (links and routers) in BFT when we switch from 2D to 3D.

## 3.4 Experimental Setup

The cycle-accurate on-chip network simulator (BookSim2.0) is modified to support 2layer 3D and 4-layer 3D NoC with accurate delay. XY and XYZ routing is used for 2D and 3D Mesh topolgy respectively. Horizontal wire delays have been modelled. The TSV delays have been modelled from existing works (Yaghini et al., 2016). Delays are estimated based on the floorplan as shown in Figure 3.1, 3.2, 3.3, 3.5, 3.6 and 3.7, to get the accurate performance metric. The BFT topology as shown in Figure 3.4 is implemented in simulator and the degree of routers in three levels 4, 6, 6 respectively. Nearest common ancestor algorithm is implemented for 64 nodes. Table 3.6 shows network configuration parameters of 2D and 3D variants of Mesh and BFT topologies.

| BookSim Parameter | Value                                              |
|-------------------|----------------------------------------------------|
| Topology          | 2D Mesh & 2-layer 3D Mesh & 4-layer                |
|                   | 3D Mesh & 2D BFT & 2-layer 3D BFT & 4-layer 3D BFT |
| Network Size      | 64 Nodes                                           |
| Switches          | 28 - 64                                            |
| Traffic           | Uniform Random                                     |
| Number of VCs     | 8                                                  |
| VC buffer size    | 16                                                 |
| Simulation time   | $10^5$ cycles                                      |

Table 3.6: Simulated Network Configuration.

## 3.5 Results and Discussion

#### 3.5.1 Average Network Latency

Average network latency is obtained from BookSim2.0 using default link latencies and floorplan based link latencies are plotted in Figure 3.8 and Figure 3.9 for 2D Mesh and 2D BFT respectively. Using floorplan based link lengths and corresponding delays in simulation, an increase in average network latency from 19% to 43% is observed. An increase in average network latency up to  $1.45 \times$  is observed in Mesh, and up to  $8 \times$  in BFT topology using floorplan based, accurate link delays in the simulation. BFT topology link length  $2.5 \times$  greater than the Mesh, resulting in a larger increase in the average network latency.



Figure 3.8: Average network latency comparison with accurate link delay modelling of 2D Mesh(default link delay) and 2D Mesh with accurate link delay.

#### 3.5.2 BSE based Mesh and BFT Topology

Mesh and BFT topologies (2D, 3D variants) are evaluated based on equalised buffer space for a fair comparison. The buffer space in all topology variants is equalised within 10% to keep resource cost similar and Table 3.5 shows equalised buffer space. Figures 3.10 and 3.11 show the average network latency of 3D 2-layer and 3D 4-layer Mesh and BFT for both uniform and transpose traffic. The results are based on the



Figure 3.9: Average network latency comparison with accurate link delay modelling of 2D BFT(default link delay) and 2D BFT with accurate link delay.

equalised buffer space for V=8 and D=12.

In NoC, a latency comparison can be observed prior to the saturation point. Because the network will become unstable after the saturation point. When the amount of traffic that a network can support before it reaches the latency wall is Saturation point. Figures 3.10 (a) and (b) depict the average network latency of 3D 2-layer and 3D 4-layer Mesh with BSE and without BSE. The injection rate 0.08 and 0.16 are the saturation point for 3D 2-layer Mesh uniform traffic and 3D 4-layer Mesh uniform traffic. There is 8% variation in network latency till 0.08 injection rate and reduces up to 10 to 12% later on. From Figure 3.10 (b), there is up to 5% variation in network latency till 0.16 injection rate and reduces up to 12% later on. Figures 3.10 (c) and (d) depict average network latency of 3D 2-layer and 3D 4-layer Mesh for transpose traffic pattern. A 30 % reduction in the average network latency for 3D 2-layer Mesh after BSE.

Employing buffer equalisation, the buffer space of the 3D 2-layer and 3-layer 3D Mesh variants decrease up to 15% for each V and D as shown in Table 3.5, 3D Mesh variants show a small variation in the network latency till the saturation point.

In the BFT variants, buffer space increases up to  $2 \times$  in both variants after BSE

as shown in Table 3.5. Figure 3.11 depicts the comparison of average network latency for uniform and transpose traffic patterns for BFT variants.

Figures 3.11 (a), (b) and (c) depict average network latency of 2D, 3D 2-layer, 3D 4layer BFT for uniform traffic pattern. In Figure 3.11 (a), the average network latency is reduced up to 60% till 0.04 injection rate. This is observed due to the increase in the buffer space for 2D BFT compared to without Buffer Space Equalisation. After 0.04 injection rate the network latency increases up to 20% due to transferring of more data flits than without BSE based topology, and in 2D BFT with BSE saturation is increased from 0.02 to 0.004. In Figure 3.11 (b) the average network latency is reduced by 10% till 0.02 injection rate. After 0.04 injection rate, the network latency has increased up to 45% due to the transferring more data flits compared to network latency without BSE based topology. In Figure 3.11 (c), the average network latency



Figure 3.10: Average network latency comparison after BSE (varying VC and D) (a) 3D 2-layer Mesh uniform traffic (b) 3D 4-layer Mesh uniform traffic (c) 2D Mesh transpose (d) 3D 2-layer Mesh transpose traffic(f) 3D 4-layer Mesh transpose traffic.



Figure 3.11: Average network latency comparison after BSE (varying VC and D ) (a) 2D BFT uniform traffic (b) 3D 2-layer BFT uniform traffic (c) 3D 4-layer BFT uniform traffic (d) 2D BFT transpose (e) 3D 2-layer BFT transpose traffic(f) 3D 4-layer BFT transpose traffic.

is reduced by 53% till 0.04 injection rate. After 0.04 injection rate, the network latency increases to 55% due to transferring more data flits compared to network latency without BSE based topology.

The transpose traffic (Figure 3.11(d),3.11(e),3.11(f)) pattern has up to  $2 \times$  saturation point and 10% reduction in average network latency compared to uniform traffic in all variants of BFT topology. The network latency is reduced in BFT till saturation point.

#### 3.5.3 BFT vs Mesh Topology

The normalised performance of Mesh and BFT are depicted in Figure 3.12 for uniform and transpose traffic patterns. The performance is normalised to 2D BFT topology.

Figure 3.12 (a) shows the performance normalization of all variants for uniform traffic pattern. The 3D 4-layer Mesh performance is up to  $5 \times$  compared to 2D Mesh, up to  $4 \times$  compared to 3D 2-layer Mesh,  $12 \times$  compared to 2D BFT.  $14 \times$  to 3D 2-layer BFT and  $12 \times$  compared to 3D 4-layer BFT. Link length in Mesh is up to 80% shorter compared to BFT topology, and vertical links have a reduction in delay up to 75%. Hence the 3D 4-layer Mesh outperforms all other variants.

The 3D 2-layer and 3D 4-layer BFT have 40% and  $2.6 \times$  improvement in normalised performance till 0.02 injection rate and after that, 3D 2-layer and 4-layer 3D BFT loses 19% and 21% normalised performance compared to 2D BFT.



Figure 3.12: Normalized performance between 2D Mesh and 3D Mesh and BFT variants for (a) Uniform traffic (b) Transpose traffic.

Figure 3.12 (b) shows the performance normalization of all variants for transpose traffic pattern. 3D 4-layer Mesh achieves better performance over other topology. 3D 4-layer Mesh has normalized performance of  $1.4 \times$  to  $3.4 \times$  compared to 2D Mesh,  $1.2 \times$  to  $3.1 \times$  compared to 3D 2-layer Mesh,  $4.5 \times$  compared to 2D BFT.  $3.4 \times$  to 3D 2-layer BFT and  $2.5 \times$  compared to 3D 4-layer BFT.

The 3D 2-layer and 3D 4-layer BFT have  $2 \times$  and  $3.8 \times$  improvement in normalised performance till 0.02 injection rate, and after that, 3D 2-layer BFT loses average 6% normalised performance compared to 2D BFT. The 3D BFT variants lose performance because the level 2 to level 3 links lengths have increased by  $2 \times$  compared to 2D BFT.

The 3D 4-layer Mesh has  $5\times$  normalized performance compared to 3D 4-layer BFT. Hence, 3D 4-layer Mesh has least average network latency compared to other Mesh and BFT variants. There are 36 additional routers and  $2\times$  additional links in Mesh compared to BFT, which leads to distribute traffic over the network, i.e. reduces waiting time at the buffers.

It is observed that there is a drop in maximum normalised performance from 17 to 6.25 in Figure 3.12 (a) and (b). Due to the better distribution of traffic in the network, Mesh topology performs better in uniform compared to transpose. In BFT, the transpose traffic pattern has  $1.5 \times$  improvement in normalised performance than the uniform traffic pattern, as BFT is suited for localised traffic rather than uniformly distributed traffic.

#### 3.5.4 Flit Energy Analysis

Total Power consumption is calculated as the sum of the powers of links and routers. The formula for power consumption is given by Equation 3.3.  $P_t$  is the total power,  $P_r$ ,  $P_l$ ,  $P_{tsv}$  are powers of the routers, links and TSVs respectively.

$$P_t = P_r + P_l + P_{tsv} \tag{3.3}$$

$$P_{\rm TSV} = AF * C_{\rm TSV} * V^2 * f \tag{3.4}$$

 $P_{tsv}$  has been obtained using Equation 3.4, where AF is the switching activity factor which is the probability of output switching from 0 to 1, TSV capacitance is  $C_{TSV}$ . V is voltage and f is the operating frequency (Kim et al., 2010).

Flits per Joules (FpJ) is calculated using Equation 3.5,  $F_t$  is the total number flits delivered throughout the simulation and T is the total simulation in the cycle. Figures 3.13 (a) and (b) show the average FpJ of 2D and 3D Mesh and BFT variants for uniform and transpose traffic patterns respectively.



$$FpJ = \frac{F_t}{(P_t * T)} \tag{3.5}$$

Figure 3.13: Mesh and BFT (2D, 3D variants) topologies normalized Flits per Joules for (a) Uniform traffic (b) Transpose traffic.

Figure 3.13 (a) plots the normalised Flits per joule of all variants for uniform traffic pattern. The 3D 4-layer Mesh topology delivers up to  $1.5 \times$  FpJ compared to 2D Mesh and  $1.2 \times$  more than 3D 2-layer Mesh. The 3D 4-layer Mesh topology has up to  $4.5 \times$  Flits per joule compared to 2D BFT,  $3.2 \times$  than 3D 2-layer BFT and  $1.15 \times$  than 3D 4-layer BFT.

Figure 3.13 (b) plots the normalised FpJ of all variants for transpose traffic pattern. The 3D 4-layer Mesh topology has up to  $1.15 \times$  Flits per joule compared to 2D Mesh and  $1.1 \times$  than 3D 2-layer Mesh. The 3D 4-layer Mesh topology has up to  $4.5 \times$  Flits per joule compared to 2D BFT,  $2.9 \times$  than 3D 2-layer BFT and  $1.1 \times$  than 3D 4-layer BFT.

The BFT topology has a lower FpJ compared to Mesh topology because of the longer link lengths in the BFT. BFT has up to  $3 \times$  longer horizontal links than the Mesh topology. Moving towards 3D from 2D, FpJ increases up to  $1.5 \times$  in 3D 2-layer and up to  $3.9 \times$  FpJ in 3D 4-layer BFT compared to 2D BFT topology. A Higher FpJ in 3D variants is seen as eight horizontal links in 3D 2-layer and 12 horizontal links in 3D 4-layer are converted to TSVs.

#### 3.5.5 Energy Delay Product (EDP)

Figures 3.14 (a) and (b) depict the Mesh and BFT variants(2D, 3D 2-layer and 3D 4-layer) EDP comparison for uniform and transpose traffic respectively. EDP is compared for 0.02 injection rate as a minimum and 0.1 as a maximum injection rate. It is observed that 3D 4-layer Mesh has the lowest EDP compared to 3D 2-layer Mesh, 2D Mesh, 2D BFT, 3D 2-layer BFT and 3D 4-layer BFT for both the traffic patterns. In



Figure 3.14: Normalized EDP of Mesh and BFT (2D, 3D variants) for (a) Uniform traffic (b) Transpose traffic.

comparison with 3D BFT variants, the 3D Mesh variants have the lowest EDP since the link lengths for 3D Mesh decreases by 80 %, and the TSV count increases up to  $3\times$ .

### 3.6 Summary

In this work, the microarchitectural design space of 2D and 3D variants of the Mesh and BFT topologies have been explored. Accurate wire delays have been derived from link delay and TSV delay models. The lengths of horizontal links and TSVs have been estimated using the floorplan of the respective topologies. We evaluate the conventional 2D Mesh with 3D 2-layer Mesh, 3D 4-layer Mesh, 2D BFT, 3D 2layer BFT and 3D 4-layer BFT topologies for performance and energy trade-offs. All the variants have been compared, and trade-offs have been analysed, based on equal buffers distribution for a fair evaluation. Results of the experiments show that 3D 4-layer Mesh exhibits a performance improvement of  $2 \times$  to  $2.3 \times$  compared to 2D Mesh under uniform traffic, 3D 2-layer and 3D 4-layer Mesh under transpose traffic. The Mesh topology with uniform random traffic pattern shows improved performance compared to transpose traffic and it is due to uniform distribution of packets. 3D 4-layer BFT with transpose traffic shows an improvement in performance up to  $1.1 \times$  to  $1.3 \times$  over 3D 2-layer BFT under transpose traffic, 2D BFT with transpose traffic and 3D 4-layer BFT with uniform traffic pattern. BFT with transpose traffic pattern has a  $1.5 \times$  improvement in performance compared to uniform traffic pattern, showing that BFT is suitable for localised traffic rather than uniformly distributed traffic.

## Chapter 4

# 3D NoC Modelling in BookSim and Hotspot for Power, Performance and Thermal Evaluation

In this chapter, 3D NoC modelling capabilities extended in two existing state-of-theart simulators, viz., the 2D NoC Simulator - BookSim2.0 and the thermal behaviour simulator - HotSpot6.0. With the extended 3D NoC modules, simulators can be used for power, performance and thermal measurements through micro-architectural and physical parameters. The major extensions incorporated in BookSim2.0 are: Through Silicon Via (TSV) power and performance models, 3D topology construction modules, 3D Mesh topology construction using variable X, Y, Z radix, and tailored routing modules for 3D NoCs. 3D Mesh thermal behaviour has been analyzed for the regular arrangement and thermally aware design of the router-TSV element is proposed.

## 4.1 TSV Delay and Power Models

#### 4.1.1 TSV Delay Models

Weerasekera et al. (2009) have modelled TSVs in a compact manner by deriving reduced electrical circuit models for isolated and bundled structures. Their model took into account the coupling capacitance, resistance and inductance between the vias and their effect on the overall delay. However, the numerical data furnished only provides values for self and coupling capacitance of a 3 x 3 TSV bundle. You et al. (2013) have characterized TSV using an approximate ring oscillator model where the Driver resistance and TSV capacitance are the main contributors to the propagation delay. Ahmed et al. (2016) have recently proposed delay aware floorplanning which

considers the coupling capacitance between adjacent TSVs  $(C_{TT})$  and between the horizontal wire and TSV  $(C_{TW})$ .

For power, Jueping et al. (2010) have proposed a simple model to estimate TSV capacitance using few microarchitectural parameters and approximation techniques. Bamberg and Garcia-Ortiz (2017) have described a regression method for energy estimation based on the probability of bits passing through a 3x3 submicrometric TSV array.

Kim and Lim (2010) have modelled the RC parasitics of a TSV as a 3D interconnect along with buffers which add a non-trivial area overhead. However, their model does not consider physical parameters of the TSV such as its length, diameter and separation from other TSVs. Also, the delay of the buffer element is equal to 70ps, which is very high to be acceptable in a 3D NoC. Khalil et al. (2008) have made use of dimensional analysis to create a light-weight, high fidelity model that takes three parameters, namely TSV length, radius and pitch. Khalil's model has been incorporated due to its simplicity and agreement with simulations using electromagnetic field solvers and the lossy transmission line circuit model.

#### 4.1.2 TSVs Power Model

Kim et al. (2011) proposed an RLGC model considering multiple physical parameters of the via to determine the capacitance of a single TSV, which is necessary in order to derive the power it consumes under a given operating frequency and voltage. Their work has been extended in Kim et al. (2010) by devising an equation to calculate the power consumed using the TSV values from previous work and have also considered the Activity Factor(AF), which is a measure of the amount of work done by the underlying chip interconnects. This model considers more microarchitectural details, hence it is considered for accurate estimation of the dynamic power of TSVs as vertical interconnects in 3D NoCs. Figure 4.1 shows that each pair of TSV is made up of a signal and ground TSV. The TSV and bump provide a vertical interconnect through the silicon substrate, i.e., joining the stacked chips. The underfill is the separation between the TSV bumps. The Inter Metal Dielectric layer is the separation between TSVs. Figure 4.2 shows TSVs electrical model with labelled component. The influence of these parasitic components increases as the operating frequency increases.



Figure 4.1: Structure of a signal TSV and a ground TSV with bumps with the via-last process and their structural parameters (Kim et al., 2011).

## 4.2 3D NoC Modelling in BookSim2.0 and HotSpot

#### 4.2.1 Variable radix at X, Y, Z in Mesh topology

BookSim2.0 can construct and simulate a k-array n-cube Mesh, where k is the radix (number of elements in each dimension) and n is the number of dimensions. The radix is fixed for all dimensions. For example, 8 x 8 Mesh topology is 8-array, 2-cube 2D Mesh topology has total number of routers  $(k^n)=64$ . An 8 x 8 Mesh can be arranged into other configurations, for eg. 4 x 4 x 4 and 8 x 4 x 2 as a variable radix at each dimension. Creating such topologies requires support for creating a custom number of PEs in the X, Y, and Z dimensions. Additionally, the routing algorithm has to be modified to support variable radix topologies. Variable radix Mesh has been added to the network module. The ZXY routing has been implemented in the routing module Z as a new parameter is added. The configuration file is modified to receive X, Y, Z values. 3D NoC is simulated by supplying n=3 (three dimensions) in the configuration file. 2D NoCs are simulated with n=2 (two dimensions) and Z=1. Table [4.1] shows the list of new parameters and values. To evaluate variable radix mesh topology, varmesh has been used as topology name during the simulation.



Figure 4.2: TSVs electrical model with labelled components (Kim et al., 2011).

| Changes made   | Description |
|----------------|-------------|
| Parameters     | Χ, Υ, Ζ     |
| Topology name  | varmesh     |
| Route Function | MeshZXY     |

Table 4.1: Variable radix Mesh topology details.

## 4.2.2 TSV based Delay Model and Power Module in Book-Sim2.0

Using (Khalil et al., 2008) model, this work generates an ideal TSV configuration by combining these models by considering safe limits for each parameter to avoid the manufacturer complexity during the fabrication process. The safe limits (Table 4.2) are taken from Weerasekera et al. (2009) and Lee et al. (2014).

An analytical model of the propagation delay for TSVs is shown in Algorithm 3

Table 4.2: Important Physical Parameters for TSVs (Khalil et al.) 2008) and Safe limits values for each parameter. All the parameters are from the Electrical TSV model shown in 4.1 (a).

| Parameter Name     | Inference                                | Values (Lee et al. 2014: Weerasekera et al. 2009) |  |  |
|--------------------|------------------------------------------|---------------------------------------------------|--|--|
| $d_{TSV}$          | TSV_Diameter                             | $[20,,\!80]~\mu\mathrm{m}$                        |  |  |
| $\mathbf{p}_{TSV}$ | $TSV_Pitch,$                             | $[40,,180] \ \mu m$                               |  |  |
| $h_{bump}$         | Bump Height                              | $[5,,50] \ \mu m$                                 |  |  |
| $d_{bump}$         | Bump Diameter                            | [5,,30] μm                                        |  |  |
| $t_{ox}$           | Oxide Layer Thickness                    | $[0.1,,1.0] \ \mu m$                              |  |  |
| $t_{ox\_bot}$      | Bottom Oxide Layer Thickness $1 \ \mu m$ |                                                   |  |  |
| h <sub>imd</sub>   | Inter-metal dielectric Layer Height      | $[20,,100] \ \mu m$                               |  |  |
| $\sigma_{Cu}$      | Conductivity of Copper                   | $5.96 * 10^7 { m S/m}$                            |  |  |
| $\varepsilon_{Si}$ | Permittivity of Silicon                  | $1.05315^* \ 10^{-10} \ {\rm F/m}$                |  |  |
| $\mu_o$            | Permeability in free space               | $1.25663706 * 10^{-6} \text{ H/m}$                |  |  |
| ω                  | $2\pi f$                                 |                                                   |  |  |

TSV delay is estimated from Height/Length (l), Diameter (d), and Pitch/Separation (s). These parameters(l,d and s) brute-forced within the safe limits . Further, the least power configuration is considered for this work. The TSV is configured with TSV length  $20\mu$ m, diameter  $20\mu$ m and pitch  $60\mu$ m. These are the default TSV values for the simulator.

Algorithm 3: TSV Delay Estimation.Input: Length(l), Diameter( $d_{TSV}$ ) and Pitch/ Separation(s) of TSVOutput: TSV delay1 START2  $r = d_{tsv}/2$ 3  $l_o = \frac{\sigma_{Cu}*r_2*\sqrt{(\mu_o/\varepsilon_{si})*acosh(s/d_{tsv})}}{0.693*(1+0.617*(r/s))}$ 4 if( $l \ge l_o$ )5  $delay = \sqrt{(\mu_o/\varepsilon_{si})}*l*l/l_o$ 6 else7  $delay = \sqrt{(\mu_o/\varepsilon_{si})}*l$ 8 END

The overall power consumption of the NoC can be calculated as the sum of the power consumption of link and routers. Power consumption is given by Equation 3.3

The router  $(P_r)$  and link  $(P_l)$  powers are calculated dynamical from ORION3.0 (Kahng et al., 2009) power models. Power  $(P_{tsv})$  consumption for each configuration has been obtained using Equation 3.4 where AF is the activity factor,  $C_{TSV}$  is the TSV capacitance calculated from the Equation 4.1 (Kim et al., 2010).



Figure 4.3: Logical Layout of the TSV electrical model considered in the dynamic power model (Kim et al., 2010).

| Table 4.3: | Reduced | Model | Paramet | ers(Figure | 4.3(b)) |
|------------|---------|-------|---------|------------|---------|
|            |         |       |         |            |         |

| Parameter Name | Inference                                                 |  |  |
|----------------|-----------------------------------------------------------|--|--|
| $C_{b1}$       | $\mathrm{C}_{ins} + \mathrm{C}_{Bump1}$                   |  |  |
| $C_{b2}$       | $\mathrm{C}_{ins} + \mathrm{C}_{Bump2}$                   |  |  |
| C1             | $({\rm C}_{b1}{\rm *C}_{b2})/({\rm C}_{b1}{\rm +C}_{b2})$ |  |  |
| $C_2$          | $C_{Sisub}$                                               |  |  |
| $C_3$          | $\mathrm{C}_{Underfill} + \mathrm{C}_{Bottom}$            |  |  |

$$C_{TSV} = C_3 + \frac{C_1 * C_2 * (1 + \sigma_{Cu} / (\varepsilon_{si} * \omega))}{C_1 + 2 * C_2 * (1 + \sigma_{Cu} / (\varepsilon_{si} * \omega))}$$
(4.1)

The electrical model from Figure 4.3 shows highlighting the capacitance of TSVs elements. Table 4.3 shows the representation of capacitance of each of the TSV elements.  $C_1$  is the insulator capacitance of a TSV, silicon-substrate capacitance is  $C_2$ , and  $C_3$  denotes the combined capacitance of the bottom and underfill sections of the TSV.

Figure 4.4 shows the simulation framework used to evaluate the Mesh and BFT topology variants. The BookSim simulator has been modified to simulate 3D Mesh and BFT topologies with configurable dimensions. Floorplan module takes topology, PEs size, and router area to generate link length. These parameters are passed to the link delay and power module. The delay of horizontal and vertical links is calculated using a link delay module. The horizontal link ( $T_{D_{-H}}$ ) delay is calculated using ORION, and vertical link delay ( $T_{D_{-TSV}}$ ) is calculated from the TSV delay module. The delay of individual links is passed to the simulator to create topology (build network). Power module takes the link length and router details to calculate the accurate power details. The vertical links power ( $T_{D_{-TSV}}$ ) is calculated using the TSV power module, and the router ( $P_r$ ) and horizontal links power ( $P_{D_{-H}}$ ) are calculated using ORION.



Figure 4.4: Simulation framework for evaluating power and performance. BookSim was extended with 3D TSV delay, power and link delay modules.

#### 4.2.3 TSV-Router in HotSpot for 3D NoC Architecture

The Alpha Ev6 processor architecture (Figure 4.5(a)) is the base configuration provided by HotSpot6.0. A router-TSV element of standard dimensions has been incorporated into the Ev6 floorplan.

The position of the router in the floorplan of the chip affects the heat distribution across its neighbours. The Data cache and Integer Registers are the primary regions where the temperature is high due to their relatively higher power consumption in each time step.

The router is responsible for pushing packets across the network, an ideal placement can be a region where most of the data resides. Two separate configurations were considered under the 3D Mesh architecture: i) Router at the bottom-right corner (Naive 3D Mesh), next to the Data cache(Figure 4.5(b)) and ii) Router when shifted to the center(Thermal Aware Mesh architecture), between the Instruction and Data cache (Figure 4.5(c)). The data cache tends to be one of the hottest parts of the core. This high temperature from a neighbouring element can affect the long-term performance of the router and, effectively the chip as a whole. On the other hand, shifting the router away from the data cache reduces the interfacial contact between the two elements. The effect of thermal conduction from the hotter data cache is lowered. Router shifted away from Data cache 4.5(b) considered as Thermal-Aware Mesh architecture.

The Route-TSV based floorplans for 3D NoC Mesh and BFT topologies are shown in Figure 4.5 (c) and (d). The router is placed in between the Instruction and Data cache (Figure 4.5(c)) to keep it equidistant from both as thermal aware design. For the BFT topology, one router is shared by 4 leaf PEs as shown in Figure 4.11 and Figure 4.5 (d) presents the layout. HotSpot6.0 was extended to provide support for the two mentioned 3D NoC architectures (Mesh and BFT) and to analyze the thermal effects.

Router-TSV elements have been added to the floorplans in HotSpot. Adding this router element results in the final picture not being a perfect closed figure (rectangle or square). For this, the filler elements have been added as dead space (Figure 4.5 (b) and (c)).
The process of generating floorplan and power trace files for both 3D Mesh and 3D BFT NoC architectures has been automated using scripts which take a small number of parameters (Figure 4.6). Power trace files contain the power consumed by every element in the NoC floorplan at each time step. In these experiments, power values for 100 time steps has been generated from the existing power trace files in

|        |        |       | L (D )  | FPI | Мар    | IntMan | T      | IntReg  |  |
|--------|--------|-------|---------|-----|--------|--------|--------|---------|--|
| FPMap  | IntMan | IntO  | IntReg  | FPI | FPMul  |        | IntQ   |         |  |
| FPMul  | maviap | mog   |         | FP  | Reg    |        | LdStQ  | IntExec |  |
| FPReg  | _      | LdStQ | IntExec | FPA | Add    | FPQ    | ITB    |         |  |
| FPAdd  | FPQ    | ITB   |         |     | Bpred  |        |        | DTB     |  |
| Bpred  |        | DTB   |         |     | Icache |        | Dcache |         |  |
|        |        |       |         |     |        |        |        |         |  |
| Icache |        | D     | Dcache  |     | Filler |        |        | Router  |  |
|        |        |       |         |     |        |        |        | TSV     |  |

(b)

(a)

|        |        |            |         | 1 8 |        |           |        |         |        |        |        |         |
|--------|--------|------------|---------|-----|--------|-----------|--------|---------|--------|--------|--------|---------|
| FPMap  |        | IntO       | IntReg  |     | FPMap  | IntMan    | IntO   | IntReg  | FPMap  | IntMan | IntO   | IntReg  |
| TTTTT  | IntMap |            | 0       |     | FPMul  | interior  | miq    |         | FPMul  | manap  | miQ    |         |
| FPMul  | manup  |            |         |     | FPReg  | FPO       | LdStQ  | IntExec | FPReg  | FPO    | LdStQ  | IntExec |
|        |        |            | 4       |     | FPAdd  | ΠQ        | ITB    |         | FPAdd  | ΠQ     | ITB    |         |
| FPReg  |        | I dStC     | IntExec |     | Bpree  | ł         | I      | OTB     | Bprec  | i      | I      | OTB     |
|        | FPQ    | Luoid      | c       |     |        |           |        |         |        |        |        |         |
| FPAdd  |        | ITB        |         |     | Icache | 2         | Dcache |         | Icache |        | Dcache |         |
|        |        |            |         |     |        |           |        |         |        |        |        |         |
| Bpred  |        | DTB        |         |     |        | Ro        |        | uter    |        |        |        |         |
| Dpred  |        |            |         |     | Filler |           |        |         |        |        | Filler |         |
|        |        | Dcache     |         |     |        |           |        | T       | SV     |        |        |         |
|        |        |            |         |     | FPMap  | P Int Man | IntO   | IntReg  | FPMap  | IntMan | IntO   | IntReg  |
| Icache |        |            |         |     | FPMul  |           | mų     | FPMul   | пплиар | IntQ   |        |         |
|        |        |            |         |     | FPReg  | EDO       | LdStQ  | IntExec | FPReg  | EDO    | LdStQ  | IntExec |
|        |        |            |         |     | FPAdd  | FPQ       | ITB    |         | FPAdd  | rrų    | ITB    |         |
|        | Po     | utor       |         |     | Bprec  | i         | I      | OTB     | Bpred  |        | DTB    |         |
| Filler | KU     | Juiei      | Filler  |     |        |           |        |         |        |        |        |         |
|        | т      | SV         |         |     | Icache |           | Dcache |         | Icache |        | Dcache |         |
| L      | 1      | <b>U</b> V |         | ]   |        |           |        |         |        |        |        |         |
|        | (c     | )          |         |     |        |           |        |         | (d)    |        |        |         |
|        | (0     | /          |         |     |        |           |        |         | (~-/)  |        |        |         |

Figure 4.5: Logical representations of (a) Default Alpha Ev6 processor layout in HotSpot6.0. (b) Modified layout with router next to the Data cache(Mesh topology). (c) Modified layout with router shifted away from the Data cache(Thermal Aware Mesh architecture). (d) One router shared between 4 cores (not to scale) for BFT architecture.



Figure 4.6: The automated floorplans generation in HotSpot for 3D NoC architecture.

HotSpot6.0 based on the SPEC CPU2000 Benchmark (Henning, 2000). Corresponding components on all cores consume the same power in each time step. All layers are arranged in the layer configuration file.

The 3D NoC power models has been added to both BookSim2.0 and HotSpot6.0 simulators. Figure 4.7 shows the overall modified framework, which has been used for the analysis of the power, performance and thermal behaviour of the 3D NoC architecture with accurate interconnect delay and power models.

# 4.3 Analysis of 3D NoC Topology Variants

The extended version of BookSim2.0 has been used to evaluate 2-layer and 4-layer 3D Mesh and BFT topology variants. In this section details of simulating and analysing the topologies are presented.

## 4.3.1 General Procedure to Add New Topologies

New topologies are added into BookSim2.0 through these steps:

- Create the topology and store the file topo.cpp and topo.hpp at network directory and add the topology name in network.cpp file to use the topology in configuration file during the simulation.
- New routing methods can be added in routefunc.cpp file with route function name as topology\_rout\_name and add rout\_name at the beginning of the file to select in configuration file during the topology simulation
- The traffic and flow of simulation can be tracked through trafficmanger.cpp to make necessary changes.

Table 4.4 depicts the new list of files, function and parameters added to the simulator for 3D Mesh and BFT NoC architectures.



Figure 4.7: The over all modified simulation framework for power, performance, thermal behaviour of 3D NoC architecture.

| Input parameter             | Description                                                                      |
|-----------------------------|----------------------------------------------------------------------------------|
| List Files added/modified   |                                                                                  |
| TSV.hpp, TSV.cpp            | Detailed TSV power and Delay model which checks the valid TSV configuration      |
| varmesh.hpp and varmesh.cpp | Mesh topology with variable radix at X, Y, Z dimensions                          |
| flbft.hpp and flbft.cpp     | To evaluate 2D BFT topology                                                      |
| bft2l.hpp and bft2l.cpp     | To evaluate 2-layer 3D BFT topology                                              |
| bft4l.hpp and bft4l.cpp     | To evaluate 4-layer 3D BFT topology                                              |
| New functions               |                                                                                  |
| $valid\_tsv(h,r,p)$         | Checks the TSV configuration is within safe-limits.                              |
| get_lpTSV()                 | Returns low power TSV configuration within safe-limits                           |
| get_least _area_TSV         | Returns TSV configuration which has lowest area                                  |
| _xLeftNode(node, dim)       | Returns the left node in x dimension                                             |
| _xRightNode(node, dim)      | Returns the left node in x dimension                                             |
| _yLeftNode(node, dim)       | Returns the left node in Y dimension                                             |
| _yRightNode(node, dim)      | Returns the right node in Y dimension                                            |
| _zLeftNode(node, dim)       | Returns the left node in Z dimension                                             |
| _zRightNode(node, dim)      | Returns the right node in Z dimension                                            |
| link_trace()                | returns the details of links utilization (horizontal and vertical) in topologies |
| New parameters              |                                                                                  |
| TSV Type                    | Type of TSV (Currently its signal and ground TSV)                                |
| TSV Diameter                | Diameter of $TSV((\mu m))$                                                       |
| TSV Pitch                   | Distance between two TSV $(\mu m)$                                               |
| TSV Height                  | Height of TSV $(\mu m)$                                                          |
| X                           | Radix at x dimension (No of Router in X dimension)                               |
| y                           | Radix at Y dimension (No of Router in Y dimension)                               |
| _Z                          | Radix at Z dimension (No of Router in Z dimension)                               |
| HL1-HL4                     | Four different horizontal link latencies                                         |
| VL1,VL2                     | Two different vertical link latencies                                            |

Table 4.4: The detail of Modification to BookSim2.0 for 3D NoC topology.

# 4.3.2 Mesh topology

The floorplan consists of a system with a tiled CMP consisting of 64 Sun-SPARC cores (Xu et al., 2012) and area of core 3.4mm<sup>2</sup>. Router area is estimated from ORION3.0

(Kahng et al., 2009). The values of the micro-architectural parameters used are shown in Table 4.5.

| Clock          | PEs area | 4-port router     | 5-port router     | 6-port router | 7-port router     | Channel size | TSV Delay     |
|----------------|----------|-------------------|-------------------|---------------|-------------------|--------------|---------------|
| Frequency(GHz) | $(mm^2)$ | area $(\rm mm^2)$ | area $(\rm mm^2)$ | area $(mm^2)$ | area $(\rm mm^2)$ | size (bit)   | (Clock cycle) |
| 2.5            | 3.4      | 0.47098           | 0.598509          | 0.729954      | 0.865314          | 64           | 1             |

Table 4.5: Parameters used in designing the floorplan.

Figure 4.8 shows the floorplan of a 8 x 8 2D Mesh network consisting of 64 routers. This configuration is simulated in BookSim2.0 simulator by considering accurate wire delay (HL1=4) for the horizontal length of 1.8mm.



Figure 4.8:  $8 \times 8$  2D Mesh with 4 PEs

 $8 \ge 4 \ge 2$  (Figure 4.9) and  $4 \ge 4 \ge 4$  (Figure 4.10) are the two 3D Mesh variants used as case studies for the proposed BookSim2.0 3D extensions. Table 4.7 shows details about changes made for 2-layer and 4-layer 3D Mesh topology. Horizontal and vertical delays, HL1 and VL1 are recorded in the configuration file before simulation. Based on the floorplan, the link delays are HL1=4 and VL1=1. mesh21 and mesh41 are the names of the topologies in the configuration file to be used during the simulation.

The ZXY routing in Algorithm is implemented in routfunction.cpp for 2D and 3D Mesh topologies. ZXY routes flit intralayer and then interlayer based on the

| Parameter          | Value                   |
|--------------------|-------------------------|
| Clock Frequency    | 2.5GHz                  |
| PEs area           | $3.4 \mathrm{mm}^2$     |
| 4-port router area | $0.47098 \mathrm{mm}^2$ |
| 5-port router area | $0.598509 \text{mm}^2$  |
| 6-port router area | $0.729954 \text{mm}^2$  |
| 7-port router area | $0.865314 \text{ mm}^2$ |
| Channel size       | 64 bit                  |

Table 4.6: Parameters used in the design of the floorplan



Figure 4.9: Floorplan of  $8\times4\times2$  3D Mesh with two stacked layers connected using TSVs.

destination coordinates.

### 4.3.3 3D BFT Topology

In the Butterfly Fat Tree (BFT) topology, PEs are placed at the leaves, and routers are placed at the top and intermediate levels, as shown in Figure 4.11. A pair of coordinates is used to label each node, (L, P) where L denotes a node's level, and P indicates its position within that level. The PEs have addresses ranging from 0 to (N



Figure 4.10: Floorplan of 4  $\times$  4  $\times$  4 3D Mesh with four stacked layers connected using TSVs.

| Table $4.7$ : | 2-layer | and | 4-layer | 3D | Mesh | topology | details. |
|---------------|---------|-----|---------|----|------|----------|----------|
|---------------|---------|-----|---------|----|------|----------|----------|

| Changes made   | Description                                                       |
|----------------|-------------------------------------------------------------------|
| New parameters | HL1 and VL1 (Link delays as $horizontal(HL)$ and $vertical(VL)$ ) |
| Topology name  | mesh2l(8 x 4 x 2), mesh2l (4 x 4 x 4)                             |
| Route Function | MeshZXY                                                           |

-1), and BFT has different non-uniform links in each level (Grecu et al., 2004).



Figure 4.11: 64 node BFT topology with three levels. Level 1 is of 4 router, level 2 of 8 routers, level 3 of 16 routers. The leaves are the PEs which are connected to level-3 routers and 4 PE's per router.

Figure 4.12 (a) shows 2D floorplan of BFT topology. Micro-architectural parameters used for these experiments is shown in Table 4.5.



Figure 4.12: Floorplan of 2D Mesh with 64 PEs and each PEs are connected routers for inter PEs communication.

From the floorplan, there are five different links lengths. Table 4.8 shows the details about the changes added to the simulator to simulate a functional 3D BFT topology.

2-layer 3D BFT: Starting from the 2D BFT floorplan, level 2 routers are moved towards level 1 to reduce overall link length. Figure 4.13 shows the 2-layer 3D BFT which is extended from a 2D BFT. The 2-layer 3D BFT has two stacked layers connected through vertical TSVs. The overall link length is reduced upto 50% in 3D BFT compared to its 2D counterpart. Table 4.8 shows the details about the changes added to the simulator for 2-layer 3D BFT topology.

4-layer 3D BFT: Figure 4.14 shows a 2-layer 3D BFT modified to 4-layer 3D BFT topology with each layer consisting of 16 PEs. Level 1 routers are placed in between layers to reduce the length of TSVs. Figure 4.14(c1) shows the vertical connection between the routers from level 1 to level 2. Table 4.8 shows the details about the changes added to the simulator for 4-layer 3D BFT topology.

Table 4.9 depicts the total number of resources such as network, number of routers, link details of the 2D and 3D of Mesh and BFT variants. From Table 4.9, the 4-layer 3D Mesh has a reduction in horizontal links from 112 to 96, and there are 48 extra



Figure 4.13: (b1)  $8 \times 4 \times 2$  3D Mesh with two stacked layers connected using TSVs. (b2) Inter-layer connections.



Figure 4.14: (c1)  $4 \times 4 \times 4$  3D Mesh with four stacked layers connected using TSVs. (c2) Inter-layer connections.

VLs compared to 2D NoC mesh topology.

# 4.4 Experimental Setup

The BookSim2.0 simulator has extended to support 3D NoCs by adding (a) TSV delay and power modules, (b) Orion power and delay modules for horizontal links. The

| Changes made   | 2D BFT                        | 2-layer 3D BFT | 4-layer 3D BFT        |  |  |  |  |
|----------------|-------------------------------|----------------|-----------------------|--|--|--|--|
| New parameters | HL1-HL4                       | HL1-HL3        | VL1 HL1-HL3, VL1, VL2 |  |  |  |  |
| Topology name  | flbft                         | bft2l          | bft4l                 |  |  |  |  |
| Route Function | Nearest Common Ancestor (NCA) |                |                       |  |  |  |  |

Table 4.8: Floorplan based 2D BFT and 3D BFT variants topology details.

Table 4.9: Total resources in the 2D and 3D variants of Mesh and BFT considered in this work. Network size is 64 PEs. Links are horizontal (HL) and vertical (VL). VC is the number of virtual channels and D is buffer depth per V.

| NoC Topology    | Network (x/k,y/n,z) |   |   | Router (In,Out,VC,D) |                            |    |    | Link counts HL,VL |    |
|-----------------|---------------------|---|---|----------------------|----------------------------|----|----|-------------------|----|
|                 | Х                   | Y | Z | No. Router           | $\mathrm{In}/\mathrm{Out}$ | VC | D  | HL                | VL |
| 2-D Mesh        | 8                   | 8 | 1 | 64                   | 5/5                        | 8  | 7  | 112               | 0  |
| 2-layer 3D Mesh | 8                   | 4 | 2 | 64                   | 6/6                        | 8  | 6  | 108               | 32 |
| 4-layer 3D Mesh | 4                   | 4 | 4 | 64                   | 6/6,  7/7                  | 8  | 6  | 48                | 0  |
| 2-D BFT         | 4                   | 3 | 1 | 28                   | 4/4, 8/8                   | 8  | 16 | 96                | 48 |
| 2-layer 3D BFT  | 4                   | 3 | 2 | 28                   | 4/4, 8/8                   | 8  | 16 | 32                | 16 |
| 4-layer 3D BFT  | 4                   | 3 | 4 | 28                   | 4/4, 8/8                   | 8  | 16 | 40                | 8  |

HotSpot6.0 thermal analysis tool has been extended to support by, adding (a) TSV area and power model, and (b) Router area and Power model. Figure 4.7 shows the overall modified simulation framework for power, performance and thermal behaviour of 3D NoC architectures. In all four 3D variants of Mesh and BFT topologies are implemented in BookSim2.0 as mesh2l, mesh4l, bft2l, and bft4l. Link delays are provided as Horizontal Delay (HL), Vertical Delay(VL) and TSVs(Height, diameter and pitch) details.

The link delays are configurable and can be varied based on the floorplans of the NoCs. By default, HL and VL are one clock cycle. Table 4.10 shows experimental set up in BookSim2.0 simulator. All temperatures are in degree Kelvin(K), and the HotSpot6.0 simulations are run in the environment specified in Table 4.11.

| Input values                              |
|-------------------------------------------|
| mesh, mesh2l, mesh4l, bft2d, bft2l, bft4l |
| k=4, 8 and n=2, 3                         |
| 8, 12                                     |
| xy, xyz, nca, nca_RROD, nca_ROD           |
| 5                                         |
| Uniform, Transpose                        |
| 0.02 to 0.20                              |
| 10000 (cycles)                            |
| 10                                        |
|                                           |

Table 4.10: NoC BookSim2.0 parameter for 2D and 3D Mesh and BFT variants.

Table 4.11: Thermal evaluation Simulation Environment.

| Environment variable             | Specification               |
|----------------------------------|-----------------------------|
| Simulator                        | HotSpot6.0                  |
| Operating Frequency              | 2.5 GHz                     |
| Activity Factor                  | 0.15                        |
| Clock Cycles(cc)                 | 100000                      |
| Topology Simulated               | 2D, 2-layer and 4 -layer    |
|                                  | 3D Mesh and 3D BFT variants |
| TSVs per Vertical Link( $TSVL$ ) | 64                          |
| Power Consumed per $TSV(PCT)$    | $4.2 \ \mu W$               |

## 4.5 Results and Discussion

### 4.5.1 Average Network Latency

Average network latency obtained from BookSim2.0 using default link latencies and from floorplan based link latencies are plotted in Figure 4.15 (a) and (b) for 2D Mesh and 2D BFT respectively. Using floorplan based link lengths and corresponding delays in simulation, an increase in average network latency from 19% to 43% is observed. Floorplan and accurate delay estimation plays a significant role to observe the accurate performance of the NoC architecture. An increase in average network latency up to  $1.45 \times$  is observed in Mesh, and up to  $8 \times$  in BFT topology using floorplan derived, accurate link delays in the simulation. BFT topology link length  $2.5 \times$  greater than the Mesh, resulting in a larger increase in the average network latency.



Figure 4.15: Average network latency comparison with accurate link delay modelling. (a) 2D Mesh(default link delay) and 2D Mesh with accurate link delay and (b) 2D BFT(default link delay) and 2D BFT with accurate link delay.

### A Evaluation of Random Output Deflection(ROD) and Round Robin Output Deflection (RROD) routing (NCA) in BFT topology

Figure 4.16 depicts the comparison of latencies of 2-layer 3D BFT topology for random and round robin output deflection routing for uniform and transpose traffic pattern.

In Figure 4.16, there is a 10 to 13% increase in the overall network latency for RROD compared to ROD. With RROD, the flow of flits balanced between links which lead to the transfer of more flits compared to ROD. RROD routing is selected as the



Figure 4.16: Average network latency comparison for 2-layer 3D BFT for RROD and ROD routing.

best output path selection for the BFT topology variants compared to ROD routing. In further evaluation, RROD is used as output path in 2D and 3D BFT NoC variants.

### 4.5.2 Performance Evaluation of 3D Mesh and BFT

In Figure 4.17, the average network latency comparison of both 2D and 3D variants of Mesh and BFT with uniform and transpose traffic patterns is shown. Results are shown for VC=8 and buffer-depth(D)=12.

### A Mesh Topology

The Mesh topology shows improved performance on the uniform traffic pattern compared to the transpose traffic pattern for all variants. In Figure 4.17 (a), the 4-layer 3D Mesh with uniform traffic shows up to  $2.3 \times$  improvement over 2D Mesh uniform and up-to  $2 \times$  improvement over 2-layer 3D Mesh. The 2-layer 3D Mesh with uniform traffic shows up-to  $1.11 \times$  improvement over 2D Mesh for uniform traffic pattern.

The 4-layer 3D Mesh with transpose traffic shows up-to  $3 \times$  improvement over 2D Mesh transpose and up-to  $3.1 \times$  improvement over 2-layer 3D Mesh transpose. The 2-layer 3D Mesh with transpose traffic shows up-to  $1.1 \times$  improvement over 2D Mesh for transpose traffic pattern. The improved performance in 3D Mesh is due to the replacement of horizontal wires with the TSVs (wire delay is  $4 \times$  greater than the TSV delay) and additional vertical connection(Table 3.3).

The 4-layer 3D Mesh with uniform traffic shows up-to  $2.3 \times$  improvement over



Figure 4.17: Average network latency comparison between uniform and transpose traffic pattern for 2D and 3D variants of (a) Mesh topology and (b) BFT topology

4-layer 3D Mesh transpose traffic pattern. The 2-layer 3D Mesh with uniform traffic shows up-to  $3.4 \times$  improvement over 2-layer 3D Mesh transpose traffic pattern. The uniform distribution of the packets in uniform traffic results in lower contention in the links compared to the transpose traffic pattern.

### **B** BFT Topology

The BFT topology shows improved performance on transpose traffic pattern compared to uniform traffic pattern for all variants. In Figure 4.17 (b), the 4-layer 3D BFT with uniform traffic shows up-to  $1.2 \times$  improvement over 2D BFT uniform and up to  $1.3 \times$  improvement over 2-layer 3D BFT uniform.

The 4-layer 3D BFT with transpose traffic shows up-to  $1.2 \times$  improvement over 2D BFT transpose and up-to  $1.3 \times$  improvement over 2-layer 3D BFT transpose. In the 3D BFT topology, average delays of wire delay are  $21 \times$  than the TSV delay as shown in Table 3.3, and eight horizontal wire links are converted to TSV links.

The 4-layer 3D BFT with transpose traffic shows up-to  $1.1 \times$  improvement over 4-layer 3D BFT uniform traffic pattern. The 2-layer 3D BFT with transpose traffic shows up-to  $1.2 \times$  improvement over 2-layer 3D BFT uniform traffic pattern. The transpose traffic results in better localised traffic compared to the uniform traffic pattern.

### 4.5.3 Average Energy per Flit (EPF)

The Figures 4.18 (a) and (b) depict the average energy consumption per flit of 2D and 3D variants of Mesh and BFT topologies.

Figure 4.18 (a) depicts the EPF of 2D and 3D Mesh topology variants for both uniform and transpose traffic. The 4-layer Mesh topology has average 15% reduction of EPF compared to 2-layer 3D Mesh and 35% reduction in 2D Mesh for both uniform and transpose traffic. The 2D and 3D Mesh variants with a uniform traffic has a 10% reduction of EPF compared to transpose traffic.



Figure 4.18: (a) Average EPF for 2D and 3D Mesh topology variants (b) Average EPF for 2D and 3D BFT topology variants.

Figure 4.18 (b) shows the EPF of 2D and 3D BFT topology variants for both uniform and transpose traffic. The 4-layer BFT topology has average 40% reduction of EPF compared to 2-layer 3D BFT and 75% reduction in 2D BFT for both uniform and transpose traffic. The 2D and 3D BFT variants with transpose traffic has a 15% reduction of EPF compared to uniform traffic.

### 4.5.4 Energy Delay Product (EDP)

Figure 4.19 shows the EDP of 2D and 3D variants of Mesh and BFT topology for uniform and transpose traffic patterns. The EDP is the product of average network latency and average Energy per flit.

Figure 4.19 (a) depicts EDP of all variants for uniform traffic pattern. The 4-layer 3D Mesh has the lowest EDP, 2-layer 3D Mesh is the second lowest EDP compared to other variants. Figure 4.19 (a) shows the EDP of all variants for transpose traffic pattern.

3D Mesh variants have the lowest EDP compared to 3D BFT variants as there is 80% reduction link lengths and up to  $3\times$  larger TSVs in 3D Mesh. The 4-layer 3D



Figure 4.19: Normalized EDP of 2D and 3D Mesh and BFT topology for (a) Uniform traffic pattern (b) Transpose traffic pattern.

BFT transpose has the nearest  $EDP(1.5\times)$  compared to 4-layer 3D Mesh transpose and by optimising 2-layer and 4-layer 3D BFT designs can achieve a lower EDP than 4-layer 3D Mesh. The 2D BFT has the largest EDP because of delay of links, and the horizontal links are up to  $2.5\times$  larger compared to other variants as discussed link delay analysis.

### 4.5.5 Thermal behaviour

A general observation in all of the evaluated architectures is that the outer routers show lower temperature values than those located at the centre. This can be observed as a sharp increase in the router temperatures along the x-axis after the first few routers. This is because heat dissipation is good at the boundary, but the elements in the middle of the die are surrounded from all sides. The only way heat can escape is through the Thermal Interface Material below. Hence the elements at the centre get hotter than the boundary elements over the duration of the simulation.

The position of the router in the floorplan of the chip has an effect on the heat distribution across its neighbours. The Data cache and Integer Registers are the primary regions where the temperature is high due to their relatively higher power consumption in each time step. After  $R_31(in Figure 4.20 (d) and (e))$ , the remaining routers correspond to the lower core floorplan layer, closer to the heatsink.

#### A Router next to Data cache

The router is sandwiched between the hot Data cache and Integer Registers(Figure 4.20(a)). As a consequence of thermal conduction, the router's temperature at the end of the simulation is also high(The heat generated from Data cache directly affects router)(Figure 4.20(d)).



Figure 4.20: Heatmaps in a 64-core (a) 3D Mesh architecture with the router next to the data cache, (b) 3D Mesh architecture with the router shifted away from the data cache and Temperature distribution across routers in (c) 3D Mesh architecture with the router next to the data cache, (d) 3D Mesh architecture with the router shifted away from the data cache.

### **B** Router shifted away from Data cache

The thermal aware architecture layout overcomes the thermal shortcomings of the naive 3D Mesh. By shifting the router towards the centre, contact with the Data cache is reduced and completely avoided with the Integer Registers. As a result, the (Figure 4.20(b)) shows lower temperature in the router. Hence, by placing the router away from the elements that tend to heat up due to their higher power consumption, the router is less hotter than the naive 3D mesh. However, this also means that the Data cache has lesser places to dissipate heat and hence, heats up more than it would if the router was directly next to it. The average of the maximum temperatures of all the routers in a 4 x 8 x 2 3D NoC is lowered by 3% in the thermal aware Mesh

 $\operatorname{architecture}(\operatorname{Figure} 4.20(d)).$ 

# 4.5.6 Thermal behaviour of a 3D Mesh and 3D BFT topology A Mesh topology

Figures 4.21 (a), (b), (c), and (d) shows the thermal behaviour of 2D, 2-layer, 4-layer 3D and comparison of 2D and 3D variants Mesh topology respectively. The temperature variation in the 2D Mesh topology follows the pattern of high temperature in the middle router compared to the corner routers, i.e., 8 x 8 Mesh topology two routers and last router at every X radix shows the drop in temperature up to 7 degree Celsius compared to centre routers. 2-layer 3D Mesh shows the also follows the pattern of corner router with low temperature. The layer 1 routers(router 0 to 31) shows an average of 8 degree Celsius more than the layer 0 routers(router 32 to 64). Within each layer, the corner routers show less temperature compared to centre routers.



Figure 4.21: Thermal behaviour of (a) 2D mesh Topology (b) 2-layer 3D Mesh Topology (c) 4-layer Mesh topology (d) Average thermal behaviour comparison of 2D and 3D Mesh variants.

Similarly, the 4-layer 3D Mesh also follows the pattern of corner router with low temperature, and the layers which are near to heat sink have a lower temperature. The layer four routers(router 0 to 15) show average 20 degree Celsius more than the layer 0 routers(router 48 to 63). Within each layer, the corner routers show less temperature compared to centre routers. The layers which are away from sink have a higher temperature.

### B BFT topology

Figures 4.22 (a), (b), (c), and (d) show the thermal behaviour of 2D, 2-layer, 4layer 3D, and comparison of 2D and 3D BFT variants respectively. The temperature variation in the 2D BFT topology follows different pattern compared to the Mesh topology. It is because of the routers organization in BFT. The pattern is, routers which connects to core have high temperature and it is because these routers are concentrated (C=4). The Level Zero routers have low temperature compared to higher levels in BFT topology.



Figure 4.22: Thermal behaviour of (a) 2D BFT topology (b) 2-layer 3D BFT topology (c) 4-layer 3D BFT topology (d) Average thermal behaviour comparison of 2D and 3D Mesh variants.

The 2-layer 3D BFT also follows the pattern of 2D BFT topology lower level router with low temperature. The layer one routers show an average of 5 degree Celsius more than the layer 0 routers. Similarly, the 4-layer 3D BFT also follows the pattern of corner router with low temperature and the layers which are near to the heat sink with lower temperature. The layers which are away from sink have a higher temperature, and good packaging helps in controlling the thermal effect on the NoC architecture.

A general observation in all of the evaluated architectures is that the outer cores show lower temperature values than those located at the centre. This is because heat dissipation is better at the edge of the floorplan, but the elements in the middle of the die are surrounded from all sides. The only way heat can sink is through the Thermal Interface Material below. Hence the elements at the centre get hotter than the boundary elements over the duration of the simulation. It is evident that the routers in the naive setup register higher maximum and minimum temperatures as compared to the proposed thermally aware floorplan.

### C Router Power Analysis

The Power dissipation of routers is observed under uniform traffic pattern for 2D BFT and 2D Mesh topologies at 0.01 injection rate. The runtime power trace generated from the BookSim2.0 simulator is fed to the HotSpot6.0 to generate the heat dissipation of the NoC architecture. Router's power(mW) utilisation is generated from 1000cc to 10000cc with an interval of 2000cc for Mesh topology(64-router) and BFT topology(28-routers).

Table 4.12 shows the average of the router powers at each level of the BFT topology. From Table 4.12, it is observed that the level-1 routers have higher power usage, level 2 has the least power usage and level-0 has medium power usage. Considering nodes 0-3 as sources, the level-1 routers transfer packets from 4 to 63(up to 90%). The level-0 routers transfer packets from 32 to 63 (up to 50%). Level 2 routers transfer packets for 0-3 (only 10%).

Table 4.13 depicts the average power utilisation of peripheral routers on each(left, top, right, bottom) side and middle routers of Mesh topology.

The evolution of overall power that has been consumed by the routers throughout the simulation is observed. The overall thermal dissipation is identical for all routers

|       | Average power(mW)  |                     |                      |  |  |  |  |  |  |
|-------|--------------------|---------------------|----------------------|--|--|--|--|--|--|
|       | Level 0 (R0 to R3) | Level 1 (R4 to R11) | Level 2 (R12 to R27) |  |  |  |  |  |  |
| 2000  | 30.6977            | 31.6374             | 30.1445              |  |  |  |  |  |  |
| 4000  | 29.1810            | 29.9401             | 29.1623              |  |  |  |  |  |  |
| 6000  | 28.6611            | 29.3543             | 28.8281              |  |  |  |  |  |  |
| 8000  | 28.4031            | 29.0668             | 28.6686              |  |  |  |  |  |  |
| 10000 | 28.2285            | 28.8741             | 28.5811              |  |  |  |  |  |  |

Table 4.12: Average of router's power of each level of 2D BFT topology with interval of 2000cc.

Table 4.13: Average of router's power utilisation of peripheral routers (each side)and middle routers of Mesh topology.

|       | Left column | Top row    | Right column | Bottom row | Middle routers |
|-------|-------------|------------|--------------|------------|----------------|
| 2000  | 10.2467725  | 10.1711225 | 10.2551875   | 10.3467875 | 10.4728        |
| 4000  | 9.91973375  | 9.90385875 | 9.94620125   | 9.96809875 | 10.0679        |
| 6000  | 9.82164625  | 9.81374    | 9.82349875   | 9.8513875  | 9.9296         |
| 8000  | 9.768175    | 9.76879    | 9.77123      | 9.78861125 | 9.8547         |
| 10000 | 9.745785    | 9.7327425  | 9.73685625   | 9.75603625 | 9.8098         |

in each time interval. The thermal dissipation of the routers is influenced by the adjoining PE temperature. The power and temperature of the PEs overshadow the small variation in router power and thermal behaviour. Further, the router power values indicate that the traffic is uniformly distributed in the Mesh(unlike the BFT).

# 4.6 Summary

The presented simulator can be used to observe the Power, Performance and Thermal behaviours of 3D NoC architecture. The TSV based power and delay model is incorporated into BookSim2.0 to estimate accurate power and performance of 3D NoC architectures. The support of irregular (Variable radix at each dimension(X, Y, Z)) 3D Mesh with ZXY routing and BFT topology variants with nearest common ancestor have been implemented. The Mesh performs better than BFT topology due to its topology structure, i.e. links, router input ports, and buffers are larger in the 2D and 3D Mesh compared to 2D and 3D BFT. 3D Mesh variants have the lowest EDP compared to 3D BFT variants as there is 80% reduction link lengths and up to  $3 \times$ larger TSVs in 3D Mesh. The 4-layer 3D BFT transpose has the nearest EDP( $1.5 \times$ ) compared to the 4-layer 3D Mesh transpose. Optimising the 2-layer and 4-layer 3D BFT designs can lead to lower EDP than 4-layer 3D Mesh. The thermal aware 3D Mesh architecture layout overcomes the thermal shortcomings of the naive 3D Mesh. The average of the maximum temperatures of all the routers in a 4 x 8 x 2 3D NoC is lowered by 3% in the thermal aware Mesh architecture. The thermal behaviour of the 4-layer Mesh and BFT topologies shows that the corner routers have lower temperatures. In both topology variants, the layer closer to the heat sink are cooler by 20 degrees Celsius compared to the other layers.

# Chapter 5

# Area, Power and Performance analysis of Optimal 3D BFT NoC Architecture

In this chapter, the work explores power and performance tradeoffs in two variants of 2-layer 3D Butterfly Fat Tree (BFT) topology using a floorplan driven approach. The first 3D BFT variant analyzed is a standard stacked BFT (3DBFT) derived from a 2D BFT topology. The performance of the output flow control has been analyzed using the random and round robin output based deflection routing for 3D BFT variants. A power-performance optimal 3D BFT (OP3DBFT) is evolved from the standard 3DBFT using overall performance, link and TSV minimization, and powerperformance trade-offs. A new OP3DBFT(Optimal Power and Performance 3DBFT) architecture with round-robin deflection routing(RROD) is proposed as power and performance optimal 2-layer 3D NoC architecture.

# 5.1 Analysis of 2D and 3DBFT Topology

An equivalent 3D BFT topology is constructed from the 2D BFT topology. Floorplans of both the 2D and 3D BFT topologies are shown. TSV serialization and routing options in the BFT topologies are analyzed.

### 5.1.1 Floorplanning

The conventional 64 PE BFT topology is shown in Figure 5.1. Except for the top level routers (which have four ports each), all routers contain six ports. The ports of all the 6-port routers are connected as follows: four ports connected to all four

child nodes, remaining 2-port connect to two parent nodes. Figure 5.2 shows the floorplan of the 2D BFT topology. The floorplan consists of a system with a tiled Chip Multiprocessor containing 64 Sun-SPARC cores(area of each core is 3.4mm<sup>2</sup>) (Xu et al., 2012). Router area is estimated from the ORION area module. Table 4.6 lists some of the microarchitectural parameters used to derive the floorplan. Based on the floorplan, the 2D BFT has five different link lengths. The links length are used to estimate the delay of the link for performance evaluation (shown in Table 5.2).



Figure 5.1: 64 node BFT topology with three levels. Level 1 is of 4 router, level 2 of 8 routers, level 3 of 16 routers. The leaves are the PEs which are connected to level-3 routers and 4 PE's per router.

The 3D BFT floorplan is derived from the 2D BFT floorplan by equally distributing the PEs and the associated routers in 2-layers (Figure 5.3). The level 2 routers are moved closer to level 1 routers to reduce the link length. Eight vertical links, made up of TSV bundles, are shown in the 3DBFT floorplan.

### 5.1.2 Through Silicon Via Link Delay Model

The propagation delay depends on the dimensions(TSV length, radius and pitch) of TSVs. Khalil et al. (2008) use an analytical model of TSVs to get the TSVs delay, power and valid TSV configuration. An analytical model of the propagation delay of the TSV is shown in Algorithm 1. TSV delay depends on Height/Length(l), Diameter(d), and Pitch/Separation(s). The safe limits (safe limits are from Lee et al.



Figure 5.2: Floorpan of 2D BFT topology.



Figure 5.3: (a) Floorplan of 3DBFT (two-stacked layer) BFT connected using  $TSVs(8 \times 4 \times 2)$ . (b) Inter-layer connections.

(2014); Weerasekera et al. (2009)) of each parameter microarchitectural details are considered to avoid the manufacturer complexity during the fabrication process. The safe limits of each parameter are considered to the generated set of TSVs configuration. The TSVs configuration with  $20\mu$ m height,  $20\mu$ m diameter, and  $40\mu$ m pitch yields the lowest TSV power.

### 5.1.3 Data Serialization over TSVs

Data serialization is used to reduce the area footprint of TSVs with an additional power overhead. The channel size used in this work is 64 bits. The area footprint of 64-bit TSVs connection adversely affects the overall area of the chip. The lowest yield TSV configuration (h=20 $\mu$ m, d=20 $\mu$ m and p=40 $\mu$ m) has a higher pitch. The area of TSVs is directly proportional to the pitch size. 2:1 data serialization have been considered to reduce the overall area of the TSVs, hence the footprint area of TSVs is decreased to half after 2:1 data serialization have been considered(64 TSVs to 32 TSVs). Table 5.1 shows the TSV design parameters such as, TSV count, Keep Out Zone (KOZ) and dimensions for non-serialized and 2:1 serialized TSVs. The TSV array dimension reduces to half in the case of 2:1 serialized TSVs. The delay of the TSV is 50ps, which is much smaller than 0.4ns (2.5GHz). The TSV delay is considered as one clock cycle throughout the paper.

Table 5.1: Parallel and serial case with the TSVs design parameters and TSV  $_{\rm count}$ 

|                                          | 1:1 TSVs(per channel) | 2:1 Serialisation TSVs (per channel) |
|------------------------------------------|-----------------------|--------------------------------------|
| TSVs count                               | 64                    | 32                                   |
| KOZ $(\mu m)$                            | 5                     | 5                                    |
| TSV array dimension ( $\mu$ m) (TSV+KOZ) | $270 \ge 640$         | 135 x 640                            |

# 5.1.4 Nearest Common Ancestor(NCA) Routing in BFT Topology

Figure 5.4 shows the example of possible routing paths from source(node-0) to destination (node-32). The destination (node 32) can be reached from source (node-0) by two different output paths. Similarly, there is always two paths for each packet which ejects from any source. The two different paths are available in Level 3 and Level 2, so alternative paths can be chosen. Random output path selection and round robin output path selection are analysed in the NCA algorithm flowchart(Figure 5.7).



Figure 5.4: 2D BFT topology with two path flows from node 0 to node 32.

### A Random and Round Robin Output Based Deflection Routing

Nearest Common Ancestor (NCA) (in Algorithm 4) is implemented in routfunction.cpp for 2D and 3D BFT variants. NCA identifies the minimum and maximum reachable destination at each router and then forwards packets to an appropriate output port of the router.

In BFT topology, there are two upward paths per level, per source and destination pair(from Node 0 to 32, Figure 5.4), with an equal number of hops in each path. Hence this allows alternative paths to be chosen to avoid congestion and obtain improved on chip communication. Random Output Deflection(ROD) and Round Robin Output Deflection (RROD) path selection mechanism are analysed in the NCA algorithm (Algorithm 4 and 5).

The ROD routing selects one of two output ports at each level arbitrarily per packet. Figure 5.5 depicts the scenario for node 0 to 32 and 3 to 35. ROD routing algorithm is presented in algorithm 5 It can be observed that during random output deflection, there may be chances of selecting the same output port for two different packets which may lead to congestion which increases the communication latency.

RROD selects output port in round-robin manner(Figure 5.6) while routing a packet (Algorithm 4, 5). Figure 5.6 shows a selection of alternative port for routing packets. The RROD Output Deflection is shown in Algorithm 5. This experiment shows that RROD leads to less congestion, better communication latency compared



Figure 5.5: ROD routing for from **node 0** to **node 32** and **node 3** to **node 35** for 2D BFT topology.

to ROD.



Figure 5.6: RROD - from **node 0** to **node 32** and **node 3** to **node 35** for 2D BFT topology.

Random Output Deflection (ROD) routing is illustrated in Figure 5.5. ROD is a selection of random output path while sending flits from source to destination. In ROD, selecting same output port for different packets leads to additional latency due to contention. In the Round Robin Deflection (RROD), the output path is selected in round-robin order (Figure 5.6). The alternative output path selection helps in balanced traffic on links. Figure 5.7 depicts NCA flowchart algorithm with both ROD and RROD path selection mechanisms.

| Algorithm 4: NCA Routing algorithm for both 2D and 3D BFT.                           |  |  |  |  |
|--------------------------------------------------------------------------------------|--|--|--|--|
| <b>Input:</b> cur node, flow and dest node                                           |  |  |  |  |
| <b>Output:</b> <i>output_port</i> from <i>cur</i> node to <i>dest</i>                |  |  |  |  |
| 1 if cur!=dest then                                                                  |  |  |  |  |
| <b>2</b> Find the <i>cur</i> node level(nl) and Position of node (rp) in the level ; |  |  |  |  |
| $\mathbf{s}$ if $nl == zero$ then                                                    |  |  |  |  |
| $\texttt{4}  \  \  \  \  \  \  \  \  \  \  \  \  \$                                  |  |  |  |  |
| 5 else if $nl ==1$ then                                                              |  |  |  |  |
| <b>6</b> Find Lowest $(min)$ node and maximum $(max)$ node which can reach           |  |  |  |  |
| from <i>cur</i> ;                                                                    |  |  |  |  |
| 7 if dest is in beween max and min then                                              |  |  |  |  |
| $\mathbf{s}  \  \  \  \  \  \  \  \  \  \  \  \  \$                                  |  |  |  |  |
| 9 else                                                                               |  |  |  |  |
| 10 $\  \  \  \  \  \  \  \  \  \  \  \  \ $                                          |  |  |  |  |
| 11 else                                                                              |  |  |  |  |
| 12 Find Lowest $(min)$ node and maximum $(max)$ node which can reach                 |  |  |  |  |
| from cur;                                                                            |  |  |  |  |
| 13 if dest is in beween max and min then                                             |  |  |  |  |
| 14 $\  \  \  \  \  \  \  \  \  \  \  \  \ $                                          |  |  |  |  |
| 15 else                                                                              |  |  |  |  |
| <b>16</b> $\  \  \  \  \  \  \  \  \  \  \  \  \ $                                   |  |  |  |  |
|                                                                                      |  |  |  |  |

# **Algorithm 5:** ROD and RROD routing function in NCA for 2D and 3D BFT topology.

```
Input: Output flow (flow)(ROD or RROD)
 Output: output_port
1 Function getoutport(flow):
     /* Random Output Deflection (ROD)
                                                                        */
     if flow == ROD then
2
      out_port=rand()\%2+4;
3
     /* Round-Robin Output Deflection (RROD)
                                                                        */
     else
\mathbf{4}
        if RB == 1 then
\mathbf{5}
         out_port=4; RB=0;
6
        else
7
         | out\_port=5; RB=1;
8
9
     return out_port ;
```

# 5.1.5 Link Delay Estimation

The floorplan based link delays are estimated using ORION RC delay models (2.5GHz frequency) for wires and TSVs delay, power, and valid TSV configuration using Khalil



Figure 5.7: NCA Routing flowchart of BFT topology with ROD and RROD.

et al. (2008) model, which takes TSV length, radius and pitch as input parameter to get the power and delay of the TSVs. An ideal TSV configuration is generated using the safe limits of each parameter. TSV delay depends on the Height/Length(l), Diameter(d) and Pitch/Separation(s).

Table 5.2 shows the delays of the horizontal wires and the TSVs in 2D and 3DBFT based on the floorplans. Each vertical link contains 32-TSVs (2:1 serialisation), TSVs count; TSVs delay of each vertical link is depicted in the last two columns of Table 5.2

| Topology | Wire (mm)       | Delay (clock cycle) | Number of TSVs (32-TSVs per link) | Delay (Clock Cycle) |
|----------|-----------------|---------------------|-----------------------------------|---------------------|
|          | 9.376           | 92                  | -                                 | -                   |
|          | 8.976           | 85                  | -                                 | -                   |
| 2D BFT   | 4.4889          | 21                  | -                                 | -                   |
|          | 4.088           | 18                  | -                                 | -                   |
|          | 8.176           | 68                  |                                   |                     |
| 3DBFT    | 7.776           | 63                  |                                   |                     |
| 0DDI I   | 4.088           | 18                  | 256                               | 1                   |
|          | 3.688           | 14                  |                                   |                     |
|          | $1 \mathrm{mm}$ | 1                   |                                   |                     |

Table 5.2: Link length and delay details of BFT topologies variants.

# 5.2 Power and Performance Optimal OP3DBFT

The utilization of TSVs in a conventional 3DBFT are analyzed under synthetic traffic patterns. The TSVs with the least utilization are removed under performance constraints. The optimal 3DBFT (OP3DBFT) topology and floorplan are presented.

### 5.2.1 TSV Count Minimisation

Figure 5.8 depicts the vertical links (L1 to L8) with red and blue colours of 3DBFT topology. The channel width of each link is 64-bit. A 64-bit TSV channel, with 64 pairs of signal and ground TSVs results in a prohibitive area of  $0.1728\mu m^2$ . The reduction in the TSV area through 2:1 serialisation is presented in Section 5.1.3. Table 5.3 depicts the vertical links(L1-L8) utilisation of 3DBFT topology for uniform, transpose and bit-reversal traffic.

Table 5.3: Links utilisation (Injection rate=0.018) of 3DBFT (8-vertical links (TSVs)) for uniform, transpose and

| Traffic      | 3DBFT  |        |        |        |        |        |        |        |
|--------------|--------|--------|--------|--------|--------|--------|--------|--------|
| pattern      | L1     | L2     | L3     | L4     | L5     | L6     | L7     | L8     |
| Uniform      | 0      | 43.062 | 21.619 | 21.463 | 43.673 | 43.032 | 43.679 | 43.742 |
| Transpose    | 65.128 | 65.092 | 43.402 | 43.524 | 43.648 | 43.509 | 43.242 | 43.737 |
| bit-reversal | 65.128 | 65.092 | 43.402 | 43.524 | 43.648 | 43.509 | 43.242 | 43.737 |

bit-reversal. Average utilisation of 3DBFT is 40% and OP3DBFT is 80%.



Figure 5.8: 2D BFT links with red and blue (L1 to L8) colors are vertical links for 3DBFT topology.

Table 5.4: Links utilisation (Injection rate=0.018) of OP3DBFT (2-vertical links (TSVs)) for uniform, transpose and bit-reversal.

| Traffic pattern | OP3DBFT |        |  |
|-----------------|---------|--------|--|
|                 | L1      | L2     |  |
| Uniform         | 83.759  | 84.419 |  |
| Transpose       | 84.303  | 84.581 |  |
| bit-reversal    | 83.968  | 83.968 |  |

The average utilisation for 3DBFT is 32%, 47%, 48% for uniform, transpose and bit-reversal traffic respectively. An average of 50% links(TSVs) are under-utilised in the 3DBFT topology. Further this work attempts to reduce the number of vertical links without affecting the overall performance of the BFT.

Figure 5.9 shows the modified 3DBFT topology, with only 2 vertical links (reduced from 8). One TSV link is removed from Level-1 routers, thereby reducing the degree of the routers from 4 to 3. The overall connectivity has not been altered. By Applying

TSV serialisation and TSV count minimisation, the OP3DBFT is proposed. The topology and floorplans are shown in Figure 4.8. Table 5.4 lists the link utilisation of L1 and L2 links for uniform, transpose and bit-reversal traffic patterns. The average link utilisation for OP3DBFT is 84%, 85%, and 85% for uniform, transpose, and bit-reversal traffic respectively.

### 5.2.2 OP3DBFT - Topology and Floorplan

Figure 5.9 shows modified 3DBFT topology where each layer consists of 32 PEs each. Level-1 routers have a degree of 3 - one output port of each router connects the next odd router in level-0 (vertical interconnection) and two output ports are connected to level-1 routers (horizontal links). Figure 5.10() shows the floorplan of the OP3DBFT.



In modified 3DBFT, six links(vertical interconnect) are reduced to two as compare to 3DBFT (Figure 5.10). There are only two different link lengths (4.088mm and 3.688mm) based on OP3DBFT floorplan. The delay of both links is 13 clock cycles. The OP3DBFT has up to 80% lesser TSVs, 75% lesser TSV area is compared to the regular 3DBFT.



Figure 5.10: (i) 8 x 4 x 2 2-layer OP3DBFT with two stacked layers. (ii) Inter-layer(TSVs) connections.

# 5.3 Experimental Setup

The BookSim simulator has been extended to support 3D NoC by adding (a) TSV delay and power modules for vertical links, (b) Orion power and delay modules for horizontal links. The floorplan module takes the input as the topology, PE size, and router area to output the lengths of the links. These parameters are passed to link delay and power module. Link delay module calculates the delay of individual horizontal and vertical links. The horizontal link  $(T_{D_{-}H})$  delay is calculated using ORION, and vertical link delay $(T_{D_{-}TSV})$  is calculated from TSV delay module.

The delay of individual link is passed to the simulator to create a topology(build network). Links (horizontal wire and verticals) delay have been modelled(as described in the Section 5.1.5). The 3DBFT topologies as shown in Figure 5.3 and 5.10 are implemented in simulator. The nearest common ancestor algorithm with ROD and RROD routing is implemented for 64 nodes. The routing functionality and BFT network topology is tested and implemented in BookSim simulator. Power module takes the links length and router details to calculate the accurate power details. The vertical links power( $T_{D_TSV}$ ) is calculated using the TSV power module. The router

 $(P_r)$  and horizontal links power $(P_{D_H})$  are calculated using ORION. The accurate power details are used when the transfer of flits starts. The topologies simulated in BookSim are 2D BFT, 3DBFT, and OP3DBFT with network size 64-nodes. There are 28 routers with a 8 VCs per port with a VC buffer depth of 16. The simulation time is of  $10^5$  cycles.

# 5.4 Results and Discussion

### 5.4.1 Performance Analysis

The performance comparison of 3DBFT and OP3DBFT for uniform, transpose and bit-reversal traffic is shown in Figure 5.11. The OP3DBFT has a performance improvement of up to  $1.54 \times$ ,  $1.38 \times$ , and  $1.37 \times$  compared to 3DBFT uniform, transpose and bit-reversal traffic respectively. The performance is improved because there is reduction of up to 75% of TSV count i.e. 6 vertical links have been reduced compared to the regular 3DBFT.



Figure 5.11: Latency comparison of 2-layer and OP3DBFT topology for uniform, transpose and bit-reversal traffic.

### 5.4.2 Energy Analysis

Energy(Joules) per flits (JpF) for 3DBFT and OP3DBFT is calculated using Equation 1. Figure 5.12 shows the Joules per flit of OP3DBFT and regular 3DBFT variants for uniform, transpose and bit-reversal traffic. From the results,OP3DBFT has average 23% decrease in JPF compared to regular 3DBFT. The JPF in OP3DBFT has decreased up-to 23%, 22% and 21% in uniform, transpose and bit-reversal traffic respectively.



Figure 5.12: Energy per flit comparison of 2-layer and OP3DBFT topology for uniform, transpose and bit-reversal traffic.

### 5.4.3 Energy Delay Product (EDP)

Figure 5.13 shows the Normalised EDP of OP3DBFT and regular 3DBFT variants for uniform and transpose traffic pattern. The transpose and bit-reversal traffic has average reduction of EDP up to 10% and 11% compared to uniform traffic pattern in 3DBFT topology throughout the simulation (Figure 5.13) as BFT is suited for localised traffic rather than uniformly distributed traffic. The EDP is the product of average
network latency and average energy per flit. The OP3DBFT shows 46%, 44% and 44% reduction in EDP compared to 3DBFT uniform, transpose and bit-reversal traffic. Overall, the OP3DBFT's EDP is lower than 3DBFT because there is a reduction in OP3DBFT's latency (Figure 5.11) and energy (Figure 5.12) compared to 3DBFT.



Figure 5.13: Normalised EDP of regular 3DBFT and OP3DBFT for uniform, transpose, and bit reversal traffic.

### 5.4.4 Area Utilization

The BFT topologies have been implemented using CONNECT, a web-based NoC generator tool (Papamichael, 2011). The HDL models of OP3DBFT have been obtained by modifying the BFT HDL models. The synthesis results have obtained using Xilinx Vivado. Xilinx Artix-7 XC7A200T FPGA board has used to analyse the FPGA resource utilization and Table 5.5 shows the detailed synthesis results. From Table 5.5, it can be seen that the regular 3DBFT topology consumes 1.12% more LUTs than Optimal Power and performance 3DBFT topology. The proposed topology has 12% reduction in the area compared to regular BFT topology without compromising in its performance.

| $\rm H/W$ utilisation (%) | 3DBFT   | OPP3DBFT |
|---------------------------|---------|----------|
| LUTs                      | 54.6    | 50.04    |
| FFs                       | 10.47   | 9.72     |
| Freq                      | 100 MHz | 100 MHz  |

Table 5.5: Synthesis results of 3DBFT Topology variants

### 5.5 Summary

A novel, low cost, power-performance optimal 3D BFT topology (OP3DBFT) is proposed. OP3DBFT is evolved from the standard 3D BFT after eliminating extraneous TSV links under a performance constraint. The utilization of links in 3DBFT is analyzed under the uniform, transpose and bit-reversal traffic. The regular 3D BFT and the OP3DBFT employ 2:1 serialization to reduce the area footprint of the TSVs links. Two path selection schemes, the round-robin output(RROD) and the random(ROD) selection, based on the Nearest Common Ancestor routing are used to evaluate the performance of the BFT topologies. State-of-the-art TSV delay and power models have been incorporated into the BookSim simulator. The delays of horizontal wires are derived from ORION delay models. The OP3DBFT shows up to  $1.44 \times$ ,  $1.38 \times$  and  $1.37 \times$  performance improvement compare to 3DBFT uniform, transpose and bit-reversal traffic respectively. OP3DBFT performs better due to its modified structure (75% of TSV count reduction). The Joules Per Flit (JPF) in OP3DBFT has decreased up-to 23%, 22%, and 21% in uniform, transpose, and bit-reversal traffic respectively. The EDP of OP3DBFT shows 46%, 44%, and 44% reduction compared 3DBFT uniform, transpose, and bit-reversal traffic respectively. Based on the synthesis results, OP3DBFT consumes 12% lower area compared to regular 3D BFT topology.

# Chapter 6 Conclusion and Future Scopes

The microarchitectural design space of 2D and 3D variants of the Mesh and BFT topologies have been explored. Accurate wire delays have been derived from link delay and TSV delay models. The lengths of horizontal links and TSVs have been estimated using the floorplan of the respective topologies. The 3D NoC modelling capabilities have been extended in two existing state-of-the-art simulators, viz., the 2D NoC Simulator - BookSim2.0 and the thermal behaviour simulator - HotSpot6.0. With the extended 3D NoC modules, the simulators can be used for power, performance and thermal measurements through micro-architectural and physical parameters. Wire delays have been obtained using ORION delay models. The 2D and 3D variants of Mesh and BFT topology are characterised using uniform and transpose traffic patterns through cycle-accurate simulation. Among Mesh and BFT topology, Mesh shows better on-chip communication performance compared to BFT topology, as its topology structure, i.e. links, router input ports, and buffers are larger in the 2D and 3D Mesh compared to 2D and 3D BFT.

The effect of varying multiple TSV parameters (length, diameter, and pitch) on power and performance metrics have been investigated. The current work provides better insights on the optimal design of 3D TSV design space. The thermal behaviour of 3D NoC architectures has also been evaluated and a thermal aware 3D Mesh NoC architecture has been proposed. This work has made use of accurate power estimation models for fundamental 3D NoC elements, namely the router and TSV. The router power has been obtained using ORION. The thermal effect of the TSV and router position on the chip floorplan has been analyzed by modifying HotSpot for 3D Mesh and 3D BFT NoC architectures. A novel, low cost, power-performance optimal 3D BFT topology (OP3DBFT) has been proposed. OP3DBFT is evolved from the standard 3D BFT after eliminating extraneous TSV links. Under a performance constraint, OP3DBFT employs 2:1 serialization to reduce the area footprint of the TSVs links. Two path selection schemes, the round-robin output(RROD) and the random(ROD) selection, based on the Nearest Common Ancestor routing are used to evaluate the performance of the BFT topologies. Based on the synthesis results, optimal power and performance 2-layer 3D topology consume 12% lesser area compared to regular 3D BFT topology.

In future, a thermal aware routing algorithm that takes into account of TSVs dynamic power to overcome the Reliability issues, TSVs faults. Further examining of high performance computing and intensive communication applications using 3D NoCs can be another direction of research. Several open challenges in 3D NoCs with proper support of other emerging technologies could be the research of interest for High high performance computing systems and applications.

# Appendix A

### A.1 Routers Thermal Behaviour for 2D BFT and 2D Mesh Topologies

The runtime power trace generated from the BookSim2.0 simulator is fed to the HotSpot6.0 to generate the heat dissipation of the NoC architecture. The detailed description are discussed below.

Table A.1 (A) and (B) shows each router's power(mW) utilisation from 1000cc to 10000cc interval of 2000cc for BFT topology(28-routers). Table A.1 (c) shows the average of the router power at each level of the BFT topology. From Table A.1 (c), it is observed that the level-1 routers have higher router power usage, level 2 has least power and level-0 medium power usage. Considering nodes 0-3 as sources, the level-1 routers transfer packets from 4 to 63(up to 90%). The level-0 routers transfer packets for 0-3 (only 10%).

Table A.2 depicts each router's power usage of the Mesh topology(64-router) from 1000cc to 10000cc with an interval of 2000. Table A.2 (A), (B), (C) and (D) depicts the average router power utilisation of peripheral routers on each side of Mesh topology i.e, left, top, right and bottom sides respectively. Table A.3 depicts the average power utilisation of peripheral routers on each(left, top, right, bottom) side and middle routers of Mesh topology.

The evolution of overall power consumed by the routers throughout the simulation was observed. The overall thermal dissipation is identical for all routers in each time interval. The thermal dissipation of the routers is influenced by the adjoining PE temperature. The power and temperature of the PEs overshadow the small variation in router power and thermal behaviour. Further the router power values indicate that the traffic is uniformly distributed overall the routers in the Mesh(unlike the BFT).

# Table A.1: Each Router Power consumption and average router power of each levelof 2D BFT topology with interval of 500CC

| Time(CC) |         |         |         |         |         | 1       | Routers P | ower (mW | .)      |         |         |         |         |         |
|----------|---------|---------|---------|---------|---------|---------|-----------|----------|---------|---------|---------|---------|---------|---------|
|          | R-0     | R-1     | R-2     | R-3     | R-4     | R-5     | R-6       | R-7      | R-8     | R-9     | R-10    | R-11    | R-12    | R-13    |
| 0        | R-0     | R-1     | R-2     | R-3     | R-4     | R-5     | R-6       | R-7      | R-8     | R-9     | R-10    | R-11    | R-12    | R-13    |
| 2000     | 30.3131 | 31.1016 | 31.0873 | 30.2886 | 31.6062 | 32.0577 | 31.8174   | 31.5484  | 31.8095 | 31.6097 | 31.3401 | 31.3099 | 30.2286 | 30.2323 |
| 4000     | 28.9344 | 29.4302 | 29.4231 | 28.9362 | 29.8928 | 30.1772 | 30.071    | 29.9067  | 30.0382 | 29.9251 | 29.7597 | 29.7455 | 29.204  | 29.2216 |
| 6000     | 28.4935 | 28.8352 | 28.8304 | 28.4853 | 29.3223 | 29.5019 | 29.4399   | 29.3216  | 29.4192 | 29.3636 | 29.2423 | 29.2229 | 28.8432 | 28.8649 |
| 8000     | 28.259  | 28.5525 | 28.548  | 28.2529 | 29.0291 | 29.1783 | 29.1322   | 29.0435  | 29.132  | 29.0811 | 28.9835 | 28.9546 | 28.6628 | 28.686  |
| 10000    | 28.1273 | 28.3344 | 28.33   | 28.1222 | 28.8377 | 28.9681 | 28.9108   | 28.8541  | 28.9267 | 28.8889 | 28.8253 | 28.7808 | 28.5679 | 28.5696 |

Table A: Routers Power (mW) from R0 to-13

Table B: Routers Power (mW) from R14 to R27

| Time(CC) |         | Routers Power (mW) |         |         |         |         |         |         |         |         |         |         |         |         |
|----------|---------|--------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
|          | R-14    | R-15               | R-16    | R-17    | R-18    | R-19    | R-20    | R-21    | R-22    | R-23    | R-24    | R-25    | R-26    | R-27    |
| 2000     | 30.1462 | 30.4861            | 30.3353 | 29.8608 | 30.1268 | 30.185  | 30.3371 | 30.3914 | 30.0487 | 29.9279 | 29.9518 | 29.7929 | 30.1551 | 30.1051 |
| 4000     | 29.1488 | 29.3327            | 29.2573 | 29.0787 | 29.1531 | 29.1962 | 29.2582 | 29.3012 | 29.114  | 29.0536 | 29.0507 | 28.9573 | 29.1673 | 29.1012 |
| 6000     | 28.8064 | 28.9483            | 28.8887 | 28.7597 | 28.8192 | 28.8473 | 28.8893 | 28.9372 | 28.7832 | 28.7721 | 28.741  | 28.6974 | 28.8187 | 28.8327 |
| 8000     | 28.6352 | 28.7639            | 28.7109 | 28.6076 | 28.6448 | 28.6874 | 28.7197 | 28.7477 | 28.6327 | 28.6458 | 28.5932 | 28.5757 | 28.6444 | 28.7388 |
| 10000    | 28.5308 | 28.6177            | 28.5776 | 28.5043 | 28.5431 | 28.5649 | 28.6007 | 28.6281 | 28.5232 | 28.5388 | 28.4986 | 28.4964 | 28.5375 | 28.9974 |

Table C: Average Routers Power (mW) of each level in BFT toplogy.

|       |                    | Average power(mW)         |                      |
|-------|--------------------|---------------------------|----------------------|
|       | Level 0 (R0 to R3) | Level 1 ( $R4$ to $R11$ ) | Level 2 (R12 to R27) |
| 2000  | 30.6977            | 31.6374                   | 30.1445              |
| 4000  | 29.1810            | 29.9401                   | 29.1623              |
| 6000  | 28.6611            | 29.3543                   | 28.8281              |
| 8000  | 28.4031            | 29.0668                   | 28.6686              |
| 10000 | 28.2285            | 28.8741                   | 28.5811              |

Table A.2: Router Power consumption of 2D Mesh topology with interval of 2000CC

| Time(CC) |         |         |         |         |         | 1       | Routers Po | ower (mW      | .)      |         |         |         |         |         |         |         |
|----------|---------|---------|---------|---------|---------|---------|------------|---------------|---------|---------|---------|---------|---------|---------|---------|---------|
| 0        | R0      | R1      | R2      | R3      | R4      | R5      | R6         | $\mathbf{R7}$ | R8      | R9      | R10     | R11     | R12     | R13     | R14     | R15     |
| 2000     | 9.95538 | 10.106  | 10.3734 | 10.2745 | 10.2851 | 10.253  | 10.0778    | 10.0438       | 10.258  | 10.3468 | 10.6039 | 10.4886 | 10.4586 | 10.4638 | 10.3132 | 10.3483 |
| 4000     | 9.74992 | 9.84027 | 9.98905 | 9.95379 | 9.98832 | 9.97052 | 9.88552    | 9.85348       | 9.91628 | 9.9598  | 10.1184 | 10.0458 | 10.0892 | 10.0192 | 9.94299 | 9.99151 |
| 6000     | 9.68143 | 9.77179 | 9.8704  | 9.87642 | 9.87995 | 9.86748 | 9.80195    | 9.7605        | 9.80179 | 9.85089 | 9.94719 | 9.91823 | 9.93714 | 9.88043 | 9.82018 | 9.8425  |
| 8000     | 9.6618  | 9.73002 | 9.81861 | 9.81603 | 9.83286 | 9.82436 | 9.75263    | 9.71401       | 9.74498 | 9.77428 | 9.88325 | 9.83985 | 9.86909 | 9.82609 | 9.75878 | 9.78304 |
| 10000    | 9.644   | 9.70496 | 9.77548 | 9.77342 | 9.78723 | 9.77441 | 9.711      | 9.69144       | 9.73501 | 9.76941 | 9.86899 | 9.81018 | 9.84562 | 9.81192 | 9.74569 | 9.75304 |

Table A: Routers Power (mW) from R0 to R15

Table B: Routers Power (mW) from R16 to R31

| $\operatorname{Time}(\operatorname{CC})$ |         |         |         |         |         | 1       | Routers P | ower (mW | )       |         |         |         |         |         |         |         |
|------------------------------------------|---------|---------|---------|---------|---------|---------|-----------|----------|---------|---------|---------|---------|---------|---------|---------|---------|
| 0                                        | R16     | R17     | R18     | R19     | R20     | R21     | R22       | R23      | R24     | R25     | R26     | R27     | R28     | R29     | R30     | R31     |
| 2000                                     | 10.3697 | 10.5237 | 10.7559 | 10.2283 | 10.3451 | 10.4919 | 10.3449   | 10.3679  | 10.4317 | 10.5786 | 10.669  | 10.3751 | 10.6142 | 10.6356 | 10.4991 | 10.4422 |
| 4000                                     | 9.98627 | 10.0793 | 10.2556 | 9.99089 | 10.0803 | 10.0917 | 10.004    | 10.0164  | 10.0182 | 10.1067 | 10.1652 | 10.0041 | 10.1679 | 10.1794 | 10.0519 | 10.0527 |
| 6000                                     | 9.87913 | 9.95116 | 10.0587 | 9.90171 | 9.95013 | 9.9483  | 9.88094   | 9.85907  | 9.90984 | 9.9783  | 10.0073 | 9.93886 | 10.0292 | 10.0375 | 9.92174 | 9.91336 |
| 8000                                     | 9.80299 | 9.86411 | 9.97398 | 9.84164 | 9.90052 | 9.87744 | 9.80391   | 9.80256  | 9.84108 | 9.88489 | 9.92923 | 9.87038 | 9.94517 | 9.94345 | 9.82742 | 9.83619 |
| 10000                                    | 9.79236 | 9.81222 | 9.91818 | 9.7939  | 9.85304 | 9.82289 | 9.75769   | 9.75696  | 9.82353 | 9.83417 | 9.88204 | 9.82328 | 9.88913 | 9.88102 | 9.77083 | 9.78351 |

#### Table C: Routers Power (mW) from R32 to R47

| Time(CC) |         |         |         |         |         | 1       | Routers P | ower (mW | )       |         |         |         |         |         |         |         |
|----------|---------|---------|---------|---------|---------|---------|-----------|----------|---------|---------|---------|---------|---------|---------|---------|---------|
| 0        | R32     | R33     | R34     | R35     | R36     | R37     | R38       | R39      | R40     | R41     | R42     | R43     | R44     | R45     | R46     | R47     |
| 2000     | 10.5218 | 10.5254 | 10.5755 | 10.5508 | 10.7328 | 10.5452 | 10.2476   | 10.2562  | 10.2562 | 10.5254 | 10.5805 | 10.6126 | 10.6959 | 10.6675 | 10.499  | 10.3731 |
| 4000     | 10.0632 | 10.065  | 10.1034 | 10.0777 | 10.2112 | 10.1041 | 9.91023   | 9.96052  | 9.92953 | 10.0784 | 10.1067 | 10.1087 | 10.1954 | 10.1653 | 10.0515 | 10.0172 |
| 6000     | 9.92983 | 9.90097 | 9.93716 | 9.94958 | 10.038  | 9.97487 | 9.82842   | 9.83129  | 9.83069 | 9.90039 | 9.92936 | 9.95069 | 10.008  | 9.99851 | 9.91225 | 9.87971 |
| 8000     | 9.86358 | 9.83398 | 9.87531 | 9.86292 | 9.94426 | 9.89737 | 9.76495   | 9.78084  | 9.7817  | 9.83311 | 9.86323 | 9.85666 | 9.90714 | 9.90716 | 9.83535 | 9.8176  |
| 10000    | 9.82382 | 9.78209 | 9.83286 | 9.81092 | 9.88201 | 9.83882 | 9.72085   | 9.73924  | 9.74628 | 9.78139 | 9.81753 | 9.80626 | 9.84629 | 9.84666 | 9.78284 | 9.77466 |

| Table : Routers Power (mV) | V) from R48 to R63 $$ |
|----------------------------|-----------------------|
|----------------------------|-----------------------|

| $\operatorname{Time}(\operatorname{CC})$ |         |         |         |         |         | I       | Routers Po | ower (mW | )       |         |         |         |         |         |         |         |
|------------------------------------------|---------|---------|---------|---------|---------|---------|------------|----------|---------|---------|---------|---------|---------|---------|---------|---------|
| 0                                        | R48     | R49     | R50     | R51     | R52     | R53     | R54        | R55      | R56     | R57     | R58     | R59     | R60     | R61     | R62     | R63     |
| 2000                                     | 10.1642 | 10.3768 | 10.1927 | 10.2813 | 10.1608 | 10.3168 | 10.1944    | 10.1359  | 10.0172 | 10.4923 | 10.487  | 10.5279 | 10.549  | 10.342  | 10.2848 | 10.0741 |
| 4000                                     | 9.89859 | 10.0191 | 9.97311 | 10.0024 | 9.98637 | 10.021  | 9.91373    | 9.86852  | 9.79588 | 10.0485 | 10.0609 | 10.0663 | 10.076  | 9.95827 | 9.92966 | 9.80928 |
| 6000                                     | 9.8089  | 9.87032 | 9.85031 | 9.89814 | 9.86859 | 9.88222 | 9.81012    | 9.77053  | 9.73156 | 9.90053 | 9.90766 | 9.9225  | 9.93659 | 9.84986 | 9.83137 | 9.73103 |
| 8000                                     | 9.74988 | 9.80347 | 9.79555 | 9.82434 | 9.80217 | 9.8053  | 9.74371    | 9.73615  | 9.69939 | 9.82613 | 9.83945 | 9.85014 | 9.8461  | 9.78105 | 9.76718 | 9.69945 |
| 10000                                    | 9.72084 | 9.75769 | 9.75136 | 9.7797  | 9.75665 | 9.75915 | 9.70989    | 9.70985  | 9.68044 | 9.77547 | 9.79215 | 9.81275 | 9.80952 | 9.75146 | 9.74035 | 9.68615 |

|       | Leftmost column | Top row    | Rightmost column | Bottom row | Middle routers |
|-------|-----------------|------------|------------------|------------|----------------|
| 2000  | 10.2467725      | 10.1711225 | 10.2551875       | 10.3467875 | 10.4728        |
| 4000  | 9.91973375      | 9.90385875 | 9.94620125       | 9.96809875 | 10.0679        |
| 6000  | 9.82164625      | 9.81374    | 9.82349875       | 9.8513875  | 9.9296         |
| 8000  | 9.768175        | 9.76879    | 9.77123          | 9.78861125 | 9.8547         |
| 10000 | 9.745785        | 9.7327425  | 9.73685625       | 9.75603625 | 9.8098         |

 

 Table A.3: Average of router's power utilisation of peripheral routers (each side) and middle routers of Mesh topology

### A.2 Thermal Behaviour Analysis of 2 Layer 3D CMesh Network-on-Chip Architecture

#### A Concentrated Mesh

A low-cost extension of the mesh architecture (Figure A.1) where one router is shared between a set of processing elements. Such a set is referred to as a concentration (Ex. C=2, C=4,...). The Concentrated Mesh (CMesh) reduces the total number horizontal links at the expense of a slight but not significant degradation in performance. A router in a 3D CMesh setup is usually larger than a router used in 3D Mesh since it's responsible for routing packets from multiple cores. In order to handle this added complexity, CMesh routers consist of 9 ports. In a 64-core 4x8x2 NoC setup, only 16 such routers are present, which is 48 lesser routers as compared to the naive 3D Mesh architecture.



Figure A.1: 3D CMesh NoC architecture

In a 3D CMesh architecture, one router is shared between 4 cores(Figure A.2.

After R\_7, the remaining routers correspond to the lower core floorplan layer, closer to the heatsink. It can be observed that the only major hot regions in proximity to the router-TSV module are the upper and lower left cores. This results in the left half of the module having a higher temperature.



Figure A.2: (a) Temperature distribution across routers (b) Heatmaps of 3D Concentrated Mesh Architecture.

# A.3 The Effect of Varying TSV Parameters(Length, Diameter, TSV Pitch and Bump Height) on Latency and Power

An energy efficient TSV should be constructed such that it has a low length so that less energy is spent transmitting a bit across, a low diameter since lesser power is required to maintain the current through a smaller cylinder and a high pitch which reduces the effective coupling capacitance from neighbouring TSVs.

From Figure A.3 it is clear that the length of the TSV is the major contributing factor to the latency for different fixed values of TSV Diameter and Pitch. Varying the Pitch and Diameter however, does not affect the latency which confirms the logical intuition that sending a bit across a line is simply the propagation delay along the line which is solely dependent on the its length. The TSV latency can be generalized to one clock cycle and a lower TSV length implies a lower latency.



Figure A.3: Effect of varying TSV (a)length, (b)diameter and (c)pitch on latency for a single via, at operating frequency=2.5 GHz and voltage=1.1 V

The effect of varying other TSV parameters such as TSV length(  $\overline{A.4}$  (a)), diameter ( $\overline{A.4}$  (a)), the Bump Height( Figure  $\overline{A.5}$ (b) ) and IMD Layer Height(Figure  $\overline{A.6}$ (b)) and Oxide Layer Thickness( $\mathbf{t_{ox}}$  in Figure  $\overline{4.1}$ ) was also investigated keeping

TSV Length, Diameter and Pitch constant (Figure A.6(b)). While one of the three parameters was varied, the other two were set to the default of 1  $\mu$ m.



Figure A.4: Effect of varying TSV (a)length, and (b)diameter on Power Consumption for a single TSV, at operating frequency=2.5 GHz and voltage=1.1 V with activity factor(AF) = 0.15



Figure A.5: Effect of varying (a)TSV pitch (b)Bump Height on Power Consumption for a single TSV, at operating frequency=2.5 GHz and voltage=1.1 V with activity factor(AF) = 0.15

Bump Height is a negligible component in power consumption since the Underfill capacitance does not greatly affect the overall TSV capacitance. Insulator capacitance however, has a negative dependence on the IMD Layer Height, resulting in a linear decrease. A higher oxide layer thickness also contributes to lower power drawn by the TSV due to the its logarithmic relationship with insulator capacitance.

From Figure A.4(a) ,shorter TSVs consume lesser power due to the smaller TSV capacitance values. The TSV capacitances drop off significantly at larger lengths  $(\geq 90\mu m)$  due to the significant decrease in insulator and substrate capacitances. For a given pitch value, larger diameters cause higher insulator capacitance (C<sub>ins</sub>) and



Figure A.6: Effect of varying (a)IMD Layer Height and (b)Oxide Layer Thickness on Power Consumption for a single TSV, at operating frequency=2.5 GHz and voltage=1.1 V with activity factor(AF) = 0.15

bump capacitances ( $C_{Bump1}$  and  $C_{Bump2}$ ). These result in higher power consumed for a larger diameter TSV. From Figure A.4(b), the combined effect of diameter and TSV capacitance influences the power consumed significantly for longer TSVs.

Smaller pitch values result in higher power consumption in TSVs. Decreasing the pitch values increases the capacitances of the Underfill, IMD, and the bottom oxide layer. These capacitances in turn increase the capacitance between the signal and the ground TSV, thereby increasing the total power consumed at lower pitch values. From Figure A.5(a), a larger reduction in power is seen at smaller pitch values ( $\leq 120\mu$ m). Pitch values above 130  $\mu$ m do not affect the capacitance between the signal and the ground TSV significantly.

# A.4 Through-Silicon Via Electrical Model Parameters Details and Essential Equations for Calculating Power Consumption.

| Parameter Name                    | Inference                                |
|-----------------------------------|------------------------------------------|
| $\mathrm{C}_{\mathrm{Underfill}}$ | Underfill capacitance                    |
| $C_{Bump1}, C_{Bump2}$            | Bump capacitance                         |
| $\mathrm{C}_{\mathrm{Insulator}}$ | Insulator capacitance                    |
| $C_{\rm Si\ sub}$                 | Silicon substrate capacitance            |
| $C_{Bottom}$                      | Bottom Oxide Layer capacitance           |
| $C_{imd}$                         | Inter-metal dielectric Layer capacitance |
| $R_{TSV}$                         | TSV Resistance                           |
| $R_{bump}$                        | Bump Resistance                          |
| $L_{\rm TSV}$                     | TSV Inductance                           |
| $L_{bump}$                        | Bump Inductance                          |
| $G_{\rm Si\ sub}$                 | Silicon substrate conductance            |

Table A.4: Electrical Model Parameters(Figure 4.1(b))

 $L_{TSV}+L_{Bump}$  in series with  $R_{TSV}+R_{Bump}$  and in parallel with  $C_{Insulator}+C_{Bump2}$  and  $C_{Insulator}+C_{Bump1}$  are electrical equivalents of the Signal TSV along with the top and bottom bump. The same equivalence applies for Ground TSV.  $C_{Underfill}+C_{IMD}$  is the equivalent for the Underfill and Inter Metal Dielectric layers at the top of the TSV.  $C_{Underfill}+C_{Bottom}$  is the equivalent for the Underfill and Bottom Oxide layers at the bottom of the TSV.  $C_{Si \ sub}$  and  $G_{Si \ sub}$  are the capacitance and conductance of the silicon substrate respectively.

$$C_{\rm ins} = \pi.\varepsilon_{\rm ins}. \frac{h_{\rm tsv} - h_{\rm imd}}{\log(1 + (2.t_{\rm ox}/d_{\rm TSV}))}$$

| Parameter Name                                         | Inference                                                                                                                              |
|--------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| $C_{b1}$                                               | $\mathrm{C_{ins}} + \mathrm{C_{Bump1}}$                                                                                                |
| $C_{b2}$                                               | $\mathrm{C_{ins}+C_{Bump2}}$                                                                                                           |
| $C_1$                                                  | $(C_{b1}*C_{b2})/(C_{b1}+C_{b2})$                                                                                                      |
| $C_2$                                                  | $C_{Si\ sub}$                                                                                                                          |
| $C_3$                                                  | $\mathrm{C}_{\mathrm{Underfill}} + \mathrm{C}_{\mathrm{Bottom}}$                                                                       |
| $C_{\mathrm{Bump1}} = \varepsilon_{\mathrm{imd}}.\tau$ | $\tau \cdot \frac{\left(\frac{d_{\text{bump}}}{2}\right)^2 - \left(\frac{d_{\text{TSV}}}{2} + t_{\text{ox}}\right)^2}{h_{\text{imd}}}$ |
| $C_{\text{Bump2}} = \varepsilon_{\text{tox}} . \pi$    | $\frac{\frac{d_{\text{bump}}}{2})^2 - (\frac{d_{\text{TSV}}}{2} + t_{\text{ox\_bot}})^2}{t_{\text{ox\_bot}}}$                          |
| $C_{\text{Underfill}} =$                               | $=\frac{\pi.\varepsilon_{\text{Underfill}}.h_{\text{bump}}}{acosh(\frac{p_{\text{tsv}}}{d_{\text{bump}}})}$                            |
| $C_{ m imd}$ =                                         | $=\frac{\pi \cdot \varepsilon_{\text{imd}} \cdot h_{\text{imd}}}{a \cosh(\frac{p_{\text{tsv}}}{d_{\text{tsv}}})}$                      |
| $C_{\rm Bottom} =$                                     | $\frac{\pi . \varepsilon_{\text{ox\_bot}} . t_{\text{ox\_bot}}}{a cosh(\frac{p_{\text{tsv}}}{d_{\text{tsv}}})}$                        |
| $G_{\rm Si\ sub} = \tau$                               | $\tau.\sigma_{\rm Si}.\frac{(h_{\rm TSV}-h_{\rm imd})}{acosh(\frac{p_{\rm TSV}}{d_{\rm TSV}})}$                                        |
| $C_{ m Si\ sub}$ =                                     | $= \left(\frac{\varepsilon_{\mathrm{Si}}}{\sigma_{\mathrm{Si}}}\right) . G_{\mathrm{Si sub}}$                                          |
|                                                        |                                                                                                                                        |
|                                                        |                                                                                                                                        |

 Table A.5: Reduced Model Parameters

# Bibliography

- Agarwal, N., Krishna, T., Peh, L., and Jha, N. K. (2009). GARNET: A detailed on-chip network model inside a full-system simulator. In *IEEE International Sym*posium on Performance Analysis of Systems and Software, ISPASS 2009, April 26-28, 2009, Boston, Massachusetts, USA, Proceedings, pages 33–42.
- Ahmed, M. A., Mohapatra, S., and Chrzanowska-Jeske, M. (2016). Tsv- and delayaware 3d-ic floorplanning. Analog Integr. Circuits Signal Process., 87(2):235–248.
- Balfour, J. D. and Dally, W. J. (2006). Design tradeoffs for tiled CMP on-chip networks. In Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, Cairns, Queensland, Australia, June 28 - July 01, 2006, pages 187–198.
- Bamberg, L. and Garcia-Ortiz, A. (2017). High-level energy estimation for submicrometric tsv arrays. *IEEE Transactions on Very Large Scale Integration (VLSI)* Systems, 25(10):2856–2866.
- Catania, V., Mineo, A., Monteleone, S., Palesi, M., and Patti, D. (2016). Cycleaccurate network on chip simulation with noxim. ACM Trans. Model. Comput. Simul., 27(1):4:1–4:25.
- Dally, W. and Towles, B. (2004). Principles and Practices of Interconnection Networks. Morgan kaufmann, San Francisco.
- Dally, W. J. and Towles, B. (2001). Route packets, not wires: On-chip interconnection networks. In *Design Automation Conference*, 2001. Proceedings, pages 684–689. IEEE.
- Debora, M., Max, P., Marcio, K., Luigi, C., and Altamiro, S. (2015). Performance

evaluation of hierarchical NoC topologies for stacked 3D ICs. *Proc. - IEEE Int.* Symp. Circuits Syst., 2015-July:1961–1964.

- Dimitrakopoulos, G., Psarras, A., and Seitanidis, I. (2015). Microarchitecture of Network-on-chip Routers, volume 1025. Springer.
- Feero, B. S. and Pande, P. P. (2009). Networks-on-chip in a three-dimensional environment: A performance evaluation. *IEEE Transactions on Computers*, 58(1):32–45.
- Fourmigue, A., Beltrame, G., and Nicolescu, G. (2014). Efficient transient thermal simulation of 3d ics with liquid-cooling and through silicon vias. In 2014 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1–6.
- Grecu, C., Pande, P. P., Ivanov, A., and Saleh, R. (2004). A scalable communicationcentric soc interconnect architecture. In *International Symposium on Signals, Cir*cuits and Systems. Proc., SCS 2003. (Cat. No.03EX720), pages 343–348.
- Grot, B., Hestness, J., Keckler, S. W., and Mutlu, O. (2009). Express cube topologies for on-chip interconnects. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pages 163–174.
- Henning, J. L. (2000). Spec cpu2000: Measuring cpu performance in the new millennium. Computer, 33(7):28–35.
- Huang, W., Skadron, K., Gurumurthi, S., Ribando, R. J., and Stan, M. R. (2009). Differentiating the roles of ir measurement and simulation for power and temperatureaware design. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 1–10.
- Jabbar, M. H., Houzet, D., and Hammami, O. (2013). Impact of 3d ic on noc topologies: A wire delay consideration. In 2013 Euromicro Conference on Digital System Design, pages 68–72.
- Jain, L., Al-Hashimi, B., Gaur, M., and et al. (2007). Nirgam: a simulator for noc interconnect routing and application modeling. Workshop on Diagnostic Services in Network-on-Chips, DATE, 2007,, (2):16–20.

- Jiang, N., Becker, D. U., Michelogiannakis, G., Balfour, J., Towles, B., Shaw, D. E., Kim, J.-H., and Dally, W. J. (2013a). A detailed and flexible cycle-accurate networkon-chip simulator. In *Performance Analysis of Systems and Software (ISPASS)*, 2013 IEEE International Symposium on, pages 86–96. IEEE.
- Jiang, N. et al. (2013b). A detailed and flexible cycle-accurate network-on-chip simulator. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 86–96.
- Joyner, J. W., Zarkesh-Ha, P., and Meindl, J. D. (2001). A stochastic global net-length distribution for a three-dimensional system-on-a-chip (3d-soc). In ASIC/SOC Conference, 2001. Proceedings. 14th Annual IEEE International, pages 147–151. IEEE.
- Jueping, C., Peng, J., Lei, Y., Yue, H., and Zan, L. (2010). Through-silicon via (tsv) capacitance modeling for 3d noc energy consumption estimation. In 2010 10th IEEE International Conference on Solid-State and Integrated Circuit Technology.
- Kahng, A. B., Li, B., Peh, L.-S., and Samadi, K. (2009). Orion 2.0: a fast and accurate noc power and area model for early-stage design space exploration. In *Proceedings of the conference on Design, Automation and Test in Europe*, pages 423–428. European Design and Automation Association.
- Khalil, D., Ismail, Y., Khellah, M., Karnik, T., and De, V. (2008). Analytical model for the propagation delay of through silicon vias. In 9th International Symposium on Quality Electronic Design (isqed 2008), pages 553–556.
- Kim, D. H., Athikulwongse, K., and Lim, S. K. (2009). A study of through-siliconvia impact on the 3d stacked ic layout. In *Proceedings of the 2009 International Conference on Computer-Aided Design*, pages 674–680. ACM.
- Kim, D. H. and Lim, S. K. (2010). Through-silicon-via-aware delay and power prediction model for buffered interconnects in 3d ics. In *Proceedings of the 12th* ACM/IEEE International Workshop on System Level Interconnect Prediction, SLIP '10, pages 25–32, New York, NY, USA. ACM.
- Kim, J., Cho, J., Pak, J. S., Song, T., Kim, J., Lee, H., Lee, J., and Park, K. (2010). I/o power estimation and analysis of high-speed channels in through-silicon via

(tsv)-based 3d ic. In 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems, pages 41–44.

- Kim, J., Pak, J. S., Cho, J., Song, E., Cho, J., Kim, H., Song, T., Lee, J., Lee, H., Park, K., Yang, S., Suh, M. S., Byun, K. Y., and Kim, J. (2011). Highfrequency scalable electrical model and analysis of a through silicon via (tsv). *IEEE Transactions on Components, Packaging and Manufacturing Technology*, 1(2):181– 195.
- Kinoshita, T., Kawakami, T., Sugiura, T., Matsumoto, K., Kohara, S., and Orii, Y. (2015). Thermal stress simulation for 3d sip with tsv structure under unsteady thermal loads. In *International Electronic Packaging Technical Conference and Exhibition*, volume 2.
- Kumar, P., Pan, Y., Kim, J., Memik, G., and Choudhary, A. (2009). Exploring concentration and channel slicing in on-chip network router. In *Proceedings - 2009* 3rd ACM/IEEE International Symposium on Networks-on-Chip, NoCS 2009, pages 276–285.
- Kumar, S., Jantsch, A., Soininen, J. P., Forsell, M., Millberg, M., Oberg, J., Tiensyrja,
  K., and Hemani, A. (2002). A network on chip architecture and design methodology.
  In Proc. IEEE Computer Society Annual Symp. on VLSI. ISVLSI 2002, pages 105–112.
- Lee, M., Pak, J. S., and Kim, J. (2014). Electrical Design of Through Silicon Via. Springer Publishing Company, Incorporated.
- Lu, B., Hou, L., Fu, J., and Wang, J. (2014). Simplified empirical formula on tsv thermal analysis for 3d ic eda. In 2014 12th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), pages 1–3.
- Ogras, U. Y. and Marculescu, R. (2006). "it's a small world after all": Noc performance optimization via long-range link insertion. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 14(7):693–706.

- Pande, P. P., Grecu, C., Jones, M., Ivanov, A., and Saleh, R. (2005). Performance evaluation and design trade-offs for network-on-chip interconnect architectures. *IEEE Transactions on Computers*, 54(8):1025–1040.
- Papamichael, M. K. (2011). Fast scalable fpga-based network-on-chip simulation models. In Ninth ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMPCODE2011), pages 77–82.
- Pasricha, S. and Dutt, N. (2008). Chapter 12 Networks-On-Chip. In Pasricha, S. and Dutt, N., editors, On-Chip Communication Architectures, Systems on Silicon, pages 439–471. Morgan Kaufmann, Burlington.
- Pavlidis, V. F. and Friedman, E. G. (2007). 3d topologies for networks-on-chip. IEEE Trans. Very Large Scale Integr. Syst., 15(10):1081–1090.
- Psarras, A., Seitanidis, I., Nicopoulos, C., and Dimitrakopoulos, G. (2016). Shortpath: A network-on-chip router with fine-grained pipeline bypassing. *IEEE Transactions* on Computers, 65(10):3136–3147.
- Psathakis, A., Papaefstathiou, V., Chrysos, N., Chaix, F., Vasilakis, E., Pnevmatikatos, D., and Katevenis, M. (2015). A systematic evaluation of emerging mesh-like CMP NoCs. In Architectures for Networking and Communications Systems (ANCS), 2015 ACM/IEEE Symposium on, pages 159–170. IEEE.
- Qian, Y., Lu, Z., and Dou, W. (2009). From 2d to 3d nocs: A case study on worst-case communication performance. In *Proceedings of the 2009 International Conference* on Computer-Aided Design, ICCAD '09, pages 555–562, New York, NY, USA. ACM.
- Rahmani, A. M., Liljeberg, P., Plosila, J., and Tenhunen, H. (2011). Lastz: An ultra optimized 3d networks-on-chip architecture. In 2011 14th Euromicro Conference on Digital System Design, pages 173–180.
- Skadron, K., Stan, M. R., Huang, W., Velusamy, S., Sankaranarayanan, K., and Tarjan, D. (2003). Temperature-aware microarchitecture. In 30th Annual International Symposium on Computer Architecture, 2003. Proceedings., pages 2–13.

- Sridhar, A., Vincenzi, A., Atienza, D., and Brunschwiler, T. (2014). 3d-ice: A compact thermal model for early-stage design of liquid-cooled ics. *IEEE Transactions on Computers*, 63(10):2576–2589.
- Sun, C., Chen, C. H. O., Kurian, G., Wei, L., Miller, J., Agarwal, A., Peh, L. S., and Stojanovic, V. (2012). Dsent - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, pages 201–210.
- Swarup, S., Tan, S. X. ., and Liu, Z. (2012). Thermal characterization of tsv based 3d stacked ics. In 2012 IEEE 21st Conference on Electrical Performance of Electronic Packaging and Systems, pages 335–338.
- Tain, R.-M., Dai, M.-J., Chao, Y.-L., Li, S.-L., Chien, H.-C., Wu, S.-T., Li, W., and Lo, W.-C. (2012). Thermal performance of 3d ic package with embedded tsvs. *Transactions of The Japan Institute of Electronics Packaging*, 5(1):75–84.
- Tran, A. T. and Baas, B. (2012). Noctweak: a highly parameterizable simulator for early exploration of performance and energy of networks on-chip. Technical Report ECE-VCL-2012-2, VLSI Computation Lab, ECE Department, University of California, Davis.
- Weerasekera, R., Grange, M., Pamunuwa, D., Tenhunen, H., and Zheng, L. (2009). Compact modelling of through-silicon vias (tsvs) in three-dimensional (3-d) integrated circuits. In 2009 IEEE International Conference on 3D System Integration, pages 1–8.
- Xu, T. C., Liljeberg, P., and Tenhunen, H. (2012). Euro-Par 2011: Parallel Processing Workshops: CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29 – September 2, 2011, Revised Selected Papers, Part I, chapter A Greedy Heuristic Approximation Scheduling Algorithm for 3D Multicore Processors, pages 281–291. Springer Berlin Heidelberg, Berlin, Heidelberg.
- Yaghini, P. M., Eghbal, A., Yazdi, S. S., Bagherzadeh, N., and Green, M. M. (2016).

Capacitive and inductive tsv-to-tsv resilient approaches for 3d ics. *IEEE Transactions on Computers*, 65(3):693–705.

You, J., Huang, S., Lin, Y., Tsai, M., Kwai, D., Chou, Y., and Wu, C. (2013). In-situ method for tsv delay testing and characterization using input sensitivity analysis. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 21(3):443–453.

### List of Publications

### Journal Publications

- Bheemappa Halavar, Ujjwal Pasupulety, Basavaraj Talawar, Extending Book-Sim2.0 and HotSpot6.0 for power, performance and thermal evaluation of 3D NoC architectures, Simulation Modelling Practice and Theory, Volume 96, 2019, Elsevier, https://doi.org/ 10.1016/j.simpat.2019.101929.
- Bheemappa Halavar and Basavaraj Talawar, *Power and Performance Analysis* of TSV based 3D Network on Chip Architectures, Computer Electrical Engineering, Volume 83, May 2020, Elsevier, https://doi.org/10.1016/j.compeleceng.2020 5.106592.
- Bheemappa Halavar and Basavaraj Talawar, Area, Power and Performance analysis of Optimal 3D BFT NoC Architecture, Journal of Circuits, Systems and Computers, World Scientific [Under-review].

### **Conference Publications**

- Bheemappa Halavar and Basavaraj Talawar, Accurate Performance Analysis of 3D Mesh Network on Chip Architectures, 2018 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), March-2018, art. no. 8724488, pp. 282-286.
- Bheemappa Halavar and Basavaraj Talawar, Floorplan Based Performance Evaluation of 3D Variants of Mesh and BFT Networks-on-Chip, International Conference on Signal Processing and Communication (SPCOM), July-2018.
- Bheemappa Halavar and Basavaraj Talawar, OP3DBFT: A Power and Performance Optimal 3D BFT NoC Architecture, 18th International Conference on Intelligent Systems Design and Applications(ISDA).India, Dec-2018.
- Ujjwal Pasupulety, Bheemappa Halavar, Basavaraj Talawar, Accurate Power and Latency Analysis of a Through-Silicon Via(TSV), 7th International Conference on Advances in Computing, Communications and Informatics (ICACCI), Sep-2018.

 Ujjwal Pasupulety, Bheemappa Halavar, Basavaraj Talawar, Thermal Aware Design for Through-Silicon Via (TSV) based 3D Network-on-Chip (NoC) Architectures, 8th Int'l Symp. on Embedded computing & system Design(ISED), Dec-2018,

# Brief Bio-Data

### Personal Details

Name - Bheemappa Halavar Date of Birth - 30 Sep 1989

### Work Address

Bheemappa Halavar Research Scholar, Department of CSE, NITK Surathkal, Mangalore, Karnataka, 575 025. Email: bheemhh@gmail.com

#### Permanent Address

Bheemappa Halavar Mirajkar Building(V), Hubli, Dharawad(Dist.), Karnataka 580020. Phone No: +91 (953) 853 9429

### Qualification

- M.Tech in Computer Network and Engineering from VTU University, Belgavi, Karnataka, India (2011-2013)
- B.E in Information Science and Engineering from VTU University, Belgavi, Karnataka, India (2007-2011)

### **Previous Work experience**

1. Assistant Professor in SECAB I E T Bijapur, Karnataka from Sep-2013 to Dec<br/>-2014