

# **Maximizing Thread-Level Parallelism on GPUs**

Yoon, Myung Kuk (윤명국) Department of Computer Science and Engineering



# **Purpose of GPUs**



# **History of GPUs**



Improving Image Quality



[Programmable Stages]



[General Purpose Applications]



# New Era of Big Data and Al



#### **GPU Architecture**



CUDA Cores

**■ FP32 TFLOPS** 

이화여자대학교 EWHA WOMANS UNIVERSITY

## **High Thread Level Parallelism (TLP)**

- Warp/Wavefront: A set of 32/64 threads within a thread block
- Advantage of high TLP: hiding stalls from a warp by other warps' executions







## **TLP Limiting Factors**

- How to decide the number of threads (CTAs) assigned to SM?
  - Scheduling limit: thread counts
  - Capacity limit: register file size and shared memory size







**Scheduling Limit** 





## **No Scheduling & Capacity Limits**

- What if no scheduling and capacity limits on GPUs?
  - Hiding stalls from a warp by other warps' executions



[Performance Improvement]

Introducing additional scheduling resources and register file increases hardware complexity and costs





#### Virtual Thread Architecture

- Goal: Dispatching more threads onto GPUs to fill up the register file and shared memory without increasing the scheduling limits
- Active CTAs: issuable CTAs
  - Up to the scheduling limit
- Inactive CTAs: not issuable CTAs
  - Up to the capacity limit
- CTA scheduling (context) information
  - Warp ID (Virtual Warp ID)
  - SIMT stack (PC, RPC, and active mask)
  - CTA identifiers







## FineReg Management<sup>7</sup>

 Goal: Dispatching more threads onto GPUs by maintaining only the small portion of the live registers

Info

**ACRF** 

**PCRF** 

ACRF: Active CTA Register File

• Same as original GPU register file

• Keeps all registers of active CTAs

PCRF: Pending CTA Register File

• Backup CTA register storage

• Keeps only the live registers of pending CTAs



- Register Liveness
  - The compiler generates the list of live registers of every instruction





## **Evaluation**

- Impact on Thread-Level Parallelism
  - FineReg: 2.42x more threads than the baseline
- Performance Impact
  - FineReg: 32.8% IPC (Instructions Per Cycle) improvement









#### Conclusion

#### Advantage of high TLP

hiding stalls from a warp by other warps' executions

#### TLP Limiting Factors

- Scheduling limit: thread counts
- Capacity limit: register file size and shared memory size limits

#### Virtual Thread Architecture<sup>6</sup>

• Dispatching more threads onto GPUs to fill up the register file and shared memory without increasing the scheduling limits

#### FineReg Management<sup>7</sup>

 Dispatching more threads onto GPUs by maintaining only the small portion of the live registers









## Thank You!

- Yoon, Myung Kuk (윤명국)
- Department of Computer Science and Engineering
- E-Mail: <u>myungkuk.yoon@ewha.ac.kr</u>
- Homepage: <a href="http://ip-cal.ewha.ac.kr">http://ip-cal.ewha.ac.kr</a>

#### References

- 1. Old Super Mario, https://www.youtube.com/watch?v=GlwX5q\_Y1-0
- 2. 3D Super Mario, https://www.bbc.com/news/technology-53402067
- 3. "NVIDIA TENSOR CORES," https://www.nvidia.com/en-us/data-center/tensor-cores/
- 4. The Digitization of the World From Edge to Core, https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
- 5. Up to Speed on Deep Learning in Medical Imaging, https://medium.com/the-mission/up-to-speed-on-deep-learning-in-medical-imaging-7ff1e91f6d71
- 6. Yoon et al., Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit, ISCA 2016
- 7. Oh et al, FineReg: Fine-Grained Register File Management for Augmenting GPU Throughput, MICRO 2018



