Methodsto Improve GPGPU Multi-threadComputingAbhishekCS41004 I. ABSTRACT ModernGPGPU’s support the concurrent execution of hundreds and thousands of threads.However, the massive multi-threading of GPGPUs often results in serious cachecontention, as the cache lines brought by one thread can easily be removed byother threads in the small shared cache. The massive threads often compete inthe small sized first level data (L1D) cache, which leads to severe cachethrashing problem and throttle the GPGPU performance. The first reference paper’Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling’by Li-Jhan Chen , Hsiang-Yun Cheng , Po-Han Wang , Chia-Lin Yang emphasizes ona software-hardware cooperative approach that make use of the spatial localityamong different thread blocks to better utilize the valuable cache capacity andthe second one ‘Applying Victim Cache inHigh Performance GPGPU Computing’ by Fengfeng Fan , Jianfei Wang , Li Jiang ,Xiaoyao Liang, and Naifeng Jing applyvictim cache design into GPGPUs to reduce L1D cache thrashing problem forbetter data locality and system performance.
II. INTRODUCTION GeneralPurpose Graphic Processing Units (GPGPUs) are becoming increasingly popular forvarious applications, such as physical simulation, image processing and cloudcomputing, due to their great computing capability. Modern GPGPUs allowthousands of threads to be executed simultaneously. With vast multi-threadingcapabilities, GPGPUs can offer better performance and power efficiency thanwith usage of CPUs. To execute an application on a GPGPU device, theapplication is divided into kernels that are executed by a large number ofconcurrent threads. These threads are grouped together into cooperative threadarrays (CTAs) that are the basic units for resource allocation.
After the CTAsare introduced onto streaming multiprocessors (SMs), every 32 threads within aCTA are bundled together, namely a warp, as basic unit for execution.Oftentimes during the runtime, multiple warps can be available for scheduling.The schedulers are responsible for picking up the warp that is the mostsuitable for execution in order to hide the latency caused by data dependenciesand long memory accesses. The high thread-level parallelism (TLP) in GPGPUapplications enables faster computing capability compared with the conventionalCPUs. Despitethe high theoretically achievable thread-level parallelism capabilities, GPGPUcores suffer from serious cache contention. In each stream multiprocessor (SM)of modern GPGPUs, thousands of threads share the same small L1 cache, resultingin extremely small per-thread cache capacity. For example, the per-thread L1cache capacity in NVIDIA’s Fermi is only about 32 bytes.
Thus, with limitedcache size, data in the cache can easily be evicted before being reused,leading to serious performance loss.Themapping between Thread Blocks (TBs) and SMs determines how much locality can beutilized inside each SM, while the warp scheduling within a single SM can onlyleverage the available locality distributed by the thread block scheduler. Withlarger design space, a locality-aware thread block scheduler can providetremendous opportunities to increase cache hit rate and improve the performance.A naive heuristic method, which assumes that scheduling each pair ofconsecutive TBs to the same SM can exploit spatial locality.
This solution onlyworks for row-major applications. Inthe first paper, a generic thread block scheduler to cover different types ofaccess pattern is explained. The key challenge in this is how to accuratelyquantify the block-level spatial locality. To tackle this challenge asoftware-hardware cooperative approach to capture the amount of shared cachelines between different TBs is embraced. Through dispatching TBs with spatiallocality to the same SM, the cache hit rate of each SM can be improved.Apossible solution explained in second paper is by changing the traditionalvictim cache structure to fit the massive threads execution in GPGPUs. Then,given the limited area and power budget in GPGPUs, further to leverage theunallocated registers to meet the victim cache storage requirement.
Theunallocated registers can be known at compile time, and are generallysufficient to our needs. The experiment results demonstrate that the victimcache can greatly reduce conflict misses and the data cache hit rate can beincreased largely. The proposed victimcache design can achieve impressive performance improvement with relativelysmall hardware cost. III.
LITERATURE SURVEY Cache Locality Aware Thread Block Scheduling Asoftware-hardware cooperative approach to increase cache hit rate and improveperformance by exploiting the spatial locality between different TBs. Thecompiler is modified to extract address calculation codes from kernel programsand it generates a separate address calculation binary. The small, in-order CPUin modern GPU’s thread block scheduler is put to use to run the locality-awareblock scheduling algorithm, which is basically composed of two modules. Theaddress range calculation module analyzes the memory access range of each TBwhen it is inserted into the block queue.
To maximize the SM’s inter-blockspatial locality, the thread block dispatching module selects the candidate TBthat should be scheduled next, and the selection process runs concurrently withthread block execution on SMs to hide the timing overhead. Tocapture the working sets of each TB, the compiler is modified to extract theaddress calculation code, so that the thread block scheduler can compute thememory access range based on the address calculation binary. A GPGPU program ismade up of one or more kernels, and each kernel is an array of threads whichrun the same program code on different data.
The mapping between thread IDs anddata can be obtained through simple arithmetic’s, since threads operate onstructural data often, such as 1D or 2D arrays, in regular GPGPU programs. Thecompiler can easily extract the address calculation code, i.e., the code thatis utilized to compute the index of the data array (line 4-7), from a kernelfunction, and use the address calculation code and the base address of the dataarray to generate the address calculation binary. At run-time, the thread blockscheduler can use the binary and the thread ID to compute the memory addressesaccessed by an arbitrary thread.
Theaddress of the start point, i.e., the upper-left address, can be computed bythe memory address of the first thread in the block, and the width/height canbe computed by the address difference between the first and the last thread inthe block. These access range parameters, including the start point, width, andheight, are stored in the block queue when the block is inserted. In order tocompute the inter-block locality, the coordination of cache lines to representthe access range rectangles is used. The memory addresses of the data arrayscan be transformed into the corresponding cache line addresses.
For example, ifa cache line is 128 bytes, memory addresses from (0,0) to (127,0) belong to thecache line (0,0) and memory addresses (128,1) to (255,1) belong to the cacheline (1,1). Through the address transformation, the start point, width, andheight of each TB can be represented in cache line coordination. Oncea TB is completed on a SM m, the thread block scheduler allocates the candidateTB to the SM. The selection process for the next candidate TB is then triggeredand run concurrently with TB execution to hide the timing overhead. We selectthe candidate TB based on the overall interblock locality (Lall), i.e.
, the summationof inter-block locality between the candidate TB and all the running TBs on theSM. The inter-block locality of any two TBs (Lpair) is defined as the summationof the overlapped data access range in all data arrays between them. For eachdata array, the overlapped area is computed by the following steps:1) distancexand distancey are the differences in x-axis and y-axis between the start pointsof two TBs.2) Ifdistancex > width or distancey > height, there is no overlapped areabetween the two TBs, indicating that there is no spatial locality between thesetwo TBs.3) Otherwise,the overlapped area is (width ? distancex) ?(height ? distancey), which is equal to the number of shared cache linesbetween these two TBs.Basedon the estimation of inter-block locality, the TB with maximum Lall isselected. If no TB in the queue has inter-block locality on this SM, the TBwith minimum Lall if being scheduled on other SMs is selected, so that thedegradation of inter-block locality on other SMs can be avoided.Locality-awareTB scheduler only incurs small hardware and timing overhead.
To avoid duplicatecalculation of memory access range, 4 bytes is added per block queue entry forstoring related parameters. In the TB scheduler, we add one register per SM tostore the candidate TB, so that the candidate TB can be directly dispatchedonce a TB is completed on the SM. Two methods to reduce the timing overhead ofselecting candidate TBs are used.
First, at the initial phase with less than 8TBs being executed on the SM, we group consecutive TBs together and directlydispatch them to the same SM. After the initial phase, the TB selection processis executed concurrently with TB execution to hide the timing overhead. Victim CacheTheequivalent structure of victim cache to L1D cache is efficient to alleviate theconflict misses from the massive threads in GPGPU applications. Meanwhile, Itrequires a storage of the same capacity. However, given area and powerlimitations in GPGPUs, it is not wise to introduce such a victim cache. Tofurther improve the victim cache storage, instead of adding a new data arrayfor victim cache, unallocated registers in RF for the victim cache data storageis used.
The RF is not fully used due to the resource allocation in GPGPUs. Infact, since the TLP in GPGPUs is limited by several factors, e.g. the maximumnumber of threads and thread blocks allowed in a SM, and the RF and sharedmemory capacity, it is very likely that the RF is not fully used with manyregisters unallocated that can be known before execution. In addition, becausethe registers and cache blocks in GPGPUs are both on a warp basis with a 1024bits width, it becomes a nature fit to use the unallocated registers to holdthe victim cache data. To find these unallocated registers, a bitmap to markthe status of each register in the RF per SM is employed.
For example, we onlyneed a bitmap with 1024 bits assuming a 128KB RF with each warp operand of 1024bits. Status 1 means the corresponding register is allocated and 0 means it isunallocated throughout the execution and therefore can be used as a victimcache line. The bitmap can be initialized before each kernel launch because theregister requirement does not change during the kernel execution. The data flowof our improved victim cache can be similar to the regular victim cache. One ofthe differences is that when the victim cache hits, we need to access RF to getthe victim data.
Instead of a 1024 bits storage, we only require an index of 10bits to record the register position and use it to access the RF to get thevictim data. Another difference is that we have to check the bitmap forunallocated registers. If not available, some of the victim cache blocks shouldbe marked as invalid because there is no storage to use.
However, as the numberof unallocated registers is usually large enough, the invalidation of victimcache blocks can be rare. In summary, our proposed victim cache design can wellalleviate the conflict misses that suffer the GPGPU L1D cache. Moreover, weintroduce limited hardware supports. For example, for a victim cache of 16KB,we need 2.
5Kbit tag storage. For a RF of 1024 registers, we need a bitmap with1024 bits to record the register availability and 1280 bits to record theregister indexes for victim storage. At the same time, some associated logicsare needed to support the pipeline changes and manage the data flow.
However,given the performance gained from the victim cache design the investment can bejustified. IV. CONCLUSION Avictim cache design in GPGPUs.
It works with the on-chip L1D cache to keep moredata on-chip for fast accesses. To reduce the storage requirement for victimdata, we further leverage the unallocated registers found in RF to serve asdata array of the victim cache, and therefore the hardware cost can be greatlyreduced. The experiment results show that using our approach, the on-chip cachehit rate can be greatly increased, which leads to a large performanceimprovement with only small changes to the baseline GPGPU design.Alocality-aware thread block scheduler to improve GPGPU performance by reducingcache misses. We utilize a software-hardware cooperative method to capture thespatial locality among different TBs at runtime, and increase cache hit ratesby dispatching TBs with numerous shared cache lines to the same SM. Resultsshow that our approach can effectively reduce L1 miss rate (average 3%, up to13%) and significantly improve performance (average 9%, up to 34%) over thestate-of-the-art approach. This work opens up new research directions forleveraging GPGPU software and hardware characteristics to precisely capturecache locality and optimize cache utilization.
Boththese methods can be used to significantly improve upon the existingperformance of GPGPU in multi-thread handling. Depending upon the specificapplication the performance of these methods may vary. But Cache Locality AwareThread Block Scheduling is slightly better approach compared to victim cachesolution. V. REFERENCES 1. ‘Improving GPGPU Performance via CacheLocality Aware Thread Block Scheduling’ by Li-Jhan Chen , Hsiang-Yun Cheng ,Po-Han Wang , Chia-Lin Yang http://ieeexplore.ieee.
org/document/7898501/ 2. ‘Applying Victim Cache in High PerformanceGPGPU Computing’ by Fengfeng Fan , Jianfei Wang , Li Jiang , Xiaoyao Liang, andNaifeng Jing http://ieeexplore.ieee.org/document/7904265/