Python知識(shí)分享網(wǎng) - 專業(yè)的Python學(xué)習(xí)網(wǎng)站 學(xué)Python,上Python222
深度學(xué)習(xí)面向GPU架構(gòu)的深度神經(jīng)網(wǎng)絡(luò)訓(xùn)練與推理優(yōu)化技術(shù)綜述:?jiǎn)螜C(jī)與分布式系統(tǒng)下的性能加速方法研究 PDF 下載
匿名網(wǎng)友發(fā)布于:2026-01-02 15:54:10
(侵權(quán)舉報(bào))
(假如點(diǎn)擊沒反應(yīng),多刷新兩次就OK!)

深度學(xué)習(xí)面向GPU架構(gòu)的深度神經(jīng)網(wǎng)絡(luò)訓(xùn)練與推理優(yōu)化技術(shù)綜述:?jiǎn)螜C(jī)與分布式系統(tǒng)下的性能加速方法研究 PDF 下載 圖1

 

資料內(nèi)容:

 

3.3. Optimizing Winograd CONV
 
Park et al. [10] present two techniques for improving the perfor
mance of Winograd CONV on GPU. They note that a large fraction of
weights in a CNN are zero, especially after applying pruning technique.
If any element of U is zero, their first technique avoids loading it and also
skips the multiplication between elements of U and V. The correspond
ing code is shown in Fig. 3(a). Due to the lockstep execution of threads
in a warp, the above approach reduces latency only if all the threads
of a warp operate on zero U values. Their technique groups threads
performing multiplication on the same input fmap into the same warp,
since their U[i] value is same and hence, zero-result multiplication can
be skipped in all the threads of a warp.
They further note that since each iteration has only a few instruc
tions, the relative overhead of condition-checking instructions becomes
large. Hence, despite avoiding multiplications, above technique de
grades performance, due to the overhead of instructions added for per
forming condition check. For mitigating this overhead, they add a bit
vector in every GPU core, such that ith-bit of the vector is 0 if U[i] is
0, and vice versa. In normal case, in each cycle, PC (program-counter)
is incremented by one, as shown in the left side of Fig. 3(b). In their
technique, after every iteration, the bit-vector is scanned to find the next
non-zero bit. Then, the PC is incremented in a way that it directly jumps
to read the instruction of the corresponding iteration. This is shown
in the right side of Fig. 3(b). Here, SkipCount is obtained by subtract
ing the present index from the next non-zero index and IterationLength
shows the number of instructions in an iteration. Thus, without using
condition-check instructions, their technique executes only those itera
tions whose U[i] is non-zero.
The above technique is effective for small tile size such as 2 × 2 since
in this case, phase 2 contributes to the largest portion of execution time.
However, for large tiles such as 6 × 6, phases 1 and 3 are the largest
contributor to execution time. Thus, for large tiles, the overhead of ad
dition operations becomes large. For such case, they propose a second
technique which increases the reuse of operands of add operation during
their residency in RF. This reduces accesses to the on-chip cache. Since
GPU has a huge number of threads, the per-thread RF capacity is small.
Hence, maximally reusing the operands present in RF is important.
They propose the following optimization for phase 1, and the opti
mization proposed for phase 3 is similar to it. The input access pattern in
computation of matrix V is shown in Fig. 4. When F(4x4, 3x3) operates
on a 6 × 6 input tile (i.e., calculating 4 × 4 outputs with a 3 × 3 kernel),
there are nine distinct access patterns in the computation of V[0] to
V[35]. As an example, computing V[11] requires accessing 12 elements
shown with the shaded square in Fig. 4 and marked as Pattern6. They
make two observations and propose corresponding optimizations based
on them, which are shown in Table 6.
After applying these optimizations, the final order of computation
is (Pattern1 and Pattern2 and Pattern3) Pattern6 Pattern5
Pattern4 (Pattern7 and Pattern8 and Pattern9). Overall, their tech
niques improve CONV performance significantly and the two techniques
provide improvement for different tile sizes.
3.4. Optimizing data-layouts
Let Fh and Fw show the dimensions of CONV filter, Co shows number
of filters or output fmaps, Ci shows number of input fmaps, Hi and Wi
show the height and width of a fmap and Ni shows the batch size. Then,
Eq. (1) shows the computation performed in CONV.
 
(1)
The 4D array shown in Eq. (1) can be stored in different layouts. As
sume that the following symbols are used: image height/width (H/W),
number of images (N) and feature maps (C). Evidently, Eq. (1) assumes
NCHW layout, where the items along the lowest dimension W are stored
contiguously in memory. Successive elements in H dimension are stored
at a distance of W, and those in C dimension have a distance of HW.
Placement of data into different dimensions leads to multiple ways of
storing the data in the memory. For example, with four dimensional
data, there are 24 ways of storing the data in the memory. Li et al.
[11] study the performance impact of data-layout on different CNN lay
ers, since data-layout decides the grid and block dimensions in the GPU.
The layouts used in various frameworks are shown in Table 7(a).