123,123,123

深度學(xué)習(xí)面向GPU架構(gòu)的深度神經(jīng)網(wǎng)絡(luò)訓(xùn)練與推理優(yōu)化技術(shù)綜述：?jiǎn)螜C(jī)與分布式系統(tǒng)下的性能加速方法研究 PDF 下載

匿名網(wǎng)友發(fā)布于：2026-01-02 15:54:10

(侵權(quán)舉報(bào))

(假如點(diǎn)擊沒反應(yīng)，多刷新兩次就OK！)

深度學(xué)習(xí)面向GPU架構(gòu)的深度神經(jīng)網(wǎng)絡(luò)訓(xùn)練與推理優(yōu)化技術(shù)綜述：?jiǎn)螜C(jī)與分布式系統(tǒng)下的性能加速方法研究 PDF 下載圖1

資料內(nèi)容：

3.3. Optimizing Winograd CONV

Park et al. [10] present two techniques for improving the perfor

mance of Winograd CONV on GPU. They note that a large fraction of

weights in a CNN are zero, especially after applying pruning technique.

If any element of U is zero, their first technique avoids loading it and also

skips the multiplication between elements of U and V. The correspond

ing code is shown in Fig. 3(a). Due to the lockstep execution of threads

in a warp, the above approach reduces latency only if all the threads

of a warp operate on zero U values. Their technique groups threads

performing multiplication on the same input fmap into the same warp,

since their U[i] value is same and hence, zero-result multiplication can

be skipped in all the threads of a warp.

They further note that since each iteration has only a few instruc

tions, the relative overhead of condition-checking instructions becomes

large. Hence, despite avoiding multiplications, above technique de

grades performance, due to the overhead of instructions added for per

forming condition check. For mitigating this overhead, they add a bit

vector in every GPU core, such that ith-bit of the vector is 0 if U[i] is

0, and vice versa. In normal case, in each cycle, PC (program-counter)

is incremented by one, as shown in the left side of Fig. 3(b). In their

technique, after every iteration, the bit-vector is scanned to find the next

non-zero bit. Then, the PC is incremented in a way that it directly jumps

to read the instruction of the corresponding iteration. This is shown

in the right side of Fig. 3(b). Here, SkipCount is obtained by subtract

ing the present index from the next non-zero index and IterationLength

shows the number of instructions in an iteration. Thus, without using

condition-check instructions, their technique executes only those itera

tions whose U[i] is non-zero.

The above technique is effective for small tile size such as 2 × 2 since

in this case, phase 2 contributes to the largest portion of execution time.

However, for large tiles such as 6 × 6, phases 1 and 3 are the largest

contributor to execution time. Thus, for large tiles, the overhead of ad

dition operations becomes large. For such case, they propose a second

technique which increases the reuse of operands of add operation during

their residency in RF. This reduces accesses to the on-chip cache. Since

GPU has a huge number of threads, the per-thread RF capacity is small.

Hence, maximally reusing the operands present in RF is important.

They propose the following optimization for phase 1, and the opti

mization proposed for phase 3 is similar to it. The input access pattern in

computation of matrix V is shown in Fig. 4. When F(4x4, 3x3) operates

on a 6 × 6 input tile (i.e., calculating 4 × 4 outputs with a 3 × 3 kernel),

there are nine distinct access patterns in the computation of V[0] to

V[35]. As an example, computing V[11] requires accessing 12 elements

shown with the shaded square in Fig. 4 and marked as Pattern6. They

make two observations and propose corresponding optimizations based

on them, which are shown in Table 6.

After applying these optimizations, the final order of computation

is (Pattern1 and Pattern2 and Pattern3) → Pattern6 → Pattern5 →

Pattern4 → (Pattern7 and Pattern8 and Pattern9). Overall, their tech

niques improve CONV performance significantly and the two techniques

provide improvement for different tile sizes.

3.4. Optimizing data-layouts

Let Fh and Fw show the dimensions of CONV filter, Co shows number

of filters or output fmaps, Ci shows number of input fmaps, Hi and Wi

show the height and width of a fmap and Ni shows the batch size. Then,

Eq. (1) shows the computation performed in CONV.

(1)

The 4D array shown in Eq. (1) can be stored in different layouts. As

sume that the following symbols are used: image height/width (H/W),

number of images (N) and feature maps (C). Evidently, Eq. (1) assumes

NCHW layout, where the items along the lowest dimension W are stored

contiguously in memory. Successive elements in H dimension are stored

at a distance of W, and those in C dimension have a distance of H∗W.

Placement of data into different dimensions leads to multiple ways of

storing the data in the memory. For example, with four dimensional

data, there are 24 ways of storing the data in the memory. Li et al.

[11] study the performance impact of data-layout on different CNN lay

ers, since data-layout decides the grid and block dimensions in the GPU.

The layouts used in various frameworks are shown in Table 7(a).

熱門帖子推薦

相關(guān)帖子推薦

熱門標(biāo)簽推薦