Divergence Optimizations

We call divergence optimizations the techniques that use information from a divergence analysis to improve SIMD code. Although divergence analysis is new, the literature contains examples of divergence optimizations. In this case, previous works have used profiling or visual inspection to identify optimization opportunities. In this section we discuss four divergence aware automatic optimizations, namely thread reallocation, variable sharing, peephole optimizations, and branch fusion.

Thread Reallocation

This optimization applies on models that group PEs into independent SIMD front-lines, such as Nvidia’s GPUs, which consist of a number of warps having 32 PEs that execute in lock-step. In face of divergences, one can re-group these PEs, so that only one front-line contains divergent PEs; Fung et al. have proposed a way to perform this reallocation at the hardware level, inferring, via simulation, speed-ups of up to 20%. Zhang et al. have implemented this reallocation by manually moving data around, before sending it to the GPU, obtaining performance gains of up to 37%.

As an example, consider the CUDA kernel below, that simply decrements all the positions of an input array, until all of them are zero or less:

__global__ void dec2zero(int* v, int N) {
  int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
  if (xIndex  0) {
      v[xIndex]--;
    }
  }
}

This kernel is arguably very naive, and unlikely to exist outside the frontiers of an educational example; however, it represents a pattern rather common in GPU applications: the amount of processing performed by each thread is proportional to some input element that is unique to each PE. A typical application that behaves in this way is ray-tracing. Going back to our example, lets assume, for the sake of an example, that the input array is initialized with either random numbers or constants, by the C procedures below:

void vecRandomInit(int* data, int size) {
  for (int i = 0; i < size; ++i) {
    data[i] = random() % size;
  }
}

void vecConsInit(int* data, int size) {
  int cons = size / 2;
  for (int i = 0; i < size; ++i) {
    data[i] = cons;
  }
}

The sum of the array, in both cases, will be approximately size * size / 2; however, in virtually any of the Nvidia’s GPUs, our example kernel will be almost twice as fast when paired with the second initializer, than when paired with the first. A simple way to improve the running time of our kernel would be to sort the input array, before sending it to the GPU. The initializer below, which produces sorted input data, yields performance numbers very similar to vecConsInit:

void vecIncInit(int* data, int size) {
  for (int i = 0; i < size; ++i) {
    data[i] = i;
  }
}

However, sorting the input array also takes some time, and may not be completely necessary. Instead, we could “quasi-sort” the array, that is, we re-arrange the data in such a way that the input array is not exactly sorted, but the absolute difference between consecutive elements is smaller. We can do this via the kernel below, which implements the quasi-sorting procedure in two steps. First, we divide the data among the available PEs, and let each PE sort its small chunk of data. After this sorting is done, we scatter the data along the input array, in such a way that each thread inserts its i-th largest data into the i-th partition of the array:

__global__ static void maxSort1(int * values, int N) {
  __shared__ int shared[THREAD_WORK_SIZE * 256];
  // Copy data from the values vector into shared memory:
  // Notice that a thread brings to shared memory data that is not in their
  // sorting partition. We do it like this to have coalesced memory reads.
  for (unsigned k = 0; k < THREAD_WORK_SIZE; k++) {
    unsigned loc = k * blockDim.x + threadIdx.x;
    if (loc < N) {
      shared[loc] = values[loc + blockIdx.x * blockDim.x];
    }
  }
  __syncthreads();
  // SORT: we use six comparisons. It may be possible to do it with less. Must
  // check it.
  int index0 = threadIdx.x * THREAD_WORK_SIZE;
  int index1 = threadIdx.x * THREAD_WORK_SIZE + 1;
  int index2 = threadIdx.x * THREAD_WORK_SIZE + 2;
  int index3 = threadIdx.x * THREAD_WORK_SIZE + 3;
  if (index0 < N) {
    swapIfNecessary(shared, index0, index1);
    swapIfNecessary(shared, index2, index3);
    swapIfNecessary(shared, index0, index2);
    swapIfNecessary(shared, index1, index3);
    swapIfNecessary(shared, index1, index2);
    swapIfNecessary(shared, index0, index3);
  }
  __syncthreads();
  // SCATTER: the threads distribute their data along the array.
  __shared__ int scattered[THREAD_WORK_SIZE * 256];
  unsigned int nextLoc = threadIdx.x;
  for (unsigned i = 0; i < THREAD_WORK_SIZE; i++) {
    scattered[nextLoc] = shared[threadIdx.x * THREAD_WORK_SIZE + i];
    nextLoc += blockDim.x;
  }
  __syncthreads();
  // Copy the data back from the shared memory into the values vector:
  for (unsigned k = 0; k < THREAD_WORK_SIZE; k++) {
    unsigned loc = k * blockDim.x + threadIdx.x;
    if (loc < N) {
      values[loc + blockIdx.x * blockDim.x] = scattered[loc];
    }
  }
}

Variable sharing

This optimization consists in moving data proved to be non-divergent from registers to shared memory. Some architectures, such as CUDA, divide a fixed number of registers among the co-existing PEs; thus, the demand for local storage limits the number of active processing elements. Variable sharing has the effect of reducing the register pressure at each PE; hence, increasing the hardware occupancy. For instance, the Nvidia GTX 280 GPU can run up to 768 threads, which will have access to 8,192 registers. This gives to each thread at most 10 registers. An application that requires more than 10 registers per threads will not be able to use the hardware in its full capacity. There exist two main ways to share data and work among PEs. We call these techniques internal and external sharing. We shall start this discussion illustrating internal sharing, leaving the external variation for later.

Internal Variable Sharing

Let’s consider the kernel below, which is an artificial example showing a computation heavy loop. This kernel fills the positions of an array Out with values computed from the columns of a matrix In, plus a set of four variables. This kernel uses 11 registers when compiled with nvcc -O3; hence, it uses 2/3 of the 768 available threads.

__global__ void nonShared1(float* In, float* Out, int Width) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid < Width) {
    Out[tid] = 0.0;
    float a = 2.0F, b = 3.0F, c = 5.0F, d = 7.0F;
    for (int k = tid; k < Width * Width; k += Width) {
      Out[tid] += In[k] / (a - b);
      Out[tid] -= In[k] / (c - d);
      float aux = a;
      a = b;
      b = c;
      c = d;
      d = aux;
    }
  }
}

The divergence analysis tells us that variables a, b, c and d are non-divergent. Thus, some of these variables can be kept into shared memory, as we illustrate in the kernel regPress2. In this case, we have chosen to map two variables, c and d to the shared locations common0 and common1. We have also reduced the amount of access to the shared memory by ensuring that only the PE zero updates the common locations.

__global__ void regPress2(float* In, float* Out, int Width) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid < Width) {
    __shared__ float common0, common1;
    float c = 5.0F;
    float d = 7.0F;
    if (threadIdx.x == 0) {
      common0 = 2.0F;
      common1 = 3.0F;
    }
    __syncthreads();
    Out[tid] = 0.0;
    for (int k = tid; k < Width * Width; k += Width) {
      Out[tid] += In[k] / (common0 - common1);
      Out[tid] -= In[k] / (c - d);
      float aux = common0;
      if (threadIdx.x == 0) {
        common0 = common1;
        common1 = c;
      }
      __syncthreads();
      c = d;
      d = aux;
    }
  }
}

Variable sharing must be used with discretion, for reading and writing data from shared memory is slower than using registers. Each read will be implemented by a memory load, and each write by a memory store. Nevertheless, in this case, we have been able to reduce the register usage to 10, obtaining total hardware occupancy.

External Sharing

In the previous example we have used the PE 0 not only to update the shared memory, but also to do all the useful work on the common data. There exist situations in which it is possible to move this work to the CPU, which is generally much faster than a single GPU thread. For instance, let’s change our running example a tiny bit, to get the kernel below. In this new kernel, intermediate values of the common work are not used by the individual computations. That is, the shared variables are only mixed with the individual data once they are fully computed. We could move this computation of common values to the CPU.

__global__ void nonShared2(float* In, float* Out, int Width) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid < Width) {
    Out[tid] = 0.0;
    float a = 2.0F, b = 3.0F, c = 5.0F, d = 7.0F;
    for (int k = 0; k < Width; k++) {
      int index = tid + (k * Width);
      Out[tid] += In[index] / k;
      Out[tid] -= In[index] / (k + 1);
      float aux = a;
      a = b;
      b = c;
      c = d;
      d = aux;
    }
    Out[tid] /= (a - b) * (c - d);
  }
}

In the previous kernel, variables a, b, c and d are divergent. Furthermore, the intermediate values produced during the computation of these variables are used only to produce the final value generated by this kernel. Therefore, we can extract the slice of the program responsible for calculating these shared variables, and move this slice to the CPU, as we can see in the program below, which shows CPU and GPU code. The advantage, in this case, is not only a reduction in the register pressure inside the kernel, but also the reduction of redundant work.

__global__ void sharedExternally(float* In, float* Out, int Width, float alpha) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid < Width) {
    Out[tid] = 0.0;
    for (int k = 0; k < Width; k++) {
      int index = tid + (k * Width);
      Out[tid] += In[index] / k;
      Out[tid] -= In[index] / (k + 1);
    }
    Out[tid] /= alpha;
  }
}
// Parte do programa executada na CPU:
float a = 2.0F, b = 3.0F, c = 5.0F, d = 7.0F;
for (int k = 0; k < matrix_side_size; k++) {
 float aux = a;
  a = b;
  b = c;
  c = d;
  d = aux;
}
float alpha = (a - b) * (c - d);
sharedExternally<<>>(In, Out, Width, alpha);

Peephole optimizations

This optimization consist in replacing instructions that are proved to be non-divergent by more efficient operations. For instance, the implementation of u-SIMD conditional branch instructions is complicated by the necessity to handle divergences, as one can check from the Rules that describe conditional branches and synchronization barriers. However, this complication should not be imposed on non-divergent branches. Thus, many SIMD architectures provide two conditional jumps: one that, similarly to u-SIMD branch and sync instructions, handles divergences, and another that does not. For instance, PTX, the CUDA intermediate representation, provides bra.uni, a conditional jump that can be used in the absence of divergences. The Nvidia Programming Manual says the following about this type of conditional test:

“All control constructs are assumed to be divergent points unless the control-flow instruction is marked as uniform, using the \texttt{uni} suffix. For divergent control flow, the optimizing code generator automatically determines points of re-convergence. Therefore, a compiler or code author targeting PTX can ignore the issue of divergent threads, but has the opportunity to improve performance by marking branch points as uniform when the compiler or author can guarantee that the branch point is non-divergent.”

Similarly, PTX offers uniform versions of other control flow instructions. It also offers optimized versions of instructions that do not necessarily influence the program’s control flow, but that might be faster when the parameters are known to be non-divergent values:

  1. ldu: a uniform load instruction, that assumes that the memory address to load from is non-divergent, that is, every thread will load from the same address.
  2. call.uni a function call that assumes that it will be executed by every warp thread. In CUDA, function calls can be augmented with predicates, in such a way that only those PEs whose predicate evaluates to true will perform the call. The call.uni instruction assumes that all the PEs will see the same predicate.
  3. ret.uni a return instruction, that terminates the execution of a device function, given control back to the calling function. A divergent return would suspend the PEs that have executed it, until every other PE is the same front-line is done with that function call. This arrangement allows different PEs to leave a function call at different moments. On the other hand, if every thread is known to finish at the same time, then we can use the uniform return to improve program’s performance.

Branch Fusion

If-then-else branches may have a lot of instructions in common. It would be very nice if we could move these redundant sequences of code into a common program path, so that threads would spend a minimum of time in divergent parts of the program. Branch fusion is a very extensive kind of redundancy elimination, whose purpose is to extract common code from divergent program paths. We will use the example below to explain this optimization.

__global__ void div1(float* In, float* Out, int Width) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid < Width) {
    float t0 = tid * 6.28 + 1.0F;
    Out[tid] = 0.0;
    for (int k = tid; k < Width * Width; k += Width) {
      if (tid % 2) {
        float t1 = In[k];
        float t2 = t1 * t1;
        float t3 = t2 * t1;
        float t4 = t3 * 3.14;
        float t5 = t4 / t0;
        float t6 = t5 / t0;
        float t7 = t1 * 2.71;
        float t8 = t6 + t7;
        Out[tid] += t8;
      } else {
        float t9 = In[k];
        float t10 = t9 * t9;
        float t11 = t10 * 3.14;
        float t12 = t11 / 2.00;
        float t13 = t9 * 2.71;
        float t14 = t13 * t9;
        float t15 = t12 + t14;
        Out[tid] += t15;
      }
    }
  }
}

We can easily see that many of the instructions executed in the different branch paths are the same, albeit having different operands. An improvement on this code would be to merge these common instructions together, thus getting the kernel below.

__global__ void div2(float* In, float* Out, int Width) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid < Width) {
    float t0 = tid * 6.28 + 1.0F;
    Out[tid] = 0.0;
    for (int k = tid; k < Width * Width; k += Width) {
      float t8_15;
      float t1_9 = In[k];
      float t2_10 = t1_9 * t1_9;
      bool p = tid % 2;
      float s1 = p ? t1_9 : 3.14;
      float t3_11 = t2_10 * s1;
      if (p) {
        float t4 = t3_11 * 3.14;
        float t5 = t4 / t0;
        float t6 = t5 / t0;
        float t7 = t1_9 * 2.71;
        t8_15 = t6 + t7;
      } else {
        float t12 = t3_11 / 2.00;
        float t13 = t1_9 * 2.71;
        float t14 = t13 * t1_9;
        t8_15 = t12 + t14;
      }
      Out[tid] += t8_15;
    }
  }
}

The kernel div2 is approximately 17% faster than the kernel div1 on a Nvidia 9400M GPU. However, branch fusion must be used with discretion, for we should not increase the register pressure in the kernel to the point of reducing the hardware occupancy. For instance, if we go on applying branch fusion on the kernel above, we end up with the kernel div3, which we show below:

__global__ void div3(float* In, float* Out, int Width) {
 int tid = blockIdx.x * blockDim.x + threadIdx.x;
 if (tid < Width) {
   float t0 = tid * 6.28 + 1.0F;
   Out[tid] = 0.0;
   for (int k = tid; k < Width * Width; k += Width) {
     float t1_9 = In[k];
     float t2_10 = t1_9 * t1_9;
     bool p = tid % 2;
     float s1 = p ? t1_9 : 3.14;
     float t3_11 = t2_10 * s1;
     float t6_12, t13;
     if (p) {
       float t4 = t3_11 * 3.14;
       float t5 = t4 / t0;
       t6_12 = t5 / t0;
     } else {
       t13 = t1_9 * 2.71;
       t6_12 = t3_11 / 2.00;
     }
     float s2 = p ? 2.71 : t13;
     float t7_14 = t1_9 * s2;
     float t8_15 = t6_12 + t7_14;
     Out[tid] += t8_15;
   }
 }
}

This last kernel maximizes the amount of instructions that are executed together; however, in order to accomplish this, it must use more registers for the selectors. The maximum register pressure in div3 is 11, whereas the maximum register pressure in div1 is 9, and in div2 10. The 9400M has 8192 registers to distribute among 768 threads. In order to have maximum hardware occupancy each thread must receive at most 10 registers. By allocating one extra register beyond 10 to its PEs, div3 is able to fill only 2/3 of the thread slots, and thus loses performance

Leave a comment