GPU-2-3-CUDA组织并行线程

246 阅读7分钟

【CUDA 基础】组织并行线程

摘要

介绍CUDA模型中的线程组织模式。

前面我们学习了CUDA编程的几个关键点,包括内存,kernel,今天我们要学习线程组织形式。

每个线程的编号是依靠:

  • 块的坐标(blockIdx.x等)
  • 网格的大小(gridDim.x 等)
  • 线程编号(threadIdx.x等)
  • 线程的大小(tblockDim.x等)

这一篇我们就详细介绍每一个线程是怎么确定唯一的索引,然后建立并行计算,并且不同的线程组织形式是怎样影响性能的:

  • 二维网格二维线程块
  • 一维网格一维线程块
  • 二维网格一维线程块

关键字ThreadBlockGrid

使用块和线程建立矩阵索引

多线程的优点就是每个线程处理不同的数据计算,那么怎么分配好每个线程处理不同的数据,而不至于多个不同的线程处理同一个数据,或者避免不同的线程没有组织的乱访问内存。如果多线程不能按照组织合理的干活,那么就相当于一群没训练过的哈士奇拉雪橇,往不同的方向跑,那么是没办法前进的,必须有组织,有规则的计算才有意义。 我们的线程模型前文中已经有个大概的介绍,但是下图可以非常形象的反应线程模型,不过注意硬件实际的执行和存储不是按照图中的模型来的,大家注意区分:

cuda_thread.png

这里(ix,iy)就是整个线程模型中任意一个线程的索引,或者叫做全局地址,局部地址当然就是(threadIdx.x,threadIdx.y)了,当然这个局部地址目前还没有什么用处,他只能索引线程块内的线程,不同线程块中有相同的局部索引值,比如同一个小区,A栋有16楼,B栋也有16楼,A栋和B栋就是blockIdx,而16就是threadIdx。 图中的横坐标就是:

ix=threadIdx.x+blockIdx.x×blockDim.xix = threadIdx.x + blockIdx.x × blockDim.x

纵坐标是:

iy=threadIdx.y+blockIdx.y×blockDim.yiy = threadIdx.y + blockIdx.y × blockDim.y

这样我们就得到了每个线程的唯一标号,并且在运行时kernel是可以访问这个标号的。前面讲过CUDA每一个线程执行相同的代码,也就是异构计算中说的多线程单指令,如果每个不同的线程执行同样的代码,又处理同一组数据,将会得到多个相同的结果,显然这是没意义的,为了让不同线程处理不同的数据,CUDA常用的做法是让不同的线程对应不同的数据,也就是用线程的全局标号对应不同组的数据。

设备内存或者主机内存都是线性存在的,比如一个二维矩阵(8×6)(8×6),存储在内存中是这样的:

memory.png

我们要做管理的就是:

  • 线程和块索引(来计算线程的全局索引)
  • 矩阵中给定点的坐标ix,iy(ix,iy)
  • (ix,iy)(ix,iy)对应的线性内存的位置

线性位置的计算方法是:

idx=ix+iynxidx = ix + iy * nx

我们上面已经计算出了线程的全局坐标,用线程的全局坐标对应矩阵的坐标,也就是说,线程的坐标(ix,iy)(ix,iy)对应矩阵中(ix,iy)(ix,iy)的元素,这样就形成了一一对应,不同的线程处理矩阵中不同的数据,举个具体的例子,ix=10,iy=10ix=10,iy=10的线程去处理矩阵中(10,10)(10,10)的数据,当然你也可以设计别的对应模式,但是这种方法最简单,出错可能最低的。 我们接下来的代码来输出每个线程的标号信息:

#include <cuda_runtime.h>
#include <stdio.h>
#include "freshman.h"

__global__ void printThreadIndex(float *A,const int nx,const int ny)
{
  int ix=threadIdx.x+blockIdx.x*blockDim.x;
  int iy=threadIdx.y+blockIdx.y*blockDim.y;
  unsigned int idx=iy*nx+ix;
  printf("thread_id(%d,%d) block_id(%d,%d) coordinate(%d,%d)"
          "global index %2d ival %2d\n",threadIdx.x,threadIdx.y,
          blockIdx.x,blockIdx.y,ix,iy,idx,A[idx]);
}
int main(int argc,char** argv)
{
  initDevice(0);
  int nx=8,ny=6;
  int nxy=nx*ny;
  int nBytes=nxy*sizeof(float);

  //Malloc
  float* A_host=(float*)malloc(nBytes);
  initialData(A_host,nxy);
  printMatrix(A_host,nx,ny);

  //cudaMalloc
  float *A_dev=NULL;
  CHECK(cudaMalloc((void**)&A_dev,nBytes));

  cudaMemcpy(A_dev,A_host,nBytes,cudaMemcpyHostToDevice);

  dim3 block(4,2);
  dim3 grid((nx-1)/block.x+1,(ny-1)/block.y+1);

  printThreadIndex<<<grid,block>>>(A_dev,nx,ny);

  CHECK(cudaDeviceSynchronize());
  cudaFree(A_dev);
  free(A_host);

  cudaDeviceReset();
  return 0;
}

这段代码输出了一组我们随机生成的矩阵,并且核函数打印自己的线程标号,注意,核函数能调用printf()这个特性是CUDA后来加的,最早的版本里面不能printf(),输出结果:

(base) ut@UT:~/yangbin/cudaLearn/build/5_thread_index$ ./thread_index 
Using device 0: NVIDIA GeForce RTX 3080
Matrix<6,8>:
61.993000 60.047001 16.052000 23.415001 41.230000 28.368000 19.605000 15.928000 
9.852000 30.290001 27.114000 35.868999 3.315000 18.684999 57.742001 55.873001 
51.626999 64.167000 29.766001 16.954000 46.896999 36.125000 26.886999 7.829000 
35.944000 44.421001 19.870001 58.446999 37.860001 11.565000 12.371000 34.317001 
6.077000 28.423000 57.731998 47.306999 56.792000 11.801000 63.235001 1.108000 
42.091000 24.813000 36.977001 45.405998 43.499001 29.184000 35.743000 29.590000 
thread_id(0,0) block_id(0,2) coordinate(0,4)global index 32 ival 6.077000
thread_id(1,0) block_id(0,2) coordinate(1,4)global index 33 ival 28.423000
thread_id(2,0) block_id(0,2) coordinate(2,4)global index 34 ival 57.731998
thread_id(3,0) block_id(0,2) coordinate(3,4)global index 35 ival 47.306999
thread_id(0,1) block_id(0,2) coordinate(0,5)global index 40 ival 42.091000
thread_id(1,1) block_id(0,2) coordinate(1,5)global index 41 ival 24.813000
thread_id(2,1) block_id(0,2) coordinate(2,5)global index 42 ival 36.977001
thread_id(3,1) block_id(0,2) coordinate(3,5)global index 43 ival 45.405998
thread_id(0,0) block_id(0,0) coordinate(0,0)global index  0 ival 61.993000
thread_id(1,0) block_id(0,0) coordinate(1,0)global index  1 ival 60.047001
thread_id(2,0) block_id(0,0) coordinate(2,0)global index  2 ival 16.052000
thread_id(3,0) block_id(0,0) coordinate(3,0)global index  3 ival 23.415001
thread_id(0,1) block_id(0,0) coordinate(0,1)global index  8 ival 9.852000
thread_id(1,1) block_id(0,0) coordinate(1,1)global index  9 ival 30.290001
thread_id(2,1) block_id(0,0) coordinate(2,1)global index 10 ival 27.114000
thread_id(3,1) block_id(0,0) coordinate(3,1)global index 11 ival 35.868999
thread_id(0,0) block_id(1,0) coordinate(4,0)global index  4 ival 41.230000
thread_id(1,0) block_id(1,0) coordinate(5,0)global index  5 ival 28.368000
thread_id(2,0) block_id(1,0) coordinate(6,0)global index  6 ival 19.605000
thread_id(3,0) block_id(1,0) coordinate(7,0)global index  7 ival 15.928000
thread_id(0,1) block_id(1,0) coordinate(4,1)global index 12 ival 3.315000
thread_id(1,1) block_id(1,0) coordinate(5,1)global index 13 ival 18.684999
thread_id(2,1) block_id(1,0) coordinate(6,1)global index 14 ival 57.742001
thread_id(3,1) block_id(1,0) coordinate(7,1)global index 15 ival 55.873001
thread_id(0,0) block_id(0,1) coordinate(0,2)global index 16 ival 51.626999
thread_id(1,0) block_id(0,1) coordinate(1,2)global index 17 ival 64.167000
thread_id(2,0) block_id(0,1) coordinate(2,2)global index 18 ival 29.766001
thread_id(3,0) block_id(0,1) coordinate(3,2)global index 19 ival 16.954000
thread_id(0,1) block_id(0,1) coordinate(0,3)global index 24 ival 35.944000
thread_id(1,1) block_id(0,1) coordinate(1,3)global index 25 ival 44.421001
thread_id(2,1) block_id(0,1) coordinate(2,3)global index 26 ival 19.870001
thread_id(3,1) block_id(0,1) coordinate(3,3)global index 27 ival 58.446999
thread_id(0,0) block_id(1,2) coordinate(4,4)global index 36 ival 56.792000
thread_id(1,0) block_id(1,2) coordinate(5,4)global index 37 ival 11.801000
thread_id(2,0) block_id(1,2) coordinate(6,4)global index 38 ival 63.235001
thread_id(3,0) block_id(1,2) coordinate(7,4)global index 39 ival 1.108000
thread_id(0,1) block_id(1,2) coordinate(4,5)global index 44 ival 43.499001
thread_id(1,1) block_id(1,2) coordinate(5,5)global index 45 ival 29.184000
thread_id(2,1) block_id(1,2) coordinate(6,5)global index 46 ival 35.743000
thread_id(3,1) block_id(1,2) coordinate(7,5)global index 47 ival 29.590000
thread_id(0,0) block_id(1,1) coordinate(4,2)global index 20 ival 46.896999
thread_id(1,0) block_id(1,1) coordinate(5,2)global index 21 ival 36.125000
thread_id(2,0) block_id(1,1) coordinate(6,2)global index 22 ival 26.886999
thread_id(3,0) block_id(1,1) coordinate(7,2)global index 23 ival 7.829000
thread_id(0,1) block_id(1,1) coordinate(4,3)global index 28 ival 37.860001
thread_id(1,1) block_id(1,1) coordinate(5,3)global index 29 ival 11.565000
thread_id(2,1) block_id(1,1) coordinate(6,3)global index 30 ival 12.371000
thread_id(3,1) block_id(1,1) coordinate(7,3)global index 31 ival 34.317001

二维网格和二维块

首先来看二维网格二维块的代码:

// 2d block and 2d grid
dim3 block_0(dimx,dimy);
dim3 grid_0((nx-1)/block_0.x+1,(ny-1)/block_0.y+1);
iStart=cpuSecond();
sumMatrix<<<grid_0,block_0>>>(A_dev,B_dev,C_dev,nx,ny);
CHECK(cudaDeviceSynchronize());
iElaps=cpuSecond()-iStart;
printf("GPU Execution configuration<<<(%d,%d),(%d,%d)>>> Time elapsed %f sec\n",
      grid_0.x,grid_0.y,block_0.x,block_0.y,iElaps);
CHECK(cudaMemcpy(C_from_gpu,C_dev,nBytes,cudaMemcpyDeviceToHost));
checkResult(C_host,C_from_gpu,nxy);

运行结果:

(base) ut@UT:~/yangbin/cudaLearn/build/6_sum_matrix$ ./sum_matrix 
strating...
Using device 0: NVIDIA GeForce RTX 3080
CPU Execution Time elapsed 0.037380 sec
GPU Execution configuration<<<(128,128),(32,32)>>> Time elapsed 0.000711 sec
Check result success!
GPU Execution configuration<<<(524288,1),(32,1)>>> Time elapsed 0.001062 sec
Check result success!
GPU Execution configuration<<<(128,4096),(32,1)>>> Time elapsed 0.001060 sec
Check result success!

用cpu写一个矩阵计算,然后比对结果,发现我们的运算结果是正确的,用时0.000711秒。

一维网格和一维块

接着我们使用一维网格一维块:

// 1d block and 1d grid
dimx=32;
dim3 block_1(dimx);
dim3 grid_1((nxy-1)/block_1.x+1);
iStart=cpuSecond();
sumMatrix<<<grid_1,block_1>>>(A_dev,B_dev,C_dev,nx*ny ,1);
CHECK(cudaDeviceSynchronize());
iElaps=cpuSecond()-iStart;
printf("GPU Execution configuration<<<(%d,%d),(%d,%d)>>> Time elapsed %f sec\n",
      grid_1.x,grid_1.y,block_1.x,block_1.y,iElaps);
CHECK(cudaMemcpy(C_from_gpu,C_dev,nBytes,cudaMemcpyDeviceToHost));
checkResult(C_host,C_from_gpu,nxy);

运行结果:

(base) ut@UT:~/yangbin/cudaLearn/build/6_sum_matrix$ ./sum_matrix 
strating...
Using device 0: NVIDIA GeForce RTX 3080
CPU Execution Time elapsed 0.037380 sec
GPU Execution configuration<<<(128,128),(32,32)>>> Time elapsed 0.000711 sec
Check result success!
GPU Execution configuration<<<(524288,1),(32,1)>>> Time elapsed 0.001062 sec
Check result success!
GPU Execution configuration<<<(128,4096),(32,1)>>> Time elapsed 0.001060 sec
Check result success!

同样运行结果是正确的。用时0.001062秒。

二维网格和一维块

二维网格一维块:

// 2d block and 1d grid
dimx=32;
dim3 block_2(dimx);
dim3 grid_2((nx-1)/block_2.x+1,ny);
iStart=cpuSecond();
sumMatrix<<<grid_2,block_2>>>(A_dev,B_dev,C_dev,nx,ny);
CHECK(cudaDeviceSynchronize());
iElaps=cpuSecond()-iStart;
printf("GPU Execution configuration<<<(%d,%d),(%d,%d)>>> Time elapsed %f sec\n",
      grid_2.x,grid_2.y,block_2.x,block_2.y,iElaps);
CHECK(cudaMemcpy(C_from_gpu,C_dev,nBytes,cudaMemcpyDeviceToHost));
checkResult(C_host,C_from_gpu,nxy);

运行结果:

(base) ut@UT:~/yangbin/cudaLearn/build/6_sum_matrix$ ./sum_matrix 
strating...
Using device 0: NVIDIA GeForce RTX 3080
CPU Execution Time elapsed 0.037380 sec
GPU Execution configuration<<<(128,128),(32,32)>>> Time elapsed 0.000711 sec
Check result success!
GPU Execution configuration<<<(524288,1),(32,1)>>> Time elapsed 0.001062 sec
Check result success!
GPU Execution configuration<<<(128,4096),(32,1)>>> Time elapsed 0.001060 sec
Check result success!

同样运行结果是正确的。用时0.001060秒。

总结

用不同的线程组织形式会得到正确结果,但是效率有所区别:

线程配置执行时间
CPU单线程0.037380
(128,128),(32,32)0.000711
(524288,1),(32,1)0.001062
(128,4096),(32,1)0.001060

观察结果没有多大差距,但是明显比CPU快了很多,而且最主要的是我们用不同的线程组织模式都得到了正确结果,并且:

  • 改变执行配置(线程组织)能得到不同的性能
  • 传统的核函数可能不能得到最好的效果
  • 一个给定的核函数,通过调整网格和线程块大小可以得到更好的效果

后面学习执行模型,才会深入到硬件层面,追寻影响效率的根本原因。

参考资料