CUDA编程中的线程、线程块和网格持续创作，加速成长！这是我参与「掘金日新计划 · 10 月更文挑战」的第21天，点击查

持续创作，加速成长！这是我参与「掘金日新计划 · 10 月更文挑战」的第21天，点击查看活动详情

前面几篇博客里面由浅到深的逐步给出了pycuda的一个学习过程，包括上一篇博客里面提到的SourceModule，其实除了SourceModule，pycuda还提供了一些模板函数，之前也提过，但是没有进行详细的说明和使用的介绍，所以今天这篇博客，就pycuda提供的一些模板函数进行介绍，主要是学习核函数的编写，因为在pycuda里面，我们想要借助GPU帮我们干活，会将矩阵数据转移到GPU是没有用的，还得告诉GPU这个活怎么干，这也就是核函数的作用。建议学习pycuda的使用的话，还是需要有一些C语言的功底，不然即使学会了套用一些模板，但是针对自己的实际问题时，可能仍然不会编写自己的核函数。

下面从一组代码中进一步学习核函数

import numpy as np
import matplotlib.pyplot as plt
from pycuda import gpuarray
from pycuda.elementwise import ElementwiseKernel
import pycuda.autoinit as autoinit
import os


def iterator(c, r, max_iter):
    z = c
    for iter in range(0, max_iter, 1):
        if abs(z) > r:
            break
        z = z**2+c
    return iter

def mandelbrot(width, height, real_low, real_high, imag_low, imag_high, max_iters, max_bound):
    X = np.linspace(real_low, real_high, width)
    Y = np.linspace(imag_low, imag_high, height)
    real, image = np.meshgrid(X, Y)
    c = real+image*1j
    mandelbrot_set = np.frompyfunc(iterator, 3, 1)(c, 1.5, 256).astype(np.float)
    return mandelbrot_set


def gpu_manelbrot(width, height, real_low, real_high, imag_low, imag_high, max_iters, max_bound):
    realVals = np.matrix(np.linspace(real_low, real_high, width), dtype=np.complex64)
    imagVals = np.matrix(np.linspace(imag_low, imag_high, height), dtype=np.complex64)
    mandelbrotLattrice = np.array(realVals + imagVals.transpose(), dtype=np.complex64)

    gpu_mandelbrot_latttice = gpuarray.to_gpu(mandelbrotLattrice)
    mandelbrot_image = gpuarray.empty(shape=mandelbrotLattrice.shape, dtype=np.float32)
    mandel_kenal(gpu_mandelbrot_latttice, mandelbrot_image, np.int32(max_iters), np.float32(max_bound))
    mandelbrot_graph = mandelbrot_image.get()
    return mandelbrot_graph

# 编写内核函数
mandel_kenal = ElementwiseKernel(
    "pycuda::complex<float> *lattice, float *mandelbrot_image, int max_iters, float max_bound",
    """
    mandelbrot_image[i]=1;
    pycuda::complex<float> c=lattice[i];
    pycuda::complex<float> z(0,0);
    for(int j=0;j<max_iters;j++)
    {
        z=z*z+c;
        if(abs(z)>max_bound)
            {
                mandelbrot_image[i]=0;
                break;
            }
    }
    """,
    "mandel_kernal"
)

if __name__=="__main__":
    import time

    t1 = time.time()
    plt.figure(dpi=600)
    plt.imshow(mandelbrot(512, 512, -2, 2, -2, 2, 256, 3), extent=[-2, 2, -2, 2])
    plt.show()
    t2 = time.time()
    print(t2 - t1)

    t3 = time.time()
    image = gpu_manelbrot(512, 512, -2, 2, -2, 2, 256, 2)
    plt.figure(dpi=600)
    plt.imshow(image, extent=[-2, 2, -2, 2])
    plt.show()
    t4 = time.time()
    print(t4 - t3)

这其中ElementwiseKernel就是pycuda的一个模板函数，在这个模板函数中第一个参数，"pycuda::complex<float> *lattice, float *mandelbrot_image, int max_iters, float max_bound"表示共接收四个变量，第一个是一个复数向量，第二个是一个浮点数类型向量，第三个和第四个都是标量。

mandelbrot_image[i]=1;这个i表示的是线程索引，后面的博客陆续会讲到这个线程、线程块和网格的相关概念，这个地方只需要理解为一个线程索引就可以了，这个是由pycuda自动分配的，不用管。这行代码的意思就是将这个向量里面的值全部置为1.

代码运行结果如下

下面是两组代码分别在CPU和GPU上的运行时间。