C++ 开发人员的机器学习 - 困难的方法:DirectML

181 阅读23分钟

介绍机器学习以及可训练线性回归模型的 C++ 代码。

Get ready for AI. It's not easy, it's not trivial, it's a tough nut to crack. But let's hope we will work it out together. Here's a Microsoft library, DirectML, a low-level way to get you started.
为人工智能做好准备。这并不容易,也不是微不足道的,这是一个很难破解的难题。但我们希望我们能够共同解决这个问题。这里有一个 Microsoft 库 DirectML,这是一种帮助您入门的低级方法。

Source 来源

github.com/WindowsNT/D…

Introduction 介绍

Many people discuss, especially after the evolution of ChatGPT and other models, the benefits of Machine Learning in Artificial Intelligence. This article tries, from my own amateur view in ML but otherwise expert in general low-level programming, explain some basics and use DirectML for a real training scenario.
很多人讨论,特别是在 ChatGPT 和其他模型的演变之后,机器学习在人工智能中的好处。本文尝试从我自己对 ML 的业余观点(但在一般低级编程方面是专家)解释一些基础知识并将 DirectML 用于真实的训练场景。

Background 背景

This article aims to the hardcore Windows C++ developer, familiar with COM and DirectX to introduce how machine learning could be done using DirectML. For (way more) simplicity, you would use TensorFlow, PyTorch and Python in general, but, being myself sticky to C++, I would like to explore the internals of DirectML.
本文面向熟悉 COM 和 DirectX 的硬核 Windows C++ 开发人员,介绍如何使用 DirectML 完成机器学习。为了(更)简单起见,您通常会使用 TensorFlow、PyTorch 和 Python,但是,由于我自己对 C++ 很粘,我想探索 DirectML 的内部结构。

DirectML is a low level layer (PyTorch is a layer above it), so do prepare for lots of difficult, dirty stuff.
DirectML 是一个低级层(PyTorch 是它上面的一层),因此请为许多困难、肮脏的东西做好准备。

Machine Learning in general. 一般机器学习。

I strongly recommend Machine Learning for Absolute Beginners. Machine Learning in general is the ability of the computer to solve f(x) = y for a given x. X and Y may not only be a single variable, but a (big) set of variables. There are three basic types of machine learning:
我强烈推荐绝对初学者的机器学习。一般来说,机器学习是计算机针对给定 x 求解 f(x) = y 的能力。 X 和 Y 可能不仅是单个变量,而且是一组(大)变量。机器学习分为三种基本类型:

  • Supervised learning. This means that I have a given set of [x,y] and train a model in order to "learn" from it and calculate y for an unknown x. A simple example of that is a set of prices for a car model from 1990 to 2024 (x = year, y = price) and to query the model for a possible price in 2025. Another example would be a set of hours that a student will practice a lesson and whether they will pass it or not in the exams, based on a set of students that studied a specific number of hours (x) and passed it or not (y).
    监督学习。这意味着我有一组给定的 [x,y] 并训练一个模型,以便从中“学习”并计算未知 x 的 y。一个简单的例子是从 1990 年到 2024 年的一组汽车型号价格(x = 年份,y = 价格),并查询该模型以获取 2025 年的可能价格。另一个例子是学生的一组小时数将根据一组学习了特定小时数 (x) 并通过或未通过 (y) 的学生来练习课程以及他们是否会在考试中通过。
  • Unsupervised learning. This means that the [x,y] set that we are trying to learn from is not complete. An example is antivirus detection in where an AV tries to learn not from definite sets, but from similarity patterns.
    无监督学习。这意味着我们尝试学习的 [x,y] 集并不完整。一个例子是防病毒检测,其中 AV 尝试不是从确定的集合中学习,而是从相似性模式中学习。
  • Reinforced learning. This is based on the reverse of the f(x) = y, that is, for a given y we are trying to find x's that work. An example of that is an autonomous driving system which takes for granted that it must not crash (y) and finds all x's that will result in that.
    强化学习。这是基于 f(x) = y 的逆过程,也就是说,对于给定的 y,我们试图找到有效的 x。一个例子是自动驾驶系统,它理所当然地认为它不能崩溃 (y),并找到所有会导致崩溃的 x。

In our example, we will stick to the simplest form, the Supervised learning.
在我们的示例中,我们将坚持最简单的形式,即监督学习。

A "Hello world" of this mode is the simple-variable linear regression. That is, for a given set of [x,y] variables, we are trying to create a line f(x) = a + bx as such the line is "close" to all of them, like the following picture:
这种模式的“Hello world”是简单变量线性回归。也就是说,对于给定的一组 [x,y] 变量,我们尝试创建一条线 f(x) = a + bx ,因此该线“接近”所有变量,如下图所示:

 

What is Linear Regression?- Spiceworks - Spiceworks

Another Wikipedia example:
另一个维基百科的例子:

undefined

The formula to calculate a and b for a given [x,y] set is as follows:
对于给定的 [x,y] 集合计算 a 和 b 的公式如下:

  •  B = (nΣxy - ΣxΣy) / ((nΣx^2) - (Σx)^2)
  • A = (Σy - (B * Σx))/n
    A = (Σy - (B * Σx))/n

So, the more sets of [x,y] we have, the more likely is for our model to give a good answer for an unknown x. That's what "training" is.
因此,我们拥有的 [x,y] 组越多,我们的模型就越有可能为未知的 x 提供良好的答案。这就是“训练”。

Of course, this is very old, even my Casio FX 100C has a "LR" mode to input this. However it's time to discuss about GPU cores and how they can perform here.
当然,这已经很老了,甚至我的 Casio FX 100C 也有“LR”模式来输入它。不过,现在是时候讨论 GPU 核心以及它们如何在此处执行了。

GPU versus CPU GPU 与 CPU

The GPU cores are a lot of mini-cpus; That is, they are capable of doing simple math stuff, like add, multiply, trigonometry, logarithms etc. If you take a look at HLSL/GLSL code, for example this ShaderToy Grayscale, you will see that a simple algorithm with power, sqrt, dot etc is executed for, at a 1920x1080 resolution  = 2073600 times for an image, or 62208000 times per second for a 30fps video. My RTX 4060 has 3072 such "cpus".
GPU核心是很多迷你CPU;也就是说,它们能够执行简单的数学运算,例如加法、乘法、三角函数、对数等。如果您查看 HLSL/GLSL 代码,例如此 ShaderToy 灰度,您将看到一个具有幂次方的简单算法,即 sqrt 、dot 等对于图像以 1920x1080 分辨率 = 2073600 次执行,或者对于 30fps 视频每秒执行 62208000 次。我的 RTX 4060 有 3072 个这样的“cpu”。

Therefore, it is of great imporance to allow these cores to execute simple but massive math operations, way faster than our CPU can.
因此,让这些内核以比 CPU 快得多的速度执行简单但大量的数学运算非常重要。

CPU 中的线性回归

It's of course trivial to find the y = A+Bx formula in CPU C++ code. Given two arrays xs and ys which contain N elements of (x,y) pairs, then:
在 CPU C++ 代码中找到 y = A+Bx 公式当然是微不足道的。给定两个数组“xs”和“ys”,其中包含 (x,y) 对的 N 个元素,则:

void LinearRegressionCPU(float* px,float* py,size_t n)
{
    auto beg = GetTickCount64();
    
    float Sx = 0, Sy = 0,Sxy = 0,Sx2 = 0;
    for (size_t i = 0; i < n; i++)
    {
        Sx += px[i];
        Sx2 += px[i] * px[i];
        Sy += py[i];
        Sxy += px[i] * py[i];
    }
    
    float B = (n * Sxy - Sx * Sy) / ((n * Sx2) - (Sx * Sx));
    float A = (Sy - (B * Sx)) / n;
    auto end = GetTickCount64();

    printf("Linear Regression CPU:\r\nSx = %f\r\nSy = %f\r\nSxy = %f\r\nSx2 = %f\r\nA = %f\r\nB = %f\r\n%zi ticks\r\n\r\n", Sx,Sy,Sxy,Sx2,A,B,end - beg);  
}

Now starts your hell. The same result will be achieved in the GPU using DirectML with LOTS of additional and really complex code.  Isn't machine learning wonderful?
现在开始你的地狱。使用 DirectML 以及大量额外且非常复杂的代码,可以在 GPU 中实现相同的结果。  机器学习不是很棒吗?

张量

A tensor is a generalization of a matrix, which is a generalization of a vector, which is a generalization of a number. That is, a number is x, a vector is [x y z], a 2x2 matrix is a table that has 2 rows and 2 columns and a tensor can have any number of dimensions.
张量是矩阵的推广,矩阵是向量的推广,向量是数字的推广。也就是说,数字是 x,向量是 [x y z],2x2 矩阵是具有 2 行和 2 列的表格,张量可以具有任意数量的维度。

DirectML can "upload" tensor data to our GPU, "execute" a set of operators (math functions) and "return" the result to the CPU.
DirectML 可以将张量数据“上传”到我们的 GPU,“执行”一组运算符(数学函数)并将结果“返回”到 CPU。

DirectML 和 DirectMLX

DirectML is a low level DirectX12 API code capable of manipulating tensors to the GPU.  Start here for the MSDN docs. We will go step by step in it in three operations in our code: A copy, an adding, and the linear regression.
DirectML 是一种低级 DirectX12 API 代码,能够将张量操作到 GPU。从这里开始获取 MSDN 文档。我们将在代码中逐步进行三个操作:复制、添加和线性回归。

DirectMLX** is a header-only helper collection that allows you to build graphs easily. Remember DirectShow or Direct2D filters eh? A graph describes inputs and outputs  and which operator is applied between them. We will create three graphs with the aid of DirectMLX.
DirectMLX 是一个仅包含标头的帮助器集合,可让您轻松构建图形。还记得 DirectShow 或 Direct2D 过滤器吗?图表描述了输入和输出以及它们之间应用的运算符。我们将借助 DirectMLX 创建三个图表。

The list of DirectML structures indicates what operators you have to execute on tensors.
DirectML 结构列表指示您必须在张量上执行哪些运算符。

I 've started with HelloDirectML and expanded it for a real linear regression training.
我从 HelloDirectML 开始,并将其扩展为真正的线性回归训练。

开始我们的旅程

Initialize DirectX 12. That is, enumerate the DXGI adapters, create a DirectX 12 device,  create its Command Allocator, Command Queue and CommandList interfaces:
初始化 DirectX 12。也就是说,枚举 DXGI 适配器,创建 DirectX 12 设备,创建其命令分配器、命令队列和 CommandList 接口:

Shrink ▲ 收缩▲   

    HRESULT InitializeDirect3D12()
    {
        CComPtr<ID3D12Debug> d3D12Debug;

        // Throws if the D3D12 debug layer is missing - you must install the Graphics Tools optional feature
#if defined (_DEBUG)
        THROW_IF_FAILED(D3D12GetDebugInterface(IID_PPV_ARGS(&d3D12Debug)));
        d3D12Debug->EnableDebugLayer();
#endif

        CComPtr<IDXGIFactory4> dxgiFactory;
        CreateDXGIFactory1(IID_PPV_ARGS(&dxgiFactory));

        CComPtr<IDXGIAdapter> dxgiAdapter;
        UINT adapterIndex{};
        HRESULT hr{};
        do
        {
            dxgiAdapter = nullptr;
            dxgiAdapter = 0;
            THROW_IF_FAILED(dxgiFactory->EnumAdapters(adapterIndex, &dxgiAdapter));
            ++adapterIndex;

            d3D12Device = 0;
            hr = ::D3D12CreateDevice(
                dxgiAdapter,
                D3D_FEATURE_LEVEL_11_0,
                IID_PPV_ARGS(&d3D12Device));
            if (hr == DXGI_ERROR_UNSUPPORTED) continue;
            THROW_IF_FAILED(hr);
        } while (hr != S_OK);

        D3D12_COMMAND_QUEUE_DESC commandQueueDesc{};
        commandQueueDesc.Type = D3D12_COMMAND_LIST_TYPE_DIRECT;
        commandQueueDesc.Flags = D3D12_COMMAND_QUEUE_FLAG_NONE;

        commandQueue = 0;
        THROW_IF_FAILED(d3D12Device->CreateCommandQueue(
            &commandQueueDesc,
            IID_PPV_ARGS(&commandQueue)));

        commandAllocator = 0;
        THROW_IF_FAILED(d3D12Device->CreateCommandAllocator(
            D3D12_COMMAND_LIST_TYPE_DIRECT,
            IID_PPV_ARGS(&commandAllocator)));

        commandList = 0;
        THROW_IF_FAILED(d3D12Device->CreateCommandList(
            0,
            D3D12_COMMAND_LIST_TYPE_DIRECT,
            commandAllocator,
            nullptr,
            IID_PPV_ARGS(&commandList)));

        return S_OK;
    }

Initialize DirectML with DMLCreateDevice. In debug mode, you can use DML_CREATE_DEVICE_FLAG_DEBUG.
使用 DMLCreateDevice 初始化 DirectML。在调试模式下,可以使用DML_CREATE_DEVICE_FLAG_DEBUG。

 

##创建 DirectML 运算符图

An operator graph is describing which operator operates in which tensor. In our code, depending on the Method defined, we have three sets.
运算符图描述了哪个运算符在哪个张量中运行。在我们的代码中,根据定义的方法,我们有三组。

  1. The "abs" operator which apples the abs() function to each element of the tensor.
    1.“abs”运算符,将 abs() 函数应用于张量的每个元素。
auto CreateCompiledOperatorAbs(std::initializer_list<UINT32> j,UINT64* ts = 0)
{
    dml::Graph graph(dmlDevice);
    dml::TensorDesc desc = { DML_TENSOR_DATA_TYPE_FLOAT32, j };
    dml::Expression input1 = dml::InputTensor(graph, 0, desc);
    dml::Expression output = dml::Abs(input1);

    if (ts)
        *ts = desc.totalTensorSizeInBytes;
    return graph.Compile(DML_EXECUTION_FLAG_ALLOW_HALF_PRECISION_COMPUTATION, { output });
}

The 'j' variable is, for example, {2,2} to create a 2x2 tensor of float32, that is, 8 floats, 32 bytes. We have 1 input tensor with that description and the output tensor  is created by performing dml::Abs() function. DirectMLX simplifies creating those operators.
例如,'j'变量是{2,2},用于创建float32的2x2张量,即8个浮点数,32个字节。我们有 1 个具有该描述的输入张量,输出张量是通过执行 dml::Abs() 函数创建的。 DirectMLX 简化了这些运算符的创建。

In addition, we return the 'total input tensor size' in bytes so we know how big our buffer will later be.  The last line compiles the graph and returns an IDMLCompiledOperator.
此外,我们返回“总输入张量大小”(以字节为单位),以便我们知道缓冲区稍后有多大。  最后一行编译图表并返回 IDMLCompiledOperator。

 

  1. The 'add' operator now takes 2 inputs and produces 1 output, so:
    2.“add”运算符现在接受 2 个输入并产生 1 个输出,因此:
auto CreateCompiledOperatorAdd(std::initializer_list<UINT32> j, UINT64* ts = 0)
{
    dml::Graph graph(dmlDevice);

    auto desc1 = dml::TensorDesc(DML_TENSOR_DATA_TYPE_FLOAT32, j);
    auto input1 = dml::InputTensor(graph, 0, desc1);
    auto desc2 = dml::TensorDesc(DML_TENSOR_DATA_TYPE_FLOAT32, j);
    auto input2 = dml::InputTensor(graph, 1, desc2);

    auto output = dml::Add(input1,input2);
    if (ts)
        *ts = desc1.totalTensorSizeInBytes + desc2.totalTensorSizeInBytes;
    return graph.Compile(DML_EXECUTION_FLAG_ALLOW_HALF_PRECISION_COMPUTATION, { output });
}

I call this also with a {2,2} tensor so we have two input tensors , so we need now 64 bytes (to be returned in *ts). We use the dml::Add to create the output.
我也用 {2,2} 张量来调用它,所以我们有两个输入张量,所以我们现在需要 64 个字节(在 *ts 中返回)。我们使用 dml::Add 来创建输出。

 

  1. The linear regression operator is more complex.
    3、线性回归算子比较复杂。

Shrink ▲ 收缩▲   

auto CreateCompiledOperatorLinearRegression(UINT32 N, UINT64* ts = 0)
{
    dml::Graph graph(dmlDevice);

    auto desc1 = dml::TensorDesc(DML_TENSOR_DATA_TYPE_FLOAT32, { 1,N });
    auto desc2 = dml::TensorDesc(DML_TENSOR_DATA_TYPE_FLOAT32, { 1,N });
    auto input1 = dml::InputTensor(graph, 0, desc1);
    auto input2 = dml::InputTensor(graph, 1, desc2);

    // Create first output tensor, calculate Sx by adding all first row of the tensor and going to the output tensor (in which , we will only take the last element as the sum)
    auto o1 = dml::CumulativeSummation(input1, 1, DML_AXIS_DIRECTION_INCREASING, false);

    // Sy, similarily
    auto o2 = dml::CumulativeSummation(input2, 1, DML_AXIS_DIRECTION_INCREASING, false);

    // xy, we calculate multiplication
    auto o3 = dml::Multiply(input1, input2);

    // Sxy
    auto o4 = dml::CumulativeSummation(o3, 1, DML_AXIS_DIRECTION_INCREASING, false);

    // x*x, we calculate multiplication
    auto o5 = dml::Multiply(input1, input1);

    // Sx2
    auto o6 = dml::CumulativeSummation(o5, 1, DML_AXIS_DIRECTION_INCREASING, false);

    auto d1 = desc1.totalTensorSizeInBytes;
    while (d1 % DML_MINIMUM_BUFFER_TENSOR_ALIGNMENT)
        d1++;
    auto d2 = desc2.totalTensorSizeInBytes;
    while (d2 % DML_MINIMUM_BUFFER_TENSOR_ALIGNMENT)
        d2++;

    if (ts)
        *ts = d1 + d2;
    return graph.Compile(DML_EXECUTION_FLAG_ALLOW_HALF_PRECISION_COMPUTATION, { o1,o2,o3,o4,o5,o6 });
}

We have 2 input tensors, a [1xN] tensor with the x values and [1xN] tensor with the y values for the linear regression. 
我们有 2 个输入张量,一个具有 x 值的 [1xN] 张量和具有 y 值的线性回归的 [1xN] 张量。

Now we have 6 output tensors:
现在我们有 6 个输出张量:

  • One for the Σx
    1 为 Σx
  • One for the Σy
    1 为 Σy
  • One for the xy
    一个用于 xy
  • One for the Σxy
    1 为 Σxy
  • One for the x^2
    一个代表 x^2
  • One for the Σx^2 1 为 Σx^2

For the sums, we use the CumulativeSummation operator which sums all the values in a tensor's axis to another tensor.
对于求和,我们使用 CumulativeSummation 运算符,它将一个张量轴上的所有值求和到另一个张量。

Also, we have to care for alignment because DirectML buffers have to be aligned to DML_MINIMUM_BUFFER_TENSOR_ALIGNMENT.
此外,我们还必须注意对齐,因为 DirectML 缓冲区必须与 DML_MINIMUM_BUFFER_TENSOR_ALIGNMENT 对齐。

 

加载我们的数据

Tensors can have padding and stride, but in our examples, tensors are packed (no padding, no stride). So in case of Abs or Add we simply create an inputTensorElementArray vector of 4 floats. In case of the linear regression we load it from a 5-xy set:
张量可以有填充和步幅,但在我们的示例中,张量是打包的(没有填充,没有步幅)。因此,对于 Abs 或 Add,我们只需创建一个包含 4 个浮点的 inputTensorElementArray 向量。对于线性回归,我们从 5-xy 集合加载它:

std::vector<float> xs = { 10,15,20,25,30,35 };
std::vector<float> ys = { 1003,1005,1010,1008,1014,1022 };
size_t N = xs.size();

 However, you can call RandomData() and this will fill these buffers with 32MB of random floats. 但是,您可以调用 RandomData(),这将用 32MB 的随机浮点数填充这些缓冲区。

创建初始化程序

In DirectML, an operator "initializer" must be called and configured once. 
在 DirectML 中,必须调用并配置一次运算符“初始化程序”。

IDMLCompiledOperator* dmlCompiledOperators[] = { dmlCompiledOperator };
THROW_IF_FAILED(dmlDevice->CreateOperatorInitializer(
    ARRAYSIZE(dmlCompiledOperators),
    dmlCompiledOperators,
    IID_PPV_ARGS(&dmlOperatorInitializer)));

绑定初始化器

Binding in DirectML simply selects which part of the buffers are assigned to tensors. For example, if you have 32 input bytes as a buffer, you may have 2 input tensors, one from 0-15 and the other from 0-16.
DirectML 中的绑定只是选择将缓冲区的哪一部分分配给张量。例如,如果有 32 个输入字节作为缓冲区,则可能有 2 个输入张量,一个从 0-15,另一个从 0-16。

In our Abs example, an input tensor is 16 bytes ( 4 floats) and the output tensor is also 16 bytes (4 floats).
在我们的 Abs 示例中,输入张量为 16 个字节(4 个浮点数),输出张量也是 16 个字节(4 个浮点数)。

In our Add example, 32 bytes for input, 16 for the first tensor and 16 for the second, and 16 bytes for output.
在我们的 Add 示例中,32 个字节用于输入,16 个字节用于第一个张量,16 个字节用于第二个张量,16 个字节用于输出。

In our Linear Regression example, if we have 5 sets of (x,y), we need 2 tensors 5 floats each (one for x, one for y) and 6 tensors 5 floats each to hold our sum results as discussed above. Mapping our input and output buffers to tensors is done with DirectML binding.
在我们的线性回归示例中,如果我们有 5 组 (x,y),则需要 2 个每个 5 个浮点数的张量(一个用于 x,一个用于 y)和 6 个每个 5 个浮点数的张量来保存如上所述的总和结果。将输入和输出缓冲区映射到张量是通过 DirectML 绑定完成的。

So first we create a heap:
所以首先我们创建一个堆:

void CreateHeap()
{
    // You need to initialize an operator exactly once before it can be executed, and
    // the two stages require different numbers of descriptors for binding. For simplicity,
    // we create a single descriptor heap that's large enough to satisfy them both.
    initializeBindingProperties = dmlOperatorInitializer->GetBindingProperties();
    executeBindingProperties = dmlCompiledOperator->GetBindingProperties();
    descriptorCount = std::max(
        initializeBindingProperties.RequiredDescriptorCount,
        executeBindingProperties.RequiredDescriptorCount);

    // Create descriptor heaps.

    D3D12_DESCRIPTOR_HEAP_DESC descriptorHeapDesc{};
    descriptorHeapDesc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV;
    descriptorHeapDesc.NumDescriptors = descriptorCount;
    descriptorHeapDesc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_SHADER_VISIBLE;
    THROW_IF_FAILED(d3D12Device->CreateDescriptorHeap(
        &descriptorHeapDesc,
        IID_PPV_ARGS(&descriptorHeap)));

    // Set the descriptor heap(s).
    SetDescriptorHeaps();
}

Then we create  a binding table on it:
然后我们在其上创建一个绑定表:

DML_BINDING_TABLE_DESC dmlBindingTableDesc{};
CComPtr<IDMLBindingTable> dmlBindingTable;
void CreateBindingTable()
{
    dmlBindingTableDesc.Dispatchable = dmlOperatorInitializer;
    dmlBindingTableDesc.CPUDescriptorHandle = descriptorHeap->GetCPUDescriptorHandleForHeapStart();
    dmlBindingTableDesc.GPUDescriptorHandle = descriptorHeap->GetGPUDescriptorHandleForHeapStart();
    dmlBindingTableDesc.SizeInDescriptors = descriptorCount;

    THROW_IF_FAILED(dmlDevice->CreateBindingTable(
        &dmlBindingTableDesc,
        IID_PPV_ARGS(&dmlBindingTable)));

}

Sometimes DirectML needs additional temporary or persistent memory. We check
有时 DirectML 需要额外的临时或持久内存。我们检查

temporaryResourceSize = std::max(initializeBindingProperties.TemporaryResourceSize, executeBindingProperties.TemporaryResourceSize);

If this  is not zero, we create more temporary memory for DirectML:
如果该值不为零,我们会为 DirectML 创建更多临时内存:

auto x1 = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT);
auto x2 = CD3DX12_RESOURCE_DESC::Buffer(temporaryResourceSize, D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS);
THROW_IF_FAILED(d3D12Device->CreateCommittedResource(
    &x1,
    D3D12_HEAP_FLAG_NONE,
    &x2,
    D3D12_RESOURCE_STATE_COMMON,
    nullptr,
    IID_PPV_ARGS(&temporaryBuffer)));

RebindTemporary();

The same happens for "persistent resources"
对于“持久资源”也会发生同样的情况

persistentResourceSize = std::max(initializeBindingProperties.PersistentResourceSize, executeBindingProperties.PersistentResourceSize);

Now we need a command recorder:
现在我们需要一个命令记录器:

dmlDevice->CreateCommandRecorder(
    IID_PPV_ARGS(&dmlCommandRecorder));

to "record" our initializer to a DirectX 12 command list:
将我们的初始化程序“记录”到 DirectX 12 命令列表中:

dmlCommandRecorder->RecordDispatch(commandList, dmlOperatorInitializer, dmlBindingTable);

And then we close and "execute" the list:
然后我们关闭并“执行”列表:

void CloseExecuteResetWait()
{
    THROW_IF_FAILED(commandList->Close());

    ID3D12CommandList* commandLists[] = { commandList };
    commandQueue->ExecuteCommandLists(ARRAYSIZE(commandLists), commandLists);

    CComPtr<ID3D12Fence> d3D12Fence;
    THROW_IF_FAILED(d3D12Device->CreateFence(
        0,
        D3D12_FENCE_FLAG_NONE,
        IID_PPV_ARGS(&d3D12Fence)));

    auto hfenceEventHandle = ::CreateEvent(nullptr, true, false, nullptr);

    THROW_IF_FAILED(commandQueue->Signal(d3D12Fence, 1));
    THROW_IF_FAILED(d3D12Fence->SetEventOnCompletion(1, hfenceEventHandle));

    ::WaitForSingleObjectEx(hfenceEventHandle, INFINITE, FALSE);

    THROW_IF_FAILED(commandAllocator->Reset());
    THROW_IF_FAILED(commandList->Reset(commandAllocator, nullptr));
    CloseHandle(hfenceEventHandle);
}

After this function completes, our "Initializer" is done and need not be called again.
该函数完成后,我们的“初始化器”就完成了,不需要再次调用。

绑定运营商

We now "reset" the Binding Table with the operator instead of the initializer
我们现在使用运算符而不是初始化器“重置”绑定表

        dmlBindingTableDesc.Dispatchable = dmlCompiledOperator;
        THROW_IF_FAILED(dmlBindingTable->Reset(&dmlBindingTableDesc));

We will rebind the temporary and persistent memory, if needed:
如果需要,我们将重新绑定临时内存和持久内存:

ml.RebindTemporary();
ml.RebindPersistent();

 

绑定输入

We will only bind one input buffer with an accumulated byte sum 'tensorInputSize' of all the input tensors:
我们只会将一个输入缓冲区与所有输入张量的累积字节和“tensorInputSize”绑定:

CComPtr<ID3D12Resource> uploadBuffer;
CComPtr<ID3D12Resource> inputBuffer;

    auto x1 = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_UPLOAD);
    auto x2 = CD3DX12_RESOURCE_DESC::Buffer(tensorInputSize);
    THROW_IF_FAILED(ml.d3D12Device->CreateCommittedResource(
        &x1,
        D3D12_HEAP_FLAG_NONE,
        &x2,
        D3D12_RESOURCE_STATE_GENERIC_READ,
        nullptr,
        IID_PPV_ARGS(&uploadBuffer)));
    auto x3 = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT);
    auto x4 = CD3DX12_RESOURCE_DESC::Buffer(tensorInputSize, D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS);
    THROW_IF_FAILED(ml.d3D12Device->CreateCommittedResource(
        &x3,
        D3D12_HEAP_FLAG_NONE,
        &x4,
        D3D12_RESOURCE_STATE_COPY_DEST,
        nullptr,
        IID_PPV_ARGS(&inputBuffer)));

And now upload the data to the GPU:
现在将数据上传到 GPU:

    D3D12_SUBRESOURCE_DATA tensorSubresourceData{};
    tensorSubresourceData.pData = inputTensorElementArray.data();
    tensorSubresourceData.RowPitch = static_cast<LONG_PTR>(tensorInputSize);
    tensorSubresourceData.SlicePitch = tensorSubresourceData.RowPitch;
    ::UpdateSubresources(ml.commandList,inputBuffer,uploadBuffer,0,0,1,&tensorSubresourceData);
    auto x9 = CD3DX12_RESOURCE_BARRIER::Transition(inputBuffer,D3D12_RESOURCE_STATE_COPY_DEST,D3D12_RESOURCE_STATE_UNORDERED_ACCESS);
    ml.commandList->ResourceBarrier( 1,&x9);

For more on resource barriers, see this.
有关资源障碍的更多信息,请参阅此。

绑定输入张量

For our "abs" method, there is only one input tensor to bind:
对于我们的“abs”方法,只有一个要绑定的输入张量:

DML_BUFFER_BINDING inputBufferBinding{ inputBuffer, 0, tensorInputSize };
DML_BINDING_DESC inputBindingDesc{ DML_BINDING_TYPE_BUFFER, &inputBufferBinding };
ml.dmlBindingTable->BindInputs(1, &inputBindingDesc);

For "add" and "linear regression" methods, there are two input tensors:
对于“相加”和“线性回归”方法,有两个输入张量:

if (Method == 2 || Method == 3)
{
    // split the input buffer to half to add two tensors
    DML_BUFFER_BINDING inputBufferBinding[2] = {};
    inputBufferBinding[0].Buffer = inputBuffer;
    inputBufferBinding[0].Offset = 0;
    inputBufferBinding[0].SizeInBytes = tensorInputSize / 2;
    inputBufferBinding[1].Buffer = inputBuffer;
    inputBufferBinding[1].Offset = tensorInputSize /2;
    inputBufferBinding[1].SizeInBytes = tensorInputSize / 2;

    DML_BINDING_DESC inputBindingDesc[2] = {};
    inputBindingDesc[0].Type = DML_BINDING_TYPE_BUFFER;
    inputBindingDesc[0].Desc = &inputBufferBinding[0];
    inputBindingDesc[1].Type = DML_BINDING_TYPE_BUFFER;
    inputBindingDesc[1].Desc = &inputBufferBinding[1];

    ml.dmlBindingTable->BindInputs(2, inputBindingDesc);
}

As you see, we "split" the input buffer into half.
正如您所看到的,我们将输入缓冲区“分成”两半。

绑定输出张量

For "Abs" or "Add", we only have one output equal to input:
对于“Abs”或“Add”,我们只有一个输出等于输入:

CComPtr<ID3D12Resource> outputBuffer;
if (Method == 1)
{
    auto x5 = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT);
    auto x6 = CD3DX12_RESOURCE_DESC::Buffer(tensorOutputSize, D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS);
    THROW_IF_FAILED(ml.d3D12Device->CreateCommittedResource(
        &x5,
        D3D12_HEAP_FLAG_NONE,
        &x6,
        D3D12_RESOURCE_STATE_UNORDERED_ACCESS,
        nullptr,
        IID_PPV_ARGS(&outputBuffer)));

    DML_BUFFER_BINDING outputBufferBinding{ outputBuffer, 0, tensorOutputSize };
    DML_BINDING_DESC outputBindingDesc{ DML_BINDING_TYPE_BUFFER, &outputBufferBinding };
    ml.dmlBindingTable->BindOutputs(1, &outputBindingDesc);
}

For "Linear Regression", we have 6 output tensors so we split them to parts. We had saved the tensor sizes earlier:
对于“线性回归”,我们有 6 个输出张量,因此我们将它们分成几个部分。我们之前已经保存了张量大小:

Shrink ▲ 收缩▲   

 auto x5 = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT);
 auto x6 = CD3DX12_RESOURCE_DESC::Buffer(tensorOutputSize, D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS);
 THROW_IF_FAILED(ml.d3D12Device->CreateCommittedResource(
     &x5,
     D3D12_HEAP_FLAG_NONE,
     &x6,
     D3D12_RESOURCE_STATE_UNORDERED_ACCESS,
     nullptr,
     IID_PPV_ARGS(&outputBuffer)));

 DML_BUFFER_BINDING outputBufferBinding[6] = {};

 outputBufferBinding[0].Buffer = outputBuffer;
 outputBufferBinding[0].Offset = 0;
 outputBufferBinding[0].SizeInBytes = Method3TensorSizes[0]; // Buffer 1 is Sx, we want N floats (in which only the final we are interested in), also aligned to DML_MINIMUM_BUFFER_TENSOR_ALIGNMENT

 outputBufferBinding[1].Buffer = outputBuffer;
 outputBufferBinding[1].Offset = Method3TensorSizes[0];
 outputBufferBinding[1].SizeInBytes = Method3TensorSizes[1]; // Same for Sy

 outputBufferBinding[2].Buffer = outputBuffer;
 outputBufferBinding[2].Offset = Method3TensorSizes[0] + Method3TensorSizes[1];
 outputBufferBinding[2].SizeInBytes = Method3TensorSizes[2]; // Same for xy

 outputBufferBinding[3].Buffer = outputBuffer;
 outputBufferBinding[3].Offset = Method3TensorSizes[0] + Method3TensorSizes[1] + Method3TensorSizes[2];
 outputBufferBinding[3].SizeInBytes = Method3TensorSizes[3]; // Same for Sxy

 outputBufferBinding[4].Buffer = outputBuffer;
 outputBufferBinding[4].Offset = Method3TensorSizes[0] + Method3TensorSizes[1] + Method3TensorSizes[2] + Method3TensorSizes[3];
 outputBufferBinding[4].SizeInBytes = Method3TensorSizes[4]; // Same for xx

 outputBufferBinding[5].Buffer = outputBuffer;
 outputBufferBinding[5].Offset = Method3TensorSizes[0] + Method3TensorSizes[1] + Method3TensorSizes[2] + Method3TensorSizes[3] + Method3TensorSizes[4];
 outputBufferBinding[5].SizeInBytes = Method3TensorSizes[5]; // Same for Sxx

 DML_BINDING_DESC od[6] = {};
 od[0].Type = DML_BINDING_TYPE_BUFFER;
 od[0].Desc = &outputBufferBinding[0];
 od[1].Type = DML_BINDING_TYPE_BUFFER;
 od[1].Desc = &outputBufferBinding[1];
 od[2].Type = DML_BINDING_TYPE_BUFFER;
 od[2].Desc = &outputBufferBinding[2];
 od[3].Type = DML_BINDING_TYPE_BUFFER;
 od[3].Desc = &outputBufferBinding[3];
 od[4].Type = DML_BINDING_TYPE_BUFFER;
 od[4].Desc = &outputBufferBinding[4];
 od[5].Type = DML_BINDING_TYPE_BUFFER;
 od[5].Desc = &outputBufferBinding[5];

ml.dmlBindingTable->BindOutputs(6, od);

Ready! 准备好!

We "record" as previously, but now not the initializer, but the compiled operator:
我们像以前一样“记录”,但现在不是初始化程序,而是编译后的运算符:

dmlCommandRecorder->RecordDispatch(commandList, dmlCompiledOperator, dmlBindingTable);

And then we close and execute the command list as earlier with the CloseExecuteResetWait() function.
然后我们像之前一样使用 CloseExecuteResetWait() 函数关闭并执行命令列表。

Read it back 读回来

We want to take data from the GPU back, so we 'd use ID3D12Resource map to get it back:
我们想要从 GPU 取回数据,因此我们使用 ID3D12Resource 映射来取回数据:

Shrink ▲ 收缩▲   

// The output buffer now contains the result of the identity operator,
// so read it back if you want the CPU to access it.
CComPtr<ID3D12Resource> readbackBuffer;
auto x7 = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_READBACK);
auto x8 = CD3DX12_RESOURCE_DESC::Buffer(tensorOutputSize);
THROW_IF_FAILED(ml.d3D12Device->CreateCommittedResource(
    &x7,
    D3D12_HEAP_FLAG_NONE,
    &x8,
    D3D12_RESOURCE_STATE_COPY_DEST,
    nullptr,
    IID_PPV_ARGS(&readbackBuffer)));

auto x10 = CD3DX12_RESOURCE_BARRIER::Transition(outputBuffer,D3D12_RESOURCE_STATE_UNORDERED_ACCESS,D3D12_RESOURCE_STATE_COPY_SOURCE);
ml.commandList->ResourceBarrier(1,&x10);

ml.commandList->CopyResource(readbackBuffer, outputBuffer);

ml.CloseExecuteResetWait();

D3D12_RANGE tensorBufferRange{ 0, static_cast<SIZE_T>(tensorOutputSize) };
FLOAT* outputBufferData{};
THROW_IF_FAILED(readbackBuffer->Map(0, &tensorBufferRange, reinterpret_cast<void**>(&outputBufferData)));

This outputBufferData now is a pointer to the output buffer. For our linear regression, we know where to take data from:
这个“outputBufferData”现在是一个指向输出缓冲区的指针。对于我们的线性回归,我们知道从哪里获取数据:

Shrink ▲ 收缩▲   

float Sx = 0, Sy = 0, Sxy = 0, Sx2 = 0;
if (Method == 3)
{
    // Output 1, 
    char* o = (char*)outputBufferData;
    Sx = outputBufferData[N - 1];

    o += Method3TensorSizes[0];
    outputBufferData = (float*)o;
    Sy = outputBufferData[N - 1];

    o += Method3TensorSizes[1];
    outputBufferData = (float*)o;

    o += Method3TensorSizes[2];
    outputBufferData = (float*)o;
    Sxy = outputBufferData[N - 1];

    o += Method3TensorSizes[3];
    outputBufferData = (float*)o;

    o += Method3TensorSizes[4];
    outputBufferData = (float*)o;
    Sx2 = outputBufferData[N - 1];
}

We need the last element of tensor 1,2,4 and 6 (tensors 3 and 5 were used for intermediate calculations).
我们需要张量 1、2、4 和 6 的最后一个元素(张量 3 和 5 用于中间计算)。

And finally: 最后:

    float B = (N * Sxy - Sx * Sy) / ((N * Sx2) - (Sx * Sx));
    float A = (Sy - (B * Sx)) / N;
    
    // don't forget to unmap!
    D3D12_RANGE emptyRange{ 0, 0 };
    readbackBuffer->Unmap(0, &emptyRange);

 

Now if you run the app in the method 3:
现在,如果您在方法 3 中运行应用程序:

Linear Regression CPU:
Sx = 135.000000
Sy = 6062.000000
Sxy = 136695.000000
Sx2 = 3475.000000
A = 994.904785
B = 0.685714

Linear Regression GPU:
Sx = 135.000000
Sy = 6062.000000
Sxy = 136695.000000
Sx2 = 3475.000000
A = 994.904785
B = 0.685714

 

Phew! 唷!

Yes, it is hard. Yes, it's Machine Learning. Yes, it's AI. Don't fall for it; It's difficult. You need studying a lot to get it working.
是的,这很难。是的,这就是机器学习。是的,这就是人工智能。不要上当;它很难。你需要大量学习才能让它发挥作用。

And then of course, you have to decide your own stuff of f(x) = y with LOTs of variables to "train" the models with lots of operators and tensors.
当然,你必须用大量变量来决定你自己的 f(x) = y 内容,以用大量运算符和张量“训练”模型。

But I hope I 've fired the starter's pistol signal for you.
但我希望我已经为你发出了发令枪信号。

GOOD LUCK. 祝你好运。