通用着色器内部结构

210 阅读54分钟

7 Common Shader Internals

7 通用着色器内部结构****

Full details of the Shader models for each shader stage are provided in dedicated sections elsewhere in the spec. What follows is a discussion of a few general items (not an exhaustive list) that are common to all of the Shader models.

每个着色器阶段的着色器模型的完整细节都在规范的专门部分中提供。接下来是对一些通用项目(并非详尽无遗的列表)的讨论,这些项目是所有着色器模型的共有的。

 

7.1 Instruction Counts

7.1 指令 数量****

There are no limits on total shader program length or execution time (accounting for loops and subroutines), aside from any limitations in what may be expressed in the shader token format. Clearly longer programs will degrade in performance, but D3D11.3 currently does not specify how steeply performance will degrade relative to program length or execution time given that there are so many variables that might affect performance.
除了对着色器token 格式表示的内容有一些限制外,对整个着色器程序长度或执行时间(包括循环和子例程)没有限制。显然,较长的程序会降低性能,但 D3D11.3 目前没有指定性能下降有多严重,相对于程序长度或执行时间,因为有太多的变量可能会影响性能。

 

7.2 Common Instruction Set

7.2 通用指令集****

Aside from a few exceptions, the instruction set for all the shader stages are identical. The exceptions are confined to instructions that only make sense in a given Shader unit.

除了少数例外情况外,所有着色器阶段的指令集都是相同的。例外情况仅限于只在给定的着色器单元中有意义的指令。

 

For example the sample instruction computes LOD based on derivatives, so sample and sample_b (sample with LOD bias) are only relevant in the Pixel Shader where derivatives are present, while sample_l (sample at selected LOD) and sample_d (sample with application-provided derivatives) is available in all stages.

例如,sample 指令基于导数计算 LOD,因此sample 和sample_b(具有 LOD 偏差的sample)仅与存在导数的像素着色器中相关,而 sample_l(选定 LOD 的采样)和 sample_d(具有应用程序提供的导数的采样)在所有阶段都可用。

 

7.3 Temporary Storage

7.3 临时存储****

Temporary storage is composed of a single Element type, which is a 4-tuple of untyped 32-bit quantities. Temporary storage consists of two classes of storage: registers, which are non-indexed single elements; and arrays, which are indexable 1D arrays of elements. Temporary storage is read/write, and is uninitialized at the start of a Shader execution instance.

临时存储由单个元素类型组成,该类型是非类型化 32 位数量的 4 元组。临时存储由两类存储组成:

寄存器,它们是非索引的单个元素; 

数组,它们是元素的可索引一维数组。

临时存储是读/写的,在着色器执行实例开始时未初始化。

 

Reads of temporary storage that has not been previously written within a Shader execution instance return undefined values, but cannot return data outside of the address space of the device context.

在着色器执行实例内部,对于尚未在临时存储中写入的数据,读取操作会返回未定义的值,但不能返回设备上下文地址空间之外的数据。

 

Temporary registers are declared(22.3.35) r#, and can be used as a temporary operand in D3D11.3 instructions.

临时寄存器声明 (22.3.35) 为 r#,可用作 D3D11.3 指令中的临时操作数。

其中 # 是寄存器的编号。例如,r0、r1、r2 等都是临时寄存器的示例。mov o0.xyz, r0.xyzx

 

Temporary arrays are declared(22.3.36) as x#[n], where “n” is the array length (indexed with 0..n-1). Temporary arrays must be indexed by an r# scalar, statically indexed x# scalar, and/or and optional immediate constant (literal), and can have only one level of index nesting (e.g.x0[x1[r0.x+1].x+1] is not legal, but x0[x1[1].x+1] is legal). A temporary array reference, x#[?], can be used as a temporary operand in D3D11.3 instructions (i.e. anywhere an r# can be used). Out of bounds access to x#[?] is undefined, except that data outside the GPU process context is never visible.

临时数组声明 (22.3.36) 为 x#[n],其中“n”是数组长度(以 0..n-1 为索引)。临时数组必须由 r# 标量、静态索引的 x# 标量和/或可选的立即数常量(文字)进行索引,并且只能有一个级别的索引嵌套(例如x0[x1[r0.x+1].x+1] 不合法,但 x0[x1[1].x+1] 合法)。临时数组引用 x#[?] 可用作 D3D11.3 指令中的临时操作数(即 r# 的任何地方)。对 x#[?] 的越界访问是未定义的,但 GPU 进程上下文之外的数据永远不会可见。

其中 # 是数组的编号,n 是数组的长度。例如,x0[3] 就是一个长度为 3 的临时数组。mul r0.xz, x0[r0.w].xww, float4(0.5f,0,0.1f,0)

 

The total quantity of temporary storage per Shader execution instance is 4096 elements, which can be utilized in any combination of registers and arrays. i.e. the total number of r# and x# declared must be <= 4096.

每个 Shader 执行实例的临时存储总量为 4096 个元素,可用于寄存器和数组的任意组合。即声明的 R# 和 X# 总数必须为 <= 4096。

 

Note that the namespace for r# and x# (the #) are independent. e.g. Suppose r2 and x2[5] are declared. They are independent, but together both count as 6 units of storage against the limit of 4096 temporary registers.

请注意,r# 和 x# (#) 的命名空间是独立的。例如,假设声明了 r2 和 x2[5]。它们是独立的,但两者加在一起算作 6 个存储单元,而临时寄存器的限制为 4096 个。

 

To provide a run-time stack, a program allocates a temporary array of a fixed size. The program should provide its own stack bounds checking, e.g., skip calls if the stack push would exceed the array bounds.

为了提供运行时堆栈,程序会分配一个固定大小的临时数组。程序应提供自己的堆栈边界检查,例如,如果堆栈推送超过数组边界,则跳过调用。

 

There is no limit on the total number of times a temp registers (the same one or different ones) that can appear in a single instruction or in a shader.

临时寄存器(相同的寄存器或不同的寄存器)可以在单个指令或着色器中出现的总次数没有限制。

 

7.4 Immediate Constants

7.4 立即数 常量****

For any instruction source argument that is capable of taking a temporary register, it is also permitted to supply 32-bit immediate scalar or 32-bit immediate 4-vector in the Shader code. Only at most one source operand per instruction may be specified using an immediate value (having up to 4 components). Immediate scalar values used in indexing of registers can only be used once per indexed operand in an instruction, and but these immediate values do not count against the limit of one immediate as a raw source operand. e.g."add r0, v[1 + r0.x], float4(1.0f,2.0f,3.0f,4.0f)" is valid, since there is only one immediate source operand present (the float4), with the value 1 in the indexing of v[] not counting against the limit.

对于任何能够采用临时寄存器的指令源参数,还允许在着色器代码中提供32位立即数标量或32位立即数4维向量。每条指令最多只能使用立即数值(最多 4 个组件)指定一个源操作数。寄存器索引中使用的立即数标量值只能在指令中每个索引操作数使用一次,但这些立即数值不计入原始源操作数的一个立即值的限制。例如:“add r0, v[1 + r0.x], float4(1.0f,2.0f,3.0f,4.0f)”是有效的,因为只有一个立即数的源操作数(float4),v[] 索引中的值 1 不计入限制。

 

If a source operand is a Constant Buffer reference (see Constant Buffers below), the reference to a Constant Buffer DOES count against the same limit as immediate values.This allows implementations to provide immediate values through the same hardware path as Constant Buffers if desired. e.g."add r0, cb0[r1.x], float4(1.0f,2.0f,3.0f,4.0f)" is invalid, since both an immediate value is used as well as a Constant Buffer read in the same instruction.

如果源操作数是常量缓冲区引用(请参阅下面的常量缓冲区),则对常量缓冲区的引用与立即数值的限制相同。这允许实现在需要时通过与常量缓冲区相同的硬件路径提供立即数值。例如:“add r0,cb0[r1.x],float4(1.0f,2.0f,3.0f,4.0f)”是无效的,因为在同一条指令中同时使用了立即数值和常量缓冲区读取。

 

There is no limit on the total number of times immediate constants can appear in a single instruction or in a shader.

立即数常量在单个指令或一个着色器中出现的总次数没有限制。

 

7.5 Constant Buffers

7.5 常量缓冲 ****

There are 15 slots for ConstantBuffers that can be active per Pipeline stage. Indexing across ConstantBuffers is not permitted. A given ConstantBuffer is accessed as an operand to any Shader operation as if it is an indexable read-only register in the Shader.Unlike other Buffer binding locations in the pipeline, Constant Buffers do not allow Buffer offsets nor custom strides. The stride of the Buffer is assumed to be the Element width of R32G32B32A32_TYPELESS; and the first Element in the Buffer (at Buffer offset zero) is assumed to constant #[ 0 ], when referenced from the Shader.

每个管线阶段最多可以有15个活动的常量缓冲区槽位。不允许跨常量缓冲区建立索引。在着色器中,一个给定的常量缓冲区被视为可索引的只读寄存器,并作为一些着色器操作的操作数来访问。与管道中的其他缓冲区绑定位置不同,常量缓冲区不允许缓冲区偏移或自定义步幅。缓冲区的步幅假定为 R32G32B32A32_TYPELESS 的元素宽度;当从着色器引用时,缓冲区中的第一个元素(缓冲区偏移量为零)被假定为常量 #[ 0 ]。

cbuffer CData : register(b0){float4 a};   cb0[0]

cbuffer CData : register(b1){float4 b};   cb1[0]

不支持位段操作。

 

In Shader code, just as a t# register is a placeholder for a Texture, a cb# register is a placeholder for a ConstantBuffer at "slot" #.A ConstantBuffer is accessed in a Shader using: cb#[index] as an operand to Shader instructions, where 'index' can be either an r# or statically indexed x# containing a 32-bit unsigned integer, an immediate 32-bit unsigned integer constant, or a combination of the two, added together. e.g. "mov r0, cb3[x3[0].x+6]" represents moving Element 7 from the ConstantBuffer assigned to slot 3 into r0, assuming x3[0].x contains 1.

在着色器代码中,就像 t# 寄存器是纹理的占位符一样,cb# 寄存器是“槽”#中常量缓冲区(ConstantBuffer)的占位符。在着色器中使用以下命令访问常量缓冲区:cb#[index] 作为着色器指令的操作数,其中“index”可以是 r# 或静态索引的 x#,其中包含一个 32 位无符号整数、一个直接的 32 位无符号整数常量或两者的组合,相加在一起。例如,假设 x3[0].x 包含 1,“mov r0, cb3[x3[0].x+6]” 表示将元素 7 从分配给插槽 3 的 常量缓冲区移动到 r0。

 

There is no limit on the total number of times constant buffer reads (from any buffer and location in the buffer) that can appear in a single instruction or in a shader.

常量缓冲区读取(从缓冲区中的任何缓冲区和缓冲区中的位置)的总次数没有限制,这些读取可以出现在单个指令或着色器中。

 

The declaration of a ConstantBuffer (cb# register) in a Shader includes the following information:

着色器中常量缓冲区 (cb# 寄存器) 的声明包括以下信息:

1. The size of the ConstantBuffer can be declared (a special flag will allow for unknown-length).

1. 可以声明常量缓冲区的大小(特殊标志将允许未知长度)。

2. The Shader must indicate whether the ConstantBuffer will be accessed via Shader-computed offset values or only by literal offsets.

2. 着色器必须指示常量缓冲区是通过着色器计算的偏移值访问的,还是仅通过文本偏移量访问的。

3. The order that the declaration of a cb# appears in a Shader, relative to other cb# declarations, defines the priority of that ConstantBuffer, starting at highest priority.

3. 相对于其他 cb# 声明,cb# 的声明在着色器中出现的顺序定义了该常量缓冲区的优先级,开始在最高优先级。

定义多个常量缓冲区cb,最前面的那个cb优先级最高。

cbuffer cbPerFrame

{

DirectionalLight gDirLight;

PointLight gPointLight;

SpotLight gSpotLight;

float3 gEyePosW;

};

cbuffer cbPerObject

{

float4x4 gWorld;

float4x4 gWorldInvTranspose;

float4x4 gWorldViewProj;

Material gMaterial;

};

mfxWorldViewProj     = mFX->GetVariableByName("gWorldViewProj")->AsMatrix();

mfxWorld             = mFX->GetVariableByName("gWorld")->AsMatrix();

如果Shader需要从多个ConstantBuffer中读取数据,它会优先从cbPerFrame中读取。

 

Out of bounds access to ConstantBuffers returns 0 in all components. Out of bounds behavior is always with respect to the size of the buffer bound at that slot.

对常量缓冲区的越界访问在所有组件中都返回 0。越界行为始终与该槽绑定的缓冲区的大小有关。

 

If the constant buffer bound to a slot is larger than the size declared in the shader for that slot, implementations are allowed to return incorrect data (not necessarily 0) for indices that are larger than the declared size but smaller than the buffer size.

如果绑定到槽的常量缓冲区大于着色器中为该槽声明的大小,那么对于大于声明大小但小于缓冲区大小的索引,实现被允许返回不正确的数据(不一定是0)。

// 在着色器中声明一个大小为 2 的常量缓冲区

cbuffer ConstantBuffer : register(cb0)

{

    float4 LightDirection;        // 4 维向量

    float4 LightColor;            // 4 维颜色向量

};

 

// Pixel Shader

float4 PS_Main(float2 TexCoords : TEXCOORD0) : SV_Target

{

    // 使用常量缓冲区中的数据

    float4 color = tex2Dlod(s0, float4(TexCoords, 0, LightDirection));

    return color;

}

我们在运行时将一个大小大于 2 的常量缓冲区绑定到 cb0 槽位,那么当着色器访问超出其声明大小(即,超过 LightColor 后面的数据)时,它可能会读取到未定义的数据。

ID3D11Buffer* pConstantBuffer; // 您的常量缓冲区

ID3D11DeviceContext* pContext; // 您的设备上下文

// 绑定常量缓冲区到像素着色器的 cb0 槽位

pContext->PSSetConstantBuffers(0, 3, &pConstantBuffer);

在这个例子中,pConstantBuffer 是您的常量缓冲区,pContext 是您的设备上下文。PSSetConstantBuffers 函数的第一个参数是您要绑定的槽位(在这个例子中是 0,对应于 cb0),第二个参数是要绑定的常量缓冲区的数量(在这个例子中是 3),第三个参数是一个指向您的常量缓冲区的指针。

 

Fetching from a ConstantBuffer slot with no Buffer present always returns 0 in all components for all indices.

从不存在 Buffer 的常量缓冲区槽中获取时,所有索引的所有组件中始终返回 0。

 

With this set of information, different hardware implementations sporting varying degrees of optimization for ConstantBuffer access may make informed decisions about how to compile access to the ConstantBuffer into Shader code. Compiled shaders must never have to recompile just because different ConstantBuffers get bound to the Shader, as the necessary characteristics have been statically declared.Runtime validation (at least in debug) will ensure that the Shader code and the sizes of bound ConstantBuffers satisfy the declarations.

有了这些信息,不同的硬件实现可以根据对常量缓冲区访问的优化程度做出明智的决策,如何将对常量缓冲区的访问编译到Shader代码中。编译后的着色器绝不应该因为不同的常量缓冲区被绑定到Shader而需要重新编译,因为必要的特性已经被静态声明了。运行时验证(至少在调试中)将确保着色器代码和绑定的常量缓冲区的大小满足声明。

XMFLOAT3 mEyePosW;

mfxEyePosW           = mFX->GetVariableByName("gEyePosW")->AsVector();

mfxEyePosW->SetRawValue(&mEyePosW, 0, sizeof(mEyePosW));

cbuffer cbPerFrame

{

DirectionalLight gDirLight;

PointLight gPointLight;

SpotLight gSpotLight;

float3 gEyePosW;

};

 

The priorities assigned to ConstantBuffers assist hardware in best utilizing any dedicated constant data access paths/mechanisms, if present.There is no guarantee, however, that accesses to ConstantBuffers with higher priority will always be faster than lower priority ConstantBuffers.It is possible that a higher priority ConstantBuffer could produce slower performance than a lower priority ConstantBuffer, depending on the declared characteristics of the buffers involved.For example an implementation may have some arbitrary sized fast constant RAM not large enough for a couple of high priority ConstantBuffers that a Shader has declared, but large enough to fit a declared low priority ConstantBuffer.Such an implementation may have no choice but to use the standard (assumed slow) texture load path for large high priority ConstantBuffers (perhaps tweaking the cache behavior at least), while placing the lowest priority ConstantBuffer into the (assumed fast) constant RAM.

分配给常量缓冲区的优先级可帮助硬件最好地利用任何专用的常量数据访问路径/机制(如果存在)。但是,不能保证对具有较高优先级的常量缓冲区的访问始终比对较低优先级的常量缓冲区的访问更快。优先级较高的常量缓冲区可能比优先级较低的常量缓冲区产生更慢的性能,具体取决于所涉及的缓冲区的声明特征。例如,一个实现可能具有一些任意大小的快速常量 RAM,这些 RAM 不足以容纳 Shader 声明的几个高优先级常量缓冲区,但足够大以容纳声明的低优先级常量缓冲区。这样的实现可能别无选择,只能对大型高优先级常量缓冲区使用标准(假定慢速)纹理加载路径(至少可以调整缓存行为),同时将最低优先级的常量缓冲区放入(假定的快速)常量 RAM 中。

 

Applications are able to write Shader code that reads constants in whatever pattern and quantity desired, while still allowing different hardware to easily achieve the best performance possible.

应用程序能够编写读取所需模式和数量的常量的着色器代码,同时仍然允许不同的硬件轻松实现最佳性能。

 

7.5.1 Immediate Constant Buffer

7.5.1 立即数 常量缓冲区****

In addition to the aforementioned 15 slots for Constant Buffers, every shader program can declare(22.3.4) a single Immediate Constant Buffer with up to 4096 4-vector values. The data is tied to the shader program permanently, but otherwise behaves (gets accessed) by the shader exactly the same way as Constant Buffers.

除了前面提到的 15 个常量缓冲区插槽外,每个着色器程序还可以声明 (22.3.4) 一个具有最多 4096 个4-vector的立即数常量缓冲区。这些数据永久地与Shader程序绑定,但在其他方面(被Shader访问的方式)与常量缓冲区的行为完全相同。

 

There is no limit on the total number of times immediate constant buffer reads (from any location the buffer) can appear in a single instruction or in a shader.

立即数常量缓冲区读取(从缓冲区的任何位置)可以在单个指令或着色器中出现的总次数没有限制。

//立即数常量缓冲区

cbuffer ImmediateConstantBuffer : register(b13)

{

  float4 PreDefinedColor = float4(1.0,0.0,0.0,1.0);  //预定义的颜色值

  float MaterialShininess = 32.0; //材料的光泽度

}

 

7.6 Shader Output Type Interpretation

7.6 着色器输出类型解释****

D3D11_INPUT_ELEMENT_DESC vertexDesc[] =

{

{"POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0},

{"COLOR",    0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 12, D3D11_INPUT_PER_VERTEX_DATA, 0}

};

// Create the input layout

D3DX11_PASS_DESC passDesc;

mTech->GetPassByIndex(0)->GetDesc(&passDesc);

HR(md3dDevice->CreateInputLayout(vertexDesc, 2, passDesc.pIAInputSignature, passDesc.IAInputSignatureSize, &mInputLayout));

着色器

struct VertexOut

{

float4 PosH  : SV_POSITION;

    float4 Color : COLOR;

};

float4 PS(VertexOut pin) : SV_Target

{

    return pin.Color;

}

The application is given control over the data type interpretation for Shader outputs (i.e. writing raw integer values vs. writing normalized float values) by simply choosing an appropriate format to interpret the output resource's contents as. See the Formats(19.1) section for detail.

应用程序只需选择适当的格式来解释输出资源的内容,即可控制着色器输出的数据类型解释(即写入原始整数值与写入规范化浮点值)。有关详细信息,请参阅“格式 (19.1) ”部分。

格式列表

 

 

7.7 Shader Input/Output

7.7 着色器输入/输出****

Details on Shader input/output registers (indeed all registers) are provided in the sections dedicated to each Shader unit elsewhere in the spec.

着色器输入/输出寄存器(实际上所有寄存器)的详细信息在规范中其他地方的每个着色器单元的章节中提供。

struct VertexIn

{

float3 PosL  : POSITION;

    float4 Color : COLOR;

};

 

struct VertexOut

{

float4 PosH  : SV_POSITION;

    float4 Color : COLOR;

};

 

VertexOut VS(VertexIn vin)

{

VertexOut vout;

// Transform to homogeneous clip space.

vout.PosH = mul(float4(vin.PosL, 1.0f), gWorldViewProj);

// Just pass vertex color into the pixel shader.

    vout.Color = vin.Color;

    

    return vout;

}

 

float4 PS(VertexOut pin) : SV_Target

{

    return pin.Color;

}

One thing in common about input/output registers for all shaders is that if they are declared(22.3.30) to be dynamically indexable from the shader, and the shader indexes them out of the declared range, results are undefined, although no data from outside the GPU process context is never visible.

所有着色器的输入/输出寄存器的一个共同点是,如果它们被声明(22.3.30)为从着色器动态索引,并且着色器将它们索引到声明的范围之外,则结果是未定义的,尽管从GPU进程上下文之外永远不会看到任何数据。

Instruction:    dcl_indexRange minReg, maxReg

Stage(s):       All(22.1.1)

Description:    Declare a range of input or output registers that are to be indexed in the Shader code. The range is specified by indicating the minimum register and maximum register (minReg and maxReg).

声明一个在Shader代码中索引的输入或输出寄存器的范围。该范围通过指示最小寄存器和最大寄存器(minReg和maxReg)来指定。

dcl_indexRange v1, v3

dcl_indexRange v4, v9

dcl_indexRange o0, o4 // this line can't be used in PS

 

7.8 Integer Instructions

7.8 整数指令****

7.8.1 Overview

7.8.1 概述****

There is a collection of instructions available to Shaders which are dedicated to performing integer arithmetic and bitwise operations. Operands and output registers for integer instructions can be any of the register classes available to the floating point instructions.There is no data type associated with registers; Shader instructions determine how the data stored in registers is interpreted. Integer instructions simply assume that the data being read from operands and written to the destination are all 32-bit values (unsigned or signed 2's complement, depending on the instruction).

着色器(Shaders)中有一组指令专门用于执行整数算术和位运算。整数指令的操作数和输出寄存器可以是浮点指令可用的任何寄存器类别。寄存器没有关联的数据类型;Shader指令决定了寄存器中存储的数据如何被解释。整数指令简单地假设从操作数读取的数据和写入目的地的数据都是32位值(无符号或有符号的2的补码,具体取决于指令)。

iadd dest[.mask], [-]src0[.swizzle], [-]src1[.swizzle]

add[_sat] dest[.mask], [-]src0[_abs][.swizzle], [-]src1[_abs][.swizzle]

 

 

7.8.2 Implementation Notes

7.8.2 实施说明****

Shader register storage is made up of 32-bit*4-component quantities, and integer arithmetic on these registers is required to be performed at full 32 bit in all cases.

着色器寄存器存储由 32 位 * 4 分量组成,在所有情况下,这些寄存器上的整数运算都需要以完整的 32 位执行。

 

7.8.3 Bitwise Operations

7.8.3 按位运算****

The bitwise instructions are listed in the Bitwise Instructions(22.11) sub-section of the full instruction listing.

按位指令列在完整指令列表的“按位指令 (22.11) ”子部分中。

22.11 Bitwise Instructions

Section Contents

(back to chapter)

22.11.1 and

22.11.2 bfi

22.11.3 bfrev

22.11.4 countbits

22.11.5 firstbit

22.11.6 ibfe

22.11.7 ishl

22.11.8 ishr

22.11.9 not

22.11.10 or

22.11.11 ubfe

22.11.12 ushr

22.11.13 xor

 

7.8.4 Integer Arithmetic Operations

7.8.4 整数算术运算****

See the Integer Arithmetic Instructions(22.12) sub-section of the full instruction listing.

请参阅完整说明列表的整数算术说明 (22.12) 子部分。

22.12 Integer Arithmetic Instructions

Section Contents

 

(back to chapter)

 

22.12.1 iadd

22.12.2 iaddcb

22.12.3 imad

22.12.4 imax

22.12.5 imin

22.12.6 imul

22.12.7 ineg

22.12.8 uaddc

22.12.9 udiv

22.12.10 umad

22.12.11 umax

22.12.12 umin

22.12.13 umul

22.12.14 usubb

22.12.15 msad

 

7.8.5 Integer/Float Conversion Operations

7.8.5 整数/浮点转换操作****

There is no implicit conversion between floating-point and integer values. Contents of registers are interpreted as float or ints by the particular instruction being executed. Two instructions exist that allow explicit conversions to be performed, listed in the Type Conversion Instructions(22.13) sub-section of the full instruction listing.

浮点值和整数值之间没有隐式转换。寄存器的内容被正在执行的特定指令解释为浮点数或整数。存在两个允许执行显式转换的指令,列在完整指令列表的“类型转换指令 (22.13) ”子部分中。

22.13 Type Conversion Instructions

Section Contents
(back to chapter)
22.13.1 f16tof32
22.13.2 f32tof16
22.13.3 ftoi
22.13.4 ftou
22.13.5 itof
22.13.6 utof

 

7.8.6 Integer Addressing of Register Banks

7.8.6 寄存器库的整数寻址****

Integer offsets for reads from register banks are available. These offsets must be scalar values (i.e. a select swizzle must be used to select one component of any vector-valued register used as an index) and are considered to be unsigned 32 bit values.

提供从寄存器组读取的整数偏移。这些偏移量必须是标量值(即,必须使用选择分量来选择用作索引的任何向量值寄存器的一个分量),并被视为无符号 32 位值。

 

This indexing mechanism applied to indexable x# registers allows compilers to generate stack-like behavior for Shader subroutines.

这种应用于可索引 x# 寄存器的索引机制允许编译器为 Shader 子例程生成类似堆栈的行为。

 

An example syntax for indexing is:

索引的示例语法如下:

mov r1, cb7[3+r2.x]

 

This instruction assumes that an unsigned 32-bit integer value exists in r2.x, and uses that value to offset into ConstantBuffer 7, starting from location 3 in the ConstantBuffer. Thus, if r2.x contains integer value 2, entry 5 of ConstantBuffer 7 would be referenced.

这条指令假设在r2.x中存在一个无符号的32位整数值,并使用该值偏移到ConstantBuffer 7中,从ConstantBuffer的位置3开始。因此,如果r2.x包含整数值2,那么将会引用ConstantBuffer 7的第5个条目。

 

7.9 Floating Point Instructions

7.9 浮点指令****

Floating point instructions must follow the D3D11.3 Floating Point Rules(3.1).

浮点指令必须遵循 D3D11.3 浮点规则 (3.1) 。

 

A listing of all floating point instructions can be found here(22.10).

可以在此处 (22.10) 找到所有浮点指令的列表。

22.10 Floating Point Arithmetic Instructions

Section Contents

(back to chapter)

22.10.1 add

22.10.2 div

22.10.3 dp2

22.10.4 dp3

22.10.5 dp4

22.10.6 exp

22.10.7 frc

22.10.8 log

22.10.9 mad

22.10.10 max

22.10.11 min

22.10.12 mul

22.10.13 nop

22.10.14 round_ne

22.10.15 round_ni

22.10.16 round_pi

22.10.17 round_z

22.10.18 rcp

22.10.19 rsq

22.10.20 sincos

22.10.21 sqrt

 

7.9.1 Float Rounding

7.9.1 浮点四舍五入****

Instructions are provided for rounding floating point values to integral floating point values:

提供了将浮点值舍入为整数浮点值的说明:

 

round_ne(22.10.14) (nearest-even)

round_ne (22.10.14) (最接近偶数)

1.5 rounds to 2 (even)

2.5 rounds to 2 (even)

 

round_ni(22.10.15) (negative-infinity)

round_ni (22.10.15) (负无穷大)

round_ni(2.5) 结果为 2。

round_ni(3.5) 结果为 3。

 

round_pi(22.10.16) (positive-infinity)

round_pi (22.10.16) (正无穷大)

round_pi(2.5) 结果为 3。

round_pi(3.5) 结果为 4。

 

round_z(22.10.17) (towards zero)

round_z (22.10.17) (趋向于零)

round_pi(2.5) 结果为 2。

round_pi(-2.5) 结果为 -2。

 

7.10 Vector Vs Scalar Instruction Set

7.10 vs 标量指令集****

The D3D intermediate language (IL) and register model are 4-vec oriented. Since this does not constrain hardware implementation (vector vs scalar) too much, this convention will carry forward until a good reason to switch paradigms surfaces.It is known that many implementations actually happen to operate on scalars or combinations of layouts even now.

D3D中间语言(IL)和寄存器模型是面向4-vec的。由于这不会过多地约束硬件实现(向量vs标量),因此这种惯例将一直延续下去,直到有充分的理由切换范式。众所周知,即使现在,许多实现实际上都是在标量或布局组合上操作的。

 

One area where the vector assumption seems to materially impact data organization is the indexing of registers such as inputs or outputs – the indexing happens across registers. If it is important to be able to express cleanly how to index through an array of scalars, this could be an example of an argument for switching the IL to be completely scalar.

向量假设在数据组织上产生实质性影响的一个领域是寄存器(如输入或输出)的索引 - 索引是跨寄存器进行的。如果能够清晰地表达如何通过标量数组进行索引非常重要,那么这可能是一个将中间语言(IL)完全切换为标量的论据示例。

 

7.11 Uniform Indexing Of Resources And Samplers

7.11 资源和采样器的统一索引****

7.11.1 Overview

7.11.1 概述****

Shaders have bindpoint arrays for various classes of read-only input resources: Constant Buffers (cb), Texture/Buffers (t), Samplers (s).

Shader具有用于各种类别的只读输入资源的绑定点数组:常量缓冲区(cb)、纹理/缓冲区(t)、采样器(s)。

常量缓冲区 (cb):它们通常用于存储那些在渲染过程中不会改变的数据,比如光源位置、摄像机参数等。

纹理/缓冲区 (t):它们用于存储图像数据,着色器可以从中读取信息以生成复杂的视觉效果。

采样器 (s):它们定义了如何从纹理中采样数据,包括过滤和重复模式等。

 

D3D11 allows all of these to be dynamically but uniformly indexed from a shader, whereas previously none of them were indexable.

D3D11允许所有这些资源从Shader动态但统一地索引,而以前它们都是不可索引的。

假设我们有一个常量缓冲区数组。在以前的版本中,你可能需要为每个变量创建一个单独的着色器变量,如下所示:

cbuffer ConstantBuffer : register(b0)

{

    float4 value0;

    float4 value1;

    // ...

    float4 value9;

}

然后,在着色器中,你需要根据你想要访问的变量来选择正确的变量。这可能会导致大量的代码重复,并且在运行时改变访问的变量会非常困难。

但是,在D3D11中,你可以使用一个索引来动态地访问常量缓冲区数组,如下所示:

cbuffer ConstantBuffer : register(b0)

{

    float4 values[10];

}

 

float4 main(uint valueIndex : VALUEINDEX) : SV_Target

{

    // 动态地索引常量缓冲区数组

    return values[valueIndex];

}

 

 

As with indexing of other types, such as indexable temps (x#), the dynamic index can be either an r# or statically indexed x# containing a 32-bit unsigned integer, an immediate 32-bit unsigned integer constant, or the combination of the two, added together.

与其他类型的索引(如可索引临时索引 (x#))一样,动态索引可以是 r# 或静态索引 x#,其中包含一个 32 位无符号整数、一个即时的 32 位无符号整数常量或两者的组合,相加在一起。****

 

The constraint on the indexing of resources or samplers is that the index must be uniform. That is, the computed index must be the same at that point in the lockstep execution of the program for all invocations of the shader within the Draw*() call.

对资源或采样器进行索引的约束是索引必须是uniform。也就是说,在程序的锁步执行中,对于 Draw*() 调用中着色器的所有调用,计算索引必须相同。

在着色器中,我们可以使用uniformIndex来动态地索引纹理数组,如下所示:

Texture2D textures[10] : register(t0);

SamplerState sam : register(s0);

float4 main(uniform int uniformIndex : UNIFORMINDEX) : SV_Target

{

    // 动态但统一地索引纹理数组

    return textures[uniformIndex].Sample(sam, float2(0.5, 0.5));

}

对资源或采样器进行索引的约束是索引必须是uniform,这句话写的有问题

Texture2D textures[10] : register(t0);

SamplerState sam : register(s0);

float4 main(int uniformIndex : UNIFORMIN) : SV_Target

{

    // 动态但统一地索引纹理数组

    return textures[uniformIndex].Sample(sam, float2(0.5, 0.5));

}

Shader model 5.1之后都可以编译通过,Shader model 5.1之前报错:sampler array index must be a literal expression。

 

If due to flow control, some of the lockstep shader invocations are inactive, the computed index in those shaders is ignored and therefore cannot cause a violation of the uniform indexing constraint on all the active invocations.

如果由于流控制,一些同步着色器调用处于非活动状态,那么这些着色器中计算的索引将被忽略,因此不能导致所有活动调用的uniform索引约束被违反。

 

The HLSL compiler will enforce this behavior and driver compilers must not break it either. Violations of the uniform indexing constraint would be a result of an HLSL compiler bug or a driver compiler bug only, and in such cases the indexing results are undefined.

HLSL编译器将强制执行这种行为,驱动程序编译器也不得破坏它。违反uniform索引约束只能是HLSL编译器错误或驱动程序编译器错误的结果,在这种情况下,索引结果是未定义的。

 

7.11.2 Index Range

7.11.2 索引范围****

Out of bounds resource indexing produces the same result as if accessing a slot with no resource bound.

越界资源索引产生的结果与访问未绑定资源的插槽的结果相同。

 

In particular note that with Constant Buffers, there are 14 API-visible Constant Buffer slots (a couple of other slots are reserved for various purposes). The valid indexing range for Constant Buffers is therefore [0..13], and accesses out of that range behave as if accessing a slot with no Constant Buffer bound.

特别注意,对于常量缓冲区,有14个API可见的常量缓冲区插槽(其他几个插槽保留用于各种目的)。因此,常量缓冲区的有效索引范围是[0…13],超出该范围的访问行为就像访问没有绑定常量缓冲区的插槽一样。

 

Out of bounds indexing of the Samplers (s#) results in undefined behavior.

采样器 (s#) 的越界索引会导致未定义的行为。

 

7.11.3 Constant Buffer Indexing Example

7.11.3 常量缓冲区索引示例****

Suppose x3[0].x contains 4 and x4[2].y contains 5. The following mov instruction:

假设 x3[0].x 包含 4,x4[2].y 包含 5。以下 mov 指令:

mov r0, cb[x3[0].x+6][x4[2].y+9]

 

is therefore equivalent to:

因此等同于:

mov r0, cb[10][14]

 

which means read a 32-bit * 4-vector from location [14] in the ConstantBuffer, at ConstantBuffer bind point [10] (0-based counting).

这意味着从常量缓冲区的位置[14]读取一个32位*4-vector,在常量缓冲区绑定点[10](从0开始计数)。

 

The uniform dynamic indexing of which Constant Buffer to read from is what was not supported previously. Dynamic indexing within the Constant Buffer itself has always been supported.

从哪个常量缓冲区读取的uniform动态索引以前是不支持的。常量缓冲区本身的动态索引一直都是支持的。

 

7.11.4 Resource/Buffer Indexing Example

7.11.4 资源/缓冲区索引示例****

Suppose x3[0].x contains 4. The following ld instruction:

假设 x3[0].x 包含 4。以下 ld 指令:

ld r0, r1, t[x3[0].x+6], texture2D

 

is equivalent to: 相当于:

ld r0, r1, t[10], texture2D

 

Note the "texture2D" at the end is also a new requirement, whereby all ld/sample instructions will indicate which Shader Resource View type is to be sampled.

请注意,末尾的“texture2D”也是一个新要求,所有 ld/sample 指令都将指示要采样的着色器资源视图类型。

 

7.11.5 Sampler Indexing Example

7.11.5 采样器索引示例****

Suppose x3[0].x contains 4 and x4[2].y contains 5. The following sample instruction:

假设 x3[0].x 包含 4,x4[2].y 包含 5。以下示例说明:

sample r0, r1, t[x3[0].x+6], s[x4[2].y+9], textureCubeArray

 

is equivalent to: 相当于:

sample r0, r1, t[10], s[14], textureCubeArray

 

7.11.6 Resource Indexing Declarations

7.11.6 资源索引声明****

Shader declarations from Shader Model 4.x for individual resources, constant buffers and samplers remain the same in Shader Model 5.0. These are particularly informative for parts of shader code that reference these objects directly, just as before.

Shader Model 4.x 中针对单个资源、常量缓冲区和采样器的着色器声明在 Shader Model 5.0 中保持不变。与以前一样,这些内容对于直接引用这些对象的着色器代码部分特别有用。

 

However, all instructions that reference texture objects (t#) now specify the view dimension (e.g.textureCubeArray) as a literal parameter. This is redundant when indexing is not used, since the up-front declaration of each t# has a view dimension, but useful when indexing is used.

但是,所有引用纹理对象 (t#) 的指令现在都指定了视图维度(例如textureCubeArray) 作为文本参数。当不使用索引时,这是多余的,因为每个 t# 的预先声明都有一个视图维度,但在使用索引时很有用。

 

7.12 Limitations On Flow Control And Subroutine Nesting

7.12 流程控制和子程序嵌套的限制****

A flow control block is defined as an if(22.7.1) block, loop(22.7.4) block, or switch(22.7.18) block. Flow control blocks can nest up to 64 deep per subroutine (and main). Behavior of flow control instructions beyond this nesting limit is undefined.

流控制块被定义为if(22.7.1)块、loop(22.7.4)块或switch(22.7.18)块。流控制块可以在每个子程序(以及主程序)中嵌套最多64层。超出这个嵌套限制的流控制指令的行为是未定义的。

实测编译器不会报错,也没警告。

 

Subroutines can nest up to 32 deep. If there are already 32 entries on the return address stack and a "call" is issued, the call is skipped over.

子程序可以嵌套最多32层。如果返回地址栈上已经有32个条目,并且发出了一个“调用”,则会跳过这个调用。

 

7.13 Memory Addressing And Alignment Issues

7.13 内存寻址和对齐问题****

UAV:无序访问视图,SRV:着色器资源视图,线程组共享内存

For Typed memory views, the number of components in an address when accessed by a shader instruction is determined by the number of components in the resource

对于类型化内存视图,当Shader指令访问地址时,地址中的组件数量由资源维度中的组件数量决定。每个地址组件都是一个无符号的32位整数元素索引。

 

For Raw memory views, the address is a single component unsigned 32-bit integer byte offset from the beginning of the view. The addresses must be 32-bit aligned. If an unaligned address is specified for an operation involving a write, the entire contents of the UAV(5.3.9) being written, or all of Thread Group Shared Memory (in the Compute Shader(18)) - whichever is being accessed - becomes undefined. If an unaligned address is specified for an operation involving a read, an undefined result is returned to the shader. It is invalid for implementations to perform the access as if there were no 32-bit alignment constraints.

对于原始内存视图,地址是从视图开始的单个组件无符号32位整数字节偏移。地址必须是32位对齐的。

如果为涉及写入的操作指定了未对齐的地址,那么正在写入的UAV(5.3.9)的全部内容,或者所有的线程组共享内存(在计算着色器(18)中)——无论访问的是哪个——都会变得未定义。如果为涉及读操作的操作指定了未对齐的地址,则返回给着色器的结果是未定义的。实现执行访问操作时,如果没有32位对齐限制,则是无效的。

 

For Structured memory views, the address is two unsigned 32-bit integer values.The first value is the struct index, and the second value is a byte offset into the struct. The byte offset must be aligned to 32-bits, otherwise the same behavior described for misaligned raw memory access above applies.

对于结构化内存视图,地址是两个无符号的 32 位整数值。第一个值是结构索引,第二个值是结构的字节偏移量。字节偏移量必须 32 位对齐,否则将适用上述未对齐的原始内存访问所描述的相同行为。

 

Each memory access instruction defines its behavior for out of bounds accesses, with distinctions for the memory location being accessed (UAV vs SRV vs Thread Group Shared Memory), and the layout (raw vs structured vs typed).See the documentation of individual instructions for details. The behaviors are similar for similar classes of instructions – e.g.all atomics have the same out of bounds behavior, all immediate atomics (which return a value to a shader) have their own consistent out of bounds access behavior, etc.

每个内存访问指令都定义了其越界访问的行为,并区分了所访问的内存位置(UAV vs SRV vs 线程组共享内存)和布局(原始 vs 结构化 vs 类型化)。有关详细信息,请参阅各个说明的文档。对于类似指令的类别,行为是相似的,例如,所有 Atomics 都具有相同的越界行为,所有立即原子操作(将值返回给着色器)都有自己一致的越界访问行为等。

 

7.14 Shader Memory Consistency Model

7.14 着色器内存一致性模型****

7.14.1 Intro

7.14.1 简介****

The types of memory accesses included in the scope of this chapter are: to Unordered Access Views(5.3.9) (UAVs, u#), available to the Compute Shader(18) and Pixel Shader(16), as well as Thread Group Shared Memory (g#), available to the Compute Shader.

本章范围中包含的内存访问类型包括:

无序访问视图 (5.3.9) (UAV,u#),可用于计算着色器和像素着色器,

以及线程组共享内存(g#),可用于计算着色器。

 

The D3D11 Shader Memory Consistency Model is weak/relaxed, as generally understood in existing architectures and literature. Loosely, this means the program author and/or compiler are responsible for identifying all memory and thread synchronization points via some appropriately expressive labeling.

D3D11着色器内存一致性模型是弱/松散的,这在现有的架构和文献中通常是这样理解的。简而言之,这意味着程序作者和/或编译器负责通过一些适当的表达方式来标识所有的内存和线程同步点。

 

This section outlines how this weak/relaxed Memory Consistency Model appears to function from the point of view of D3D software.

本节概述了从 D3D 软件的角度来看,这种弱/松弛内存一致性模型的功能。

 

7.14.2 Atomicity

7.14.2 原子性****

An atomic operation may involve both reading from and then writing to a memory location. Atomic operations apply only to either u# (Unordered Access Views) or g# (Thread Group Shared Memory).

原子操作可能涉及读取内存位置,然后写入内存位置。原子操作仅适用于 u#(无序访问视图)或 g#(线程组共享内存)。

这种操作保证了在多线程环境中,对同一内存位置的读取和写入是不会被其他线程打断的,从而避免了数据的不一致性。

无序访问视图允许着色器(shader)对资源进行随机读写访问,

而线程组共享内存则是在一个线程组内部共享的内存资源。

这两种资源都可以进行原子操作。

需要注意的是,虽然原子操作可以保证数据的一致性,但它们通常会降低程序的并行性,因为原子操作需要对内存位置进行加锁,以防止其他线程的访问。因此,在编写图形程序时,应尽量减少原子操作的使用,以提高程序的性能。

 

It is guaranteed that when a thread issues an atomic operation on a memory address, no write to the same address from outside the current atomic operation by any thread can occur between the atomic read and write.

可以保证,当线程对内存地址发出原子操作时,在原子读取和写入之间不会发生任何线程从当前原子操作外部写入同一地址的情况。

 

If multiple atomic operations from different threads target the same address, the operations are serialized in an undefined order.

如果来自不同线程的多个原子操作以同一地址为目标,则这些操作将按未定义的顺序序列化。

 

Atomic operations do not imply a memory or thread fence. Fence operations (dubbed "sync") are introduced below.If the program author/compiler does not make appropriate use of fences, it is not guaranteed that all threads see the result of any given memory operation at the same time, or in any particular order with respect to updates to other memory addresses.

原子操作并不意味着内存或线程围栏(fence)。下面介绍了围栏(fence)操作(称为“同步”)。如果程序作者/编译器没有适当地使用围栏(fence),则不能保证所有线程都同时看到任何给定内存操作的结果,或者以任何特定顺序看到其他内存地址的更新。

 

Atomicity is implemented at 32-bit granularity. If a load or store operation spans more than 32-bits, the individual 32-bit operations are atomic, but not the whole.

原子性以 32 位粒度实现。如果加载或存储操作跨度超过 32 位,则单个 32 位操作是原子操作,但不是全部操作。

 

Limitation: Atomic operations on Thread Group Shared Memory are atomic with respect to other atomic operations, as well as operations that only perform reads ("load"s).However atomic operations on Thread Group Shared Memory are NOT atomic with respect to operations that perform only writes ("store"s) to memory. Mixing of atomics and stores on the same Thread Group Shared Memory address without thread synchronization and memory fencing between them produces undefined results at the address involved.This limitation arises because some implementations of loads and stores do not honor the locking semantics for implementing atomics. It turns out this has no impact on loads, since they are guaranteed to retrieve a value either before or after an atomic (they will not retrieve partially updated values, given they are all defined at 32-bit quanta).However store operations could find their way into the middle of an atomic operation and thus have their effect possibly lost.

限制:线程组共享内存的原子操作相对于其他原子操作以及仅执行读取(“load”) 的操作来说是原子的。然而,对线程组共享内存的原子操作相对于仅执行写入(“store”) 到内存的操作并不是原子的。在同一线程组共享内存地址上混合原子操作和存储操作,而没有线程同步和内存围栏,会产生未定义的结果。这种限制是因为一些加载和存储的实现并不遵守实现原子操作的锁定语义。事实证明,这对加载没有影响,因为它们保证在原子操作之前或之后检索值(它们不会检索部分更新的值,因为它们都定义为32位量子)。然而,存储操作可能会找到进入原子操作中间的方式,从而可能丧失其效果。

 

Note that there is no such limitation on atomics to UAV memory; atomic operations on UAV memory is atomic both with respect to other atomic operations as well as loads and stores.

请注意,对于 UAV 内存的原子操作,不存在此类限制;UAV 内存上的原子操作既对其他原子操作具有原子性,也对加载和存储操作具有原子性。

 

7.14.3 Sync

7.14.3 同步****

A sync(22.17.7) instruction is included in the Shader IL for Pixel Shader and the Compute Shader.

同步 (22.17.7) 指令包含在像素着色器和计算着色器的着色器 IL 中。

 

This provides memory fence semantics at various scopes, and optional thread group synchronization semantics (the latter only applies to the Compute Shader). For details, including some discussion of the implications see the description of the sync(22.17.7) instruction.

这提供了在不同范围内的内存屏障语义,以及可选的线程组同步语义(后者仅适用于计算着色器)。有关详细信息,包括一些关于影响的讨论,请参阅sync(22.17.7)指令的描述。

 

7.14.4 Global vs Group/Local Coherency on Non-Atomic UAV Reads

7.14.4 全局与组/局部一致性在非原子UAV读取****

全局一致性:在全局一致性模型中,所有线程看到的内存操作顺序是一致的。这意味着,如果一个线程对一个变量进行了修改,那么所有其他线程都能看到这个修改。然而,这种一致性模型可能会限制并行性,并可能导致性能下降。

组/局部一致性:在组/局部一致性模型中,只有在同一个组或局部范围内的线程才能看到彼此的内存操作。这可以提高并行性,因为不需要等待所有线程达到一致。然而,这也可能导致一些复杂的同步问题。

Typical implementations will have a cache hierarchy to improve read access performance on UAV(5.3.9) accesses.A constraint that some implementations have with the first stage in this cache hierarchy is that, in addition to operating at per-thread-group scope only, the cache does not have an efficient way of being synchronized with writes or atomics that have happened by other thread groups.Such behavior only surfaces as an issue for applications when cross-thread-group communication needs to be performed involving data loads.In this case, the hardware basically needs to know that it must bypass the first stage of caches on loads, reaching out to a more global memory so that the cross thread-group communication can function. D3D allows applications specify this cross-thread-group communication intent as follows.

典型的实现将具有缓存层次结构,以提高UAV(5.3.9)访问的读取性能。这个缓存层次结构的第一阶段在某些实现中存在的一个限制是,除了只在每个线程组范围内操作外,缓存没有有效的方式与其他线程组发生的写入或原子操作进行同步。只有当需要执行涉及数据加载的跨线程组通信时,此类行为才会成为应用程序的问题。在这种情况下,硬件基本上需要知道它必须绕过loads缓存的第一阶段,到达更全局的内存,以便跨线程组通信可以正常工作。D3D 允许应用程序指定此跨线程组通信意图,如下所示。

 

If a Compute Shader(18) thread in a given thread group needs to perform loads of data that was written by atomics or stores in another thread group, the UAV slot where the data resides must be tagged upon declaration in the shader as "globally coherent", so the implementation can ignore the local cache. Otherwise, this form of cross-thread group data sharing will produce undefined results.

如果给定线程组中的计算着色器(18)线程需要执行由其他线程组中的原子或存储写入的数据的加载,那么数据所在的UAV插槽必须在着色器中声明时标记为“全局一致”,这样实现就可以忽略本地缓存。否则,这种形式的跨线程组数据共享将产生未定义的结果。

 

Atomic read-modify-write operations do not have this constraint (even though a part of the operation is a read/load), because a byproduct of the hardware honoring atomicity is that the entire system sees the operation, whereas simple loads on some implementations may only go to a local cache that has no knowledge of external updates.

原子读-修改-写操作没有此约束(即使操作的一部分是读/加载),因为硬件尊重原子性的副作用是整个系统都可以看到该操作,而某些实现中,简单加载可能只会去到一个没有外部更新信息的本地缓存。

 

If a UAV is not declared as "globally coherent", it is only "group coherent", which means loads can only see data written by stores and atomics in other threads in the same thread group. The affected hardware knows it can make use of its thread-group specific caching for loads, since writes to the memory only came from the current thread group. A UAV tagged as "globally coherent" is also inherently obviously "group coherent", although the affected hardware would not use its local cache.As such, the "globally coherent" flag should only be specified when necessary.

如果一个UAV没有被声明为“全局一致”,那么它只是“组一致”,这意味着加载只能看到由同一线程组中的其他线程的存储和原子写入的数据。受影响的硬件知道它可以利用其线程组特定的缓存进行加载,因为写入到内存的只来自当前的线程组。一个被标记为“全局一致”的UAV也显然是“组一致”的,尽管受影响的硬件不会使用其本地缓存。因此,只应在必要时指定“全局一致性”标志。

 

As a reminder though, to guarantee coherency on UAV accesses on all implementations, not only must shaders make the global vs group scope distinction discussed here upon UAV declaration, but they must also make appropriate use of memory and/or thread barriers ("sync_*" in the IL) as needed within in the shader to enforce proper ordering of operations by individual threads as seen by others.In addition, the "sync" operation has options for memory barriers that also distinguish between global vs group scope, but that control is separate from the topic of this section, and may not be exposed until a later time, as discussed in the sync instruction definition.

作为提醒,为了在所有实现上保证UAV访问的一致性,着色器不仅必须在UAV声明时做出这里讨论的全局与组范围的区分,而且还必须在着色器内部根据需要适当地使用内存和/或线程屏障(barriers)(在IL中为"sync_"),以强制执行由其他线程看到的单个线程的操作的正确排序。

此外,“sync”操作还具有内存屏障(barriers)选项,这些选项也区分了全局范围与组范围,但该控制与本节的主题是分开的,并且可能要等到以后才会公开,如同步指令定义中所述。

Instruction:    sync[_uglobal|_ugroup][_g][_t]

Stage(s):       All(22.1.1)

Description:    Thread group sync and/or memory barrier.

 

Back to issue of global vs group coherency on non-atomic UAV reads.Importantly, for many scenarios where cross thread-group communication or reduction (such as histograms) can be accomplished using only atomic operations (no cross thread-group loads involved), there is no problem since atomic operations are implemented by all hardware in a globally coherent way, regardless of whether the UAV has been tagged as "globally coherent" or not.

回到非原子UAV读取的全局与组一致性问题。重要的是,对于许多可以仅通过原子操作(不涉及跨线程组加载)实现的跨线程组通信或减少(如直方图)的场景,由于所有硬件都以全局一致的方式实现了原子操作,无论UAV是否被标记为“全局一致”,都不会有问题。

 

In the Pixel Shader(16), if a UAV is not declared as "globally coherent", it is only "locally coherent". "Local coherency" is the Pixel Shader’s equivalent of the Compute Shader’s "group coherency", except having scope limited only to a single Pixel Shader invocation.This indicates that the Pixel Shader is not doing any cross-PS-invocation communication involving simple load operations.Note, however, that in the Pixel Shader just like in the Compute Shader, atomic read-modify-write operations are always globally coherent. Indeed it is likely to be rare for a Pixel Shader or perhaps even the Compute Shader to need to declare a UAV as "globally coherent", given that atomic operations, which are always globally coherent, might provide the most practical mechanism for cross-PS-invocation or cross-group operations.

在像素着色器(16)中,如果一个UAV没有被声明为“全局一致”,那么它只是“局部一致”。 “局部一致性”是像素着色器的计算着色器的“组一致性”的等价物,只是范围仅限于单个像素着色器调用。这表示像素着色器未执行任何涉及简单加载操作的跨 PS 调用通信。但请注意,在像素着色器中,就像在计算着色器中一样,原子读-修改-写操作始终是全局一致的。事实上,像素着色器甚至计算着色器可能需要将UAV声明为“全局一致”的情况可能很少见,因为原子操作总是全局一致的,可能提供了跨PS调用或跨组操作最实用的机制。

 

7.15 Shader-Internal Cycle Counter (Debug Only)

7.15 着色器内部循环计数器(仅限调试)****

7.15.1 Basic Semantics

7.15.1 基本语义****

To assist comparisons of algorithms running on GPUs during application development, a cycle counter can be read into shaders. The cycle counter is a 64-bit unsigned integer.

为了帮助比较应用程序开发期间在 GPU 上运行的算法,可以将循环计数器读入着色器。循环计数器是一个 64 位无符号整数。

 

The cycle counter appears as an additional 2*32-bit (64 bit total) input register type that can declared in any version 5.0+ shader. There are currently no native 64-bit integer arithmetic operations in shaders, although it is simple enough to emulate this. It may be fine for shaders to just look at the low 32-bits of the counter – this can be requested in the shader. Applications may also export the measurements using standard shader outputs for later analysis such as on the CPU.

循环计数器表现为一个额外的2*32位(总共64位)输入寄存器类型,可以在任何5.0+版本的着色器中声明。目前在着色器中还没有原生的64位整数算术操作,尽管模拟这个操作足够简单。对于着色器来说,只看计数器的低32位可能就足够了 - 这可以在着色器中请求。应用程序也可以使用标准的着色器输出来导出测量结果,以便稍后在CPU上进行分析。

 

The counter is an implementation-dependent measure of cycles in the GPU engine, requiring care to interpret it usefully.

计数器是 GPU 引擎中周期实现相关度量,需要小心的有效的解释它。

 

7.15.2 Interpreting Cycle Counts

7.15.2 解释周期计数****

For this discussion, consider a shader "invocation" to be a single execution of one shader program from beginning to end. For the Compute Shader however, an "invocation" is a single thread-group’s execution – e.g. the lifespan of the contents of thread-group shared memory.

在这个讨论中,我们将一个着色器的"调用"视为一个着色器程序从开始到结束的单次执行。然而,对于计算着色器来说,一个"调用"是一个单个线程组的执行 - 例如,线程组共享内存内容的生命周期。

 

The initial value of the counter is undefined.

计数器的初始值未定义。

 

A single reading of the cycle counter is meaningless. But any shader invocation can poll the counter value any number of times.

单独读取周期计数器的值是没有意义的。但是,任何着色器调用都可以任意次数地轮询计数器的值。

 

Computing a delta from cycle counter readings within a shader invocation is meaningful.

在着色器调用中计算周期计数器读数的差值是有意义的。

 

Computing a delta from cycle counter readings across separate shader invocations is not meaningful on all hardware. Developers must obtain information directly from IHVs about whether this is meaningful.

在所有硬件上,跨不同着色器调用计算周期计数器读数的差值可能并没有意义。开发者必须直接从独立硬件供应商(IHVs)获取信息,以确定这是否有意义。

 

The only IHV agnostic approach to interpreting the counters is to limit calculation of deltas to within a given shader invocation, and only make comparisons of deltas within or between shader invocations.

解释计数器的唯一与 IHV 无关的方法是将增量的计算限制在给定的着色器调用内,并且仅对着色器调用内或着色器调用之间的增量进行比较。

 

There are plenty of reasons why test runs will execute differently. The obvious one is that execution of a shader can be interrupted by thread switching, so delta measurements will be arbitrarily larger than the number of cycles spent executing instructions in a given thread.

测试运行执行的不同有很多原因。显而易见的一个是,着色器的执行可以被线程切换中断,所以增量测量将会比在给定线程中执行指令所花费的周期数任意地大。

 

There is no supported way to find out the frequency of the counter. There is no way to correlate this shader internal counter with external timers such as asynchronous time queries.

没有支持的方法来找出计数器的频率。无法将此着色器内部计数器与外部计时器(如异步时间查询)相关联。

 

The counter measurements cannot be correlated with measurements on different hardware by other hardware vendors or even necessarily the same vendor.

计数器测量值不能与其他硬件供应商(甚至不一定是同一供应商)在不同硬件上的测量值相关联。

 

If a GPU’s speed changes, such as for power saving, there is no way to know this happened, or its effect on cycle measurements.

如果 GPU 的速度发生变化,例如为了省电,则无法知道发生了什么,也没有办法知道它对周期测量的影响。

 

Beyond these hints about the care needed to interpret the counter, the onus is on developers to research the properties of new hardware designs that may affect measurements.

除了这些关于解释计数器需要注意的提示之外,开发人员还有责任研究可能影响测量的新硬件设计的特性。

 

7.15.3 Shader Compiler Constraints

7.15.3 着色器编译器约束****

The HLSL shader compiler and driver compilers must treat reads of the cycle counter as barriers. Instructions can’t be moved across a counter read, and counter reads can’t be merged.

HLSL 着色器编译器和驱动程序编译器必须将循环计数器的读取视为屏障。指令不能在计数器读取之间移动,计数器读取也不能合并。

IS指令调度

 

7.15.4 Feature Availability

7.15.4 功能可用性****

The runtime enforces that shaders using this feature can only be created on a system with debug layer enabled. The debug layer is not allowed to be redistributed to end-user machines. The point is that shaders that use this counter are not intended to be shipped.

运行时强制要求,着色器使用这个功能,只有在启用了调试层的系统。调试层不允许分发到最终用户的机器。关键是,使用此计数器的着色器并非用于发布。

 

7.15.5 Conformance

7.15.5 一致性****

This feature will not be tested on hardware by WHQL, except perhaps simply checking that drivers do not crash. Microsoft will test that the HLSL compiler output is correct.

WHQL(Windows Hardware Quality Lab - Windows系统硬件质量实验室) 不会在硬件上测试此功能,除非只是检查驱动程序不会崩溃。Microsoft 将测试 HLSL 编译器输出是否正确。

 

7.15.6 Shader Bytecode Details

7.15.6 着色器字节码详细信息****

A new input register, vCycleCounter(22.3.29), can be declared in any version 5_0 (and beyond) shader:

可以在任何版本 5_0(及更高版本)着色器中声明新的输入寄存器 vCycleCounter (22.3.29) :

dcl_input vCycleCounter.{x|xy}.  

 

Reading x yields the 32 LSBs of the 64-bit count, and reading y yields the 32 MSBs.

读取 x 得到 64 位计数的 32 LSB,读取 y 得到 32 个 MSB。

 

This register can only be used as the source to a mov instruction, e.g. mov r0.w, vCycleCounter.x.

此寄存器只能用作 mov 指令的源,e.g. mov r0.w、vCycleCounter.x。

 

7.16 Textures And Resource Loading

7.16 纹理和资源加载****

Up to 128 Resources (e.g. Buffer, Texture1D/2D/3D/Cube) can be active per Pipeline stage. A Resource binding is a representation of a Resource's base pointer (and other data such as size and pixel layout) and is independent of the samplers.

每个 Pipeline 阶段最多可以激活 128 个资源(例如 Buffer、Texture1D/2D/3D/Cube)。资源绑定是资源基指针(以及其他数据,如大小和像素布局)的表示形式,并且独立于采样器。

 

A texture out of a set of bound textures cannot be selected via Shader indexing, however Texture1D/2D/3D resources with an Array dimension > 1, or TextureCube (which has an Array dimension of 6), allow indexing along the array axis from within Shader code.

一组绑定纹理中的纹理不能通过Shader索引来选择,但是具有大于1的数组维度的Texture1D/2D/3D资源,或者TextureCube(其数组维度为6),允许从Shader代码内沿数组轴进行索引。

 

Textures can only have a single Element format. Likewise, Buffers used as input to Shaders can also only have a single Element format, and have an implied data stride equal to the Element size.A single Buffer (or Texture) could be set to multiple input slots simultaneously, with different Element formats and/or offsets, however because Buffers bound as Shader inputs have their data stride implied by the Element format, it is not possible to describe "Array-of-Structures" style layouts in Buffers bound at Shader input.This unlike the Input Assembler Stage, where multiple element Buffers are permitted, and Element offsets and strides can be defined Buffers freely.

纹理只能具有一种元素格式。同样,用作着色器输入的缓冲区也只能具有一种元素格式,并且隐含的数据步幅等于 Element 大小。

单个缓冲区(或纹理)可以同时设置为多个输入槽,具有不同的元素格式和/或偏移量,但是由于作为着色器输入的缓冲区具有元素格式隐含的数据步长,因此无法在着色器输入绑定的缓冲区中描述“结构数组”样式布局。这与输入汇编阶段不同,在输入汇编阶段,允许多元素缓冲区,并且可以自由定义元素偏移和步长。

假设我们有一个RGBA纹理和一个浮点数缓冲区,我们可以这样在Shader代码中声明和使用它们:

// 声明一个RGBA纹理

Texture2D texture : register(t0);

// 声明一个浮点数缓冲区

Buffer buffer : register(b0);

// 在Shader代码中,我们可以使用纹理采样器来访问纹理中的像素

float4 pixel = texture.Sample(sampler, uv);

// 我们也可以使用索引来访问缓冲区中的元素

float value = buffer[index];

在这个例子中,uv 是一个二维向量,用于指定我们想要采样的纹理坐标。index 是一个整数,用于指定我们想要访问的缓冲区元素的索引。sampler 是一个采样器对象,用于指定纹理采样的方式。

 

Data from textures is accessed in shaders via the load (ld) and sample instructions.The ld instruction provides a simple read and (optional) float32 conversion of texture data using integral addresses, while the sample instructions use normalized floating point addressing and perform filtering in addition to the format conversion.

在着色器中通过load(ld) 和sample指令访问来自纹理的数据。ld 指令使用整数地址提供纹理数据的简单读取和(可选)float32 转换,而sample指令使用规范化浮点寻址,并在格式转换之外执行过滤。

7.17 Texture Load

7.17 纹理加载****

The load operation performs a non-filtered read of resource data. See the ld(22.4.6) instruction definition for details.

加载操作对资源数据执行不筛选读取。有关详细信息,请参阅 ld (22.4.6) 指令定义。

ld[_aoffimmi(u,v,w)][_s]

                    dest[.mask],

                    srcAddress[.swizzle],

                    srcResource[.swizzle]

ld:这是指令的名称,表示"load"。

[_aoffimmi(u,v,w)]:这是一个可选的后缀,表示地址偏移。它指示纹理坐标将按一组提供的立即纹素空间整数常量值进行偏移。

dest[.mask]:这是操作结果的地址。

srcAddress[.swizzle]:这是执行采样所需的纹理坐标。

srcResource[.swizzle]:这是一个必须声明的纹理寄存器 (t#),用于识别要从中提取数据的纹理或缓冲区。

 

7.17.1 Multisample Resource Load

7.17.1 多重采样资源加载****

多重采样资源加载是一种在着色器中访问纹理数据的技术,它使用多个采样点来确定像素的最终颜色,从而提高图像的质量。

在DirectX中,您可以使用Load函数来从多重采样纹理中加载数据。这个函数接受一个整数坐标作为输入,并返回该坐标处的像素值。例如:

// 声明一个2D多重采样纹理

Texture2DMS<float4, 128> tex : register(t0);   //Texture2DMSArray<float4, 128> texture : register(t0);

// 使用Load函数访问纹理中的一个像素

int2 coord = int2(10, 20);

float4 pixel = tex.Load(coord, sampleIndex);

在这个例子中,coord 是一个二维向量,表示我们想要访问的像素的坐标。sampleIndex 是一个整数,表示我们想要访问的采样的索引。然后,pixel 就会包含该坐标处的像素值。

需要注意的是,多重采样资源加载需要更多的计算资源,因为它需要处理更多的样本。但是,它可以显著提高图像的质量,特别是在处理边缘锯齿时。

Multisample resources can be set as shader inputs, which allows individual samples to be read by the shader. Support for multisample shader reads has the following restrictions:

多重采样资源可以被设置为着色器输入,从而允许着色器读取单个样本。多重采样着色器读取的支持具有以下限制:

1. Pixel Shader only (not supported for other shader stages)

1. 仅限像素着色器(其他着色器阶段不支持)

2. load instruction only (no use of sample instructions)

2. 仅加载指令(不使用sample指令)

3. Texture2D and Texture2DArray resources only

3. 仅限 Texture2D 和 Texture2DArray 资源

4. number of samples in bound resource must be declared in shader

4. 绑定资源中的采样数必须在着色器中声明

5. sample index for load instruction must be a literal

5. 加载指令的采样索引必须是文本

See ld(22.4.6) and dcl_resource(22.3.12) definitions for details.

有关详细信息,请参阅 ld (22.4.6) 和 dcl_resource (22.3.12) 定义。