第2章 MXMACA 编程环境第2章 MXMACA 编程环境本章内容 ● MXMACA云端编程环境 ● MXMACA本

第2章 MXMACA 编程环境

本章内容
● MXMACA云端编程环境
● MXMACA本机编程环境
● 用MXMACA打印“Hello World”
本章介绍MXMACA编程环境，它是MXMACA程序员进入MXMACA编程世界进行活动和施展身手的“舞台”。本章将介绍以下两种编程环境选项：
一是直接使用沐曦云端集成开发环境；
二是在本地已安装Ubuntu操作系统的电脑上安装MXMACA开发工具，搭建MXMACA编程环境。
有了MXMACA编程环境，我们就可以快速编写一个Hello World程序，测试基本的开发环境有没有问题，并熟悉MXMACA编程的开发流程和开发工具。

很简单，都在里面，不像昇腾五花八门一大堆库，实际芯片就一个，没发现为什么要这么复杂的软件栈。沐曦和天数智芯的软件栈也很简单，就一个芯片，一个软件栈。

2.1 MXMACA云端编程环境

沐曦云端集成开发环境（Integrated Developing Environment，IDE）是一个综合性的工具软件，它把MXMACA C/C++程序设计全过程所需的各项功能和工具链整合在一起，为MXMACA程序设计人员提供高效、便利、完整的服务。如果想快速进入MXMACA编程世界，你可以在沐曦开发者社区注册获得云端MXMACA编程账号，然后就可以开启MXMACA编程之旅了。

你可以访问沐曦官网的“开发者社区”，点击右上角的注册按钮，根据注册网页提示输入相关内容就可以获得开发者账号。
沐曦开发者社区账号根据权限可分成三类：普通用户、注册用户和企业用户。
● 普通用户无须注册可直接浏览社区。
● 注册用户须填写用户名、手机号（含验证码）、邮箱、登录密码进行注册。 ● 企业用户须提供姓名、所在地、公司、行业、企业邮箱、手机号（含验证码）、正在使用产品的SN号、对接销售或市场人员姓名、用途及应用场景描述、您主要关注的领域、您关注哪些咨询平台等信息，并由后台人工审核通过后完成注册。注册用户注册成功后，可以访问沐曦在线编译平台，在网页上直接编写MXMACA程序代码，然后点击运行代码并查看运行结果。

企业用户应具备MXMACA云端编程和云端运行环境。企业用户注册成功后，如果这是你第一次使用云端MXMACA编程，你可能需要检查在你申请的云端账号里是否正确安装了编程环境。你可以在Linux系统中使用以下命令进行检查。

%%bash
which mxcc # /opt/maca/mxgpu_llvm/bin/mxcc
mxcc --help

/opt/maca/mxgpu_llvm/bin/mxcc
OVERVIEW: MXMACA C/C++ Compiler

USAGE: mxcc [options] file...

OPTIONS:
  -arbin <value>          Specify the path of the archiver tool used create static librarie. Can only be used when --emit-static-lib is defined
  -clean                  Delete all the non-temporary files that the same mxcc command would generate without this option
  --config <value>        Specifies configuration file
  -c                      Only run preprocess, compile, and assemble steps
  -default-stream <value> Specify the stream that MXMACA commands from the compiled program will be sent to by default.
  -device-arch=<value>    set the device's arch in triple
  -device-arch <value>    set the device's arch in triple
  -device-bc              Compile input file into device-only bitcode file
  -device-bin             Compile input file into device-only binary file
  -device-obj             Compile input file into device-only object file
  -disable_promote_alloca_to_bsm <value>
                          backend metaxgpu.
  -disable_promote_alloca_to_vector <value>
                          backend metaxgpu.
  -dlink-asm              generate dlink phase asm
  -dlink-cpp              Compile input file into dlink phase cpp file
  -dlink-obj              Compile input file into dlink phase obj file
  -D <macro>=<value>      Define <macro> to <value> (or 1 if <value> omitted)
  -emit-llvm              Use the LLVM representation for assembler and object files
  --emit-static-lib       Enable linker job to emit a static library.
  -enable_ldg_bsm_opt <value>
                          backend metaxgpu.
  -expt-extended-lambda   Alias for --extended-lambda: Allow __host__, __device__ annotations in lambda declarations
  -extended-lambda        Allow __host__, __device__ annotations in lambda declarations
  -E                      Only run the preprocessor
  -fatbc                  Compile input file into device-only mcfb(bc) file
  -fatbin                 Compile input file into device-only mcfb(bin) file
  -fexceptions            Enable support for exception handling
  -fgpu-rdc               Generate relocatable device code, also known as separate compilation mode
  -fixed_function_abi <value>
                          backend metaxgpu.
  -fkeep-device-symbols   keep unused device symbols
  -fmaca-uuid <value>     pass the unique id to mxcc frontend
  -fmad=<value>           Enables(disables) the contraction of floating-point multipiles and adds/subtracts into floating-point multiply-add operations.
  -fmad <value>           Enables(disables) the contraction of floating-point multipiles and adds/subtracts into floating-point multiply-add operations.
  -fno-exceptions         Disable support for exception handling
  -fno-gpu-rdc            
  -fopenmp-targets=<value>
                          Specify comma-separated list of triples OpenMP offloading targets to be supported
  -fopenmp                Parse OpenMP pragmas and generate parallel code.
  -foptimization-record-passes=<regex>
                          Only include passes which match a specified regular expression in the generated optimization record (by default, include all passes)
  -forward-unknown-to-compiler
                          Forward unknown options to host and device compilation. An 'unknown option' is an option which is not recognized by mxcc but can be processed by clang
  -fproc-stat-report=<value>
                          Save subprocess statistics to the given file
  -fproc-stat-report<value>
                          Print subprocess statistics
  -ftz=<value>            Controls single-precision denormals support
  -ftz <value>            Controls single-precision denormals support
  -gcc-toolchain=<value>  Use the gcc toolchain at the given directory
  -gcc-toolchain <value>  Use the gcc toolchain at the given directory
  -gcc-version=<value>    Try to find and use the gcc toolchain of the given version
  -gcc-version <value>    Try to find and use the gcc toolchain of the given version
  --generate-line-info    Generate device-side line info. Disables asm optimizations.
  --gpu-max-threads-per-block=<value>
                          Default max threads per block for kernel launch bounds for MXMACA
  -g                      Generate source-level debug information
  -help                   Display available options
  -h                      Display available options
  -include-device-pch <file>
                          Include device precompiled header file
  -include-host-pch <file>
                          Include host precompiled header file
  -include <file>         Include file before parsing
  -input-is-device        treat all input file as device source file
  -input-is-host          treat all input file as host source file
  -isystem <directory>    Add directory to SYSTEM include search path
  -I <dir>                Add directory to the end of the list of include search paths
  --list-gpu-arch         List all gpu architectures supported by the tool and exit.
  -L <dir>                Add directory to library search path
  -maca-device-input=<value>
                          input MXMACA device file
  -maca-device-input <value>
                          input MXMACA device file
  -maca-device-lib-path=<value>
                          MXMACA device library path. Alternative to maca-path.
  -maca-device-lib=<value>
                          MXMACA device library
  -maca-device-only       Compile MXMACA code for device only
  -maca-host-input=<value>
                          input MXMACA host file
  -maca-host-input <value>
                          input MXMACA host file
  -maca-host-lib-path=<value>
                          MXMACA host library path. Alternative to maca-path.
  -maca-host-lib=<value>  MXMACA host library
  -maca-host-only         Compile MXMACA code for host only.  Has no effect on non-MXMACA compilations.
  -maca-infer-ldg         Infer that the LDG offset is greater than 128 bytes for MXMACA
  -maca-link              Link clang-offload-bundler bundles for MXMACA
  --maca-noopt-device-debug
                          Enable device-side debug info generation. Disables asm optimizations.
  -maca-path=<value>      MXMACA installation path, used for finding and automatically linking required bitcode libraries.
  -map-sched-select <value>
                          backend metaxgpu.
  -map_use_pk_fma <value> backend metaxgpu.
  -MD                     Write a depfile containing user and system headers
  -metaxgpu-bsm-direct-address <value>
                          backend metaxgpu.
  -metaxgpu-sched-select <value>
                          backend metaxgpu.
  -MF <file>              Write depfile output from -MMD, -MD, -MM, or -M to <file>
  -mllvm <value>          Additional arguments to forward to LLVM's option processing
  -MMD                    Write a depfile containing user headers
  -MM                     Like -MMD, but also implies -E and writes to stdout by default
  -MP                     Create phony target for each dependency (other than main file)
  -MT <value>             Specify name of main file output in depfile
  -M                      Like -MD, but also implies -E and writes to stdout by default
  -nodefaultrpath         Do not add default rpath when linking
  -nogpuinc               Do not add include paths for MXMACA and do not include the default MXMACA wrapper headers
  -nogpulib               Do not link device library for MXMACA device compilation
  -objtemp                for tempfile output dir
  -odir <value>           for output dir
  -o <file>               Write output to <file>
  -pg                     Enable mcount instrumentation
  -prec-div=<value>       Controls single-precision division and reciprocals
  -prec-div <value>       Controls single-precision division and reciprocals
  -prec-sqrt=<value>      Controls single-precision square root
  -prec-sqrt <value>      Controls single-precision square root
  -resource-usage         Show resource usage such as registers and memory of the GPU code.
  -run-args=<value>       Specify command line arguments for the executable when used in conjunction with --run
  -run-args <value>       Specify command line arguments for the executable when used in conjunction with --run
  -run                    Compile and link all input files into an executable and executes it
  -scalarize_global_loads <value>
                          backend metaxgpu.
  -shfl_combine <value>   backend metaxgpu.
  -std=<value>            Language standard to compile for
  -use-fast-math          Make use of fast math library
  -U <macro>              Undefine macro <macro>
  --version               Print version information
  -v                      Show commands to run and use verbose output
  -Wdeprecated            Enable warnings for deprecated constructs and define __DEPRECATED
  -Wl,<arg>               Pass the comma separated arguments in <arg> to the linker
  -W<warning>             Enable the specified warning
  -w                      Suppress all warnings
  -Xarchive <arg>         Pass <arg> to the ar
  -Xcompiler <arg>        Pass <arg> directly to the compiler
  -Xdevice=<arg>          Pass <arg> to the MXMACA device compilation
  -Xdevice <arg>          Pass <arg> to the MXMACA device compilation
  -Xhost=<arg>            Pass <arg> to the MXMACA host compilation
  -Xhost <arg>            Pass <arg> to the MXMACA host compilation
  -Xlinker <arg>          Pass <arg> to the linker
  -Xmaca-device-linker <arg>
                          Pass <arg> to the MXMACA device linker
  -x <language>           Treat subsequent input files as having type <language>

分析下mxcc --help输出的参数

编译流程控制： -device-arch # 指定目标GPU架构（如MX2200） -fgpu-rdc # 启用设备端分离编译（类似CUDA的RDC模式） -emit-llvm # 输出LLVM IR用于跨平台优化
性能优化相关： -fmad=enable # 启用乘加融合优化（提升计算密度） -use-fast-math # 启用快速数学库（牺牲精度换性能） -ftz=enable # 刷新非规格化数到零（加速特殊计算）
调试与分析： -resource-usage # 显示寄存器/共享内存使用情况（性能调优关键） --generate-line-info # 保留设备端调试信息（支持Nsight等工具）
异构编程支持： -default-stream # 设置默认计算流（多流编程基础） -Xdevice/-Xhost # 分别传递参数给设备/主机编译器（异构编译关键）
高级功能扩展： -fopenmp-target # 支持OpenMP异构卸载（CPU+GPU协同） -extended-lambda # 增强lambda表达式支持（函数式编程范式）

%%bash
macainfo # 查看MXMACA芯片信息

======================
MXC System Attributes
======================
Runtime Version:	1.0
System Timestamp Freq:	1000MHz
Signal Max Wait Time:	18446744073709551615(0xffffffffffffffff)
Machine Model:		LARGE
System Endianess:	LITTLE

***********
Agent 1
***********
  Name:					INTEL(R) XEON(R) GOLD 6330 CPU @ 2.00GHZ
  Uuid:					4350552d-5858-0000-0000-000000000000
  Market Name:				Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
  Vendor Name:				CPU
  Feature:				Not Specified
  Profile:				FULL_PROFILE
  Float Round Mode:			NEAR
  Max Queue Number:			0(0x0)
  Queue Min Size:			0
  Queue Max Size:			0
  Queue Type:				MULTI
  Node:					0
  Device Type:				CPU
  Cache Info:				
  Chip ID:				0(0x0)
  Cacheline Size:			64(0x40)
  Max Clock Freq(MHz):			0
  BDFID:				0
  Internal Node ID:			0
  Accelerator Processors:		56
  PEUs per AP:				0
  Data Processor Clustes(DPCs):		0
  DPC Arrays per DPC.:			0
  Watch Pointers on Address Ranges:	1
  Pool Info:
    Pool 1
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				1626668(0x18d22c) KB
      Allocatable:			TRUE
      Alloc Granule:			4 KB
      Alloc Alignment:			4 KB
      Accessible by all:		TRUE
    Pool 2
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				2077004(0x1fb14c) KB
      Allocatable:			TRUE
      Alloc Granule:			4 KB
      Alloc Alignment:			4 KB
      Accessible by all:		TRUE
    Pool 3
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				1626668(0x18d22c) KB
      Allocatable:			TRUE
      Alloc Granule:			4 KB
      Alloc Alignment:			4 KB
      Accessible by all:		TRUE
    Pool 4
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				2077004(0x1fb14c) KB
      Allocatable:			TRUE
      Alloc Granule:			4 KB
      Alloc Alignment:			4 KB
      Accessible by all:		TRUE
  ISA Info:
    N/A
***********
Agent 2
***********
  Name:					INTEL(R) XEON(R) GOLD 6330 CPU @ 2.00GHZ
  Uuid:					4350552d-5858-0000-0000-000000000000
  Market Name:				Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
  Vendor Name:				CPU
  Feature:				Not Specified
  Profile:				FULL_PROFILE
  Float Round Mode:			NEAR
  Max Queue Number:			0(0x0)
  Queue Min Size:			0
  Queue Max Size:			0
  Queue Type:				MULTI
  Node:					1
  Device Type:				CPU
  Cache Info:				
  Chip ID:				0(0x0)
  Cacheline Size:			64(0x40)
  Max Clock Freq(MHz):			0
  BDFID:				0
  Internal Node ID:			1
  Accelerator Processors:		56
  PEUs per AP:				0
  Data Processor Clustes(DPCs):		0
  DPC Arrays per DPC.:			0
  Watch Pointers on Address Ranges:	1
  Pool Info:
    Pool 1
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				1626668(0x18d22c) KB
      Allocatable:			TRUE
      Alloc Granule:			4 KB
      Alloc Alignment:			4 KB
      Accessible by all:		TRUE
    Pool 2
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				2077004(0x1fb14c) KB
      Allocatable:			TRUE
      Alloc Granule:			4 KB
      Alloc Alignment:			4 KB
      Accessible by all:		TRUE
    Pool 3
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				1626668(0x18d22c) KB
      Allocatable:			TRUE
      Alloc Granule:			4 KB
      Alloc Alignment:			4 KB
      Accessible by all:		TRUE
    Pool 4
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				2077004(0x1fb14c) KB
      Allocatable:			TRUE
      Alloc Granule:			4 KB
      Alloc Alignment:			4 KB
      Accessible by all:		TRUE
  ISA Info:
    N/A
***********
Agent 3
***********
  Name:					XCORE1020
  Uuid:					7302869f-215c-9fe9-4d96-3a19ab0419b3
  Market Name:				Device 4083
  Vendor Name:				METAX
  Feature:				KERNEL_DISPATCH
  Profile:				BASE_PROFILE
  Float Round Mode:			NEAR
  Max Queue Number:			12(0xc)
  Queue Min Size:			1000
  Queue Max Size:			20000
  Queue Type:				MULTI
  Node:					2
  Device Type:				GPU
  Cache Info:				
    L1:					32 KB
    L2:					8192 KB
  Chip ID:				16515(0x4083)
  Cacheline Size:			128(0x80)
  Max Clock Freq(MHz):			1600
  BDFID:				13312
  Internal Node ID:			2
  Accelerator Processors:		56
  PEUs per AP:				4
  Data Processor Clustes(DPCs):		8
  DPC Arrays per DPC.:			1
  Watch Pointers on Address Ranges:	4
  Fast Float16 Operation:		FALSE
  Wavefront Size:			64(0x40)
  Workgroup Max Size:			1024(0x400)
  Workgroup Max Size per Dimension:
    x					1024(0x400)
    y					1024(0x400)
    z					1024(0x400)
  Max Waves Per AP:			32(0x20)
  Max Work-items Per AP:		2048(0x800)
  Grid Max Size:			4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x					4294967295(0xffffffff)
    y					4294967295(0xffffffff)
    z					4294967295(0xffffffff)
  Max fbarriers/Workgroups:		32
  Pool Info:
    Pool 1
      Segment:				GLOBAL
      Flags:				COARSE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 2
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 3
      Segment:				GLOBAL
      Flags:				COARSE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 4
      Segment:				GLOBAL
      Flags:				COARSE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 5
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 6
      Segment:				GLOBAL
      Flags:				COARSE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 7
      Segment:				GLOBAL
      Flags:				COARSE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 8
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 9
      Segment:				GLOBAL
      Flags:				COARSE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 10
      Segment:				GLOBAL
      Flags:				COARSE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 11
      Segment:				GLOBAL
      Flags:				FINE_GRAINED
      Size:				0(0x0) KB
      Allocatable:			TRUE
      Alloc Granule:			0 KB
      Alloc Alignment:			65536 KB
      Accessible by all:		FALSE
    Pool 12
      Segment:				PRIVATE
      Flags:				
      Size:				0(0x0) KB
      Allocatable:			FALSE
      Alloc Granule:			0 KB
      Alloc Alignment:			0 KB
      Accessible by all:		FALSE
    Pool 13
      Segment:				GROUP
      Flags:				
      Size:				64(0x40) KB
      Allocatable:			FALSE
      Alloc Granule:			0 KB
      Alloc Alignment:			0 KB
      Accessible by all:		FALSE
  ISA Info:
    ISA 1
      Name:				METAX-MXC-MXMACA--XCORE1020
      Machine Models:			LARGE
      Profiles:				BASE
      Fast Float16 Operation:		FALSE
      Wavefront Size:			64(0x40)
      Workgroup Max Size:		1024(0x400)
      Workgroup Max Size per Dimension:
        x				1024(0x400)
        y				1024(0x400)
        z				1024(0x400)
      Grid Max Size:			18446744073709551615(0xffffffffffffffff)
      Grid Max Size per Dimension:
        x				4294967295(0xffffffff)
        y				4294967295(0xffffffff)
        z				4294967295(0xffffffff)
      Max fbarriers/Workgroups:		32

PEUs per AP: 4 # 每个SM包含4个处理单元每个sm就4个？那为什么要用所谓流多处理器？流什么意思？我看网上cuda的示意图，不止4个吧英伟达的sm

在GPU架构中，**流多处理器（Streaming Multiprocessor，SM）是一个关键的计算单元，而PEUs（Processing Element Units，处理单元）**是SM内部的执行单元。以下是关于这些问题的详细解释：

SM（流多处理器）的含义
SM是GPU架构中的核心计算单元，负责执行CUDA线程和计算任务。它包含多个CUDA核心（PEUs）、Tensor核心（用于深度学习加速）、纹理单元等。**“流”**的概念来源于GPU的并行计算模型，即通过并行处理大量数据流来提高计算效率。
PEUs（处理单元）的数量
在不同的GPU架构中，每个SM包含的CUDA核心数量是不同的。例如：
NVIDIA的Turing架构，比如T4：每个SM包含64个CUDA核心。
NVIDIA的Ampere架构，比如A100：每个SM包含64个CUDA核心。
NVIDIA的Hopper架构：每个SM包含128个CUDA核心。

而沐曦GPU的架构中，每个SM包含4个PEUs，这可能是因为其设计目标和应用场景与NVIDIA的GPU有所不同。沐曦GPU可能更注重特定计算任务的优化，而不是追求大规模的并行计算能力。

DPC（Data Processing Cluster）与SM的关系
DPC（数据处理器集群）通常用于管理和优化数据的传输和处理，而SM则专注于计算任务的执行。在GPU架构中，DPC可以为SM提供优化后的数据，减少数据传输的延迟，从而提高整体计算效率。DPC和SM之间的关系类似于“数据供应者”和“计算执行者”的关系。
为什么SM的数量和PEUs的数量不同架构设计目标：不同的GPU架构设计目标不同。NVIDIA的GPU通常面向高性能计算和图形渲染，因此需要大量的CUDA核心来实现高吞吐量的并行计算。而沐曦GPU可能更专注于特定领域（如AI推理或特定计算任务），因此其SM设计可能更注重能效比和特定任务的优化。

应用场景：NVIDIA的GPU广泛应用于游戏、科学计算、AI训练等领域，这些场景需要处理大量的并行任务。而沐曦GPU可能更专注于特定的计算任务，因此其架构设计可能更简化。

总的来说，SM和PEUs的数量差异反映了不同GPU架构的设计目标和应用场景。NVIDIA的GPU通过大量CUDA核心实现高性能并行计算，而沐曦GPU可能通过优化特定任务的执行效率来满足其设计目标。

wavefront的概念。在AMD GPU中，wavefront相当于NVIDIA的warp，是调度和执行的基本单位。每个wavefront包含64个线程，而NVIDIA的warp是32线程。沐曦可能采用了类似AMD的设计，所以每个wavefront有64线程。

为什么不是所有核心都参与计算？

硬件资源限制：虽然每个SM或AP有多个CUDA核心或PEUs，但硬件资源（如寄存器、共享内存等）是有限的。因此，不能同时调度所有线程，而是分批调度。

指令流水线：为了提高效率，GPU采用了指令流水线技术。即使某些核心在某个周期内没有执行任务，它们也可能在下一个周期中执行其他任务。

延迟隐藏：通过调度多个Warp或Wavefront，GPU可以在等待某些线程完成内存访问时切换到其他线程，从而隐藏延迟。

但是总共才 56个ap，每个4个pe，你一个wavefront size 64 ，岂不是要64/4pe=16个ap同时执行？

（1）单指令多数据（SIMD）模型 GPU通常采用**单指令多数据（SIMD）**模型来实现高效的并行计算。在这种模型下：一个PE可以同时处理多个线程，但这些线程执行的是相同的指令。例如，在NVIDIA的GPU中，一个Warp包含32个线程，这些线程会被分配到同一个SM（流多处理器）中的多个CUDA核心上。每个CUDA核心（类似于PE）在每个时钟周期内可以处理一个Warp中的一个线程。在AMD的GPU中，一个Wavefront包含64个线程，这些线程会被分配到一个AP（加速处理器）中的多个PEU上。每个PEU在每个时钟周期内可以处理一个Wavefront中的一个线程。

%%bash 
# 如果macainfo命令不能被正确执行，你需要正确配置MXMACA云端运行环境。
export