第2章 MXMACA 编程环境
本章内容
● MXMACA云端编程环境
● MXMACA本机编程环境
● 用MXMACA打印“Hello World”
本章介绍MXMACA编程环境,它是MXMACA程序员进入MXMACA编程世界进行活动和施展身手的“舞台”。本章将介绍以下两种编程环境选项:
一是直接使用沐曦云端集成开发环境;
二是在本地已安装Ubuntu操作系统的电脑上安装MXMACA开发工具,搭建MXMACA编程环境。
有了MXMACA编程环境,我们就可以快速编写一个Hello World程序,测试基本的开发环境有没有问题,并熟悉MXMACA编程的开发流程和开发工具。
很简单,都在里面,不像昇腾五花八门一大堆库,实际芯片就一个,没发现为什么要这么复杂的软件栈。沐曦和天数智芯的软件栈也很简单,就一个芯片,一个软件栈。
2.1 MXMACA云端编程环境
沐曦云端集成开发环境(Integrated Developing Environment,IDE)是一个综合性的工具软件,它把MXMACA C/C++程序设计全过程所需的各项功能和工具链整合在一起,为MXMACA程序设计 人员提供高效、便利、完整的服务。如果想快速进入MXMACA编程世界,你可以在沐曦开发者社区注册获得云端MXMACA编程账号,然后就可以开启MXMACA编程之旅了。
你可以访问沐曦官网的“开发者社区”,点击右上角的注册按钮,根据注册网页提示输入相关内容就可以获得开发者账号。
沐曦开发者社区账号根据权限可分成三类:普通用户、注册用户和企业用户。
● 普通用户无须注册可直接浏览社区。
● 注册用户须填写用户名、手机号(含验证码)、邮箱、登录密码进行注册。
● 企业用户须提供姓名、所在地、公司、行业、企业邮箱、手机号(含验证码)、正在使用产品的SN号、对接销售或市场人员姓名、用途及应用场景描述、您主要关注的领域、您关注哪些咨询平台等信息,并由后台人工审核通过后完成注册。注册用户注册成功后,可以访问沐曦在线编译平台,在网页上直接编写MXMACA程序代码,然 后点击运行代码并查看运行结果。
企业用户应具备MXMACA云端编程和云端运行环境。企业用户注册成功后,如果这是你第一次 使用云端MXMACA编程,你可能需要检查在你申请的云端账号里是否正确安装了编程环境。你可以在Linux系统中使用以下命令进行检查。
%%bash
which mxcc # /opt/maca/mxgpu_llvm/bin/mxcc
mxcc --help
/opt/maca/mxgpu_llvm/bin/mxcc
OVERVIEW: MXMACA C/C++ Compiler
USAGE: mxcc [options] file...
OPTIONS:
-arbin <value> Specify the path of the archiver tool used create static librarie. Can only be used when --emit-static-lib is defined
-clean Delete all the non-temporary files that the same mxcc command would generate without this option
--config <value> Specifies configuration file
-c Only run preprocess, compile, and assemble steps
-default-stream <value> Specify the stream that MXMACA commands from the compiled program will be sent to by default.
-device-arch=<value> set the device's arch in triple
-device-arch <value> set the device's arch in triple
-device-bc Compile input file into device-only bitcode file
-device-bin Compile input file into device-only binary file
-device-obj Compile input file into device-only object file
-disable_promote_alloca_to_bsm <value>
backend metaxgpu.
-disable_promote_alloca_to_vector <value>
backend metaxgpu.
-dlink-asm generate dlink phase asm
-dlink-cpp Compile input file into dlink phase cpp file
-dlink-obj Compile input file into dlink phase obj file
-D <macro>=<value> Define <macro> to <value> (or 1 if <value> omitted)
-emit-llvm Use the LLVM representation for assembler and object files
--emit-static-lib Enable linker job to emit a static library.
-enable_ldg_bsm_opt <value>
backend metaxgpu.
-expt-extended-lambda Alias for --extended-lambda: Allow __host__, __device__ annotations in lambda declarations
-extended-lambda Allow __host__, __device__ annotations in lambda declarations
-E Only run the preprocessor
-fatbc Compile input file into device-only mcfb(bc) file
-fatbin Compile input file into device-only mcfb(bin) file
-fexceptions Enable support for exception handling
-fgpu-rdc Generate relocatable device code, also known as separate compilation mode
-fixed_function_abi <value>
backend metaxgpu.
-fkeep-device-symbols keep unused device symbols
-fmaca-uuid <value> pass the unique id to mxcc frontend
-fmad=<value> Enables(disables) the contraction of floating-point multipiles and adds/subtracts into floating-point multiply-add operations.
-fmad <value> Enables(disables) the contraction of floating-point multipiles and adds/subtracts into floating-point multiply-add operations.
-fno-exceptions Disable support for exception handling
-fno-gpu-rdc
-fopenmp-targets=<value>
Specify comma-separated list of triples OpenMP offloading targets to be supported
-fopenmp Parse OpenMP pragmas and generate parallel code.
-foptimization-record-passes=<regex>
Only include passes which match a specified regular expression in the generated optimization record (by default, include all passes)
-forward-unknown-to-compiler
Forward unknown options to host and device compilation. An 'unknown option' is an option which is not recognized by mxcc but can be processed by clang
-fproc-stat-report=<value>
Save subprocess statistics to the given file
-fproc-stat-report<value>
Print subprocess statistics
-ftz=<value> Controls single-precision denormals support
-ftz <value> Controls single-precision denormals support
-gcc-toolchain=<value> Use the gcc toolchain at the given directory
-gcc-toolchain <value> Use the gcc toolchain at the given directory
-gcc-version=<value> Try to find and use the gcc toolchain of the given version
-gcc-version <value> Try to find and use the gcc toolchain of the given version
--generate-line-info Generate device-side line info. Disables asm optimizations.
--gpu-max-threads-per-block=<value>
Default max threads per block for kernel launch bounds for MXMACA
-g Generate source-level debug information
-help Display available options
-h Display available options
-include-device-pch <file>
Include device precompiled header file
-include-host-pch <file>
Include host precompiled header file
-include <file> Include file before parsing
-input-is-device treat all input file as device source file
-input-is-host treat all input file as host source file
-isystem <directory> Add directory to SYSTEM include search path
-I <dir> Add directory to the end of the list of include search paths
--list-gpu-arch List all gpu architectures supported by the tool and exit.
-L <dir> Add directory to library search path
-maca-device-input=<value>
input MXMACA device file
-maca-device-input <value>
input MXMACA device file
-maca-device-lib-path=<value>
MXMACA device library path. Alternative to maca-path.
-maca-device-lib=<value>
MXMACA device library
-maca-device-only Compile MXMACA code for device only
-maca-host-input=<value>
input MXMACA host file
-maca-host-input <value>
input MXMACA host file
-maca-host-lib-path=<value>
MXMACA host library path. Alternative to maca-path.
-maca-host-lib=<value> MXMACA host library
-maca-host-only Compile MXMACA code for host only. Has no effect on non-MXMACA compilations.
-maca-infer-ldg Infer that the LDG offset is greater than 128 bytes for MXMACA
-maca-link Link clang-offload-bundler bundles for MXMACA
--maca-noopt-device-debug
Enable device-side debug info generation. Disables asm optimizations.
-maca-path=<value> MXMACA installation path, used for finding and automatically linking required bitcode libraries.
-map-sched-select <value>
backend metaxgpu.
-map_use_pk_fma <value> backend metaxgpu.
-MD Write a depfile containing user and system headers
-metaxgpu-bsm-direct-address <value>
backend metaxgpu.
-metaxgpu-sched-select <value>
backend metaxgpu.
-MF <file> Write depfile output from -MMD, -MD, -MM, or -M to <file>
-mllvm <value> Additional arguments to forward to LLVM's option processing
-MMD Write a depfile containing user headers
-MM Like -MMD, but also implies -E and writes to stdout by default
-MP Create phony target for each dependency (other than main file)
-MT <value> Specify name of main file output in depfile
-M Like -MD, but also implies -E and writes to stdout by default
-nodefaultrpath Do not add default rpath when linking
-nogpuinc Do not add include paths for MXMACA and do not include the default MXMACA wrapper headers
-nogpulib Do not link device library for MXMACA device compilation
-objtemp for tempfile output dir
-odir <value> for output dir
-o <file> Write output to <file>
-pg Enable mcount instrumentation
-prec-div=<value> Controls single-precision division and reciprocals
-prec-div <value> Controls single-precision division and reciprocals
-prec-sqrt=<value> Controls single-precision square root
-prec-sqrt <value> Controls single-precision square root
-resource-usage Show resource usage such as registers and memory of the GPU code.
-run-args=<value> Specify command line arguments for the executable when used in conjunction with --run
-run-args <value> Specify command line arguments for the executable when used in conjunction with --run
-run Compile and link all input files into an executable and executes it
-scalarize_global_loads <value>
backend metaxgpu.
-shfl_combine <value> backend metaxgpu.
-std=<value> Language standard to compile for
-use-fast-math Make use of fast math library
-U <macro> Undefine macro <macro>
--version Print version information
-v Show commands to run and use verbose output
-Wdeprecated Enable warnings for deprecated constructs and define __DEPRECATED
-Wl,<arg> Pass the comma separated arguments in <arg> to the linker
-W<warning> Enable the specified warning
-w Suppress all warnings
-Xarchive <arg> Pass <arg> to the ar
-Xcompiler <arg> Pass <arg> directly to the compiler
-Xdevice=<arg> Pass <arg> to the MXMACA device compilation
-Xdevice <arg> Pass <arg> to the MXMACA device compilation
-Xhost=<arg> Pass <arg> to the MXMACA host compilation
-Xhost <arg> Pass <arg> to the MXMACA host compilation
-Xlinker <arg> Pass <arg> to the linker
-Xmaca-device-linker <arg>
Pass <arg> to the MXMACA device linker
-x <language> Treat subsequent input files as having type <language>
分析下mxcc --help输出的参数
-
编译流程控制: -device-arch # 指定目标GPU架构(如MX2200) -fgpu-rdc # 启用设备端分离编译(类似CUDA的RDC模式) -emit-llvm # 输出LLVM IR用于跨平台优化
-
性能优化相关: -fmad=enable # 启用乘加融合优化(提升计算密度) -use-fast-math # 启用快速数学库(牺牲精度换性能) -ftz=enable # 刷新非规格化数到零(加速特殊计算)
-
调试与分析: -resource-usage # 显示寄存器/共享内存使用情况(性能调优关键) --generate-line-info # 保留设备端调试信息(支持Nsight等工具)
-
异构编程支持: -default-stream # 设置默认计算流(多流编程基础) -Xdevice/-Xhost # 分别传递参数给设备/主机编译器(异构编译关键)
-
高级功能扩展: -fopenmp-target # 支持OpenMP异构卸载(CPU+GPU协同) -extended-lambda # 增强lambda表达式支持(函数式编程范式)
%%bash
macainfo # 查看MXMACA芯片信息
======================
MXC System Attributes
======================
Runtime Version: 1.0
System Timestamp Freq: 1000MHz
Signal Max Wait Time: 18446744073709551615(0xffffffffffffffff)
Machine Model: LARGE
System Endianess: LITTLE
***********
Agent 1
***********
Name: INTEL(R) XEON(R) GOLD 6330 CPU @ 2.00GHZ
Uuid: 4350552d-5858-0000-0000-000000000000
Market Name: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
Vendor Name: CPU
Feature: Not Specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0
Queue Max Size: 0
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq(MHz): 0
BDFID: 0
Internal Node ID: 0
Accelerator Processors: 56
PEUs per AP: 0
Data Processor Clustes(DPCs): 0
DPC Arrays per DPC.: 0
Watch Pointers on Address Ranges: 1
Pool Info:
Pool 1
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 1626668(0x18d22c) KB
Allocatable: TRUE
Alloc Granule: 4 KB
Alloc Alignment: 4 KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 2077004(0x1fb14c) KB
Allocatable: TRUE
Alloc Granule: 4 KB
Alloc Alignment: 4 KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 1626668(0x18d22c) KB
Allocatable: TRUE
Alloc Granule: 4 KB
Alloc Alignment: 4 KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 2077004(0x1fb14c) KB
Allocatable: TRUE
Alloc Granule: 4 KB
Alloc Alignment: 4 KB
Accessible by all: TRUE
ISA Info:
N/A
***********
Agent 2
***********
Name: INTEL(R) XEON(R) GOLD 6330 CPU @ 2.00GHZ
Uuid: 4350552d-5858-0000-0000-000000000000
Market Name: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
Vendor Name: CPU
Feature: Not Specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0
Queue Max Size: 0
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq(MHz): 0
BDFID: 0
Internal Node ID: 1
Accelerator Processors: 56
PEUs per AP: 0
Data Processor Clustes(DPCs): 0
DPC Arrays per DPC.: 0
Watch Pointers on Address Ranges: 1
Pool Info:
Pool 1
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 1626668(0x18d22c) KB
Allocatable: TRUE
Alloc Granule: 4 KB
Alloc Alignment: 4 KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 2077004(0x1fb14c) KB
Allocatable: TRUE
Alloc Granule: 4 KB
Alloc Alignment: 4 KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 1626668(0x18d22c) KB
Allocatable: TRUE
Alloc Granule: 4 KB
Alloc Alignment: 4 KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 2077004(0x1fb14c) KB
Allocatable: TRUE
Alloc Granule: 4 KB
Alloc Alignment: 4 KB
Accessible by all: TRUE
ISA Info:
N/A
***********
Agent 3
***********
Name: XCORE1020
Uuid: 7302869f-215c-9fe9-4d96-3a19ab0419b3
Market Name: Device 4083
Vendor Name: METAX
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 12(0xc)
Queue Min Size: 1000
Queue Max Size: 20000
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 32 KB
L2: 8192 KB
Chip ID: 16515(0x4083)
Cacheline Size: 128(0x80)
Max Clock Freq(MHz): 1600
BDFID: 13312
Internal Node ID: 2
Accelerator Processors: 56
PEUs per AP: 4
Data Processor Clustes(DPCs): 8
DPC Arrays per DPC.: 1
Watch Pointers on Address Ranges: 4
Fast Float16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per AP: 32(0x20)
Max Work-items Per AP: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgroups: 32
Pool Info:
Pool 1
Segment: GLOBAL
Flags: COARSE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL
Flags: COARSE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 4
Segment: GLOBAL
Flags: COARSE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 5
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 6
Segment: GLOBAL
Flags: COARSE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 7
Segment: GLOBAL
Flags: COARSE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 8
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 9
Segment: GLOBAL
Flags: COARSE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 10
Segment: GLOBAL
Flags: COARSE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 11
Segment: GLOBAL
Flags: FINE_GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 0 KB
Alloc Alignment: 65536 KB
Accessible by all: FALSE
Pool 12
Segment: PRIVATE
Flags:
Size: 0(0x0) KB
Allocatable: FALSE
Alloc Granule: 0 KB
Alloc Alignment: 0 KB
Accessible by all: FALSE
Pool 13
Segment: GROUP
Flags:
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0 KB
Alloc Alignment: 0 KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: METAX-MXC-MXMACA--XCORE1020
Machine Models: LARGE
Profiles: BASE
Fast Float16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 18446744073709551615(0xffffffffffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgroups: 32
PEUs per AP: 4 # 每个SM包含4个处理单元 每个sm就4个?那为什么要用所谓 流多处理器?流什么意思?我看网上cuda的示意图,不止4个吧英伟达的sm
在GPU架构中,**流多处理器(Streaming Multiprocessor,SM)是一个关键的计算单元,而PEUs(Processing Element Units,处理单元)**是SM内部的执行单元。以下是关于这些问题的详细解释:
-
SM(流多处理器)的含义
SM是GPU架构中的核心计算单元,负责执行CUDA线程和计算任务。它包含多个CUDA核心(PEUs)、Tensor核心(用于深度学习加速)、纹理单元等。**“流”**的概念来源于GPU的并行计算模型,即通过并行处理大量数据流来提高计算效率。 -
PEUs(处理单元)的数量
在不同的GPU架构中,每个SM包含的CUDA核心数量是不同的。例如:
NVIDIA的Turing架构,比如T4:每个SM包含64个CUDA核心。
NVIDIA的Ampere架构,比如A100:每个SM包含64个CUDA核心。
NVIDIA的Hopper架构:每个SM包含128个CUDA核心。
而沐曦GPU的架构中,每个SM包含4个PEUs,这可能是因为其设计目标和应用场景与NVIDIA的GPU有所不同。沐曦GPU可能更注重特定计算任务的优化,而不是追求大规模的并行计算能力。
-
DPC(Data Processing Cluster)与SM的关系
DPC(数据处理器集群)通常用于管理和优化数据的传输和处理,而SM则专注于计算任务的执行。在GPU架构中,DPC可以为SM提供优化后的数据,减少数据传输的延迟,从而提高整体计算效率。DPC和SM之间的关系类似于“数据供应者”和“计算执行者”的关系。 -
为什么SM的数量和PEUs的数量不同 架构设计目标:不同的GPU架构设计目标不同。NVIDIA的GPU通常面向高性能计算和图形渲染,因此需要大量的CUDA核心来实现高吞吐量的并行计算。而沐曦GPU可能更专注于特定领域(如AI推理或特定计算任务),因此其SM设计可能更注重能效比和特定任务的优化。
应用场景:NVIDIA的GPU广泛应用于游戏、科学计算、AI训练等领域,这些场景需要处理大量的并行任务。而沐曦GPU可能更专注于特定的计算任务,因此其架构设计可能更简化。
总的来说,SM和PEUs的数量差异反映了不同GPU架构的设计目标和应用场景。NVIDIA的GPU通过大量CUDA核心实现高性能并行计算,而沐曦GPU可能通过优化特定任务的执行效率来满足其设计目标。
wavefront的概念。在AMD GPU中,wavefront相当于NVIDIA的warp,是调度和执行的基本单位。每个wavefront包含64个线程,而NVIDIA的warp是32线程。沐曦可能采用了类似AMD的设计,所以每个wavefront有64线程。
为什么不是所有核心都参与计算?
硬件资源限制:虽然每个SM或AP有多个CUDA核心或PEUs,但硬件资源(如寄存器、共享内存等)是有限的。因此,不能同时调度所有线程,而是分批调度。
指令流水线:为了提高效率,GPU采用了指令流水线技术。即使某些核心在某个周期内没有执行任务,它们也可能在下一个周期中执行其他任务。
延迟隐藏:通过调度多个Warp或Wavefront,GPU可以在等待某些线程完成内存访问时切换到其他线程,从而隐藏延迟。
但是总共才 56个ap,每个4个pe,你一个wavefront size 64 ,岂不是要64/4pe=16个ap同时执行?
(1)单指令多数据(SIMD)模型 GPU通常采用**单指令多数据(SIMD)**模型来实现高效的并行计算。在这种模型下: 一个PE可以同时处理多个线程,但这些线程执行的是相同的指令。 例如,在NVIDIA的GPU中,一个Warp包含32个线程,这些线程会被分配到同一个SM(流多处理器)中的多个CUDA核心上。每个CUDA核心(类似于PE)在每个时钟周期内可以处理一个Warp中的一个线程。 在AMD的GPU中,一个Wavefront包含64个线程,这些线程会被分配到一个AP(加速处理器)中的多个PEU上。每个PEU在每个时钟周期内可以处理一个Wavefront中的一个线程。
%%bash
# 如果macainfo命令不能被正确执行,你需要正确配置MXMACA云端运行环境。
export