reference: ug1027
Software Acceleration
- use the source code libraries of existing hardware functions, such as the Xilinx xfOpenCV library
- modifying your code to better use the PL device architecture
sds++ system compiler
-coption:- invoke HLS to compoile a file to hardware IP
- translate
# pragma SDSinto pragma understood by HLS
HLS
- scheduling
- pipelining
- dataflow
sds++ linker
- analyzes dataflow into/between hardware functions
- identify operations can be shared
- orchestrate accelerators and data transfers through data movers
- software control code(stubs)
- insert wait barrier API into stubs
Execution Model
hardware functions -> hardware accelerators that are accessed as a task with the standard C runtime through calls into these functions.
-
CPU & accelerators
through arguments after task completion
-
memory & acceletors
through data movers: eg: DMA engine
- automatically inserted into the system by the sds++ system compiler taking into account user data mover pragmas such as zero_copy.
system compiler
- intercepts each call to a hardware function, and
- replaces it with a call to a generated stub function that has an identical signature but with a derived name.
stub function
-
synchronize software and accelerator hardware at the exit of the hardware function call.
-
control all accelerator and data mover through a set of send and receive APIs provided by the
sds_liblibrary within the stub.one optimization: array arguments between hardware functions calls if not accessed after the function calls other than destructors or `free()` calls dataflow through stream
SDSoC program execution steps
-
Initialization of the sds_lib library occurs during the program constructor before entering main().
-
every call to a hardware function is intercepted by a function call into a stub function with the same function signature (other than name) as the original function.Within the stub function, the following steps occur:
a. A synchronous accelerator task control command is sent to the hardware.
b. For each argument to the hardware function, an asynchronous data transfer request is sent to the appropriate data mover, with an associated
wait()handle. A non-void return value is treated as an implicit output scalar argument.c. A barrier
wait()is issued for each transfer request. If a data transfer between accelerators is implemented as a direct hardware stream, the barrierwait()for this transfer occurs in the stub function for the last in the chain of accelerator functions for this argument.最后一个需要wait -
Clean up of the
sds_liblibrary occurs during the program destructor, upon exiting main().
Customized concurrent task execution
#pragma SDS async(ID): generate a stub function without any barrier wait() calls for data transfers
issue all data requests -> return to program
enabling concurrent execution of the program while the accelerator is running
your responsibility: insert #pragma SDS wait(ID)within the program at appropriate synchronization points, which are resolved into sds_wait(ID) API calls to correctly synchronize hardware accelerators, their implicit data movers, and the CPU.
Build Process
build: using sds++ system compiler
- compilation
- Compilation for main application(on the ARM core) & each hardware accelerator
- Compiling the application code with an object (.o) file produced using standard GNU Arm compilation tools
- Running the hardware accelerated functions with an object (.o) file produced using HLS
- linking
- Modifying the hardware platform to accept the accelerators
- Implementing the hardware accelerators into PL: synthesis 、 implementation、bitstream generation(Vivado)
- Updating the software images with hardware access APIs to call hardware functions
- Producing an integrated SD card image that can boot the board with the application in an Executable and Linkable Format (ELF) file.
PS: The Data Motion Network report lists the accelerated functions and how their arguments were mapped and connected to platform interfaces
sds++ system compiler:
- HLS & Vivado:implement the generated hardware system
- Arm compiler & sds++ linker: create application binaries that run on the CPU invoking the accelerator (stubs) for each hardware function by outputting a complete bootable system for an SD card.
Best Practices
- streaming data
- Reuse data
- task-level parallelization
software-centric approach
- good memory management techniques: eg: sds_alloc()/sds_free()
- system emulation: functionally correct
- Write/migrate hardware functions to separate C/C++ files as to not re-compile the entire design for incremental changes
hardware-centric approach
- Keep track of the AXI4 Interface offsets for: IP、accelerator、what function definition parameters require what data type The interfaces need to be byte aligned.
- Maintain the original Vivado IP project so that modifications to it can be quickly implemented
- Keep the static library (.a) file and corresponding header file together.