深度学习框架-Massif工具分析内存优化

129 阅读1分钟

1 问题背景

运行环境x86,业务模型推理运行对内存有极致要求,需要降低运行内存。

2 定位分析

用valgrind的massif工具,在x86服务器抓一下日志。 执行命令:

export LD_LIBRARY_PATH=${PWD}/src
valgrind --tool=massif --time-unit=B ./tools/benchmark/benchmark --modelFile=/home/xxx/models/ASR/encoder_weight_fly.ms --enableFp16=true --warmUpLoopCount=0 --loopCount=1

根据日志分析大内存事件,inner_allocator.cc:72 分配90M

99.14% (98,541,096B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->90.04% (89,498,712B) 0x54C9D58: mindspore::DefaultAllocator::Malloc(unsigned long) (inner_allocator.cc:72)
| ->89.61% (89,067,960B) 0x54F7DD6: mindspore::lite::Tensor::MallocData(std::shared_ptr<mindspore::Allocator>) (tensor.cc:432)
| | ->89.61% (89,067,960B) 0x5534713: mindspore::kernel::LiteKernel::PreProcess() (lite_kernel.cc:73)
| |   ->89.61% (89,067,960B) 0x55349F7: mindspore::kernel::LiteKernel::Execute() (lite_kernel.cc:105)
| |     ->89.61% (89,067,960B) 0x553777C: mindspore::kernel::KernelExec::DoExecute() (kernel_exec.cc:80)
| |       ->89.61% (89,067,960B) 0x55387D5: mindspore::kernel::KernelExec::Execute(std::function<bool (std::vector<mindspore::lite::Tensor*, std::allocator<mindspore::lite::Tensor*> >, std::vector<mindspore::lite::Tensor*, std::allocator<mindspore::lite::Tensor*> >, mindspore::MSCallBackParam const&)> const&, std::function<bool (std::vector<mindspore::lite::Tensor*, std::allocator<mindspore::lite::Tensor*> >, std::vector<mindspore::lite::Tensor*, std::allocator<mindspore::lite::Tensor*> >, mindspore::MSCallBackParam const&)> const&) (kernel_exec.h:104)

工具Massif文档: Massif: a heap profiler

分析根因:内存管理allocator会提前分配一整块内存,为了内存复用目的,根据引用计数标记其中内存状态(内存申请、内存释放)。graph图(const node)的拓扑序是广度优先,实际无法达到内存复用。

3 解决方案

优化原理:优化graph图的拓扑序,按照深度优先排序,实现内存复用目的。