作为个人笔记使用，持续更新ing，记录在PyTorch使用中遇到过的各种问题和解决方法，如果有幸有看官的话，欢迎留言，共同进步。

installation problems

conda 清华镜像

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
# 使用conda-forge的包的话 输入下一行
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
# 需要pytorch的话 输入下一行
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --set show_channel_urls yes
# 查看config
conda config --show
# 需要删除某条镜像的话
conda config --remove channels https://pypi.doubanio.com/simple/
# 恢复默认（未尝试
conda config --remove-key channels

安装pytorch，首先从Pytorch官网查询命令。

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

由于-c pytorch其实指定了从conda自己的仓库下载，所以为了使用镜像，去掉-c pytorch就好了。

username is not in the sudoers file

使用非管理员账户使用除conda和pip外的命令时，可能会出现这种情况。如下介绍参考资料中成功且最方便的一个。

# 在管理员账户下转到root
su root
# username 替换自己的账号名
adduser username sudo

no space left on deivce

最常见的情况是，由于服务器长期运行\tmp下空间满了。

# 临时解决方案，在command line下
export TMPDIR=/home/username/tmp
# “永久”方案，不推荐，因为/tmp重启后会清理的
# 在.bashrc
export TMPDIR=/home/username/tmp
source ~/.bashrc

conda install opencv-python failed

使用conda安装opencv-python时经常出现环境冲突或者package问题，参考，比如no local packages or working links found for opencv-python。

# 经过多次尝试，出现这种情况，可直接使用pip install
# 保险期间检查下是不是conda env的pip
which pip
pip install opencv-python

可能的原因, 不够确定，有空时再研究下。

版本问题

CUDNN_STATUS_EXECUTION_FAILED

问题描述：使用代码yolo github时，出现RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED。

解决方案, 这类问题的debug思路：先停用GPU，使用CPU跑代码，看是否是代码问题，修完CPU下代码或没问题时，可以基本判断是cuda，cudnn，vs，python，pytorch中的版本问题。

nvcc -V检测cuda版本，确定pytorch版本和cuda是合适的。可以使用镜像加速安装的过程。

Code Snippet

BatchNorm 的初始化

def weights_init_normal(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1:
        torch.nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find("BatchNorm2d") != -1:
        torch.nn.init.normal_(m.weight.data, 1.0, 0.02)
        torch.nn.init.constant_(m.bias.data, 0.0)

参考资料：PyTorch forum, PyTorch tutorial, Github issue

总结一下：BatchNorm默认情况下 init.uniform_(self.weight) ,但是BatchNorm的weight应该初始化在 1 附近。因为 gamma 其实是在做scale，所以初始值应该接近于1。

CPU 加载 checkpoint trained on GPU

模型使用GPU训练并存下checkpoint，如果直接去torch.load，可能会出现目前只有CPU而报错的情况。

rase AssertionError("Torch not compiled with CUDA enabled")

使用：

net.load_dict(torch.load('model.pt', map_location=lambda storage, loc: storage))

passing in a function, taking a CPU storage and its serialized location as input, and returning some storage to replace it.

高阶（很少用到）的特性

non_blocking

doc

As an exception, several functions such as to() and copy_() admit an explicit non_blocking argument, which lets the caller bypass synchronization when it is unnecessary.

pytorch discuss

使用non_blocking可以使得数据从cpu读入gpu和gpu kernel的运算实现异步，但是有两个前提条件，1. dataloader使用pin_memory; 2. train loop中不将结果传出到cpu，即没有其他的sync point

PyTorch 疑难杂症