使用的命令:
CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh ${CONFIG_FILE} 4 --resume-from ${CHECKPOINT_FILE}
出现问题:
- 模型train 1 epoch后挂掉,报错信息:
RuntimeError: replicas_[0].size() == rebuilt_param_indices_.size() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/distributed/c10d/reducer.cpp":1326, please report a bug to PyTorch. rebuilt parameter indices size is not same as original model parameters size.321 versus 629160
- pytorch github issue:https://github.com/pytorch/pytorch/issues/47050
- mmcv issue: https://github.com/open-mmlab/mmcv/issues/636#issuecomment-722436575
- 解决方案:安装1.6版本pytorch,并重装mmcv
pip uninstall mmcv-full
pip install mmcv-full
python setup.py install