MMDetection单机多卡训练出现问题

使用的命令:

CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh ${CONFIG_FILE} 4 --resume-from ${CHECKPOINT_FILE}

出现问题:

  • 模型train 1 epoch后挂掉,报错信息:
RuntimeError: replicas_[0].size() == rebuilt_param_indices_.size() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/distributed/c10d/reducer.cpp":1326, please report a bug to PyTorch. rebuilt parameter indices size is not same as original model parameters size.321 versus 629160
pip uninstall mmcv-full
pip install mmcv-full
python setup.py install
Licensed under CC BY-NC-SA 4.0
comments powered by Disqus