Hugo Evers
10/23/2023, 8:24 AMif __name__ == '__main__'
solves a lot of problems. So if you run have some problems about processes, try writing your code inside the context of it.).
And indeed I ran into issues when running parralelformers on AWS batch on a p3.16xlarge instance with 8 gpus, so its running in kedro on a docker container.datajoely
10/23/2023, 9:31 AMParallelRunner
as well it may interfereHugo Evers
10/23/2023, 9:48 AMSequentialRunner
and finding it causes the issues that were mentioneddatajoely
10/23/2023, 9:51 AMHugo Evers
10/23/2023, 9:53 AMfrom transformers import TrainingArguments
import torch
# get the number of gpus
num_gpus = torch.cuda.device_count()
if num_gpus > 1:
from parallelformers import parallelize
parallelize(model, num_gpus=num_gpus, fp16=True, verbose="detail")
inside of a kedro node gives
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00) WARNING No nodes ran. Repeat the previous runner.py:213 command to attempt a new run. [10/15/23 12:57:26] ERROR Node 'sort_using_baal: node.py:356 func[redacted]) -> [redacted]' failed with error: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00)
## Environment
python 3.10.1
parralelformers latest
os: ubuntuJuan Luis
10/23/2023, 11:32 AM