openai jukebox - fix for RuntimeError: Failed to initialize NCCL


I started to document a fix and share a patch diff for this, and evidently neglected to save my changes anywhere so all that is left of the effort is this doc I just found to be recovered when I started Libre Write a moment ago. I figure that even unfinished, this much may help someone struggling to get an openai jukebox colab to work here in the modern day (November 2022) after folks have apparently long since abandoned such things.
I leave this here, as I found it.. incomplete and with me lacking the motivation presently to go back and figure out what it was I was going to publish when finished because even incomplete, random red arrow image pointing at nothing included, it still offers an explanation and the simplest fix I found for the issue tracked at https://github.com/openai/jukebox/issues/18 ( RuntimeError: Failed to initialize NCCL ) when running https://colab.research.google.com/github/sirbots/jukebox-the-continuator/blob/master/Jukebox_the_Continuator.ipynb or even, with enough motivation, github/openai/jukebox/blob/master/jukebox/Interacting_with_Jukebox.ipynb

From dist_utils.py:




This causes the jupyter kernel to bind to tcp port 29500 and *listen* on the loopback interface (because torch.dist.is_available() returns true) as if it is to be the MPI master, and it starts making connections to itself, as you see below.
If you interrupt or otherwise re-run the notebook from the same Jupyter kernel, this tcp port will already be in use and the call to _setup_dist_from_mpi() will try for n_attempts times and then bail out.
A simple fix for this behavior is to “Restart runtime” or “Restart and run all” from Colab’s Runtime menu, which will free up the port so that it the setup_dist_from_mpi() call can succeed (for the first time, like before, again but the same condition will occur if you attempt to re-execute the cell in the same Jupyter runtime.)
Shape1