Skip to content

Commit bd9bb36

Browse files
H-Huangfacebook-github-bot
authored andcommitted
Allow ports to be reused in gloo (pytorch#97677)
Summary: Pull Request resolved: pytorch#97677 X-link: pytorch/gloo#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Test Plan: Add a gloo test to create 4 groups of size 64 using multithreaded PG + gloo. In total 256 ranks. Differential Revision: D44029927 fbshipit-source-id: 9c31c38485333602c33e12c12813bea33ccb9438
1 parent 97fc8ea commit bd9bb36

File tree

2 files changed

+47
-0
lines changed

2 files changed

+47
-0
lines changed

test/distributed/test_multi_threaded_pg.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,5 +220,34 @@ def test_gather(self):
220220
for i in range(self.world_size):
221221
self.assertEqual(gather_list[i], torch.ones(3, 3) * i)
222222

223+
class TestLargeWorld(MultiThreadedTestCase):
224+
@property
225+
def world_size(self):
226+
return 64
227+
228+
def setUp(self):
229+
super().setUp()
230+
self._spawn_threads()
231+
232+
def test_gloo_init(self):
233+
groups = []
234+
num_ports_used = 0
235+
num_groups = 4
236+
# create multiple gloo groups with 64 ranks
237+
for i in range(num_groups):
238+
group = dist.new_group(backend="gloo")
239+
groups.append(group)
240+
241+
# tear down gloo groups
242+
for i in range(num_groups):
243+
dist.destroy_process_group(groups[i])
244+
groups.clear()
245+
self.assertEqual(len(groups), 0)
246+
247+
# create multiple gloo groups with 64 ranks
248+
for i in range(num_groups):
249+
group = dist.new_group(backend="gloo")
250+
groups.append(group)
251+
223252
if __name__ == "__main__":
224253
run_tests()

torch/csrc/distributed/c10d/ProcessGroupGloo.cpp

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -638,6 +638,24 @@ bool doesHostnameResolveToUsableAddress(const std::string& hostname) {
638638
struct addrinfo* rp = nullptr;
639639
for (rp = result; rp != nullptr; rp = rp->ai_next) {
640640
auto fd = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
641+
642+
// Set SO_REUSEADDR to signal that reuse of the listening port is OK.
643+
int on = 1;
644+
rv = setsockopt(
645+
fd,
646+
SOL_SOCKET,
647+
SO_REUSEADDR,
648+
reinterpret_cast<const char*>(&on),
649+
sizeof(on));
650+
if (rv == -1) {
651+
#ifdef _WIN32
652+
closesocket(fd);
653+
#else
654+
close(fd);
655+
#endif
656+
logAndThrow("setsockopt: ", strerror(errno));
657+
}
658+
641659
if (fd == -1) {
642660
continue;
643661
}

0 commit comments

Comments
 (0)