NAME Data::DisjointSet::Shared - shared-memory union-find (disjoint-set) for Linux SYNOPSIS use Data::DisjointSet::Shared; # a universe of 1000 elements (0 .. 999), anonymous shared mapping my $d = Data::DisjointSet::Shared->new(undef, 1000); $d->union(0, 1); # merge the sets containing 0 and 1 $d->union(1, 2); # 0, 1, 2 are now one set $d->merge(3, 4); # merge is an alias for union $d->connected(0, 2); # true -- same set $d->same(0, 3); # false -- 'same' is an alias for connected $d->find(2); # canonical representative (root) of 2's set $d->set_size(0); # number of elements in 0's set (3) $d->num_sets; # how many disjoint sets remain $d->capacity; # the fixed element count (1000) # union many pairs in a single lock acquisition (flat [a0,b0,a1,b1,...]) my $merged = $d->union_many([ 5,6, 6,7, 8,9 ]); # returns how many merged $d->reset; # back to all singletons (num_sets == capacity) # share across processes via a backing file my $shared = Data::DisjointSet::Shared->new("/tmp/dsu.bin", 1000); DESCRIPTION A union-find (disjoint-set) structure in shared memory. It maintains a partition of a fixed universe of N integer elements (numbered "0 .. N-1") into disjoint sets. The three core operations are: * union(a, b) -- merge the set containing "a" with the set containing "b" into a single set. * find(x) -- return the *canonical representative* (the root) of the set containing "x". Two elements are in the same set if and only if they have the same root. * connected(a, b) -- test whether "a" and "b" are currently in the same set. The structure uses path compression (path halving on "find") together with union by size (the larger-sized root always becomes the parent), which gives near-constant amortized time per operation (the inverse-Ackermann bound). Each element is one "parent" slot and one "size" slot, both 32-bit, so the payload is "8 * N" bytes regardless of how many unions are performed; the mapping also carries a fixed ~16 KB cross-process reader table, so the "mmap_size" stat reports the true total. Because the "parent"/"size" arrays live in a shared mapping, several processes share one structure: any process that opens the same backing file, inherits the anonymous mapping across "fork", or reopens a passed memfd, sees the others' unions and contributes its own. A write-preferring futex rwlock with dead-process recovery guards every mutation, so unions from many processes serialize cleanly and the final partition is well defined. IMPORTANT: "find", "connected" and "set_size" perform path compression -- they mutate the structure to keep future queries fast -- and therefore acquire the write lock. They are *not* read-only operations. Only "num_sets" and "capacity" are cheap reads ("capacity" is lock-free; it is immutable after construction). There is no operation that unions one whole structure into another. A disjoint-set is a partition of one fixed universe; combining it with a separate structure is not meaningful. To combine elements, use "union" or "union_many" within a single structure. Linux-only. Requires 64-bit Perl. METHODS Constructors my $d = Data::DisjointSet::Shared->new($path, $n); my $d = Data::DisjointSet::Shared->new(undef, $n); # anonymous my $d = Data::DisjointSet::Shared->new_memfd($name, $n); my $d = Data::DisjointSet::Shared->new_from_fd($fd); $path is the backing file ("undef" or omitted for an anonymous mapping). $n is the number of elements: the universe is "0 .. $n-1". It must be ">= 1" and "<= 2**31"; "new" and "new_memfd" croak if $n is out of range. A freshly created structure starts with every element in its own singleton set, so "num_sets == $n" and "set_size($i) == 1" for every $i. When reopening an existing file or memfd, the stored "n" wins and the existing partition is preserved; the $n you pass to "new" on a reopen is only used when the file is brand new. "new_memfd" creates a Linux memfd (transferable via its "memfd" descriptor); "new_from_fd" reopens one in another process. Union and query my $merged = $d->union($a, $b); # merge; 1 if newly merged, 0 if already together my $merged = $d->merge($a, $b); # alias for union my $root = $d->find($x); # canonical representative of $x's set my $bool = $d->connected($a, $b); # true if $a and $b are in the same set my $bool = $d->same($a, $b); # alias for connected my $size = $d->set_size($x); # number of elements in $x's set "union" merges the set containing $a with the set containing $b and returns 1 if they were in different sets (a merge happened, and "num_sets" decreased by one) or 0 if they were already in the same set (nothing changed). "merge" is a short alias. "find" returns the root of $x's set -- a stable representative that is equal for all members of the same set. "connected" (aliased "same") returns true when $a and $b share a root. "set_size" returns the number of elements in $x's set. All four of "find", "connected", "set_size" and "union" take the write lock: "union" obviously mutates, and "find"/"connected"/"set_size" mutate via path compression. Every index argument must be in "0 .. capacity-1"; an out-of-range index croaks before any lock is taken (so a caught croak never leaks a lock). Bulk union my $merged = $d->union_many(\@pairs); "union_many" takes an array reference of even length holding a flat list of pairs, "[a0, b0, a1, b1, ...]", and unions each consecutive "(a, b)" pair under a single write lock. It returns the number of pairs that actually caused a merge (i.e. the count of "union" calls that would have returned 1). The batch is atomic with respect to validation: every index is resolved and range-checked before the lock is taken, so an odd-length array or any out-of-range index croaks without performing "any" union and without holding the lock. An empty array reference is a valid no-op that performs zero unions. Partition size my $sets = $d->num_sets; # current number of disjoint sets my $sets = $d->sets; # alias for num_sets my $n = $d->capacity; # the fixed element count (immutable) "num_sets" (aliased "sets") is the current count of disjoint sets: it starts equal to "capacity" and decreases by one on each successful "union". "capacity" is the number of elements the structure was created with and never changes. Both are read-only; "capacity" is lock-free. Lifecycle $d->reset; # back to all singletons $d->clear; # alias for reset $d->path; $d->memfd; $d->sync; $d->unlink; # or Class->unlink($path) "reset" (aliased "clear") returns every element to its own singleton set, so "num_sets" becomes "capacity" again. "sync" flushes the mapping to its backing store (a no-op for anonymous and memfd structures, which have none); "unlink" removes the backing file (also callable as "Class->unlink($path)"); "path" returns the backing path ("undef" for anonymous, memfd, or fd-reopened structures) and "memfd" the backing descriptor -- the memfd of a "new_memfd" structure or the dup'd fd of a "new_from_fd" structure, and -1 for file-backed or anonymous structures. STATS stats() returns a hashref describing the structure: * "capacity" -- the fixed number of elements ("0 .. capacity-1"). * "sets" -- the current number of disjoint sets ("num_sets"). * "ops" -- running count of operations that took the write lock ("union", "union_many", "find", "connected", "set_size", "reset"). * "mmap_size" -- bytes of the shared mapping. SHARING ACROSS PROCESSES The structure lives in a shared mapping, shared the same three ways as the rest of the family: a backing file (every process calls "new($path, $n)" on the same path), an anonymous mapping inherited across "fork", or a memfd whose descriptor is passed to an unrelated process (over a UNIX socket via "SCM_RIGHTS", or via "/proc/$pid/fd/$n") and reopened with new_from_fd($fd). Because the mapping is shared, every process unions into and queries the same partition. All mutation is serialized by the write lock, so the final partition is independent of how the processes interleave. # parent and children share one disjoint-set with no coordination my $d = Data::DisjointSet::Shared->new(undef, 100); # before fork unless (fork) { $d->union($_, $_ + 1) for 0 .. 49; exit } wait; print $d->num_sets, "\n"; # reflects the child's unions SECURITY The mmap region is writable by all processes that open it. Do not share backing files with untrusted processes. CRASH SAFETY Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded ownership; if a holder dies, the next contender detects the dead owner and recovers. Because "union" updates a couple of words while holding the lock, a crash leaves the partition consistent up to the last completed "union". Limitation: PID reuse is not detected (very unlikely in practice). SEE ALSO Data::Histogram::Shared, Data::CountMinSketch::Shared, Data::HyperLogLog::Shared, Data::BloomFilter::Shared, Data::Intern::Shared, Data::SortedSet::Shared, Data::SpatialHash::Shared, and the rest of the "Data::*::Shared" family. AUTHOR vividsnow LICENSE This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.