feat: update workspace paths and enhance gitignore

- Updated stablediffusion crate path from "../stable-diffusion-burn" to "./crates/stable-diffusion-burn" for proper workspace resolution - Enhanced .gitignore to include generated model files (.mpk, .pt, .bin, .safetensors, .ckpt) and user_data directory - Added Cargo.lock to gitignore with appropriate comment - Reorganized IDE files section in gitignore for better clarity - Added newline at end of file for proper formatting
2026-03-05 19:39:14 +01:00
parent 4bb7ca9074
commit 3a67c0979c
1605 changed files with 537032 additions and 2 deletions
--- a/crates/stable-diffusion-burn/burn-crates/burn-collective/README.md
+++ b/crates/stable-diffusion-burn/burn-crates/burn-collective/README.md
@@ -0,0 +1,139 @@
+# burn-collective
+
+Collective operations on tensors
+
+The following collective operation are implemented:
+
+- `all-reduce`
+    Aggregates a tensor between all peers, and distributes the result to all peers.
+    Different strategies can be used on the local and global levels. The result can only be
+    returned when all peers have called the all-reduce.
+- `reduce`
+    Aggregates a tensor from all peers onto one peer, called the "root"
+- `broadcast`
+    Copies a tensor from one peer to all other peers in the collective.
+
+Peers must call `register` before calling any other operation.
+The total number of devices on the node, or nodes in the collective, must be known ahead of time.
+
+In many libraries like NCCL and PyTorch, participating units are called "ranks".
+This name is confusing in the context of tensors, so in burn-collective the participating units
+are called "peers".
+
+*`reduce` and `broadcast` are not yet implemented for multi-node contexts*
+
+## Local and Global
+
+Internally, there are two levels to the collective operations: local and global. Operations are done on the local level, then optionally on the global level.
+
+| Local                                      | Global                                        |
+|-----------------------------------------------|-----------------------------------------------|
+| Intra-node (typically within one machine)     | Inter-node (typically across machies)         |
+| Participants are threads (one per peer/GPU) | Participants are processes (one per node)     |
+| Communication depends on backend              | Network peer-to-peer communication            |
+| Local server is launched automatically      | Global coordinator must be launched manually  |
+| Local server does the aggregation          | Nodes do the operations themselves            |
+
+For global operations (ie. with multiple nodes), there must be a global orchestrator available.
+Start one easily with `burn_collective::start_global_orchestrator()`.
+
+On the global level, nodes use the `burn_communication::data_service::TensorDataService` to
+expose and download tensors in a peer-to-peer manner, in order to be independent.
+
+## Components
+
+The following are the important pieces of the collective operations system.
+
+| Term                           | One per...    | Meaning
+|--------------------------------|---------------|----------------------------------------------------------
+| Local Collective Client        | Peer/thread | Requests operations to the Local Collective Server
+| Local Collective Server        | Node/process  | Does local-level ops for threads in this process. In the case of global operations, passes operations on to the Global Collective Client.
+| Global Collective Client       | Node/process  | Does global-level ops for this node. Registers and requests strategies from the Global Collective Orchestrator.
+| Global Collective Orchestrator | Collective    | Responds to the Global Collective Client from each node. Responsible for aggregation strategies.
+
+## Strategies
+
+Different strategies can be used on the local and global level.
+
+### Centralized
+
+An arbitrary peer is designated as the "root", and all others are transferred to the root's device.
+The operation is done on that device.
+The resulting tensor then sent to each peer.
+
+### Tree
+
+Tensors in groups of N are aggregated together. This is done recursively until only one tensor
+remains. The strategy tries to put devices of the same type closer in the tree.
+When N=2, this is like a binary tree reduce.
+The resulting tensor then sent to each peer
+
+### Ring
+
+See this good explanation: <https://blog.dailydoseofds.com/p/all-reduce-and-ring-reduce-for-model>
+
+The tensors are sliced into N parts, where N is the number of tensors to aggregate.
+Then, the slices are sent around in a series of cycles and aggregated until every tensor's slices
+is a sum of the other corresponding slices.
+
+In the case where the tensors are too small to split into N slices, a fallback algorithm is used.
+For now, the fallback is a binary tree.
+
+(p=3, n=3)
+
+o->o  o  
+o  o->o  
+o  o  o->
+
+o  1->o  
+o  o  1->
+1->o  o  
+
+o  1  2->
+2->o  1  
+1  2->o  
+
+3  1  2
+2  3  1
+1  2  3
+
+(This is essentially a reduce-scatter)
+
+3->x  x  
+x  3->x  
+x  x  3->
+
+3  3->x  
+x  3  3->
+3->x  3  
+
+3  3  3->
+3->3  3  
+3  3->3  
+
+3  3  3
+3  3  3
+3  3  3
+
+(This is essentially an all-gather)
+
+This is done so that every peer is both sending and receiving data at any moment.
+This is an important part of this strategy's advantages.
+
+The ring strategy takes full advantage of the bandwidth available. The latency scales with the
+number of peers.
+
+So when the tensors are very small, or when the number of peers is very large, the latency is more
+important in the ring strategy, and a tree algorithm is better. Otherwise, the ring algorithm is
+the better.
+
+In multi-node contexts, use of the Ring strategy in the local level may be less
+advantageous. With multiple nodes, the global all-reduce step is enabled, and its result
+is redistributed to all devices.
+The Ring strategy inherently distributes the result, which in this context would not be necessary.
+
+It is recommended to use the Ring strategy at the global level
+
+### Double binary tree
+
+<https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/>