feat: update workspace paths and enhance gitignore
- Updated stablediffusion crate path from "../stable-diffusion-burn" to "./crates/stable-diffusion-burn" for proper workspace resolution - Enhanced .gitignore to include generated model files (.mpk, .pt, .bin, .safetensors, .ckpt) and user_data directory - Added Cargo.lock to gitignore with appropriate comment - Reorganized IDE files section in gitignore for better clarity - Added newline at end of file for proper formatting
This commit is contained in:
@@ -0,0 +1,139 @@
|
||||
# burn-collective
|
||||
|
||||
Collective operations on tensors
|
||||
|
||||
The following collective operation are implemented:
|
||||
|
||||
- `all-reduce`
|
||||
Aggregates a tensor between all peers, and distributes the result to all peers.
|
||||
Different strategies can be used on the local and global levels. The result can only be
|
||||
returned when all peers have called the all-reduce.
|
||||
- `reduce`
|
||||
Aggregates a tensor from all peers onto one peer, called the "root"
|
||||
- `broadcast`
|
||||
Copies a tensor from one peer to all other peers in the collective.
|
||||
|
||||
Peers must call `register` before calling any other operation.
|
||||
The total number of devices on the node, or nodes in the collective, must be known ahead of time.
|
||||
|
||||
In many libraries like NCCL and PyTorch, participating units are called "ranks".
|
||||
This name is confusing in the context of tensors, so in burn-collective the participating units
|
||||
are called "peers".
|
||||
|
||||
*`reduce` and `broadcast` are not yet implemented for multi-node contexts*
|
||||
|
||||
## Local and Global
|
||||
|
||||
Internally, there are two levels to the collective operations: local and global. Operations are done on the local level, then optionally on the global level.
|
||||
|
||||
| Local | Global |
|
||||
|-----------------------------------------------|-----------------------------------------------|
|
||||
| Intra-node (typically within one machine) | Inter-node (typically across machies) |
|
||||
| Participants are threads (one per peer/GPU) | Participants are processes (one per node) |
|
||||
| Communication depends on backend | Network peer-to-peer communication |
|
||||
| Local server is launched automatically | Global coordinator must be launched manually |
|
||||
| Local server does the aggregation | Nodes do the operations themselves |
|
||||
|
||||
For global operations (ie. with multiple nodes), there must be a global orchestrator available.
|
||||
Start one easily with `burn_collective::start_global_orchestrator()`.
|
||||
|
||||
On the global level, nodes use the `burn_communication::data_service::TensorDataService` to
|
||||
expose and download tensors in a peer-to-peer manner, in order to be independent.
|
||||
|
||||
## Components
|
||||
|
||||
The following are the important pieces of the collective operations system.
|
||||
|
||||
| Term | One per... | Meaning
|
||||
|--------------------------------|---------------|----------------------------------------------------------
|
||||
| Local Collective Client | Peer/thread | Requests operations to the Local Collective Server
|
||||
| Local Collective Server | Node/process | Does local-level ops for threads in this process. In the case of global operations, passes operations on to the Global Collective Client.
|
||||
| Global Collective Client | Node/process | Does global-level ops for this node. Registers and requests strategies from the Global Collective Orchestrator.
|
||||
| Global Collective Orchestrator | Collective | Responds to the Global Collective Client from each node. Responsible for aggregation strategies.
|
||||
|
||||
## Strategies
|
||||
|
||||
Different strategies can be used on the local and global level.
|
||||
|
||||
### Centralized
|
||||
|
||||
An arbitrary peer is designated as the "root", and all others are transferred to the root's device.
|
||||
The operation is done on that device.
|
||||
The resulting tensor then sent to each peer.
|
||||
|
||||
### Tree
|
||||
|
||||
Tensors in groups of N are aggregated together. This is done recursively until only one tensor
|
||||
remains. The strategy tries to put devices of the same type closer in the tree.
|
||||
When N=2, this is like a binary tree reduce.
|
||||
The resulting tensor then sent to each peer
|
||||
|
||||
### Ring
|
||||
|
||||
See this good explanation: <https://blog.dailydoseofds.com/p/all-reduce-and-ring-reduce-for-model>
|
||||
|
||||
The tensors are sliced into N parts, where N is the number of tensors to aggregate.
|
||||
Then, the slices are sent around in a series of cycles and aggregated until every tensor's slices
|
||||
is a sum of the other corresponding slices.
|
||||
|
||||
In the case where the tensors are too small to split into N slices, a fallback algorithm is used.
|
||||
For now, the fallback is a binary tree.
|
||||
|
||||
(p=3, n=3)
|
||||
|
||||
o->o o
|
||||
o o->o
|
||||
o o o->
|
||||
|
||||
o 1->o
|
||||
o o 1->
|
||||
1->o o
|
||||
|
||||
o 1 2->
|
||||
2->o 1
|
||||
1 2->o
|
||||
|
||||
3 1 2
|
||||
2 3 1
|
||||
1 2 3
|
||||
|
||||
(This is essentially a reduce-scatter)
|
||||
|
||||
3->x x
|
||||
x 3->x
|
||||
x x 3->
|
||||
|
||||
3 3->x
|
||||
x 3 3->
|
||||
3->x 3
|
||||
|
||||
3 3 3->
|
||||
3->3 3
|
||||
3 3->3
|
||||
|
||||
3 3 3
|
||||
3 3 3
|
||||
3 3 3
|
||||
|
||||
(This is essentially an all-gather)
|
||||
|
||||
This is done so that every peer is both sending and receiving data at any moment.
|
||||
This is an important part of this strategy's advantages.
|
||||
|
||||
The ring strategy takes full advantage of the bandwidth available. The latency scales with the
|
||||
number of peers.
|
||||
|
||||
So when the tensors are very small, or when the number of peers is very large, the latency is more
|
||||
important in the ring strategy, and a tree algorithm is better. Otherwise, the ring algorithm is
|
||||
the better.
|
||||
|
||||
In multi-node contexts, use of the Ring strategy in the local level may be less
|
||||
advantageous. With multiple nodes, the global all-reduce step is enabled, and its result
|
||||
is redistributed to all devices.
|
||||
The Ring strategy inherently distributes the result, which in this context would not be necessary.
|
||||
|
||||
It is recommended to use the Ring strategy at the global level
|
||||
|
||||
### Double binary tree
|
||||
|
||||
<https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/>
|
||||
Reference in New Issue
Block a user