Pipeline Scheduling
Try placing all forward tasks first (AFAB), then compare with the 1F1B preset.
Pipeline parallelism splits a model across GPUs.
Each microbatch flows forward then backward. The challenge: minimize the bubble — idle GPU time between tasks. 1F1B interleaves forward and backward passes to reduce peak activation memory.
schedule = Schedule([]) # drag tasks onto the timeline1F1B emerges naturally from DFS tie-breaking in list scheduling.
No special logic needed — just prioritize finishing what you started. The schedule falls out of the tie-breaker, not the priority function.
HLS Scheduling
Both adders work every cycle — the question is which two additions to pair first.
Resource-constrained scheduling minimizes latency under functional unit limits.
This maps directly to list scheduling from HLS (high-level synthesis).
We have 1 multiplier and 2 adders to evaluate (a+b)*(c+d) + (e+f)*(g+h).
Pairing same-multiplier inputs (a+b with c+d) lets the multiplier start one
cycle earlier than pairing cross-multiplier inputs (a+b with e+f).
The tie-breaker matters more than the priority function.
This contradicts most textbook presentations of list scheduling, where ALAP/ASAP priorities are emphasized over tie-breaking strategy.
Interleaved 1F1B
Dark blocks = chunk 0 (layers 0-3), light blocks = chunk 1 (layers 4-7). Color = microbatch, label = layer number.
Megatron-LM v2 introduced interleaved 1F1B: each GPU holds v=2 non-contiguous layer groups. Data flows through all 4 devices twice per microbatch (0→1→2→3→0→1→2→3), halving the pipeline bubble from (p-1)/m to (1/v)·(p-1)/m.
This schedule uses 8 layers, 4 GPUs, 8 microbatches with fw=5, bw=10 per virtual stage.
This is the Megatron V2 interleaved 1F1B algorithm: warmup fills the pipeline with forwards, then steady-state alternates one forward and one backward per device. Unlike contiguous 1F1B, a greedy list scheduler can’t reproduce this — the wave-based queue ordering is specific to the interleaved topology. Can a solver match or beat it? The CP-SAT formulation in the next section finds out.
Scheduling as Constraint Satisfaction
Each task gets a start-time variable bounded by ASAP and ALAP. Constraints enforce dependencies, no overlap, and memory limits.
Google OR-Tools CP-SAT solves constraint satisfaction problems over integer variables. We model pipeline scheduling with four ingredients:
- Variables: each task gets a start time within [ASAP, ALAP]
- Dependencies: successor can’t start before predecessor ends
- NoOverlap: tasks on the same GPU can’t overlap in time
- Cumulative: peak activation memory per GPU stays within budget