Autopipe

Pipeline Scheduling

Try placing all forward tasks first (AFAB), then compare with the 1F1B preset.
Pipeline parallelism splits a model across GPUs.

Each microbatch flows forward then backward. The challenge: minimize the bubble — idle GPU time between tasks. 1F1B interleaves forward and backward passes to reduce peak activation memory.

Task Pool
F0.L0F0.L1B0.L1B0.L0F1.L0F1.L1B1.L1B1.L0
0510152025303540455055GPU 0 (Layer 0)ActivationGPU 1 (Layer 1)Activation
Makespan: 0
Bubble: 0.0%
schedule = Schedule([])  # drag tasks onto the timeline
1F1B emerges naturally from DFS tie-breaking in list scheduling.

No special logic needed — just prioritize finishing what you started. The schedule falls out of the tie-breaker, not the priority function.

HLS Scheduling

Both adders work every cycle — the question is which two additions to pair first.
Resource-constrained scheduling minimizes latency under functional unit limits.

This maps directly to list scheduling from HLS (high-level synthesis). We have 1 multiplier and 2 adders to evaluate (a+b)*(c+d) + (e+f)*(g+h). Pairing same-multiplier inputs (a+b with c+d) lets the multiplier start one cycle earlier than pairing cross-multiplier inputs (a+b with e+f).

Task Pool
+(a,b)+(c,d)+(e,f)+(g,h)×(a+b,c+d)×(e+f,g+h)+(L,R)
012345678MULADD 0ADD 1Registers
The tie-breaker matters more than the priority function.

This contradicts most textbook presentations of list scheduling, where ALAP/ASAP priorities are emphasized over tie-breaking strategy.

Interleaved 1F1B

Dark blocks = chunk 0 (layers 0-3), light blocks = chunk 1 (layers 4-7). Color = microbatch, label = layer number.
Megatron-LM v2 introduced interleaved 1F1B: each GPU holds v=2 non-contiguous layer groups. Data flows through all 4 devices twice per microbatch (0→1→2→3→0→1→2→3), halving the pipeline bubble from (p-1)/m to (1/v)·(p-1)/m.

This schedule uses 8 layers, 4 GPUs, 8 microbatches with fw=5, bw=10 per virtual stage.

D0 (L0,L4)00004444000404444440400044440000D1 (L1,L5)11115555151515155151515155551111D2 (L2,L6)22226666662626222262626666662222D3 (L3,L7)33337777777733333333777777773333Makespan: 285 | Bubble: 15.8%
024681011activationsD0D1D2D3Peak activations: D0: 11|D1: 9|D2: 7|D3: 5
This is the Megatron V2 interleaved 1F1B algorithm: warmup fills the pipeline with forwards, then steady-state alternates one forward and one backward per device. Unlike contiguous 1F1B, a greedy list scheduler can’t reproduce this — the wave-based queue ordering is specific to the interleaved topology. Can a solver match or beat it? The CP-SAT formulation in the next section finds out.

Scheduling as Constraint Satisfaction

Each task gets a start-time variable bounded by ASAP and ALAP. Constraints enforce dependencies, no overlap, and memory limits.
Google OR-Tools CP-SAT solves constraint satisfaction problems over integer variables. We model pipeline scheduling with four ingredients:
  1. Variables: each task gets a start time within [ASAP, ALAP]
  2. Dependencies: successor can’t start before predecessor ends
  3. NoOverlap: tasks on the same GPU can’t overlap in time
  4. Cumulative: peak activation memory per GPU stays within budget

Variables: ASAP/ALAP Domains

Each task’s start time is bounded by its earliest possible start (ASAP, from forward topological order) and latest possible start (ALAP, from backward pass). The solver searches within these windows.
051015202530354045F0.L0ASAPALAPB0.L1ASAPALAP

Dependency Constraints

For every edge (pred → succ) in the DAG: . This encodes the data-flow ordering — backward can’t start before forward finishes.

NoOverlap Per GPU

Each GPU lane gets a constraint: no two interval variables assigned to the same GPU may overlap. The solver enforces this natively using interval propagation — no Big-M tricks needed.

Cumulative Memory

Each forward task creates an activation that lives until its backward completes. constrains the sum of overlapping activations to stay below capacity. Phase 2 minimizes this peak.
The key insight: CP-SAT uses interval variables natively. NoOverlap and Cumulative are first-class constraints with dedicated propagators — much faster than encoding scheduling as generic integer programming.

Results from CP-SAT

Three schedules for the same DAG and interleaved allocation — greedy, makespan-optimal, and memory-optimal.
The greedy list scheduler leaves gaps that inflate the makespan. The CP-SAT solver finds the globally optimal schedule — matching the Megatron V2 paper’s result — without any hand-tuning. Phase 2 then reorders tasks to minimize peak memory while preserving the same makespan.

List Scheduler (Greedy)

D0 (L0,L4)00004444000044400404444044400400D1 (L1,L5)11115555111515511515155555511511D2 (L2,L6)22226666626262226222666666622622D3 (L3,L7)33337777773737333333777777373733Makespan: 320 | Bubble: 25.0%
024681012activationsD0D1D2D3Peak activations: D0: 12|D1: 11|D2: 8|D3: 5

CP-SAT: Minimize Makespan

D0 (L0,L4)00004444000404444440400044440000D1 (L1,L5)11115555151515155151515155551111D2 (L2,L6)22226666662626222262626666662222D3 (L3,L7)33337777777733333333777777773333Makespan: 285 | Bubble: 15.8%
024681011activationsD0D1D2D3Peak activations: D0: 11|D1: 9|D2: 7|D3: 5

CP-SAT: Minimize Memory Pressure

D0 (L0,L4)04400440044004400440044004400440D1 (L1,L5)15511551155115511551155115511551D2 (L2,L6)26622662266226622662266226622662D3 (L3,L7)37733773377337733773377337733773Makespan: 285 | Bubble: 15.8%
012345678activationsD0D1D2D3Peak activations: D0: 8|D1: 8|D2: 8|D3: 5
Same DAG + constraints → the solver finds the answer automatically. No hand-tuning, no heuristic design. The formulation is the algorithm.