MLSYS ENGINEERING

3.6. Barrier

Another common way to coordinate threads is inserting a barrier. This is widely used in GPU parallel computing.

Consider a simple example where each thread has two steps: first, copy a piece of data; second, do some computing. Each thread copies a different piece, but every thread needs all the data to be ready before it can start computing. Without coordination, a thread that finishes its copy early might start computing while another thread's data is still being loaded, causing incorrect results.

A barrier solves this. It is a single line of code placed between the two steps. Any thread that reaches the barrier is blocked there. Once every thread has arrived, the barrier releases them all at the same time, and they continue to the next step together.

Think of it like a race. All runners must reach the starting line before the race begins, and the starting line is the barrier. No runner is allowed to start until everyone is ready. Once they are all at the line, the starting signal fires and they all go at once.

The following code shows how this looks in Python.

Code 14. Barrier synchronization with two threads.
NUM_PARTICIPANTS = 2
barrier = asyncio.Barrier(NUM_PARTICIPANTS)

async def worker_task(task_id):
    print(f"Task {task_id}: copy start")
    await asyncio.sleep(random.uniform(0.5, 2.0))
    print(f"Task {task_id}: copy complete")

    # Wait at the barrier until all participants have arrived
    await barrier.wait()

    print(f"Task {task_id}: compute start")
    await asyncio.sleep(random.uniform(0.5, 1.5))
    print(f"Task {task_id}: compute complete")

tasks = [worker_task(i) for i in range(1, NUM_PARTICIPANTS + 1)]
await asyncio.gather(*tasks)

# Output:
# Task 1: copy start
# Task 2: copy start
# Task 2: copy complete
# Task 1: copy complete
# Task 1: compute start
# Task 2: compute start
# Task 2: compute complete
# Task 1: compute complete

The sequence diagram in Figure 7 shows what happens. Thread 2 finishes its copy phase first and hits the barrier, entering a waiting state shown in gray. Thread 1 is still copying. Once Thread 1 also reaches the barrier, both threads are released simultaneously and their compute phases begin at the same point in time.

Thread 1 Thread 2 copy copy wait barrier compute compute
Figure 7. Barrier sequence diagram.

So far, we have assumed that compute is the bottleneck and that adding more threads will make things faster. In practice, this is not always true. Even with perfect parallelization, the cores may still sit idle most of the time, waiting for data to arrive from memory. In the next chapter, we will look at memory and understand why it can be the real bottleneck.