Atomic operations

MLSYS ENGINEERING

3.5. Atomic operations

Using += across multiple threads can produce incorrect results due to the race condition, as introduced in Race condition. For your convenience, we repeat the example that explains the bug here.

Code 10. Race condition example with two threads.

x = 0  # Initial value
temp1 = x  # Thread 1 Read
temp1 += 1  # Thread 1 Compute, temp1 is 1.
temp2 = x  # Thread 2 Read
temp2 += 1  # Thread 2 Compute, temp2 is 1.
x = temp1  # Thread 1 Write back, x is now 1.
x = temp2  # Thread 2 Write back, x is still 1.

The root cause is that the read-modify-write steps of x += 1 are three separate instructions that can be interrupted and interleaved by other threads. What we need is to make these steps atomic, meaning they are treated as a single indivisible unit that no other thread can interrupt. The tool we use to achieve this is a lock. You may notice that we used sync earlier to mean sequential, blocking execution, as opposed to async. Here, using a lock to protect shared data is sometimes called synchronizing the threads, which is a different but equally common meaning of the word in parallel computing. Both usages are standard in the field.

A lock ensures that only one thread can execute the protected code at a time, making the read-modify-write operation atomic. Here is how it looks in code.

Code 11. Thread-safe increment with lock.

x = 0
lock = asyncio.Lock()

async def thread_safe_increment():
    global x
    # Only one thread can enter this with block at a time.
    async with lock:
        x += 1

When a thread reaches the async with lock line, it tries to acquire the lock. If the lock is already held by another thread, it waits until the lock becomes available. Only one thread can hold the lock at a time, so only that thread is allowed to execute the code inside the with block. The lock is released automatically when the with block exits, at which point another waiting thread can acquire it and continue.

As shown in the following figure, Thread 2 and Thread 3 both wait in gray while Thread 1 holds the lock and runs x += 1. Once Thread 1 releases the lock, Thread 3 happens to acquire it next, followed by Thread 2. Notice that the order is not determined by thread number. Any waiting thread can acquire the lock once it is free.

Figure 6. Lock sequence diagram.

Think of it like a single-occupancy restroom. When someone enters, they lock the door behind them. Anyone else who needs to use it must wait outside until the door is unlocked. Once the person inside finishes and unlocks the door, the next person can enter and lock it again. No two people are ever inside at the same time.

Now let's apply this to parallel matmul. Recall that we can use m * n * k total threads, where each thread handles a single += operation. Here is what that looks like in code.

Code 12. Parallel matmul with race condition.

import asyncio
import numpy as np

async def compute_element(a, b, output, i, j, l):
    output[i, j] += a[i, l] * b[l, j]

async def parallel_matmul(a, b):
    assert a.shape[1] == b.shape[0], "Incompatible shapes."
    m, k = a.shape
    _, n = b.shape
    output = np.zeros((m, n))
    futures = [compute_element(a, b, output, i, j, l)
               for i in range(m)
               for j in range(n)
               for l in range(k)]
    await asyncio.gather(*futures)
    return output

This has a race condition. For each output cell output[i, j], there are k threads all doing += on the same cell concurrently. This is exactly the same bug as the x += 1 example. We can fix it by giving each output cell its own lock. Only compute_element needs to change; parallel_matmul stays the same except for passing the locks in.

Code 13. Fixed compute_element with lock.

async def compute_element(a, b, output, locks, i, j, l):
    async with locks[i][j]:
        output[i, j] += a[i, l] * b[l, j]

Each output cell has its own lock, so threads working on different cells do not block each other. Only threads competing to update the same cell are serialized. Code that correctly protects shared data from race conditions like this is called thread-safe.

You might wonder how the lock itself is implemented safely. After all, acquiring a lock involves checking whether the flag is free and then setting it to taken. If these were two separate instructions, two threads could both read the flag as free at the same time, both decide to acquire the lock, and both set the flag to taken, giving two threads the lock simultaneously, which is exactly the race condition we were trying to prevent in the first place. The lock would be broken by the very problem it was designed to fix.

The answer lies in the hardware. CPUs provide special instructions like compare-and-swap (CAS) that bundle the check and the write into a single uninterruptible step. This is an atomic operation implemented natively as a CPU instruction. The CPU physically locks the memory bus for that one instruction so no other core can read or write that memory location until it completes. Only one thread can win the CAS. The other sees that the flag is already taken and waits. Locks, barriers (which we will introduce in the next section), and all other synchronization tools are ultimately built on top of these hardware-level atomic operations.