3.5. Atomic operations
Recall from Race condition that using += across multiple
threads can produce incorrect results due to the race condition. For your
convenience, we repeat the example that explains the bug here.
x = 0 # Initial value
temp1 = x # Thread 1 Read
temp1 += 1 # Thread 1 Compute, temp1 is 1.
temp2 = x # Thread 2 Read
temp2 += 1 # Thread 2 Compute, temp2 is 1.
x = temp1 # Thread 1 Write back, x is now 1.
x = temp2 # Thread 2 Write back, x is still 1.
The root cause is that the read-modify-write steps of x += 1 are three
separate instructions that can be interrupted and interleaved by other threads.
What we need is to make these steps atomic, meaning they are treated as a
single indivisible unit that no other thread can interrupt. The tool we use to
achieve this is a lock. You may notice that we used sync earlier to
mean sequential, blocking execution, as opposed to async. Here, using a lock
to protect shared data is sometimes called synchronizing the threads, which is
a different but equally common meaning of the word in parallel computing. Both
usages are standard in the field.
A lock ensures that only one thread can execute the protected code at a time, making the read-modify-write operation atomic. Here is how it looks in code.
x = 0
lock = asyncio.Lock()
async def thread_safe_increment():
global x
# Only one thread can enter this with block at a time.
async with lock:
x += 1
When a thread reaches the async with lock line, it tries to acquire the lock.
If the lock is already held by another thread, it waits until the lock becomes
available. Only one thread can hold the lock at a time, so only that thread is
allowed to execute the code inside the with block. The lock is released
automatically when the with block exits, at which point another waiting
thread can acquire it and continue.
As shown in Figure 6,
Thread 2 and Thread 3 both wait in gray while Thread 1 holds the lock and runs
x += 1. Once Thread 1 releases the lock, Thread 3 happens to acquire it
next, followed by Thread 2. Notice that the order is not determined by thread
number. Any waiting thread can acquire the lock once it is free.
Think of it like a single-occupancy restroom. When someone enters, they lock the door behind them. Anyone else who needs to use it must wait outside until the door is unlocked. Once the person inside finishes and unlocks the door, the next person can enter and lock it again. No two people are ever inside at the same time.
Now let's apply this to parallel matmul. Recall that we can use m * n * k
total threads, where each thread handles a single += operation. Here is what
that looks like in code.
import asyncio
import numpy as np
async def compute_element(a, b, output, i, j, l):
output[i, j] += a[i, l] * b[l, j]
async def parallel_matmul(a, b):
assert a.shape[1] == b.shape[0], "Incompatible shapes."
m, k = a.shape
_, n = b.shape
output = np.zeros((m, n))
futures = [compute_element(a, b, output, i, j, l)
for i in range(m)
for j in range(n)
for l in range(k)]
await asyncio.gather(*futures)
return output
This has a race condition. For each output cell output[i, j], there are k
threads all doing += on the same cell concurrently. This is exactly the same
bug as the x += 1 example. We can fix it by giving each output cell its own
lock. Only compute_element needs to change; parallel_matmul stays the same
except for passing the locks in.
async def compute_element(a, b, output, locks, i, j, l):
async with locks[i][j]:
output[i, j] += a[i, l] * b[l, j]
Each output cell has its own lock, so threads working on different cells do not block each other. Only threads competing to update the same cell are serialized. Code that correctly protects shared data from race conditions like this is called thread-safe.
You might wonder how the lock itself is implemented safely. After all, acquiring a lock involves checking whether the flag is free and then setting it to taken. If these were two separate instructions, two threads could both read the flag as free at the same time, both decide to acquire the lock, and both set the flag to taken, giving two threads the lock simultaneously, which is exactly the race condition we were trying to prevent in the first place. The lock would be broken by the very problem it was designed to fix.
The answer lies in the hardware. CPUs provide special instructions like compare-and-swap (CAS) that bundle the check and the write into a single uninterruptible step. This is an atomic operation implemented natively as a CPU instruction. The CPU physically locks the memory bus for that one instruction so no other core can read or write that memory location until it completes. Only one thread can win the CAS. The other sees that the flag is already taken and waits. Locks, barriers (which we will introduce in the next section), and all other synchronization tools are ultimately built on top of these hardware-level atomic operations.