MLSYS ENGINEERING

3.2. Race condition

Multi-threading can be fast, but it may also create bugs. When threads run in parallel, their instructions can interleave with each other in unpredictable ways, which could lead to incorrect results. For a simple example, if we want to do x += 1, the actual execution in the CPU could look like this:

Code 5. Instruction breakdown of x += 1.
x = 0  # Initial value
temp1 = x  # Read
temp1 += 1  # Compute
x = temp1  # Write back

If we have two threads doing the same concurrently, one possible execution could look like this, where Thread 1 finishes all three steps before Thread 2 starts:

Code 6. Two threads with no interleaving, correct result.
x = 0  # Initial value
temp1 = x  # Thread 1 Read
temp1 += 1  # Thread 1 Compute, temp1 is 1.
x = temp1  # Thread 1 Write back, x is now 1.
temp2 = x  # Thread 2 Read
temp2 += 1  # Thread 2 Compute, temp2 is 2.
x = temp2  # Thread 2 Write back, x is now 2.

This gives the correct result of 2. However, another possible execution has the instructions interleaved like this:

Code 7. Two threads with interleaving, incorrect result.
x = 0  # Initial value
temp1 = x  # Thread 1 Read
temp1 += 1  # Thread 1 Compute, temp1 is 1.
temp2 = x  # Thread 2 Read
temp2 += 1  # Thread 2 Compute, temp2 is 1.
x = temp1  # Thread 1 Write back, x is now 1.
x = temp2  # Thread 2 Write back, x is still 1.

This gives the wrong result of 1. Which execution actually happens depends on the exact timing of the threads, which can vary between runs. The code may work correctly most of the time and only fail occasionally, making the bug very hard to reproduce and debug. The same bug can occur in matmul because += is used when computing the inner-product.

This condition of having multiple threads writing to the same variable concurrently, or one writing while others are reading, is known as a race condition. It is created by instructions from different threads interleaving without waiting or blocking each other, which is known as asynchronous execution.

So, asynchronous execution can indeed save us some time by running things in parallel, but it requires careful management to avoid race conditions. If multiple threads want to modify the same variable in an asynchronous manner, it may create a bug.

Note that we often use async and sync as short forms of asynchronous and synchronous respectively.