Characterizing and Removing Branch Mispredictions (Thesis)
Abstract:
Control-flow mispredictions are a profound impediment to processor
performance, because each misprediction introduces a pipeline bubble
of many cycles' duration. For example, the minimum bubble in the recently
released Alpha 21264 is at least seven cycles, and often as much as
twenty cycles. With such long penalties, even small misprediction
rates harm performance substantially.Although a huge number of techniques have been proposed to combat this
problem, most have focused on only one type of misprediction: conflicts
in the predictor's state tables.This thesis describes a taxonomy of misprediction types and presents
data showing that several other types of mispredictions are just as
important as conflicts. Techniques to attack three of these
misprediction types are then described. Alloying makes both local
and global history available in a single branch predictor structure,
providing robust performance compared against both conventional
two-level predictors and against hybrid predictors. Speculative
update with fixup ensures that the predictor's state remains
up-to-date, while protecting the state against corruption from
mispredicted branches. Finally, multipath execution
simultaneously executes both sides of a branch, eliminating
mispredictions for those branches that are otherwise difficult to
predict.This work also explores tradeoffs among branch prediction,
instruction window size, data cache size, and instruction cache size,
and shows that branch prediction is the most powerful lever on
processor performance.This thesis makes one further contribution, demonstrating that long
benchmarks can be simulated efficiently by simulating only a small but
carefully chosen 50 million instruction segment of the program's
overall execution. This technique avoids unrepresentative initial
phases of a program and sets the simulation window to capture behavior
that is representative in terms of the program's branch prediction
accuracy, cache behavior, and overall execution speed.