Dynamic Branch Prediction
ECE 570 High Performance Computer Architecture
Electrical and Computer Engineering
Oregon State University
Professor Ben Lee


Alexey Malishevsky (malishal@research.cs.orst.edu
Douglas Beck (beckd@research.ece.orst.edu
Andreas Schmid (schmidan@ece.orst.edu
Eric Landry (landry@ece.orst.edu
 
 
 


Table of Contents

1. Abstract 
2. Introduction 
3. Dynamic Branch Prediction 
  1. One level branch predictor
  2. Various predictor features
  3. Two level branch predictor
  4. Hashing techniques
  5. Difficulties
  6. Hybrid branch predictor
  7. Multiple component Hybrid branch predictor
  8. Branch classification
  9. Industry branch prediction implementations
4. Implementation 
  1. Static branch predictors
  2. Decode History Table (DHT)
  3. Branch History Table (BHT)
  4. Combination BHT-DHT
  5. Correlation - Based Prediction
  6. Two-Level Adaptive Prediction
  7. Skewed branch predictor
  8. Gshare branch predictor
  9. 'Original' implementation
  10. Cache details
  11. Associativity
  12. Implementation issues
5. Simulation Platform 
  1. Simulator executable
  2. Command line options
  3. Spec95 benchmarks
6. Statistics 
  1. Cache size Cache Initialization Time
  2. Hashing Technique
  3. Prediction Bits
  4. Deterministic Finite Automata
  5. Associativity
  6. History Bits
  7. Prediction Scheme
  8. Software
7. Results 
  1. Overall simulation
8. Limitations & Future Work
9. Conclusion
10. Links
11. References 


Abstract

Branch prediction has become an area of interest due to its effect on the performance of pipelined and superscalar processors. Various methods have been proposed to speculate the path of an instruction stream after a branch. This paper presents our evaluation of the performance of several prediction schemes by both the accuracy of prediction and the amount of hardware the technique requires to reach such a level of accuracy.

Introduction

Branch prediction is used to overcome the fetch limitation imposed by control hazards in order to expose instruction-level parallelism (ILP), the key ingredient to pipelined and superscalar architectures that mask instruction execution latencies by exploiting (ILP). Branch prediction can be thought of as a sophisticated form of prefetch or a limited form of data prediction that attempts to predict the result of branch instructions so that a processor can speculatively fetch across basic-block boundaries. Once a prediction is made, the processor can speculatively execute instructions depending on the predicted outcome of the branch. If branch prediction rates are high enough to offset misprediction penalties, the processor will likely have a better overall performance. Without branch prediction, such a processor must stall whenever there are unresolved branch instructions. Since on average 20 percent of instructions are branches, this imposes a substantial penalty on the performance of such processors. Many branch prediction schemes have been proposed. This work uses selected benchmark programs from the SPEC95 to compare several well known branch prediction schemes.

Implementation Overview

The branch prediction schemes chosen for this comparison are statically taken/not-taken, bimodal, branch history table (BHT), combination, correlation, two-level adaptive (TLA), skewed, and gshare branch prediction schemes. Metrics for the various schemes were generated with the sim-inorder simulator from the SimpleScalar tool suite written by Todd M. Austin while at the University of Wisconsin at Madison. Implementation of these schemes included various modifications to the bpred.c, bpred.h and sim-inorder.c files The tool suite is available from http://www.cs.wisc.edu/~mscalar/simplescalar.html


Background Information

In order to explain Dynamic Branch Prediction, one has to differentiate it from Static Branch Prediction. Static Branch Prediction in general is a prediction that uses information that was gathered before the execution of the program. The simplest predictors are to predict that the branch is always taken (MIPS-X, Stanford) or to predict that the branch is always not taken (Motorola MC88000). Another way is to run the program with a profiler and some training input data predict the branch direction which was most frequently taken according to profiling. This prediction can be indicated by a hint bit in the branch opcode.

Figure 1: One-Level Branch Predictor

Dynamic Branch Prediction on the other hand uses information about taken or not taken branches gathered at run-time to predict the outcome of a branch. There are several dynamic branch predictor in use or being researched nowadays. Those include One-Level Branch Predictors, Two-Level Branch Predictors and Multiple Component Branch Predictors.

For One-Level Branch Prediction, in the simple case of direct mapping and just the branch address as information source, one takes k least significant bits of the branch instruction address in order to map it to an entry of a single column table with 2k entries. Each entry consists in general of a n-bit counter, which is designed as a saturated counter in most cases. A saturated counter works like this: each time the branch is taken the counter is incremented; if the maximum value is already reached, nothing is done. Each time the branch is not taken the counter is decremented; if the minimum value = 0 is reached it stays the same.

The prediction works now after this principle: If the value of the counter is below 2n-1 , the branch is predicted as not taken, otherwise the branch is predicted as taken.

Various Features

MAPPING

Direct mapping, as explained above, uses the least significant k bits of branch instruction’s address. This is the simplest and fastest mapping method and requires the least hardware cost. The table size, however, has to be chosen carefully. If it is too large, only a small part of the buffer will be used. If it is too small, many mappings to the same location will occur.

Fully associative mapping uses tags associated with each table entry such that with the help of the k least significant bits and this tag each entry can be uniquely associated with the same branch instruction. The search gets more complicated on the other hand, especially the comparisons of the tags take a lot of time and hardware, but the table can be used more efficiently as the entries do not have to be ordered in any way. Various replacement strategies in the case of a full table are possible, like "least recently used" or "first in first out". If the outcome of a branch is to be predicted and no table entry matches the branch instruction address, a default predictor has to be used.

Set associative mapping is a combination of the two mappings above. The branch address is directly mapped to sets of multiple table entries, and within those sets the entries are searched fully associative. This is a good compromise of the positive aspects of the different mappings.

Apart from that, the branch instruction address can also be mapped by using Hashing, i.e. a so called "hash function" is used to do the mapping. This function can be chosen pretty arbitrarily using i least significant bits of the branch address. The goal is to distribute the mappings as equally and efficiently over the whole table, avoiding overlappings as much as possible.

N-BIT COUNTER

1-bit counter: A 1-bit counter records the last outcome of the branch. This is the most simple version possible, though not very effective.

2-bit counter: This is the most commonly used counter. It combines affordable costs with good predictions. It can, for example, predict the outcome of loop branches effectively: During the loop, the loop branch is always predicted as taken, of course. At the last iteration of the loop, it predicts wrongly, but the 2-bit saturated counter is only set back from strongly taken to weakly taken, meaning that at the next entry of the loop the loop branch will again be predicted to be taken, which is then correct, and so on. So only one false prediction occurs.

BUFFERSIZE
 

The Buffersize varies according to the number of bits taken from the branch address for the mapping. Typical are buffersizes from 256 to 4K entries.

Figure 2: Two-Level Branch Predictor

The Two-Level Branch Predictor, also referred to as Correlation-Based Branch Predictor, uses a two-dimensional table of counters, also called "Pattern History Table". It was introduced by Yeh and Patt [1] who because of the fact that the outcome of the branch depends not only on the branch address but also on the outcome of other recent branches (inter branch correlation) and a longer history of the same branch itself (intra branch correlation). The table entries are two-bit counters. So in order to predict the branch outcome and choose a table entry, we have:

THREE INFORMATION SOURCES

The Branch Address was mentioned before. It uses k least significant bits of branch instruction address.

A Global Branch History is a shift register in which the outcome of any branch is stored. A "one" is stored for a taken branch and a "zero" for a non-taken one. The register is shifted through while storing the newest value. In order to address the table, the n last branch outcomes are considered.

The Local History Table is a table of shift registers of the sort of a global branch history. Each shift register, however, refers to the last outcomes of one single branch. Since this local history table is accessed as a one-level branch prediction table, it is not guaranteed that no overlapping of the branches occurs, and in one shift register may be stored the information of different branches.

Since the table has only two dimensions, two of the three information sources have to be selected to access rows and columns. Another method is to merge two sources to one, which will be covered later.

Alternatively to the two-dimensional table as shown in figure 2, it is also possible to employ a local history table and map the last k entries of the respective shift register to a one-dimensional Global Pattern History Table, that means that each same pattern of the last branch outcomes is mapped onto the same table entry, regardless of the actual branch address. As a matter-of-fact, this is the original correlation-based branch predictor suggested by Yeh and Patt[1], is also referred to as Yeh’s Algorithm and has been proven to be very successful.

In general it can be stated that a two-level branch predictor is more accurate than a one-level branch predictor, but this advantage is also associated with the disadvantage of a more costly implementation and the fact that the so called Warm Up Phase, i.e. the time the table entries contain usable values, is much longer.

 
 
Figure 3: The merging of sources
 

 

The merging of sources can happen in many ways, but two are most common: Concatenation and XORing. Considering figure 3, both are pretty self-explanatory. 

In the case of concatenation, simply some k bits of one register are taken and another m-k from the second one, assuming that the result is supposed to have m bits. Both are then concatenated into one.

For XORing, the bits of the two source registers are bitwise XORed with another to form the result.

Experimental results have shown that XORing yields the better results, which means that it reduces aliasing in respect to concatenation.

Problems

ALIASING

What was until now referred to as overlapping of the mappings of different branch addresses to the same branch history table entry is actually called Aliasing. Aliasing is due to a limited table size, such that there are more branch instruction addresses than table entries. Due to their nature, aliasing occurs only with direct mapping and hashing, but not with fully associative mapping. A trade-off has to be made between aliasing and affordable table size.

In general, aliasing does not necessarily mean that the number of correct branch predictions is reduced. It is even possible that aliasing improves the correct prediction rate. In this case this is called Constructive Aliasing. If it actually reduces the correct prediction rate one speaks of Destructive Aliasing, and if aliasing does not affect branch prediction at all, it is referred to as Harmless Aliasing.

CONTEXT SWITCHES

When several processes are using one processor, then the act of storing the variables of the currently running process, suspending it and switching over to the next process is called a Context Switch. The reason this effects branch prediction lies in the fact that all Pattern History Tables, Branch History Tables or whatever was used are likely to be flushed. This means that the next time the process is running again all this information has to be collected again in order to allow a reliable branch prediction. During this Warm Up Phase the performance of such history-based predictors is decreased significantly, and a simpler branch predictor that uses a smaller history or none at all (like Always Taken) usually yield better results.

 

 

Figure 4: Hybrid Branch Predictor

 

A Hybrid Branch Predictor as proposed by McFarling [2] is built by taking two branch predictors that work independently from each other and a selector that chooses which branch prediction to take. The selector consists of a table of 2-bit counters and is mapped to by the k least significant bits of the branch instruction address.

The updating of all the Branch History tables involved with the predictors and the selector works as follows. As both predictors operate independently they are also updated independently as mentioned in earlier sections. The selector however is updated according to the table shown below:

 

 
Predictor 1
Predictor 2
Update to Selector
correct
correct
no change
correct
incorrect
increment
incorrect
correct
decrement
incorrect
incorrect
no change
 

The idea is to store the success of the predictors. An evaluation can only be done if the predictors suggest different branch directions, though. If they predict the same, no change is made. The selection is made by choosing predictor 1 if the selector entry has a value of 2 or 3, and predictor 2 is taken of values 0 or 1.

The reason for coming up with such a hybrid branch predictor comes from the observation that some branch predictors perform well with one sort of programs, i.e. floating-point programs containing a lot of loops, but bad with others where different branch predictors are more successful. To avoid the negative effect of context switches it is also a good idea to combine a branch predictor with a large history table with another predictor with a small or no history table. Thus with the help of a hybrid branch predictor it is possible to exploit the different properties of the single-scheme components for a better overall performance.
 

 
Figure 5: Multiple Component Hybrid Branch Predictor


The Multiple Component Hybrid Branch Predictor is a generalization of the Hybrid Branch Predictor and was suggested by Patt [3]. Here N independent branch predictors are combined, thus allowing to combine general predictors that yield acceptable success rates for all programs with specialized predictors designed for special applications. The result is a branch predictor that performs well on most kinds of programs without having to change anything.

The tricky thing lies in the selection of one of the N predictors. One table with 
2-bit counters is obviously not sufficient any more. So to perform the selection, [3] adds N 2-bit counters to each entry in the Branch Target Buffer which are then called Predictor Selection Counters. For each new entry in the Branch Target Buffer, all Predictor Selection Counters are initialized to 3.

The predictor to be selected is the one with value 3 in its respective Predictor Selection Counter. If there are more than one predictor with value 3, as in the beginning, a Priority Encoder decides statically which of the predictors to take.

The Predictor Selection Counters are updated by decrementing the counters for all predictors that made an incorrect prediction if one of the predictors with value 3 made a correct prediction. Otherwise all predictors that predicted correctly are incremented. This system guarantees that always one Predictor Selection Counter will be 3. It is also more accurate than a saturated counter: A predictor which was correct for the last 5 times can be differentiated from one that was correct for the last 4 times.

Figure 6: Component Predictors involved
If the time needed for the selection seems to be a bottleneck of the system, it is also possible to compute the predictor to be chosen before the branch instruction is fetched and store it in the Branch Target Buffer, reducing the necessary selection time to the multiplexer operation time.

Figure 6 shows a brief overview of the component predictors suggested by [3]. As one can see there is a great variety involved. The loop predictor for example counts the number of iterations a loop takes and with the help of this predicts the number of iterations the loop will take the next time, while Always Taken is the most simple branch predictor but quite effective after a context switch.

In order to choose the priorities of the predictors thorough test runs are necessary: All predictors have to be tested with a variety of test programs, and the parameters for the predictors have to be varied, too.

This Multiple Component Hybrid Branch Predictor has yielded better results as all other known branch predictors, with the overall size of it being equal to that of the competitive branch predictors.

Branch Classification

Another approach to improve branch prediction rate is to classify branches statically [4]. This means that the program is run with a profiler and some input data to determine the probability of a branch to be taken or not taken. The branches are basically divided in two classes: Within those two classes subdivision are can be made to differentiate the branches further. The idea is then to find suitable branch predictors for each class of branches.

It turned out that the most successful approach is to apply a static profiler-guided branch predictor with the uni-direction branches and a dynamic hybrid branch predictor for the remaining mixed-direction branches.

Implementations

Finally an overview over the branch prediction schemes implemented in some current machines is provided.

The PowerPC604 has a 64 entry fully associative Branch Target Buffer for predicting the Branch Target Address and a decoupled direct mapped 512 entry Pattern History Table.

The PowerPC620 has a 256 entry two-way set associative Branch Target Buffer for predicting the Branch Target Address and a decoupled direct mapped branch prediction buffer. 

The UltraSPARC is reported to have a 2-bit branch prediction scheme.

The Intel Pentium contains a 256 entry 4-way set associative Branch Target Buffer. Coupled with each Branch Target Buffer entry is a 2-bit branch predictor that is responsible for the branch prediction. This is a simple One-Level Branch Predictor.

The Intel Pentium Pro works with a 512 entry 4-way set associative Branch Target Buffer. Coupled with each Branch Target Buffer entry is in this case a 4-bit local branch history. This is mapped in a second level onto a Global Pattern History Table, thus implementing Yeh’s Algorithm mentioned already earlier.

Implemented branch prediction methods

The Environment

In this project, we implemented a number of branch prediction methods, benchmarked them, and analyzed the results. 

The environment included ECE instructional machines running under Linux OS. In order to implement branch prediction schemes, we used SimpleScalar simulator. SimpleScalar simulator uses its own abstract instruction set (RISC ISA, similar to MIPS) to simulate programs in the superscalar environment. This simulator allows in-order and out-of-order instruction execution. It allows multiple instructions to be fetched, decoded, issued, executed, committed in one clock cycle. In other words, SimpleScalar simulator simulates truly superscalar execution model, similar to many common superscalar microprocessors. It is also able to incorporate a branch prediction. Its package includes C compiler (gcc) which allows to compile C programs into SimpleScalar ISA to be run on this simulator. Important observation is that each branch misprediction costs 2 cycles of penalty. However, we didn’t consider the influence of prediction rates on running time (# cycles to execute n instructions, or IPC). We just looked on prediction rates of different schemes. But, because of high average percentage of branch instructions (about 22% on average), misprediction costs can tremendously decrease IPC. 

Overview of Implemented Methods

We have implemented a couple of common static and dynamic branch prediction schemes.  One-level branch prediction schemes are:   Two-level branch prediction schemes are:  Detailed descriptions of implemented methods 

Always untaken and always taken 

Always untaken scheme just constantly returns the address of the next instruction. 

Always taken scheme uses branch target address buffer (BTA buffer) to store and re-use target addresses. Each time the branch is resolved, the corresponding target address is stored in the BTA cache (it will be described later). The next time, "always taken" scheme always predicts the branch outcome as taken, retrieves the corresponding branch target address from this cache, and uses this address to fetch the next instruction. 

Decode History Table (DHT) 



Direct history table (DHT), also called bimodal predictor, is the simplest branch prediction method which uses the address of the branch instruction to index the global table of prediction bits. Usually, low bits of the address are used to index directly into the table entry. Then branch prediction bits are used to predict branch outcome. When the branch outcome is actually known, the corresponding entry in this table is updated. This method is the cheapest in terms of hardware. 

number of bits(order) = 2 ´ 2(order) 

Branch History Table (BHT) 

 


Branch history table method (BHT) is similar to DHT, except that it uses the tagged structure. Each entry in the table, in addition to prediction bits, contains the address of the branch instruction. It introduces the concept of hit and miss. In this scheme, after an entry is referenced (indexed by the address of the branch instruction), its tag is compared against the address of the branch instruction. If it matches, hit occurs and the predictions bits from this entry are used to predict the branch outcome. Otherwise, miss happens, and the branch is predicted as not taken. This method effectively removes all aliasing, so any branch prediction bits, associated with one branch instruction, cannot be affected by another branch instruction (except that a branch instruction can force an entry accosiated with another branch instruction to be removed). In addition, we have implemented the fully associative array which allows to search the whole table for the matching tag. Least frequently used replacement algorithm has been used here. However, this implementation didn’t create any performance advantage over the directly mapped tagged table; as a result, it was abandoned. 

number of bits(order) = (2 + 32) ´ 2(order) 

Combination of DHT-BHT 

 


This scheme is a combination of the previous two schemes. This scheme accesses BHT, and, if there is a hit, it uses prediction bits from BHT to predict the branch outcome. Otherwise, DHT is indexed and prediction bits are retrieved from the corresponding entry of DHT. If hit occurs, DHT is not updated after the branch outcome is known. The main advantage of this scheme is that we can make DHT much larger than BHT because of its low hardware requirements. 

number of bits(order1, order2) = 2 ´ 2(order1) + (2 + 32) ´ 2(order2) 

Correlation - Based Prediction 

 


 
Correlation-based prediction scheme differs from previously described ones in the way that the branch outcome is thought to depend not only on this branch’s previous behavior, but also on the behavior of previously executed branches in the instruction stream. The behavior of branches in the instruction stream is recorded in the global shift register. If a branch is taken, 1 is inserted into the shift register and the shift register is shifted; if untaken - 0 is inserted into the shift register and the shift register is shifted. The table of prediction bits is two-dimensional. The row is selected by the low bits of the address of the branch instruction, and the column is selected by the value in the shift register. The underlining idea behind this scheme is that the behavior of a branch can correlate with the behavior of other branches. This scheme has large hardware requirements in comparison to one-level predictors. 

number of bits(order, history bits) = (2) ´ 2(order + history bits) 

Two-Level Adaptive Prediction 

 


Two-level adaptive prediction scheme (also referred as the Yeh algorithm), like the previously described method, also contains history information of the previous branch outcomes. But this information is local and kept in correlation registers. For each branch, there is an associated correlation register, and the pattern of the branch history is stored there. Based on this pattern, a particular entry in the global pattern table (GPT) is accessed. The corresponding correlation register and the corresponding entry in GPT are updated each time a branch is resolved. Any entry in GPT can be shared by several branches. In order to lookup, we index the table by the low bits of the address of the branch instruction to find the corresponding correlation register, and use its value to index GPT to get the appropriate prediction bits. 

This scheme has reasonable hardware requirements which are much lower that of the correlation-based predictor. 

number of bits(order, correlation bits) = (correlation bits) ´ 2(order) + (2) ´ 2(correlation bits) 

Skewed Branch Predictor 

 

Skewed branch predictor uses the global history, like correlation-based scheme. It takes global history bits and the address of the branch instruction, and uses this information to index three separate arrays of branch prediction bits. Three different hash functions are used to index these tables (each hash function indexes its own table). Then, these three sets of prediction bits are obtained and three branch outcomes are computed based on each set. Voting by majority is used to make the final prediction. During the update, this method updates all three tables. By (Michard, at al.), good choice of hash functions can eliminate almost all aliasing. 

The hardware requirements are small, just three times the hardware requirements for the bimodal predictor. 

number of bits(order, history bits) = 3 ´ (2) ´ 2(order after hashing) 

Gshare Scheme 

 


 


Gshare scheme is similar to bimodal predictor or branch history table. However, like correlation-based prediction, this method records history of branches into the shift register. Then, it uses the address of the branch instruction and branch history (shift register) to XOR’s them together. The result is used to index the table of prediction bits. Implementation can use either tagged or tagless tables. In the tagged table, we compare the indexed entry with a tag and, if they don’t match, we just predict as not taken. If they match, we use the corresponding prediction bits to predict the branch outcome. In the tagless implementation, the table is just directly indexed and the corresponding prediction bits are used to predict the branch outcome. We have tried both these methods, and found that tagless table performs better. We just index an entry and use its contents (prediction bits) to predict the branch outcome. 

The hardware requirements with tagless table are the same as for the bimodal predictor. 

number of bits(order, history) = (2) ´ 2(order after hashing) 

"Original" Implementation 

After we implemented all these methods, we tried to look on some new approaches and improvements. We didn’t reinvent the wheel and we just slightly modified the best scheme. The best scheme happened to be the two-level adaptive predictor. We modified it to shift both taken? bit and correct? bit (whether the prediction was correct) into the correlation register. 

number of bits(order, correlation bits) = (correlation bits) ´ 2(order) + (2) ´ 2(correlation bits) 


Storing the branch target addresses: BTA buffer 

In the previous section, we were concerned with the correct prediction of the branch outcome. If we predict the branch as untaken, we just fetch the next instruction. However, if we predict the branch as taken, we need somehow to figure out the branch target address before it is actually calculated. In order to solve this problem, we use the cache for branch target addresses (BTA). Each time the branch outcome and BTA are calculated, while updating predictor’s data (prediction bits and, sometimes, history bits), we can also store BTA in a cache, called BTA buffer. We will call it BTA instead of BTB (as some implementations call it) because some implementations even store the target instruction (used for branch folding). In our case, we just store BTA in the cache. We have separated two important issues: correctly predict the branch outcome and correctly obtain BTA. Branch predictors solve the first problem. Now we will discuss our implementation of the solution for the second problem. 

Direct-mapped cache implementation 

The first implementation of BTA was straightforward: use the table of pairs - tag (the address of branch instruction) and branch target address, and index these pairs by using low bits of the address of the branch instruction. If tag matches the address of the branch instruction, it is a hit, and we use the corresponding BTA. Otherwise, it is a miss, and we use zero for BTA. 

 


N-ways set-associative cache implementation 

However this direct mapped cache is not very efficient. It requires too large size to produce "reasonable" good hit rates. As a result, we have implemented n-ways set-associative cache for BTA buffer. 

The address of the branch instruction is used to index the table (low bits). Each entry contains a list of <tag, BTA> pairs. The cache simultaneously searches for the matching tag. If it finds it, it returns the corresponding BTA (hit). If not, it returns 0 (miss). Cyclic (LIFO) replacement policy was used to implement the replacement of the BTA. Counters are associated with each entry to implement the replacement strategy. 

BTA with associativity > 1 proved to be very effective, comparing to the direct mapped cache with the same number of entries (1-way cache uses k entries, and n-ways set-associative cache uses k/n entries for comparison). 

 


 


 
Implementation issues 
 
After implementation, we could see several implementation issues which are important to consider. 
    1. Concatenate
    2. XOR

    3. Mix?
Next sections will describe these issues in more detail. 

DFA’s for prediction bits 

In addition several different DFA’s have been implemented to update prediction bits. 

There are: 
Simulation Platform

The simulation platform for this project was a Pentium Pro 200 running Linux. This platform is acceptable for generating branch prediction statistics. The simulation can execute one million branch instructions between one and five minutes simulation time depending on the processor load and the percentage of branch instructions in a program relative to non-branch instructions.

The branch prediction software has been implemented for the sim-inorder executable. Sim-inorder was used because of our ability to modify the program without compromising program functionality or stability. Shown below are the command line options for the for the simulator:

 

As you can see, the simulator has a number of command line options. Will discuss them individually in more detail. I will also discuss some things that should be considered when using this executable. The command line options are as follows:
 
Option
Description
-Tn
Predictor Type. This option lets the user select the branch predictor type for the simulation. There are currently nine branch predictor types implemented within this software:

 

1 Skewed Branch Prediction

2 Decode History Table (DHT) Branch Prediction 

3 Branch History Table (BHT) Branch Prediction

4 Combination (DHT-BHT) Branch Prediction

5 Global Correlation Branch Prediction

6 Two Level Adaptive Prediction (TLA) Branch Prediction

7 Gshare Branch Prediction

0 Static Not Taken

100 Static Taken
-an
Associativity. This option allows the user to adjust the BTB associativity. Setting n to a value of one will make the BTB non-associative. Making the n value greater than one will make the cache n-way set associative. Keep in mind that making the cache associative can make the cache consume large amounts of memory during simulation. Increasing n beyond reasonable limits will cause the software to run out of memory during the simulation.
-cn
BTB Cache Size. This option sets the size of the Branch Table buffer. This number must be a power of 2. If this number is not a power of 2 the simulation will not execute.
-bn:m
BPB Cache Size and Automata Type. The variable n sets the automata type for the prediction bits. There are four automata type implemented on the simulator: 2B2SC, 2BTS, 2BTB, 2B. These are a follows:

 

2B2SC - 2 Bit with Dual Short Cut

2BTS - 2 Bit with Taken Short Cut

2BTB - 2 Bit with Taken Bias

2B - 2 Bit Saturating Counter

 

The m variable sets the size of the Branch Prediction Bits buffer size. This number must be a power of 2. If this number is not a power of 2 the simulation will not execute.
-xn
Intermediate Statistics. This option will force the simulator to output intermediate prediction rates every 100 branch instructions when n is non-zero. If this option is not input to the simulator or n is set to zero there will be no intermediate output.
-Pn
Prediction Bits. This option allows the user to set the number of prediction bits used per cache entry. If this option is used and n is set to any number other than 2 bits, an n-bit saturating counter will be used as the automata type.
-Hn
History Bits. This option allows the user to set the number of history bits (Global or Local depending on the predictor type). Keep in mind that large values of n will have large memory requirements for the simulation. If the simulator runs low on memory, it will terminate.
 

Additional things that should be considered when using this executable are as follows:
Once these points are kept in mind it is easy to generate scripts that will execute multiple simulations and generate output files. The following script will run a number of Spec95 benchmarks for a given predictor type and generate a Matlab compatible output file that can used to view initial results:

sim-inorder -T$1 -a4 -c16 -b16 gcc1 spec/test.c
sim-inorder -T$1 -a4 -c16 -b16 ijpeg -image_data specmun.ppm
sim-inorder -T$1 -a4 -c16 -b16 swim < swim.in
sim-inorder -T$1 -a4 -c16 -b16 spec/hydro2d < hydro2d.in
sim-inorder -T$1 -a4 -c16 -b16 spec/wave5 < wave5.in
sim-inorder -T$1 -a4 -c16 -b16 spec/turb3d < turb3d.in
sim-inorder -T$1 -a4 -c16 -b16 tomcatv < tomcatv.in
sim-inorder -T$1 -a4 -c16 -b16 compress95 < gcc1
sim-inorder -T$1 -a4 -c16 -b16 anagram words < input.txt 

This project uses the commercial software package Spec95 to provide a simulation environment that simulates all the environments that the branch prediction schemes will find themselves in the real world. Spec95 is a compilation of a number of benchmark programs. I have listed a the Spec95 benchmark programs used in this project as follows:
 
Spec95 Executable Processor Focus Description
Gcc Integer & Branch Intensive A C software language compiler. Arguably one of the most used software packages worldwide.
Ijpeg Integer & Branch Intensive Image compression and decompression
Swim Single Precision Intensive Shallow Water Simulation Model
Hydro2d Double Precision Intensive Astronomical Software Program. Solves Navier-Stokes equation.
Wave5 Double Precision Intensive 2D electromagnetic particle-in-a-cell. Simulated 750,000 particles.
Turb3d Double Precision Intensive Simulates isotropic homogeneous turbulence.
Tomcat Double Precision Intensive Double precision floating point Fortran benchmark.
Compress95 Integer & Branch Intensive Compression algorithm
Anagram Integer & Branch Intensive Text manipulation software
 
These benchmark programs stress the branch prediction algorithms in different ways. Some of these benchmarks have very predictable branch instructions (~99% accurate branch predictions). Whereas, other benchmarks are extremely hard to predict branch outcomes. 

Statistics

The branch instruction outcome prediction accuracy is affected by many different variables. This paper attempts to focus on those main components individually. The components that will be focused on in this section are as follows:  This paper will focus on each of these components individually in an attempt to increase the overall prediction accuracy.

Cache Size

The overall cache size has a large affect on the overall prediction rates. When the cache is small, there is a large amount of aliasing into the cache. As a result, there is a lower overall cache hit rate when the cache size is small because many cache entries need to be victomized before they can contribute toward branch prediction. In addition, when there are more cache entries, there will less aliasing and less victomizing as a result of the larger cache size. In this condition, the branch prediction schemes should work well because there is a large amount of information available in the cache and it has a high probability of being the correct data because of the lower aliasing rates. Therefore, when as the cache size is increased, the overall prediction accuracy should increase. The prediction accuracy will eventually saturate because other variables such as the predictability of the program and its data will come into play. This behavior is shown in the following simulation where the cache size is swept from a cache size of 2^6 to a cache size of 2^18.


 

Cache Initialization

The cache initialization time will also affect the overall prediction rates. Every prediction scheme has a unique cache initialization time. This initialization time depends on a number of variables. One of the variables it is most dependent on is the size of the cache. A 4-way associative 16 entry cache is going to take approximately 4 times the as long as a 16 entry non-associative cache. At first glance, both caches appear to have 16 entries. One of the cache sizes in the previous example has 64 entries because of its 4-way associativity. As a result, the algorithm is going to take longer to initialize than the simple direct mapped 16-entry cache.

This problem of initialization is compounded by issues such as context switching. When the processor switches to another task or program, the cache is flushed. This is because the processor is starting a new task. If the cache was to use it's current prediction bits, the prediction rates would be poor unless there is a correlation between the two tasks.

Shown below is a graph of the cache initialization algorithms for several prediction algorithms. As you can see Bimodal, the most simple algorithm, initializes itself within a few hundred cycles. Whereas, the Two Level Adaptive approach takes on the order of a few thousand cycles to initialize.


Hashing Technique

There are many different hashing techniques for branch prediction. One of the simplest techniques involves using the PC of the branch instruction. Then truncating off bits larger than the cache size. Therefore, if the design used a cache of 2^6, there would be 6 bits remaining after the truncation. After this truncation, the remaining 6 bits would be used to directly address the prediction bits. The truncation technique is shown here:


Here is another ,…


There is an unlimited number of hashing techniques. Creative hashing techniques use as much information as possible that is available at the time of the branch prediction for best results. In our simulation, the hashing technique is included into the prediction scheme. Therefore, the hashing technique used depends on the prediction type that used in the simulation. 

Prediction Bits


The number of prediction bits used per cache entry effects the branch prediction accuracy. The following graph depicts the prediction accuracy for different number of prediction bits. The state machine used for this simulation is an n-bit saturating counter. Some other standard state machines that can be include n-bit short cut, n-bit taken bias, n-bit taken short cut. These state machines will be discussed later in Deterministic Finite Automata. As you can see, increasing the number of prediction bits does not necessarily improve the branch prediction accuracy. In the simulation, the highest prediction rates were found by using 3 prediction bits. Although, this number can vary greatly depending on the type of state machine you are running on those bits and also what benchmark, cache sizes, etc. The graph depicts decreasing prediction rates as prediction bits are increased. This is a result of using an n-bit saturating counter. It is possible to continue to increase the prediction accuracy as the number of prediction bits are increased. The designer must use creativity when designing the state machine (DFA) to use in accordance with the number of prediction bits.

Associativity


Our simulations showed that using a smaller associative cache with approximately the same hardware requirements as a larger non-associative cache is a more effective branch prediction technique. The following graph shows how increasing the associativity will in turn increase the overall prediction accuracy. Below, I have shown a simulation, and varied the associativity of the cache. As you can see, as the associativity is increased, the prediction accuracy increases.

Deterministic Finite Automata

Deterministic Finite Automata (DFA) refers to the type of state machine architecture used on the prediction bits. There is an unlimited number of state machines that can be used with the prediction bits. It is important to obtain the best use of the limited number of prediction bits. Our group has implemented a number of DFA as listed below: These different state machines produce different prediction rates. Below I have shown a graph of DFA type vs. prediction accuracy. As you can see, the most common state machine 2-bit saturating counter predicts the outcome of the branches very well. It is also apparent that using a more creative scheme such as 2-bit taken short cut and 2-bit short cut can provide even better results. It's important to design the DFA scheme around the types of software that will be executed on the processor as well as the type of branch prediction method used.


History Bits

Some prediction schemes attempt to retain additional information about the branch history in order to achieve a higher prediction accuracy. Two Level Adaptive prediction does just that by retaining local history bits. We have found that the prediction accuracy varies as a result of the number of history bits. Below I have shown a graph of the prediction accuracy as a function of the number of local history bits for TLA. As you can see, the prediction accuracy increases as a result of increasing the number of local history bits. Unfortunately, increasing the number of local history bits for this implementation increases the overall size of the cache and the complexity of the branch prediction unit.


Prediction Scheme

Our group has implemented a number of branch prediction schemes. The schemes included into the modified sim-inorder software are: When designing branch prediction it is important to run many simulations. It is important to test prediction schemes is many environments. Some prediction schemes work very well in a given situation and then very poorly in another. Whereas, other prediction schemes have a well rounded behavior. Below, I have shown all the branch prediction techniques simulated with the gcc Spec95 benchmark. As you can see, the prediction results greatly vary from one scheme to the next.


The Program and its Data

One of the most important factors in branch prediction accuracy is the type of software executed on the processor. Some software contains very predictable branches. Whereas other software contains very unpredictable branches. Also, the prediction accuracy depends on the type of data used as input to the software. Below I have shown all the Spec95 benchmarks simulated against a single bimodal predictor. As you can see, the prediction accuracy varies dramatically from one benchmark to the next.


Results

I have run two overall simulations. All Spec95 benchmarks are simulated against all the prediction algorithms. In the first simulation, the branch prediction unit is assumed to be complicated and consume large die area. As a result, the branch predictors have large cache sizes. The second simulation, the branch prediction unit is restricted to a small space on the die. As a result, the cache sizes are much smaller. The caches are also associative to attempted to increase prediction accuracy without making the number of cache entries larger. Shown below is the first simulation. As you can see, the prediction output varies dramatically from one Spec95 benchmark to the next. In addition, it is easy to see that the branch prediction schemes perform differently under these circumstances.

In an effort to obtain overall results, all of the prediction accuracy points on the graph below are averaged. This gives us the ability to see the branch prediction units overall accuracy. Below I have attempted to summarize. As you can see, the Two Level Adaptive (TLA) prediction scheme worked very well in this case. In addition, our 'original' design has reported high prediction results as well. This is expected since our original design is a modified version of the Two Level Adaptive circuit. Also notice that the simplest predictor type the Bimodal prediction scheme is very close to the TLA. Also, notice that Gshare and Correlation branch prediction reported poor results in this situation.


 

Below the branch prediction schemes are summarized with respect to each of the Spec95 benchmarks. These overall results are the prediction schemes average prediction rate over all benchmarks.

Two Level Adaptive 75.30%
Original Scheme 75.20%
Bimodal 74.50%
Combination 73.80%
BHT 73.80%
Skewed 72.60%
Static Taken 71.90%
Gshare 66.50%
Correlation 66.40%
Static Not Taken 27.00%


The graph below shows the the prediction results for the second simulation. For this simulation, the cache was limited to 16 entries with a 4-way associative cache. I have attempted to summarize below. As you can see, our 'Original' prediction scheme worked very well in this case. This was to be expected because of out technique of implementing a modified version of TLA. In addition, the TLA design has reported high prediction results as well. Surprisingly enough, Gshare predicted very well under these circumstances. Also notice that the simplest predictor type the Bimodal prediction is still high on the list. 

Below the branch prediction schemes are summarized with respect to each of the Spec95 benchmarks. These overall results are the prediction schemes average prediction rate over all benchmarks.

Two Level Adaptive 58.50%
Gshare 57.47%
Bimodal 53.93%
Static Taken 53.76%
Correlation 53.38%
Skewed 49.60%
Combination 47.77%
BHT 47.77%
Static Not Taken 26.98%

Limitations & Future Work

This branch prediction software does not calculate actual performance measure. This software calculates prediction rates only. However, due to large percentage of branch instructions in benchmarks even small changes in branch prediction rates dramatically effect CPI.

In this project, ten branch prediction schemes were implemented. There are many additional prediction schemes possible that could be implemented.

Additional error checking could be implemented to check the command line parameter input for errors. There can be subroutines added to calculate additional statistical results. From these additional statistics it would be possible to show how CPI rate is affected by branch prediction rate.

Conclusion

In conclusion, We have researched a number of branch prediction methods. We looked at both static and dynamic branch prediction schemes.

We used the Simplescaler simulator to generate our branch prediction results. We made a number of changes to the source code in order to perform our branch prediction methods (available below).

We have found that the best method out of the schemes we've implemented overall is Two Level Adaptive Branch Prediction. Surprisingly enough, simple methods like Bimodal predictors perform better than complicated. This is different from other research work in the area. Some results shows that Gshare performs exceptionally well as a branch predictor. We found this not to be the case. In our simulations, Gshare gave average performance rates.

We have found that the best method to implement when limited in hardware requirements is a Bimodal or Gshare predictor. However, if there is room for a more complicated branch prediction method on the processor, we would chose to implement Two Level Adaptive Branch Prediction because of it's ability to predict branch outcomes with high accuracy.

We would like to point out that this research that we have done in this area is very limited. Our 'four' person group has spent a mere six weeks on this project. With the software we have implemented, there is so much more work that can be done. There are many improvements that can be made to the software. We encourage bright young energetic research students interested in branch prediction to continue this work and build on the software and branch prediction subroutines that we have created.


Links

Branch prediction presentation overheads
Shar file of Source code and executables for branch prediction simulations
Script for running all benchmarks against a given prediction technique
Script for sweeping the cache size against a given prediction technique
Script for sweeping all prediction schemes against any type of simulation

References


T.-Y. Yeh and Y.N. Patt, "Two-level Adaptive Branch Prediction", 24th ACM/IEEE International Symposium on Microarchitecture, Nov. 1991

S. McFarling, "Combining Branch Predictors", WRL Technical Note TN-36, Digital Equipment Corporation, June 1993

M. Evers, P.-Y. Chang and Y.N. Patt, "Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches", Proceedings, 23th International Symposium on Computer Architecture, Philadelphia, May 1996

P.-Y. Chang, E. Hao, T.-Y. Yeh and Y.N. Patt, "Branch Classification: A New Mechanism for Improving Branch Predictor Performance," Proceedings of the 27th International Symposium on Microarchitecture, San Jose, California, Nov. 1994.

Michaud, Pierre; Andre Siznec, Richard Uhlig; "Skewed branch predictors," June 1996

Pino, Jose Luis; Singh Balraj; David E. Culler, "Performance Evaluation of One and Two-Level Dynamic Branch Predicton Schemes over Comparable Hardware Costs"