MVPsim

Introduction

The goal of MVPsim (Multi-threaded Virtual Processor Simulation) is to create a simulator that can study two main topics of interest: The actual processor to be modeled is a "standard" super-scalar processor that is able to use cache-misses to switch between up to four hardware contexts. In this case, a context is defined to be a thread, so the MVP processor can have up to four threads available for its use. To accomplish the above goals the group decided to construct a new simulator out of two pre-existing programs: SimpleScalar (the execution based simulator) and Pthread (thread library).

Implementation of MVPsim

The MVPsim project has two main stages to complete before the MVP simulator can be finished:
  1. Combining SimpleScalar and Pthread to create thread capable processor.
  2. Modifying both SimpleScalar and Pthread to create four thread multi-threading capable processor (MVPsim).

Stage 1: Combining SimpleScalar and Pthread to create thread capable processor

In order to understand the modifications required to use Pthread with SimpleScalar, it is important to understand how both operate.

SimpleScalar operation:
SimpleScalar is a small collection of execution based simulators where each one models a different type or level of detail with a processor and each utilizes the same tool set. The simulator that MVPsim is interested for the project is sim-outorder. The sim-outorder processor simulates a out-of-order processor based upon Gurindar Sohi's RUU design shown below.

To understand to understand the limitations in SimpleScalar 2's (SS2) sim-outorder, it is important to first understand how SS2 works. SS2 is an in-order execution based simulator followed by a trace-based out-of-order timing simulator as shown to the right. So in other words, the instructions are decoded and instantaneously executed (the register files and main memory are modified). After that, the instructions are placed into the timing simulator where the instructions are moved into execution units and hazards are resolved similarly to execution in a super-scalar processor. It is important to note that no data is handled by the timing simulator, all data and "actual" execution is taken care of by the earlier execution simulator. Another way to look at this is that sim-outorder models an on the fly trace generator that runs the program and passes the instructions (as each is decoded and executed) to a trace-driven timing simulator. In oder to correctly model wrong path execution, the in-order execution portion produces wrong path instructions called spec_mode instructions for the timing simulator that do not affect the state of the machine (i.e. except for cache valid/dirty bits, when the execution simulator enters spec_mode, the RF and memory is not modified). This "spec_mode" lasts until the timing simulator indicates that a branch instruction should have executed by now, turns "off" spec_mode, and forces the execution simulator to restart execution down the correct branch. Since the RF and memory has not been modified since the branch instruction (when spec_mode was initiated), the state of the simulated processor was preserved. This system allows sim-outorder to execute along speculative branch paths and have it not affect the correctness of the in-order simulation.

The sim-outorder simulator simulates a five stage pipeline (fetch, decode, issue, writeback, and commit) on the above architecture. The five stages are implemented with six functions driven by the sim_main function. These functions include:

Function Simulator Purpose

commit Timing This stage is where completed instructions are removed from the RUU and any store instructions at the head the the RUU are executed. Note, since no values are held in the RUU, none are saved when the instructions are removed.
ruu_release_fu Timing This small function releases the functional units once instructions are done with them.
ruu_writeback Timing This stage is where instructions are "completed" in the timing simulator and data dependencies are updated for instructions in the RUU. Also, this stage is the location where misspeculated branches are "discovered" by the timing simulator and the processor is reset to operation before spec_mode was declared.
ruu_issue Timing At this stage, ready instructions are issued from the RUU unit to the functional units. This stage is where instructions can be issued out-of-order based upon which instructions are ready (have all their operands) and which functional units are not busy. During this stage, the loads access the caches and the dirty/valid bit are updated. Note, only the RF and memory hold data values, the data cache access holds addresses for hit/miss timing calculations.
ruu_dispatch Execution This stage is where new instructions are decoded (executed by the execution based simulator) and placed into the RUU structures. If room exist in the RUU unit, instructions are removed from the queue filled during ruu_fetch, decoded (and executed), and then placed into the RUU unit. Note, at this point the instructions are executed and placed into the RUU in-order. Also, it at this function where instructions make the transition from the in-order execution based simulator to the out-of-order timing simulator. This function is where the RF and memory are updated by instructions and where miss-predicted branches (sim-outorder can tell a miss-predicted branch since the branch has already been executed) enable spec_mode.
ruu_fetch Execution This stage is responsible for fetching instructions out of the i-cache and place them into a queue for later decoding.

To get a better understanding of how an instruction moves within SS2, please observe the following figure:

By choosing the above method of execution, sim-outorder forces three main limitations that are significant for MVPsim:

  1. No precise interrupts. Using the RUU (essentially a ROB and reservation station combined), instructions are not allowed to modify the RF and memory until they reach the head of the RUU. In that way, when interrupts occur, the PC of the instruction that is at head of the RUU is saved along with the current state of the RF and memory. However, in sim-outorder, that is not the case, since every instruction in the RUU has modified the RF and memory (including some that aren't yet in the RUU). So, since the state of the machine is unknown, precise interrupts cannot occur.
  2. Current SimpleScalar ModelNo OS support. As a direct consequence of a lack of precise interrupts, sim-outorder is incapable of supporting an OS. In order to model system calls required by most programs, SimpleScalar implements an interesting model. When SS2 decodes a system call in a simulated program, SS2 passes the syscall to the host machine's OS for it to handle the syscall. As shown to the right, the model is of the SimpleScalar executable code (usually part of the SPEC95 benchmark) running on the SimpleScalar simulator, which in turn runs on the host processor/OS. The arrows represent the passing of the syscall to the host machine and back to the simulated code.
  3. No signals or timer support. Without precise interrupts, SS2 cannot support signals which generate interrupts to notify the processor of their generation. Also, OS timers are not supported. Not only do timers not work without signals (which are generated when a timer goes off), but are also limited by the syscall model shown to the right. If timers (created by syscalls) are sent out to host machine OS, they will run in regards to the host machines processor not the simulated machines processor. For example, imagine that a timer is set by simulated code to "ring" in approximately 1,000 simulated processor cycles. SS2 will set up that timer to run on the host processor for 1,000 cycles and return a signal (lets assume that signals work for the moment) 1,000 host cycles later. However, the host machine is a busy machine, it has to deal with an OS and several other tasks (besides the simulator) demanding processor attention. So after 1,000 host cycles, the signal comes back to the simulated processor only after 10 simulated cycles, 100 times faster then it should!

Pthread operation:
Pthread is a user library implementation of threads, or in other words, it is a collection of libraries linked to a user program that allows a user program to declare and run threads. To use Pthread, a programmer must first create the threads by using the pthread_create function. "pthread_create" essentially tells Pthread where to find the user defined threads for future use. Since the threads are user definable, any function the user designates can be a thread. Next, the programmer needs to start the threads running by using the pthread_join command. Once called, Pthread takes over the running of the threads until all are finished and returns control back to the main program. Pthread runs one thread at a time in a round-robin fashion and switches between threads in the following ways:

  1. When the current thread ends its execution, it calls the function pthread_exit. This function deletes the thread from operation and turns control to the pthread scheduler whereby either a different thread is started or control is returned to the main program if there exists no more threads to run.
  2. As each thread is executed, Pthread starts an OS timer. If the timer activates before the thread is finished executing, an interrupt is generated. At this time the state of the thread is saved, the pthread scheduler called, and a different thread is executed (providing, of course, that there are more threads to run).
  3. Since Pthread was created to hide I/O latency, Pthread also switches threads when it detects a thread making a long I/O request. The thread switching process works the same as a timer switch above.
The implementation of pthread is rather complicated and relies heavily on OS support. When a process runs, the OS sets up a location for the potential state of the process on the OS stack. As shown below, when an interrupt occurs, the state of the process is saved onto the stack, increments the stack pointer and the processor runs the interrupt handling routine. When the routine finishes and returns from the interrupt, the OS pops the process's state (the one above the stack pointer) back off the stack, decrements the stack pointer, and the process returns to execution.

In order to benefit from this aspect of the OS, pthread builds its threads in a creative way. When the pthread_create function is called, Pthread allocates space on the OS stack for each thread. The stack pointer for each thread and a pointer to the threads start is saved in a queue by Pthread (called prio_queue). Then, when a thread is to start execution, pthread sets the stack pointer to the start of the thread, sets up the OS timer, removes the thread from th queue, and calls the start routine of the thread. The new OS stack is shown to the right. Once a timer interrupt is generated the following steps occur.
  1. The OS recognizes the interrupt and saves the thread onto the space on the OS stack given by the stack pointer.
  2. The OS loads the interrupt handler (sig_handler_real in pthread)
  3. sig_handler_real eventually runs the pthread scheduling routines
  4. The currently running thread is returned to the queue and a new thread selected and removed from the queue (pthread supports a round-robin style of scheduling).
  5. To start the new thread running, the stack pointer is set to above the new thread to run and pthread returns from the interrupt.
  6. The OS restores the state of the processor using the values immediately below the stack pointer (the new thread) and the new thread begins executing.
Once all threads are finished running, pthread returns to the main program by acting similarly to a simple return from the pthread_join function call.

What pthread requires in order to run on a simulator, is three things:

  1. A OS stack to save and restore states of the threads.
  2. A OS timer to generate an interrupt to switch between threads.
  3. Precise timer interrupts to allow the correct interfacing between the hardware, OS, and Pthread.

Combining Pthread and SimpleScalar:
In order to port Pthread to SS, three functional capabilities must be added to SS2:

  1. Timers
  2. Precise interrupts
  3. OS stack
To implement timers, SS2 syscall.c was modified to detect a SETITIMER syscall. Upon detection of the syscall the timer value is removed and stored in a global timer variable. To decrement the timer, code was modified in the sim_main function to decrement the timer every clock cycle.

Since Timer interrupts are those that are asynchronous to the currently executing instruction stream, it was easy to modify SS2 to handle them. The timer interrupts are checked at the beginning of ruu_fetch in sim-outorder.c. This check is just a comparison with the global timer variable. It is important to note that interrupts cannot occur while the processor is in speculative mode because the entire interrupt handler would be run in speculative mode. To start executing the interrupt handler, ruu_fetch calls the SS2 interrupt handler (interrupt_fetch) which switches the current PC to that of the interrupt vector. This PC swapping method is acceptable since the RUU does not need to be drained before servicing the interrupt. This is mostly because the processor does not have a supervisor mode with multiple levels of memory protection.

To implement an OS stack into SS2, a separate file, signal.c, was created. Signal.c create and then maintains the OS stack. When interrupt_fetch is called, (to process a detected interrupt), it interfaces with the OS stack code to save the current state of the machine, increment the "stack pointer" and return the interrupt handler PC. Upon a return from and interrupt, the code drops back into interrupt fetch, which decrements the stack pointer and returns the state of the machine. Last, the interrupt fetch routine returns the saved PC back into the normal SS2 program.

After these changes, Pthread was tested on SS2 and found to work normally. Next, the MVPsim group moved on to creating the actual MVPsim.

Stage 2: Creating MVPsim

Essentially, MVPsim models the effect of multi-threading using multiple hardware contexts. In the MVP group case, the hardware contexts are simply register banks. When it is required to switch from one context to another, MVPsim deactivates one register bank and activates another. The processor then fetches on the new register banks PC. The processor also has to drain the RUU unit and flush the pipeline, incurring the same steps and penalties seen by branch prediction. To tell when to switch between contexts MVPsim uses to mechanisms:
  1. Cache miss event: Upon detection of a cache miss, in order to avoid the main memory latency, MVPsim will switch to a new context.
  2. Timer interrupt: Since it is necessary to be able to remove threads out of the hardware and replace them with other threads (in the case of barrier syncs and inter-thread communication), MVPsim expands on the Pthread timer. The timer interrupt was lengthened to allow all threads in the hardware contexts and chance to run. Then, when the timer interrupts, a thread is removed and another added in its place. Thus, fairness is promoted and deadlocks prevented.

To create MVPsim, the MVP group has to go through 3 sub stages. These stages include:

The Execution Model:
The execution model for MVPsim is shown to the left:
Thread Mode/Non-Thread Mode
The two large boxes on the execution model show the states contained within thread mode and non-thread mode. When thread Mode is set, the hardware allows a context switch on a cache miss event while non-thread mode disallows any such action. Essentially, non-thread mode is declared whenever the main program is running (no threads to run so why context switch?) and when threads are being created or scheduled (a context switch would result in data corruption). Any other time that the hardware has threads, a context switch is desired on a cache miss event. In this way, thread mode can alert the hardware that is has threads and doubles as a locking variable in the event that threads are being scheduled.

States

Transitions
  1. If the newly created threads have low priority
  2. Create new threads (pthread_create function call)
  3. If the newly created threads have high priority
  4. Start threads (pthread_join function call)
  5. No threads left to run
  6. Move scheduled thread into hardware
  7. Context empty and new threads available or timer interrupt and new threads available
  8. Cache miss event or timer interrupt
  9. No new threads or contexts are full and a context is ready
  10. Return from cache miss
  11. No new threads and no thread is ready or context is full and no thread is ready

The Hardware-Software Interactions:

To understand the interactions between pthread, the OS, and the hardware on the MVPsim simulation, refer to the above figure. The MVPsim processor contains two hardware contexts, in other words, two sets of register banks enabling the processor to support up to two threads at once (In actuality MVPsim will support more contexts, but has been simplified for the explanation). Each context has a valid bit to let the processor know if it is allowed to fetch instructions for the context and a ready bit to denote if the thread is ready to execute of waiting for a cache miss to return. Note, at least one context must be valid at all times or the processor will not fetch any instructions. The hardware also has a Thread Mode bit and a More Threads bit. When set, the Thread Mode bit allows the processor to switch between the contexts and prevents the switch when not set. The More Threads bit, indicates whether there are threads in the Pthread prio_queue.

The OS section consists solely of the OS stack that Pthread relies on to save the state of a thread on a timer interrupt as explained previously.

The Pthread section consists of a prio_queue and the pthread_run pointer. The prio_queue is where Pthread stores all non-completed threads that are not currently executing. It is from this queue that the Pthread scheduler chooses the next thread to run. The pthread_run variable holds a pointer to the thread that is currently being executed. In the MVPsim case, pthread_run holds a pointer to the thread executing in the context that needs to be swapped with a non-executing thread (the currently running context).

Before an example execution can be presented it is helpful to define how MVPsim accomplishes some processes:

How a context switch occurs:
A context switch is when the hardware moves from one context (H0) to another (H1). Since a context is just a register bank the process (assuming multiple register banks are available) is a simple process of turning off one bank and turning on another. The processor itself will have to flush the ROB (RUU) and the fetch queue making the entire process very similar to recovering from a miss-predicted branch and requires the same penalty.
How the OS saves the state:
From the fact that each context is a register bank, every context then knows its own OS stack location (the pointer is stored in a specific register). When an interrupt occurs, the OS reads stack pointer and places the state information at the appropriate location. The stack pointer is initially set up by Pthread during the OS stack creation and assigned to the thread during the pthread_create function call.
How Pthread knows what thread is being removed from the hardware context:
Pthread can tell what thread is being removed by reading a pointer stored in a register in a context. Upon calling the Pthread scheduler, the value of this register is assigned to the pthread_run variable. The pointer is to a pthread (the thread information that is stored in the prio_queue). Implemented this way, Pthread can find exactly which thread is in the current context.

To gain an understanding into the hardware, OS, and Pthread relationship of MVPsim, the following three thread example is presented:

The Implementation of MVPsim:
The implementation of MVPsim can be accomplished in two rather large steps:

  1. Implementing Cache miss events
  2. Implementing multiple hardware contexts
The addition of a instruction that causes a context switch on a cache switch is a rather difficult process for a variety of reasons, all related to the SimpleScalar2 implementation. First, there is only one state of the processor stored in SimpleScalar2 and no other state can be constructed without moving forward in the program. In other words, you can't undo anything unless you were in spec_mode when you did it, hence, you didn't really do it in the first place. So the bad news is that by the time you find out that the instruction is faulting, the instruction has completed (in the execution simulator) and you can't undo that instruction or redo the instruction (consider ADD $1,$2,$1).

With the limitations in mind, the solution to an context switch instruction is clear, if somewhat tricky to actually implement. Whenever an context switch causing instruction is decoded, the cache miss must also be detected, and the processor put into speculative mode until it can be determined (in the timing simulator) that the cache miss has actually occurred, so the interrupt vector can be called. To show how this can be done in practice, below is the process that you would need to go through to implement an event on a cache miss that drains the RUU before calling the interrupt vector. Remember that the RUU cannot be flushed, since instructions before and including the faulting instruction must be allowed to complete for the program to be correct.

There are four areas that need to be modified (along with the addition of the interrupt handlers) in sim-outorder.c to support this behavior. Also, an additional register needs to be allocated to control when the interrupt is enabled, and actually set the interrupt itself. Sim-outorder is divided into several distinct sections, each with a specific purpose that mirrors the sections of a superscalar processor. The sections are shown below, with their purpose, and a brief summary of what we added to each section.

Section Purpose Changes

commit This is where completed instructions are removed from the RUU. Also, stores are executed here as well. If the instruction was a faulting load, then leave spec_mode, and call the interrupt handler.
ruu_release_fu This is a small function to release the functional units No changes necessary
ruu_writeback This is where instructions are "completed" in the timing simulator. This is also the location where misspeculated branches are "discovered" by the timing simulator. Minor change to prevent a faulting load from being mistaken for a mispredicted branch. Also, disallow a misspredicted branch from leaving spec_mode when a faulting load has occurred.
ruu_issue This is where instructions are sent to functional units to be "executed". Timing for the instructions (including loads) are computed here. We modify the timing assignment for a faulting load to be the time it takes to discover that the L2 cache misses instead of the time to access main memory.
ruu_decode This is where new instructions are decoded (executed in the execution based simulator) and placed into the RUU structures. This function also uses the executed branches to determine if the simulator should enter spec_mode. For all loads (when cache miss interrupts are enabled) we check to see if the load will miss using the cache_probe() function. If it misses, we go into speculative mode.
ruu_fetch This stage is where instructions are fetched in from the I-caches or main memory and placed into a queue for later decoding by ruu_dispatch. Branches are predicted using a two-bit BTB, but the simulator does not enter spec_mode until ruu_dispatch, where the branches are executed. With this stratagem, the processor knows exactly when the predicted value does not equal the actual value and ruu_dispatch can move the processor into spec_mode. No changes necessary.

Implementing multiple hardware contexts is another surprisingly difficult procedure and it is this stage that the MVP group is currently trying to implement. Most of the problems here revolve around the fact that adding multiple hardware contexts require extensive modifications to both Pthread and SS2, and no way to test smaller implementation steps. This all or nothing approach has made this step a nightmare to debug.

Conclusions

In conclusion it was hoped that the project could be completed and results generated within a term. Alas, it was not to be.

However, the MVP group now has an extensive knowledge of both Pthread and SS2 so that the continuing of the project should go smoother. Also, the group obtained greater knowledge in the realm of multi-threading and has made several future recommendations:

All in all this was a rather difficult and time consuming project, but hopefully in the near future MVPsim can be used to collect and detail valuable data and results on multi-threading.