CSCI 210: Final Project
August 30th
For this final project, you will write a configurable cache simulator using the infrastructure provided. Your cache simulator will read an address trace (a chronological list of memory addresses referenced), simulate the cache, generate cache hit and miss data, and calculate the execution time for the executing program. The address traces have been generated by a simulator executing real programs. Your cache simulator will be graded for accuracy, but is not the end product of this project, rather it is a tool you will use to complete the project. In this project, you will experiment with various cache configurations and make conclusions about the optimal cache organization for this set of programs.
Checkout your repository from github at the following link: It will contain a java file you will use to write your cache simulator. You will also need a zip file of address traces, which you can download here.
You can work on this project with a partner if you choose. If you decide to work with a partner, you and your partner should check out a single repository. The first partner will create a team name, and the second partner should choose that team name. Please be careful choosing a team, as this cannot be undone. Please name your team something that makes it clear who you are.
If you choose to work with a partner, you and your partner must complete the entire project together. Dividing the project up into pieces and having each partner complete a part of it on their own will be considered a violation of the honor code. Both you and your partner are expected to fully understand all of the code you submit.
Using the address trace:
An address trace is simply a list of addresses produced by a program
running on a processor. These are the addresses resulting from load
and store instructions in the code as it is executed. Some address
traces would include both instruction fetch addresses and data (load
and store) addresses, but you will be simulating only a data cache, so
these traces only have data addresses. These traces were generated by
a simulator of a RISC processor running three programs, art, mcf, and
swim from the SPEC benchmarks. The files are art.trace.gz,
mcf.trace.gz, and swim.trace.gz. The number of loads/stores vary by
benchmark. They are all compressed with gzip and since you are running
on a Unix machine (, you do not need to ever store the traces
uncompressed. Use the following command to generate the trace and pipe it through your cache
simulator, like so:
gunzip -c art.trace.gz | java cache [cache args]
Because your workload is three programs, you will run three
simulations for each architecture you simulate, and then combine the
results in some meaningful way. The simulator arguments should be
taken in as command line arguments, like so:
java cache -s 32 -a 4 -l 32 -mp 30
This would simulate a 32 KB, 4-way set-associative cache with 32-byte
blocks, and a 30-cycle miss penalty. Your code should support any
reasonable values for cache size, associativity, etc. (You may assume
the cache size will be less than 4MB.)
Format of the address trace:
All lines of the address trace are of the format:
where LS is a 0 for a load and 1 for a store, ADDRESS is an
8-character hexadecimal number, and IC is the number of instructions
executed between the previous memory access and this one (including
the load or store instruction itself). There is a single space between
each field. The instruction count information will be used to
calculate execution time (or at least cycle count). A sample address
trace starts out like this:
# 0 7fffed80 1
# 0 10010000 10
# 0 10010060 3
# 1 10010030 4
# 0 10010004 6
# 0 10010064 3
# 1 10010034 4
You should assume no accesses address multiple cache blocks (e.g., assume all accesses are for 32 bits or less).
The simulator output:
Your program should produce miss rates for all accesses, miss rates for loads only, and execution time for the program, in cycles. It should also show total CPI, and average memory access time (cycles per access, assuming 0 cycles for a hit and miss penalty for a miss). For execution time, assume the following: All instructions (except loads) take one cycle. A load takes one cycle plus the miss penalty. The miss penalty is 0 cycles for a cache hit and 30 cycles for a cache miss (unless specified otherwise). Loads or stores each result in a stall for miss-penalty cycles.
You will simulate a write-allocate cache. In the trace shown, the first 31 instructions should take 151 cycles, assuming four cache misses and 3 cache hits for the 5 loads and 2 stores, and a 30-cycle miss penalty. You will be modeling a write-back cache, but we assume the write of a dirty line takes place mostly in the background. As such, we assume an extra 2-cycle delay to write a dirty line to a write buffer. So, using the parameters from above, a load or store miss that would evict a clean line takes 30 cycles, and if evicting a dirty line takes 32 cycles. Cache replacement policy is always LRU for associative caches. If useful, assume a 2 GHz processor. Each trace contains the memory accesses of just over 5 million instructions. Your simulations should process all of them. Your output must follow the format provided in the lab framework exactly.
(40pts) Correctness of the simulator based on running your simulator.The cache:
For the second part of this lab, you will use your cache simulator to test out different cache configurations and answer questions on gradescope about your findings. You should try every configuration below on the art, swim, and mcf cache traces provided to you, and discuss all of them in your answers to the questions. Note that your question answers are worth just as much as the code for this lab - you are expected to give detailed answers that clearly backup your claims with your simulator results.
The default cache configuration will be 16-byte block size, direct-mapped, 16 KB cache size, write-back, and write-allocate. You will re-evaluate some of these parameters one at a time, in the following order. In each case, choose a best value for each parameter, then use that for all subsequent analyses.
Look at 16 KB, 32 KB, and 128 KB cache sizes. Larger caches take longer to access, so assume that a processor with a 32 KB cache requires a 5% longer cycle time, and the 128 KB 15% longer. Choose the best size/cycle time combination and proceed to the next step. Look at cache associativity of direct-mapped, 2-way set-associative, and 8-way set-associative. Assume that 2-way associative adds 5% to the cycle time, and 8-way adds 10% to the cycle time. Choose the best associativity and cycle time, and proceed. Look at cache block sizes of 16, 32, and 64 bytes. Assume that it takes two extra cycles to load 32 bytes into the cache, and 6 extra cycles to load 64 bytes. (i.e., raise the miss penalty accordingly). Choose the best size and miss penalty and proceed.
Questions for the Final Project:
You will find the following questions on gradescope in the assignment Final Project Questions. Answer them there.
- (10pts) What is the optimal cache size, associativity, and block size for a cache, given the parameters above?
- (10pts) Is cache miss rate a good indicator of performance? In what cases did the option with the lowest miss rate not have the lowest execution time? Why?
- (10pts) Were results uniform across the three programs? In what cases did different programs give different conclusions? Speculate as to why that may have been true.
- (10pts) What was the speedup of your final design over the default? You should use the definition of speedup we went over in class.
Think about how to intelligently debug and test your program. Running immediately on the entire input gives you little insight on whether it is working (unless it is way off). To do this create separate memory tests (you can see the text format above) to ensure cache size, cache associativity, blocksize, and miss penalty are functioning correctly. You do not need to turn them in, but they will help tremendously.
Speed matters. These simulations should take a couple minutes (actually, much less) on an unloaded machine. If it is taking much more than that, do yourself a favor and think about what you are doing inefficiently.
Simulations are not the same as hardware. If your tag only takes 16 bits, feel free to use an integer for that value. Other time-saving optimizations along these lines might be useful.
Give execution time in some reasonable and consistent form.
Determining Correctness
As the lab encourages, you should write very basic tests to ensure that the simulator is functioning properly. After getting to a state where you have convinced yourself that the simulator functions properly, you may use the following values to help verify that your program is correct. Note that different inputs will be used to test if your simulator is correct for grading.
> gunzip -c art.trace.gz | java cache -a 1 -s 16 -l 16 -mp 30
Cache parameters:
Cache Size (KB) 16
Cache Associativity 1
Cache Block Size (bytes) 16
Miss penalty (cyc) 30
Simulation results:
execution time 21857966 cycles
instructions 5136716
memory accesses 1957764
overall miss rate 0.28
read miss rate 0.30
memory CPI 3.26
total CPI 4.26
average memory access time 8.54 cycles
dirty evictions 60540
load_misses 523277
store_misses 30062
load_hits 1208606
store_hits 195819
> gunzip -c mcf.trace.gz | java cache -a 8 -s 64 -l 32 -mp 42
Cache parameters:
Cache Size (KB) 64
Cache Associativity 8
Cache Block Size (bytes) 32
Miss penalty (cyc) 42
Simulation results:
execution time 143963250 cycles
instructions 19999998
memory accesses 6943857
overall miss rate 0.42
read miss rate 0.36
memory CPI 6.20
total CPI 7.20
average memory access time 17.85 cycles
dirty evictions 995694
load_misses 2036666
store_misses 867426
load_hits 3552806
store_hits 486959
C. Taylor