Forward and merge exclusively with LSQs and MSHRs#618
Forward and merge exclusively with LSQs and MSHRs#618ngober merged 6 commits intoChampSim:developfrom
Conversation
ngober
left a comment
There was a problem hiding this comment.
This is a very thorough PR. I appreciate the research that you put into it, especially referencing the published designs.
I've long been bothered by the collision checking behavior in channels. It doesn't seem like something that would actually be built, and it's probably costly in terms of runtime. So, I'm pleased to see it go.
It looks to me like what you've done is take the collision check from the middle of the channel to the destination, in the miss handler. However, I wonder if it is more prudent instead to perform merging on the source end of the channel. The core should know if it is issuing a duplicate packet (based on what's in the LQ/SQ), and can avoid sending duplicates. Caches will probably still need to scan the MSHRs, so that duplicate prefetches can get squashed.
I think that source-side merging can eliminate a lot of logic that checks for duplicates and save bandwidth.
Let me know if I'm misreading what you've written. Thanks for working on this!
| if (mshr_entry == MSHR.end()) { | ||
| mshr_entry = std::find_if(inflight_writes.begin(), inflight_writes.end(), matches_address(handle_pkt.address)); | ||
| } | ||
|
|
||
| if (mshr_entry != inflight_writes.end()) // miss already inflight |
There was a problem hiding this comment.
Be very careful with this. You have changed the semantics of this code. Whereas previously mshr_entry was guaranteed to fall in the bounds of MSHR, it may now fall in either that or inflight_writes. The two ranges have the same type right now, but that is not enforced in the code.
You also run the risk of merging a non-write into a write below (line 354).
Maybe it is best to scan the MSHR and inflight_writes arrays separately and then call into a common handling function.
There was a problem hiding this comment.
Be very careful with this. You have changed the semantics of this code. Whereas previously
mshr_entrywas guaranteed to fall in the bounds ofMSHR, it may now fall in either that orinflight_writes. The two ranges have the same type right now, but that is not enforced in the code.
GCC and clang enforce that inflight_writes and MSHR have the same type by giving an error when assigning mshr_entry from inflight_writes otherwise so we don't need to spend additional effort.
You also run the risk of merging a non-write into a write below (line 354).
It is intentional. The intent of commit 26c2f9d is to forward a write to a non-write.
Maybe it is best to scan the MSHR and inflight_writes arrays separately and then call into a common handling function.
My analysis of the problem with the changed semantics is that the code keeps mentioning "MSHR" even though it now also scans inflight_writes. I thought that was fine considering that inflight_writes is typed as std::deque<mshr_type>, but mshr_type may be better to be renamed altogether to avoid confusion. Perhaps fill_type?
There was a problem hiding this comment.
You're correct that the type system causes compilation errors if, for example, inflight_writes changed types. I think I was unclear about what my concerns were. Semantically, this change moves from guaranteeing that the entry is in range A only to guaranteeing that it is in either range A or B, but we don't know which. I think the semantic of this, not merely the grammar, is worth being careful about.
You are also correct concerning the write/non-write merging. This is called forwarding and I should know that. 😞
I'm okay with renaming the type here. It's worth noting that the MSHR and the inflight_write are representing different things. The first represents a wait for data that has not arrived (and afterwards the pipeline of the data commit), the second represents the pipeline of data that has not yet been committed. So, perhaps, the inflight_write type is like an MSHR with a wait time of zero. Perhaps, and I'm mostly conjecturing here, what needs to happen is that MSHRs that have received their data need to be moved to the write pipe, so the MSHR array represents things that are waiting only, and not also those that are in a write pipe.
There was a problem hiding this comment.
I'm okay with renaming the type here. It's worth noting that the MSHR and the inflight_write are representing different things. The first represents a wait for data that has not arrived (and afterwards the pipeline of the data commit), the second represents the pipeline of data that has not yet been committed. So, perhaps, the inflight_write type is like an MSHR with a wait time of zero. Perhaps, and I'm mostly conjecturing here, what needs to happen is that MSHRs that have received their data need to be moved to the write pipe, so the MSHR array represents things that are waiting only, and not also those that are in a write pipe.
Moving resolves misses to the write pipe sounds good. I would probably also rename inflight_writes to inflight_fills or something to represent that it will not only consume writes from the upper layer but also reads from the lower layer to "fill" a cache line.
Thank you for the quick review. I agree that moving the merging and forwarding operations to the source end is better than moving them to the destination, and I attempted to enforce that by:
However, there is one missing piece: LSQ does not merge duplicate packets. My data shows it caused degradation in the simulated performance. For example, the following diff shows removing the merging operation in channel significantly increases L1D accesses though the lower levels do not see such a big change: akihikodaki/e@4f2c9cd#diff-07b66c639839c13ae44569bb1a116810918ff732a41eaa4f273ba395e75983cd By the way, I found my description of commit 2605acc mentioning the reduced bandwidth is misleading as the L1D has sufficient bandwidth to satisfy LQ requests: On the other hand, I'm worried that the savings may not be present in common microarchitecture designs, and having them may hurt the simulation accuracy because the published designs I cited do discuss store-to-load forwarding, but they do not mention merging of loads. I also have a suspicion that having a small, low-latency cache is a better option to improve the performance. It is necessary to search entries in LQ to merge requests early, which effectively makes LQ behave like a small cache. Having a real, small low-latency cache instead can achieve the same goal while reducing the latency of non-duplicate requests as well. LQ should already have a CAM search port to detect ordering failures between loads and stores, but even the port is not free to use if loads and stores can happen simultaneously so searching entries in LQ may not be cheaper than having a low-latency cache area-wise either. In summary, taking the cited designs and the potential cost of merging operations into account, I think it is better not to merge requests in LSQ for simulation accuracy. This is all theoretical though, so I'd like to hear if you have more insights on or know literature that discusses this topic. |
I believe that the failure here is not actually in the LQ, but in the OOO engine. Because load addresses in our traces are not associated with destination registers, we treat the loads as independent operations. A more proper model would do better renaming here. Maybe that's out of scope for this patch, and this patch serves to highlight the gap in the model. |
Destination registers may not help determine if loads are mergeable. Below is a part of Here, these five instructions look independent when looking at the destination registers that will be loaded from memory as they have different ones: channel used to merge these loads by looking at load addresses, but I think such merging operations should not happen when simulating a conventional microarchitecture due to reasons described in #618 (comment). |
Channel already implements forwarding, but it misses if a write packet is already moved from WQ. Forward in cache to ensure that no read request will be issued when a write is inflight.
Avoid merging in channels due to implications in accuracy, performance, and code complexity. Here, I classify channels into three categories: 1. Channels whose requesters are CACHEs or PTWs. 2. Channels between O3_CPUs and their L1Is. 3. Channels between O3_CPU and their L1Ds. Merging operations in categories 1 and 2 are unnecessary. For category 1, CACHEs or PTWs already merge requests with their MSHRs. For category 2, merging do not affect the simulation results because each request consumes the fetch bandwidth even when it was merged with another. Merging operations in category 3 do reduce the cycles consumed by memory requests because LSQ do not implement merging and requests merged into one only consumes one of the bandwidth. However, Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1 revision 050 and the Software Optimization Guide for the AMD Zen5 Microarchitecture revision 1.00 do not mention such merging though they discuss store-to-load forwarding. An open source core, BOOMv3 does not implement it either. Therefore, having such merging operations may not be appropriate when accurately simulating conventional microarchitecture designs. Remove the merging operations in channel in hope that it will improve accuracy for category 3 and performance for all categories. It also reduces the code complexity of the channel implementation.
LSQ and MSHRs already implement forwarding so remove forwarding in channel to reduce overheads and code complexity.
The occupancy of MSHRs is already tested with test/cpp/src/402-miss-latency.cc and there is nothing special with prefetch fills. This means we can safely remove the tests of MSHRs with prefetch fills to save the code to maintain without hurting the ability of the test suite to catch bugs.
"MSHR" stands for Miss Status Holding Register, but the word was too overloaded: - Writes that fill cache lines were also called MSHR. - Cache misses that were merged to existing MSHRs or inflight fills before they get a MSHR allocated were also called MSHR. Replace "MSHR" with "fill" or "miss" where appropriate.
|
I added commit 20ad50a to remove redundant tests which will be out-of-date after commit 38f8b01. |
ngober
left a comment
There was a problem hiding this comment.
I'm terribly sorry that it took me so long to review this. I got distracted when I was working on it a long time ago and never got back. This generally looks good and is a more proper model.
Author: Nathan Gober Committer: GitHub Merge pull request #618 from akihikodaki/cache4 Forward and merge exclusively with LSQs and MSHRs
A common memory subsystem design is to implement store-to-load forwarding with LSQs and to merge cache misses with MSHRs. On the other hand, ChampSim has a structure named "channel" which performs forwarding and merging while it also has LSQs and MSHRs.
My concern with channel is that the duplication of forwarding and merging operations makes it complicated to instrument for Kanata, an instruction pipeline visualizer (please also see #594). However, I also found a possibility to improve accuracy and performance and to reduce code complexity during the attempt to resolve the duplication so I open this pull request to propose these improvements.
This pull request comprises three commits (evaluated using the mean values of IPC computed for DPC-3 traces):
Please refer to each commit for details.
For detailed benchmark results, please refer to:
https://github.dev/akihikodaki/e/blob/c94ae27f53002aea55206cc1092018eb52f02ea5/stats.ipynb
The "timeline" column at the bottom left allows you to compare commits. Commit akihikodaki/e@4600ff3 is the evaluation results of the baseline.
I evaluated the changes with the DPC-3 traces and also checked that they do not contradict the following: