Forward and merge exclusively with LSQs and MSHRs by akihikodaki · Pull Request #618 · ChampSim/ChampSim

akihikodaki · 2025-04-30T08:33:50Z

A common memory subsystem design is to implement store-to-load forwarding with LSQs and to merge cache misses with MSHRs. On the other hand, ChampSim has a structure named "channel" which performs forwarding and merging while it also has LSQs and MSHRs.

My concern with channel is that the duplication of forwarding and merging operations makes it complicated to instrument for Kanata, an instruction pipeline visualizer (please also see #594). However, I also found a possibility to improve accuracy and performance and to reduce code complexity during the attempt to resolve the duplication so I open this pull request to propose these improvements.

This pull request comprises three commits (evaluated using the mean values of IPC computed for DPC-3 traces):

Commit 26c2f9d changes to perform forwarding with MSHRs. This eliminates a time window where forwarding does not happen in channel and prepares for the removal of forwarding in channel. +0.002 %
Commit 2605acc removes merging. This reduces the IPC because the bandwidth between LSQ and L1D becomes narrower. -5.932 %
Commit 694b84e removes forwarding. -0.016 %

Please refer to each commit for details.

For detailed benchmark results, please refer to:
https://github.dev/akihikodaki/e/blob/c94ae27f53002aea55206cc1092018eb52f02ea5/stats.ipynb
The "timeline" column at the bottom left allows you to compare commits. Commit akihikodaki/e@4600ff3 is the evaluation results of the baseline.

I evaluated the changes with the DPC-3 traces and also checked that they do not contradict the following:

The implementation of BOOMv3
Descriptions in Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1 revision 050
Descriptions in the Software Optimization Guide for the AMD Zen5 Microarchitecture revision 1.00

ngober

This is a very thorough PR. I appreciate the research that you put into it, especially referencing the published designs.

I've long been bothered by the collision checking behavior in channels. It doesn't seem like something that would actually be built, and it's probably costly in terms of runtime. So, I'm pleased to see it go.

It looks to me like what you've done is take the collision check from the middle of the channel to the destination, in the miss handler. However, I wonder if it is more prudent instead to perform merging on the source end of the channel. The core should know if it is issuing a duplicate packet (based on what's in the LQ/SQ), and can avoid sending duplicates. Caches will probably still need to scan the MSHRs, so that duplicate prefetches can get squashed.

I think that source-side merging can eliminate a lot of logic that checks for duplicates and save bandwidth.

Let me know if I'm misreading what you've written. Thanks for working on this!

ngober · 2025-04-30T14:01:05Z

+  if (mshr_entry == MSHR.end()) {
+    mshr_entry = std::find_if(inflight_writes.begin(), inflight_writes.end(), matches_address(handle_pkt.address));
+  }
+
+  if (mshr_entry != inflight_writes.end()) // miss already inflight


Be very careful with this. You have changed the semantics of this code. Whereas previously mshr_entry was guaranteed to fall in the bounds of MSHR, it may now fall in either that or inflight_writes. The two ranges have the same type right now, but that is not enforced in the code.

You also run the risk of merging a non-write into a write below (line 354).

Maybe it is best to scan the MSHR and inflight_writes arrays separately and then call into a common handling function.

Be very careful with this. You have changed the semantics of this code. Whereas previously mshr_entry was guaranteed to fall in the bounds of MSHR, it may now fall in either that or inflight_writes. The two ranges have the same type right now, but that is not enforced in the code.

GCC and clang enforce that inflight_writes and MSHR have the same type by giving an error when assigning mshr_entry from inflight_writes otherwise so we don't need to spend additional effort.

You also run the risk of merging a non-write into a write below (line 354).

It is intentional. The intent of commit 26c2f9d is to forward a write to a non-write.

Maybe it is best to scan the MSHR and inflight_writes arrays separately and then call into a common handling function.

My analysis of the problem with the changed semantics is that the code keeps mentioning "MSHR" even though it now also scans inflight_writes. I thought that was fine considering that inflight_writes is typed as std::deque<mshr_type>, but mshr_type may be better to be renamed altogether to avoid confusion. Perhaps fill_type?

You're correct that the type system causes compilation errors if, for example, inflight_writes changed types. I think I was unclear about what my concerns were. Semantically, this change moves from guaranteeing that the entry is in range A only to guaranteeing that it is in either range A or B, but we don't know which. I think the semantic of this, not merely the grammar, is worth being careful about.

You are also correct concerning the write/non-write merging. This is called forwarding and I should know that. 😞

I'm okay with renaming the type here. It's worth noting that the MSHR and the inflight_write are representing different things. The first represents a wait for data that has not arrived (and afterwards the pipeline of the data commit), the second represents the pipeline of data that has not yet been committed. So, perhaps, the inflight_write type is like an MSHR with a wait time of zero. Perhaps, and I'm mostly conjecturing here, what needs to happen is that MSHRs that have received their data need to be moved to the write pipe, so the MSHR array represents things that are waiting only, and not also those that are in a write pipe.

I'm okay with renaming the type here. It's worth noting that the MSHR and the inflight_write are representing different things. The first represents a wait for data that has not arrived (and afterwards the pipeline of the data commit), the second represents the pipeline of data that has not yet been committed. So, perhaps, the inflight_write type is like an MSHR with a wait time of zero. Perhaps, and I'm mostly conjecturing here, what needs to happen is that MSHRs that have received their data need to be moved to the write pipe, so the MSHR array represents things that are waiting only, and not also those that are in a write pipe.

Moving resolves misses to the write pipe sounds good. I would probably also rename inflight_writes to inflight_fills or something to represent that it will not only consume writes from the upper layer but also reads from the lower layer to "fill" a cache line.

I added commit 38f8b01 and dd35287 to unify the fill pipe and to rename mshr_type, inflight_write, and similar, respectively.

akihikodaki · 2025-05-01T06:46:22Z

This is a very thorough PR. I appreciate the research that you put into it, especially referencing the published designs.

I've long been bothered by the collision checking behavior in channels. It doesn't seem like something that would actually be built, and it's probably costly in terms of runtime. So, I'm pleased to see it go.

It looks to me like what you've done is take the collision check from the middle of the channel to the destination, in the miss handler. However, I wonder if it is more prudent instead to perform merging on the source end of the channel. The core should know if it is issuing a duplicate packet (based on what's in the LQ/SQ), and can avoid sending duplicates. Caches will probably still need to scan the MSHRs, so that duplicate prefetches can get squashed.

I think that source-side merging can eliminate a lot of logic that checks for duplicates and save bandwidth.

Let me know if I'm misreading what you've written. Thanks for working on this!

Thank you for the quick review.

I agree that moving the merging and forwarding operations to the source end is better than moving them to the destination, and I attempted to enforce that by:

relying on LSQ to perform forwarding before issuing a request to L1D and
relying on MSHR and inflight_writes to perform merging and forwarding before issuing a request to a lower level.

However, there is one missing piece: LSQ does not merge duplicate packets. My data shows it caused degradation in the simulated performance. For example, the following diff shows removing the merging operation in channel significantly increases L1D accesses though the lower levels do not see such a big change: akihikodaki/e@4f2c9cd#diff-07b66c639839c13ae44569bb1a116810918ff732a41eaa4f273ba395e75983cd
This implies that the increase in L1D accesses is impacting the simulated performance and merging them before issuing to L1D can resolve the degradation.

By the way, I found my description of commit 2605acc mentioning the reduced bandwidth is misleading as the L1D has sufficient bandwidth to satisfy LQ requests: lq_width and L1D's max_tag_check are equal (both are 2) in my evaluation. I think the real problem is that accessing L1D adds 2 cycles of latency.

On the other hand, I'm worried that the savings may not be present in common microarchitecture designs, and having them may hurt the simulation accuracy because the published designs I cited do discuss store-to-load forwarding, but they do not mention merging of loads.

I also have a suspicion that having a small, low-latency cache is a better option to improve the performance. It is necessary to search entries in LQ to merge requests early, which effectively makes LQ behave like a small cache. Having a real, small low-latency cache instead can achieve the same goal while reducing the latency of non-duplicate requests as well.

LQ should already have a CAM search port to detect ordering failures between loads and stores, but even the port is not free to use if loads and stores can happen simultaneously so searching entries in LQ may not be cheaper than having a low-latency cache area-wise either.

In summary, taking the cited designs and the potential cost of merging operations into account, I think it is better not to merge requests in LSQ for simulation accuracy. This is all theoretical though, so I'd like to hear if you have more insights on or know literature that discusses this topic.

ngober · 2025-05-01T14:03:26Z

LQ should already have a CAM search port to detect ordering failures between loads and stores, but even the port is not free to use if loads and stores can happen simultaneously so searching entries in LQ may not be cheaper than having a low-latency cache area-wise either.

In summary, taking the cited designs and the potential cost of merging operations into account, I think it is better not to merge requests in LSQ for simulation accuracy. This is all theoretical though, so I'd like to hear if you have more insights on or know literature that discusses this topic.

I believe that the failure here is not actually in the LQ, but in the OOO engine. Because load addresses in our traces are not associated with destination registers, we treat the loads as independent operations. A more proper model would do better renaming here. Maybe that's out of scope for this patch, and this patch serves to highlight the gap in the model.

akihikodaki · 2025-05-02T05:22:57Z

LQ should already have a CAM search port to detect ordering failures between loads and stores, but even the port is not free to use if loads and stores can happen simultaneously so searching entries in LQ may not be cheaper than having a low-latency cache area-wise either.
In summary, taking the cited designs and the potential cost of merging operations into account, I think it is better not to merge requests in LSQ for simulation accuracy. This is all theoretical though, so I'd like to hear if you have more insights on or know literature that discusses this topic.

I believe that the failure here is not actually in the LQ, but in the OOO engine. Because load addresses in our traces are not associated with destination registers, we treat the loads as independent operations. A more proper model would do better renaming here. Maybe that's out of scope for this patch, and this patch serves to highlight the gap in the model.

Destination registers may not help determine if loads are mergeable. Below is a part of putc depicting a frequent pattern where loads can be merged:

   0x0000748eeb9bfdfa <+330>:   pop    %rbx
   0x0000748eeb9bfdfb <+331>:   pop    %r12
   0x0000748eeb9bfdfd <+333>:   pop    %r13
   0x0000748eeb9bfdff <+335>:   pop    %rbp
   0x0000748eeb9bfe00 <+336>:   ret

Here, these five instructions look independent when looking at the destination registers that will be loaded from memory as they have different ones: %rbx, %r12, %rbp and %rip (written by ret).
However, these loads can be actually merged if they are in one cache line, which is quite likely.

channel used to merge these loads by looking at load addresses, but I think such merging operations should not happen when simulating a conventional microarchitecture due to reasons described in #618 (comment).

Channel already implements forwarding, but it misses if a write packet is already moved from WQ. Forward in cache to ensure that no read request will be issued when a write is inflight.

Avoid merging in channels due to implications in accuracy, performance, and code complexity. Here, I classify channels into three categories: 1. Channels whose requesters are CACHEs or PTWs. 2. Channels between O3_CPUs and their L1Is. 3. Channels between O3_CPU and their L1Ds. Merging operations in categories 1 and 2 are unnecessary. For category 1, CACHEs or PTWs already merge requests with their MSHRs. For category 2, merging do not affect the simulation results because each request consumes the fetch bandwidth even when it was merged with another. Merging operations in category 3 do reduce the cycles consumed by memory requests because LSQ do not implement merging and requests merged into one only consumes one of the bandwidth. However, Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1 revision 050 and the Software Optimization Guide for the AMD Zen5 Microarchitecture revision 1.00 do not mention such merging though they discuss store-to-load forwarding. An open source core, BOOMv3 does not implement it either. Therefore, having such merging operations may not be appropriate when accurately simulating conventional microarchitecture designs. Remove the merging operations in channel in hope that it will improve accuracy for category 3 and performance for all categories. It also reduces the code complexity of the channel implementation.

LSQ and MSHRs already implement forwarding so remove forwarding in channel to reduce overheads and code complexity.

The occupancy of MSHRs is already tested with test/cpp/src/402-miss-latency.cc and there is nothing special with prefetch fills. This means we can safely remove the tests of MSHRs with prefetch fills to save the code to maintain without hurting the ability of the test suite to catch bugs.

"MSHR" stands for Miss Status Holding Register, but the word was too overloaded: - Writes that fill cache lines were also called MSHR. - Cache misses that were merged to existing MSHRs or inflight fills before they get a MSHR allocated were also called MSHR. Replace "MSHR" with "fill" or "miss" where appropriate.

coveralls · 2025-05-08T12:28:17Z

coverage: 67.765% (-0.4%) from 68.139%
when pulling dd35287 on akihikodaki:cache4
into 9ad76ae on ChampSim:develop.

akihikodaki · 2025-05-09T03:29:49Z

I added commit 20ad50a to remove redundant tests which will be out-of-date after commit 38f8b01.
I also rebased this tree on the current develop, which makes the tests pass. Coveralls reports the ratio of covered lines decreases, but this is because some lines are removed and there is no new uncovered lines.

ngober

I'm terribly sorry that it took me so long to review this. I got distracted when I was working on it a long time ago and never got back. This generally looks good and is a more proper model.

Author: Nathan Gober Committer: GitHub Merge pull request #618 from akihikodaki/cache4 Forward and merge exclusively with LSQs and MSHRs

ngober reviewed Apr 30, 2025

View reviewed changes

akihikodaki mentioned this pull request May 1, 2025

Add Kanata logging #594

Open

akihikodaki added 3 commits May 8, 2025 10:19

Forward in cache

dd83a67

Channel already implements forwarding, but it misses if a write packet is already moved from WQ. Forward in cache to ensure that no read request will be issued when a write is inflight.

Do not forward in channel

b428540

LSQ and MSHRs already implement forwarding so remove forwarding in channel to reduce overheads and code complexity.

akihikodaki force-pushed the cache4 branch from 694b84e to dc125d8 Compare May 8, 2025 10:21

akihikodaki added 3 commits May 8, 2025 11:16

Track inflight fills with one deque

38f8b01

akihikodaki force-pushed the cache4 branch from dc125d8 to dd35287 Compare May 8, 2025 11:21

ngober approved these changes Sep 15, 2025

View reviewed changes

Comment thread src/cache.cc

ngober merged commit 9ccadb3 into ChampSim:develop Sep 16, 2025
32 checks passed

github-actions Bot added a commit that referenced this pull request Sep 16, 2025

Update documentation for refs/heads/develop.

9968d3d

Author: Nathan Gober Committer: GitHub Merge pull request #618 from akihikodaki/cache4 Forward and merge exclusively with LSQs and MSHRs

akihikodaki mentioned this pull request Jan 16, 2026

Do not merge requests and responses in completers #682

Open

Conversation

akihikodaki commented Apr 30, 2025

Uh oh!

ngober left a comment

Choose a reason for hiding this comment

Uh oh!

ngober Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

akihikodaki May 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngober May 1, 2025

Choose a reason for hiding this comment

Uh oh!

akihikodaki May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akihikodaki May 9, 2025

Choose a reason for hiding this comment

Uh oh!

akihikodaki commented May 1, 2025

Uh oh!

ngober commented May 1, 2025

Uh oh!

akihikodaki commented May 2, 2025

Uh oh!

coveralls commented May 8, 2025

Uh oh!

akihikodaki commented May 9, 2025

Uh oh!

ngober left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akihikodaki May 2, 2025 •

edited

Loading