Skip to content

Forward and merge exclusively with LSQs and MSHRs#618

Merged
ngober merged 6 commits intoChampSim:developfrom
akihikodaki:cache4
Sep 16, 2025
Merged

Forward and merge exclusively with LSQs and MSHRs#618
ngober merged 6 commits intoChampSim:developfrom
akihikodaki:cache4

Conversation

@akihikodaki
Copy link
Copy Markdown
Contributor

A common memory subsystem design is to implement store-to-load forwarding with LSQs and to merge cache misses with MSHRs. On the other hand, ChampSim has a structure named "channel" which performs forwarding and merging while it also has LSQs and MSHRs.

My concern with channel is that the duplication of forwarding and merging operations makes it complicated to instrument for Kanata, an instruction pipeline visualizer (please also see #594). However, I also found a possibility to improve accuracy and performance and to reduce code complexity during the attempt to resolve the duplication so I open this pull request to propose these improvements.

This pull request comprises three commits (evaluated using the mean values of IPC computed for DPC-3 traces):

  • Commit 26c2f9d changes to perform forwarding with MSHRs. This eliminates a time window where forwarding does not happen in channel and prepares for the removal of forwarding in channel. +0.002 %
  • Commit 2605acc removes merging. This reduces the IPC because the bandwidth between LSQ and L1D becomes narrower. -5.932 %
  • Commit 694b84e removes forwarding. -0.016 %

Please refer to each commit for details.

For detailed benchmark results, please refer to:
https://github.dev/akihikodaki/e/blob/c94ae27f53002aea55206cc1092018eb52f02ea5/stats.ipynb
The "timeline" column at the bottom left allows you to compare commits. Commit akihikodaki/e@4600ff3 is the evaluation results of the baseline.

I evaluated the changes with the DPC-3 traces and also checked that they do not contradict the following:

  • The implementation of BOOMv3
  • Descriptions in Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1 revision 050
  • Descriptions in the Software Optimization Guide for the AMD Zen5 Microarchitecture revision 1.00

Copy link
Copy Markdown
Collaborator

@ngober ngober left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very thorough PR. I appreciate the research that you put into it, especially referencing the published designs.

I've long been bothered by the collision checking behavior in channels. It doesn't seem like something that would actually be built, and it's probably costly in terms of runtime. So, I'm pleased to see it go.

It looks to me like what you've done is take the collision check from the middle of the channel to the destination, in the miss handler. However, I wonder if it is more prudent instead to perform merging on the source end of the channel. The core should know if it is issuing a duplicate packet (based on what's in the LQ/SQ), and can avoid sending duplicates. Caches will probably still need to scan the MSHRs, so that duplicate prefetches can get squashed.

I think that source-side merging can eliminate a lot of logic that checks for duplicates and save bandwidth.

Let me know if I'm misreading what you've written. Thanks for working on this!

Comment thread src/cache.cc Outdated
Comment on lines +338 to +342
if (mshr_entry == MSHR.end()) {
mshr_entry = std::find_if(inflight_writes.begin(), inflight_writes.end(), matches_address(handle_pkt.address));
}

if (mshr_entry != inflight_writes.end()) // miss already inflight
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be very careful with this. You have changed the semantics of this code. Whereas previously mshr_entry was guaranteed to fall in the bounds of MSHR, it may now fall in either that or inflight_writes. The two ranges have the same type right now, but that is not enforced in the code.

You also run the risk of merging a non-write into a write below (line 354).

Maybe it is best to scan the MSHR and inflight_writes arrays separately and then call into a common handling function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be very careful with this. You have changed the semantics of this code. Whereas previously mshr_entry was guaranteed to fall in the bounds of MSHR, it may now fall in either that or inflight_writes. The two ranges have the same type right now, but that is not enforced in the code.

GCC and clang enforce that inflight_writes and MSHR have the same type by giving an error when assigning mshr_entry from inflight_writes otherwise so we don't need to spend additional effort.

You also run the risk of merging a non-write into a write below (line 354).

It is intentional. The intent of commit 26c2f9d is to forward a write to a non-write.

Maybe it is best to scan the MSHR and inflight_writes arrays separately and then call into a common handling function.

My analysis of the problem with the changed semantics is that the code keeps mentioning "MSHR" even though it now also scans inflight_writes. I thought that was fine considering that inflight_writes is typed as std::deque<mshr_type>, but mshr_type may be better to be renamed altogether to avoid confusion. Perhaps fill_type?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct that the type system causes compilation errors if, for example, inflight_writes changed types. I think I was unclear about what my concerns were. Semantically, this change moves from guaranteeing that the entry is in range A only to guaranteeing that it is in either range A or B, but we don't know which. I think the semantic of this, not merely the grammar, is worth being careful about.

You are also correct concerning the write/non-write merging. This is called forwarding and I should know that. 😞

I'm okay with renaming the type here. It's worth noting that the MSHR and the inflight_write are representing different things. The first represents a wait for data that has not arrived (and afterwards the pipeline of the data commit), the second represents the pipeline of data that has not yet been committed. So, perhaps, the inflight_write type is like an MSHR with a wait time of zero. Perhaps, and I'm mostly conjecturing here, what needs to happen is that MSHRs that have received their data need to be moved to the write pipe, so the MSHR array represents things that are waiting only, and not also those that are in a write pipe.

Copy link
Copy Markdown
Contributor Author

@akihikodaki akihikodaki May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with renaming the type here. It's worth noting that the MSHR and the inflight_write are representing different things. The first represents a wait for data that has not arrived (and afterwards the pipeline of the data commit), the second represents the pipeline of data that has not yet been committed. So, perhaps, the inflight_write type is like an MSHR with a wait time of zero. Perhaps, and I'm mostly conjecturing here, what needs to happen is that MSHRs that have received their data need to be moved to the write pipe, so the MSHR array represents things that are waiting only, and not also those that are in a write pipe.

Moving resolves misses to the write pipe sounds good. I would probably also rename inflight_writes to inflight_fills or something to represent that it will not only consume writes from the upper layer but also reads from the lower layer to "fill" a cache line.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added commit 38f8b01 and dd35287 to unify the fill pipe and to rename mshr_type, inflight_write, and similar, respectively.

@akihikodaki
Copy link
Copy Markdown
Contributor Author

This is a very thorough PR. I appreciate the research that you put into it, especially referencing the published designs.

I've long been bothered by the collision checking behavior in channels. It doesn't seem like something that would actually be built, and it's probably costly in terms of runtime. So, I'm pleased to see it go.

It looks to me like what you've done is take the collision check from the middle of the channel to the destination, in the miss handler. However, I wonder if it is more prudent instead to perform merging on the source end of the channel. The core should know if it is issuing a duplicate packet (based on what's in the LQ/SQ), and can avoid sending duplicates. Caches will probably still need to scan the MSHRs, so that duplicate prefetches can get squashed.

I think that source-side merging can eliminate a lot of logic that checks for duplicates and save bandwidth.

Let me know if I'm misreading what you've written. Thanks for working on this!

Thank you for the quick review.

I agree that moving the merging and forwarding operations to the source end is better than moving them to the destination, and I attempted to enforce that by:

  • relying on LSQ to perform forwarding before issuing a request to L1D and
  • relying on MSHR and inflight_writes to perform merging and forwarding before issuing a request to a lower level.

However, there is one missing piece: LSQ does not merge duplicate packets. My data shows it caused degradation in the simulated performance. For example, the following diff shows removing the merging operation in channel significantly increases L1D accesses though the lower levels do not see such a big change: akihikodaki/e@4f2c9cd#diff-07b66c639839c13ae44569bb1a116810918ff732a41eaa4f273ba395e75983cd
This implies that the increase in L1D accesses is impacting the simulated performance and merging them before issuing to L1D can resolve the degradation.

By the way, I found my description of commit 2605acc mentioning the reduced bandwidth is misleading as the L1D has sufficient bandwidth to satisfy LQ requests: lq_width and L1D's max_tag_check are equal (both are 2) in my evaluation. I think the real problem is that accessing L1D adds 2 cycles of latency.

On the other hand, I'm worried that the savings may not be present in common microarchitecture designs, and having them may hurt the simulation accuracy because the published designs I cited do discuss store-to-load forwarding, but they do not mention merging of loads.

I also have a suspicion that having a small, low-latency cache is a better option to improve the performance. It is necessary to search entries in LQ to merge requests early, which effectively makes LQ behave like a small cache. Having a real, small low-latency cache instead can achieve the same goal while reducing the latency of non-duplicate requests as well.

LQ should already have a CAM search port to detect ordering failures between loads and stores, but even the port is not free to use if loads and stores can happen simultaneously so searching entries in LQ may not be cheaper than having a low-latency cache area-wise either.

In summary, taking the cited designs and the potential cost of merging operations into account, I think it is better not to merge requests in LSQ for simulation accuracy. This is all theoretical though, so I'd like to hear if you have more insights on or know literature that discusses this topic.

@akihikodaki akihikodaki mentioned this pull request May 1, 2025
@ngober
Copy link
Copy Markdown
Collaborator

ngober commented May 1, 2025

LQ should already have a CAM search port to detect ordering failures between loads and stores, but even the port is not free to use if loads and stores can happen simultaneously so searching entries in LQ may not be cheaper than having a low-latency cache area-wise either.

In summary, taking the cited designs and the potential cost of merging operations into account, I think it is better not to merge requests in LSQ for simulation accuracy. This is all theoretical though, so I'd like to hear if you have more insights on or know literature that discusses this topic.

I believe that the failure here is not actually in the LQ, but in the OOO engine. Because load addresses in our traces are not associated with destination registers, we treat the loads as independent operations. A more proper model would do better renaming here. Maybe that's out of scope for this patch, and this patch serves to highlight the gap in the model.

@akihikodaki
Copy link
Copy Markdown
Contributor Author

LQ should already have a CAM search port to detect ordering failures between loads and stores, but even the port is not free to use if loads and stores can happen simultaneously so searching entries in LQ may not be cheaper than having a low-latency cache area-wise either.
In summary, taking the cited designs and the potential cost of merging operations into account, I think it is better not to merge requests in LSQ for simulation accuracy. This is all theoretical though, so I'd like to hear if you have more insights on or know literature that discusses this topic.

I believe that the failure here is not actually in the LQ, but in the OOO engine. Because load addresses in our traces are not associated with destination registers, we treat the loads as independent operations. A more proper model would do better renaming here. Maybe that's out of scope for this patch, and this patch serves to highlight the gap in the model.

Destination registers may not help determine if loads are mergeable. Below is a part of putc depicting a frequent pattern where loads can be merged:

   0x0000748eeb9bfdfa <+330>:   pop    %rbx
   0x0000748eeb9bfdfb <+331>:   pop    %r12
   0x0000748eeb9bfdfd <+333>:   pop    %r13
   0x0000748eeb9bfdff <+335>:   pop    %rbp
   0x0000748eeb9bfe00 <+336>:   ret

Here, these five instructions look independent when looking at the destination registers that will be loaded from memory as they have different ones: %rbx, %r12, %rbp and %rip (written by ret).
However, these loads can be actually merged if they are in one cache line, which is quite likely.

channel used to merge these loads by looking at load addresses, but I think such merging operations should not happen when simulating a conventional microarchitecture due to reasons described in #618 (comment).

Channel already implements forwarding, but it misses if a write packet
is already moved from WQ. Forward in cache to ensure that no read
request will be issued when a write is inflight.
Avoid merging in channels due to implications in accuracy, performance,
and code complexity.

Here, I classify channels into three categories:
1. Channels whose requesters are CACHEs or PTWs.
2. Channels between O3_CPUs and their L1Is.
3. Channels between O3_CPU and their L1Ds.

Merging operations in categories 1 and 2 are unnecessary. For category
1, CACHEs or PTWs already merge requests with their MSHRs. For category
2, merging do not affect the simulation results because each request
consumes the fetch bandwidth even when it was merged with another.

Merging operations in category 3 do reduce the cycles consumed by memory
requests because LSQ do not implement merging and requests merged into
one only consumes one of the bandwidth. However, Intel® 64 and IA-32
Architectures Optimization Reference Manual: Volume 1 revision 050 and
the Software Optimization Guide for the AMD Zen5 Microarchitecture
revision 1.00 do not mention such merging though they discuss
store-to-load forwarding. An open source core, BOOMv3 does not implement
it either. Therefore, having such merging operations may not be
appropriate when accurately simulating conventional microarchitecture
designs.

Remove the merging operations in channel in hope that it will improve
accuracy for category 3 and performance for all categories. It also
reduces the code complexity of the channel implementation.
LSQ and MSHRs already implement forwarding so remove forwarding in
channel to reduce overheads and code complexity.
The occupancy of MSHRs is already tested with
test/cpp/src/402-miss-latency.cc and there is nothing special with
prefetch fills. This means we can safely remove the tests of MSHRs with
prefetch fills to save the code to maintain without hurting the ability
of the test suite to catch bugs.
"MSHR" stands for Miss Status Holding Register, but the word was too
overloaded:
- Writes that fill cache lines were also called MSHR.
- Cache misses that were merged to existing MSHRs or inflight fills
  before they get a MSHR allocated were also called MSHR.

Replace "MSHR" with "fill" or "miss" where appropriate.
@coveralls
Copy link
Copy Markdown

Coverage Status

coverage: 67.765% (-0.4%) from 68.139%
when pulling dd35287 on akihikodaki:cache4
into 9ad76ae on ChampSim:develop.

@akihikodaki
Copy link
Copy Markdown
Contributor Author

I added commit 20ad50a to remove redundant tests which will be out-of-date after commit 38f8b01.
I also rebased this tree on the current develop, which makes the tests pass. Coveralls reports the ratio of covered lines decreases, but this is because some lines are removed and there is no new uncovered lines.

Copy link
Copy Markdown
Collaborator

@ngober ngober left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm terribly sorry that it took me so long to review this. I got distracted when I was working on it a long time ago and never got back. This generally looks good and is a more proper model.

Comment thread src/cache.cc
@ngober ngober merged commit 9ccadb3 into ChampSim:develop Sep 16, 2025
32 checks passed
github-actions Bot added a commit that referenced this pull request Sep 16, 2025
Author: Nathan Gober
Committer: GitHub

Merge pull request #618 from akihikodaki/cache4

Forward and merge exclusively with LSQs and MSHRs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants