merge: perform exact rename detection in linear time#4202
Conversation
|
Hi @mitesch - this is good stuff. Apologies that the rename detection part of the merge code is a bit of a bear, so thanks for tackling this. I'm curious about a couple of things: How often do you find yourself with multiple oid's such that you have more than one in the vector? It would be nice to save on the small allocations that result from using a vector here. In your workflow, if you're generally only seeing a single delete for an oid (which seems likely to me) then it might make sense to just store an oid for that delete, and then only switch to using a vector if there's already an oid for that delete. This would reduce the number of small allocations you're doing to initialize the vectors. Another thing that would reduce the number of small allocations is to use a pooling allocator ( Finally, our Strictly speaking, casting between a I realize this sounds rather language lawyery. I don't know that this will actually be a problem in practice but these things get a bit iffy on weird platforms that are very old, very large or very small and are quite nontrivial to debug when they do arise, so better to catch it early. Anyway, if I'm thinking about this correctly then I think that a Thanks for this - this is a really nice change. I'm curious about the improvements that you're seeing? I also assume that you're only doing |
|
Thanks for taking a look. Before the change, we had a particular merge pattern that was taking >7 hours (order of 100k adds/deletes, crazy but they have a tool to automate the resolutions) that after this change takes <10 minutes. In this case we are doing 100% similarity, but in general use, I expect the inexact rename detection would hit the git_merge_options->target_limit before the In this particular case, we had a lot with duplicate oids, but I think the general case will be closer to 1:1. I think all your suggestions make sense. I'll try to clean it up Monday. |
|
Neat. Thanks again. <3 Drop into slack.libgit2.org if you want to chat about this, it's been a while since I've been in the bowels of this so I may have given you bad advice. |
d92dd95 to
9e454bb
Compare
|
I switched to git_array_t, it definitely is the better option. I'm still looking into avoiding the array allocation for the more common use case of a single delete per objectid. My initial way is to just add a size_t first_item to deletes_by_oid_queue that would represent the 0th element then 1-... would be stored in the array at n-1. I'm not a C expert though, so I don't know if there are any cleaner way with unions or other features. Is the git_pool worth using if we only have one allocation |
What you're describing sounds reasonable to me... but let me outline an alternative that you might like better or you might find horrible. Instead of having to know that the first element is a If your struct was: then you could (Note that this is written without testing.) This may look insane, but it does simplify the dequeue code. You can always just call The not so nice thing about this, though, is that this is one of those things that's a bit, uh, "clever". So if you're disgusted by this, then I don't blame you at all and we can pretend that it never happened. 😛
I think so - if I'm reading this all correctly, you're |
|
Updated to use the git_pool and a first_entry. |
|
@ethomson ping. should be ready for review. |
|
@carlosmn Do you want to put a quick set of 👀 on this before I merge it, or coordinate its inclusion into rugged? Or should I just go ahead? |
4b9f10d to
cee1e7a
Compare
The current exact rename detection has order n^2 complexity. We can do better by using a map to first aggregate deletes and using that to match deletes to adds. This results in a substantial performance improvement for merges with a large quantity of adds and deletes.
|
Commits are squashed and I tweaked the message a bit. |
|
Thanks @mitesch - I'm going to merge this as soon as we get 0.26 out the door. 😀 |
The current exact rename detection has order n^2 complexity.
We can do better by using a map to first aggregate deletes and
using that to match deletes to adds.
This results in a substantial performance improvement for merges
with a large quantity of adds and deletes.