Skip to content

merge: perform exact rename detection in linear time#4202

Merged
ethomson merged 1 commit into
libgit2:masterfrom
mitesch:linear_exact_rename
Jun 21, 2017
Merged

merge: perform exact rename detection in linear time#4202
ethomson merged 1 commit into
libgit2:masterfrom
mitesch:linear_exact_rename

Conversation

@mitesch
Copy link
Copy Markdown
Contributor

@mitesch mitesch commented Apr 12, 2017

The current exact rename detection has order n^2 complexity.
We can do better by using a map to first aggregate deletes and
using that to match deletes to adds.

This results in a substantial performance improvement for merges
with a large quantity of adds and deletes.

@ethomson
Copy link
Copy Markdown
Member

Hi @mitesch - this is good stuff. Apologies that the rename detection part of the merge code is a bit of a bear, so thanks for tackling this. I'm curious about a couple of things:

How often do you find yourself with multiple oid's such that you have more than one in the vector? It would be nice to save on the small allocations that result from using a vector here. In your workflow, if you're generally only seeing a single delete for an oid (which seems likely to me) then it might make sense to just store an oid for that delete, and then only switch to using a vector if there's already an oid for that delete. This would reduce the number of small allocations you're doing to initialize the vectors.

Another thing that would reduce the number of small allocations is to use a pooling allocator (git_pool) for the deletes_by_oid_queues. In fact, you could even reuse the existing pool in the diff_list, since these objects will share a similar lifecycle, you don't need to reclaim the memory before the whole of the diff list is freed, I wouldn't think.

Finally, our git_vectors store a void *. You're casting these to size_ts. There's no guarantee that a pointer size and size_t are the same size... We could you uintptr_t here, but I think that the only guarantee then is that uintptr_t is at least as large as void *, not that they're the same.

Strictly speaking, casting between a void * and an integer type is getting into questionable space as well, but I feel better about that assumption.

I realize this sounds rather language lawyery. I don't know that this will actually be a problem in practice but these things get a bit iffy on weird platforms that are very old, very large or very small and are quite nontrivial to debug when they do arise, so better to catch it early.

Anyway, if I'm thinking about this correctly then I think that a git_array(size_t) would be preferable here to a vector.

Thanks for this - this is a really nice change. I'm curious about the improvements that you're seeing? I also assume that you're only doing 100% similarity matching so you're not affected by the O(n^2) inexact rename detection.

@mitesch
Copy link
Copy Markdown
Contributor Author

mitesch commented Apr 14, 2017

Thanks for taking a look.

Before the change, we had a particular merge pattern that was taking >7 hours (order of 100k adds/deletes, crazy but they have a tool to automate the resolutions) that after this change takes <10 minutes. In this case we are doing 100% similarity, but in general use, I expect the inexact rename detection would hit the git_merge_options->target_limit before the O(n^2) behavior becomes a real problem.

In this particular case, we had a lot with duplicate oids, but I think the general case will be closer to 1:1.

I think all your suggestions make sense. I'll try to clean it up Monday.

@ethomson
Copy link
Copy Markdown
Member

Neat. Thanks again. <3 Drop into slack.libgit2.org if you want to chat about this, it's been a while since I've been in the bowels of this so I may have given you bad advice.

@mitesch mitesch changed the title merge: perform exact rename detection in linear complexity merge: perform exact rename detection in linear time Apr 17, 2017
@mitesch mitesch force-pushed the linear_exact_rename branch from d92dd95 to 9e454bb Compare April 19, 2017 14:49
@mitesch
Copy link
Copy Markdown
Contributor Author

mitesch commented Apr 19, 2017

I switched to git_array_t, it definitely is the better option.

I'm still looking into avoiding the array allocation for the more common use case of a single delete per objectid. My initial way is to just add a size_t first_item to deletes_by_oid_queue that would represent the 0th element then 1-... would be stored in the array at n-1. I'm not a C expert though, so I don't know if there are any cleaner way with unions or other features.

Is the git_pool worth using if we only have one allocation queue = git__malloc(sizeof(deletes_by_oid_queue)); which can be added to the pool. It's an easy enough change and probably worth doing if the pool would make a perf difference.

@ethomson
Copy link
Copy Markdown
Member

I'm still looking into avoiding the array allocation for the more common use case of a single delete per objectid. My initial way is to just add a size_t first_item to deletes_by_oid_queue that would represent the 0th element then 1-... would be stored in the array at n-1. I'm not a C expert though, so I don't know if there are any cleaner way with unions or other features.

What you're describing sounds reasonable to me... but let me outline an alternative that you might like better or you might find horrible.

Instead of having to know that the first element is a size_t in the struct and the remaining elements are in the git_array, you could sneakily point to that size_t in the git_array (instead of mallocing the array) and then do the actual malloc only when you want to grow...

If your struct was:

typedef struct {
    git_array_t(size_t) arr;
    size_t first;
    size_t next_pos;
} deletes_by_oid_queue;

then you could enqueue like:

if (!git_oidmap_valid_index(map, pos)) {
    queue = git__calloc(1, sizeof(deletes_by_oid_queue));
    GITERR_CHECK_ALLOC(queue);

    /* initial oid - populate the array with a pointer to our own `size_t` */
    queue->first = idx;
    queue->arr.size = 1;
    queue->arr.ptr = &queue->first;
} else {
    /* if we only have one entry in our array, this means we've hacked it up like above
     * and need to convert it to a real array.  we can't simply call `git_array_alloc` because
     * trying to grow this array would try to realloc `queue->first` which would definitely crash
     */
    if (queue->arr.size == 1 && queue->arr.asize == 0) {
        git_array_init(queue->arr);

        array_entry = git_array_alloc(queue->arr);
        GITERR_CHECK_ALLOC(array_entry);
        *array_entry = queue->first;
    }

    /* now insert the new entry */
    array_entry = git_array_alloc(queue->arr);
    GITERR_CHECK_ALLOC(array_entry);
    *array_entry = idx;
}

(Note that this is written without testing.)

This may look insane, but it does simplify the dequeue code. You can always just call git_array_get whether you have one element or 100 - you don't have to do the trick where you look at queue->first then look at queue->arr.

The not so nice thing about this, though, is that this is one of those things that's a bit, uh, "clever". So if you're disgusted by this, then I don't blame you at all and we can pretend that it never happened. 😛

Is the git_pool worth using if we only have one allocation queue = git__malloc(sizeof(deletes_by_oid_queue)); which can be added to the pool. It's an easy enough change and probably worth doing if the pool would make a perf difference.

I think so - if I'm reading this all correctly, you're mallocing one deletes_by_oid_queue for every deleted oid... Using a git_pool will help amortize some of those mallocs... it's sort of hard to know how much of a difference it will make - or even on what system(s) and what circumstances. I suspect that it will be an improvement, especially in the pathological cases you were looking at, but I'm not 100%...

@mitesch
Copy link
Copy Markdown
Contributor Author

mitesch commented Apr 19, 2017

Updated to use the git_pool and a first_entry.
I ended up going with my original thought for the first_entry. The change in dequeue to support it seems simpler than playing with the array. The memory cleanup is slightly dirty as well. Pushed here for reference.

@mitesch
Copy link
Copy Markdown
Contributor Author

mitesch commented Apr 21, 2017

@ethomson ping. should be ready for review.

@ethomson
Copy link
Copy Markdown
Member

ethomson commented May 1, 2017

@carlosmn Do you want to put a quick set of 👀 on this before I merge it, or coordinate its inclusion into rugged? Or should I just go ahead?

@mitesch mitesch force-pushed the linear_exact_rename branch from 4b9f10d to cee1e7a Compare May 17, 2017 16:41
The current exact rename detection has order n^2 complexity.
We can do better by using a map to first aggregate deletes and
using that to match deletes to adds.

This results in a substantial performance improvement for merges
with a large quantity of adds and deletes.
@mitesch
Copy link
Copy Markdown
Contributor Author

mitesch commented May 17, 2017

Commits are squashed and I tweaked the message a bit.

@ethomson
Copy link
Copy Markdown
Member

Thanks @mitesch - I'm going to merge this as soon as we get 0.26 out the door. 😀

@ethomson ethomson merged commit 40294f3 into libgit2:master Jun 21, 2017
@pks-t pks-t mentioned this pull request Jan 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants