refdb_fs: enhance performance of globbing#4629
Conversation
d7f4689 to
68be55e
Compare
|
This is now ready for review. I did some preliminary measurements of cases with few references in the repository. The performance was tested by timing 10 consecutive runs of the test-suite: Obviously, the methodology is not quite what you'd want for a real measurement. The results with the changes: The results for the commit this PR is based on (d906a87): As you can see, the tests took a bit more time with the changes. This was to be somewhat expected, since in the tests, the globs usually select a significant portion of the references. Hence, the additional, albeit superficial, parse of the glob gains more weight. However, I also saw spurious segfaults for both the master and the feature-branch. Those hit once while running the test for the master and, I think, not for any of the runs for the feature-branch. So maybe the difference is just one run ending earlier. Anyways, I will repeat those tests again with a script which gives me more informative statistics. I will also do some tests with one of those repos I have with lots of refs (the target scenario for the optimization). Btw, you could have a look at the durations reported by travis. But I'm too lazy to compare them by hand. |
|
I did some measurements using a little test program. It queries the number of references matching a glob and prints some counters ("real" time, user and kernel ticks): I ran the program with three sets of repository and glob:
Sidenote: git-dit is a distributed issue manager of which I am a co-author. We store messages as commits and use references simply for keeping them alive. For each "issue", we have a sub-directory containing those references. The two scenarios involving git-dit fetch those refs for two separate issues. Each scenario was run 20 times with both the program linked against the feature-branch version of libgit2 ("enhanced") and the merge base ("master", d906a87). I found a slight improved performance even for the "libgit2" scenario, although the number of references omitted should be quite low (there are not tags and only a few remote references in addition to the two references I had in the repo). For my target scenarios ("gitdit1" and "gitdit2"), I found an improvement of factors >10. TL;DR: the patch-set makes globbing refs faster. For those interested, here are the numbers I got: "master-libgit2": "enhanced-libgit2": "master-gitdit1" enhanced-gitdit1: master-gitdit2: enhanced-gitdit2: |
pks-t
left a comment
There was a problem hiding this comment.
Thanks for your contribution!
This optimization makes a lot of sense to me, the benefits could be huge when there's a lot of deeply nested references. Code looks good to me, except for two small nits which should be fixed.
| } | ||
|
|
||
| if ((error = git_buf_printf(&path, "%s/", backend->commonpath)) < 0 || | ||
| (error = git_buf_put(&path, ref_prefix, ref_prefix_len)) || |
| git_iterator_options fsit_opts = GIT_ITERATOR_OPTIONS_INIT; | ||
| const git_index_entry *entry = NULL; | ||
| const char *ref_prefix = GIT_REFS_DIR; | ||
| int ref_prefix_len = strlen(ref_prefix); |
| break; | ||
| case '/': | ||
| last_sep = pos; | ||
| /* FALLTHROUGH */ |
There was a problem hiding this comment.
So we're looking for the first non-literal character here, using the previous literal directories as our base for looking for references. Makes a lot of sense
|
Those two things should have been obvious to me. Guess I'm too used to I'm ready for squashing as soon as I get an approval. |
Instead of a hardcoded "refs", we may choose a different directory within the git directory as the root from which we look for references.
A glob used for iteration may start with an entire path containing no special characters. If we start scanning for references within that path rather than in `refs/`, we may end up scanning only a small fraction of all references.
6b82bc1 to
20a2b02
Compare
|
I decided to squash without an additional review, in order to save a round-trip. |
|
Thanks, looks good to me. @carlosmn: you've got additional comments? |
|
Thanks again for your contribution! |
This patch-set addresses the performance of reference iteration with a glob specified. The enhancement is specific to FS-based repositories.
libgit2 provides facilities for iterating over references of a repository. The references may be filtered using a glob, e.g. using an iterator created via
git_reference_iterator_glob_new(). At least for FS-based repositories, all references are visited during the iteration and each of the references encountered is matches against the glob. For the general case, this is the only option.However, a glob may also start with a literal path. In this case, we may scan only the corresponding subdirectory of
refsrather than the entire reference space. The cost of this optimization is a single scan of the glob.Targets #4619.
TODO: