Patch ID calculation#4272
Conversation
|
Seems like we also don't parse patches with multiple diffs correctly. At least the following diff only invokes the file callback once: Will investigate later on. But other than multi-file patch IDs and patches of files with white space, this patch seems to correctly generate the stable IDs. |
|
You need to flip that around. In libgit2 terminology, a patch doesn't have multiple diffs, a diff has multiple patches. A "patch" is a per-file change. Multiple patches are aggregated into a diff. I sort of loathe this terminology, but here we are. (And please do correct me if I'm wrong, I always get confused by our naming here.) How we choose to talk about the "patch id" for compatibility with git and our own internal nomenclature woes has yet to be seen. We could call it a "diff id" instead. Or not. |
|
Ah, yeah. Have been bitten by that terminology multiple times now, thanks for reminding! Having switched over to the diff interface, all tests pass right now. So I chose to name the function While at it, I've now fixed parsing patches with unquoted file names containing whitespaces. I've also tried to tighten our diff parsing code to return an error whenever we parse a patch containing multiple files, but quickly noticed that we in fact rely on the ability to ignore trailing garbage in a patch. So this interface is potentially dangerous in that it allows a developer to parse a patch as diff where we simply truncate the diff to the first patch. While it can be argued that the developer is at fault for not reading documentation, I can sympathise due to our confusing terminology. But well, here we are. Meh... |
IIRC, this is necessary, in fact, since they can (legally) have trailing data, IIRC. (sigh.) |
| if (!quoted && git__isspace(ctx->line[len])) | ||
| if (!quoted && | ||
| (!git__strncmp(ctx->line + len, " b/", strlen(" b/")) || | ||
| !git__strncmp(ctx->line + len, " \"b/", strlen(" \"b/")) || |
There was a problem hiding this comment.
There's no guarantee that the prefixes are a/ and b/. You can set custom prefixes with --src-prefix and --dst-prefix, or you can have no prefix at all (with --no-prefix)...
This is a bit of a bummer as it means parsing the header line is completely ambiguous. What does:
diff --git a/b c/d/e f/g h/i j/k
mean?
I think we'll need to store the whole header buffer then parse it with the knowledge given to us by the --- and +++ lines.
:(
There was a problem hiding this comment.
Except that a patch needn't have +++ and --- lines.
diff --git foo bar foo bar
old mode 100755
new mode 100644
(Note: git cannot apply this patch.)
There was a problem hiding this comment.
God, that's awful. So at some point, I guess we'd just have to say "This is unsupported." I think it makes most sense to use the "+++" and "---" lines and bail out if they're missing, given that upstream git itself cannot use these kind of patches.
There was a problem hiding this comment.
Truly hideous, right? Why these aren't just quoted whenever there's a space in the filename, I don't know. 😢
Can we split the space in the filename commit out? I think this removes our ability to deal with custom prefixes. I doubt anybody is actually using that but it might be thinking about this a bit more before we pull it out. I think the patchid stuff is 👍 and ready to go.
There was a problem hiding this comment.
Yeah, wanted to propose the same. Will create a new PR for the other "fix" and keep it on hold for now
The upstream git project provides the ability to calculate a so-called
patch ID. Quoting from git-patch-id(1):
A "patch ID" is nothing but a sum of SHA-1 of the file diffs
associated with a patch, with whitespace and line numbers ignored."
Patch IDs can be used to identify two patches which are probably the
same thing, e.g. when a patch has been cherry-picked to another branch.
This commit implements a new function `git_diff_patchid`, which gets a
patch and derives an OID from the diff. Note the different terminology
here: a patch in libgit2 are the differences in a single file and a diff
can contain multiple patches for different files. The implementation
matches the upstream implementation and should derive the same OID for
the same diff. In fact, some code has been directly derived from the
upstream implementation.
The upstream implementation has two different modes to calculate patch
IDs, which is the stable and unstable mode. The old way of calculating
the patch IDs was unstable in a sense that a different ordering the
diffs was leading to different results. This oversight was fixed in git
1.9, but as git tries hard to never break existing workflows, the old
and unstable way is still default. The newer and stable way does not
care for ordering of the diff hunks, and in fact it is the mode that
should probably be used today. So right now, we only implement the
stable way of generating the patch ID.
|
🎉 |
This is a work in progress. It does produce the same patch IDs for many files, but I'm not quite there yet. I've also discovered an error in parsing patches where the file names have whitespaces inside and added a test, but there is no fix yet.