Common parser interface#4310
Conversation
|
I've rebased on master with #4250. As expected, the problem with recursive includes was solved by this. Only remaining problem now is improper indentation, which I'll fix later |
|
Fixed remaining issues. Next steps which I'll not handle inside of this PR:
|
cbb6343 to
1096c49
Compare
|
Rebased upon latest master to fix a single conflict due to the new "common.h" headers. I've also added commit messages for every commit. |
|
|
||
| /* Next patch */ | ||
| { "diff --git " , STATE_END, 0, NULL }, | ||
| { "@@ -" , STATE_END, 0, NULL }, |
There was a problem hiding this comment.
These 0s should be STATE_START, no? It looks like we do want to start again but using a number instead of the enum variant makes it look like we're after a different value.
| STATE_COPY, | ||
|
|
||
| STATE_END, | ||
| } parse_header_state; |
There was a problem hiding this comment.
It'd be nice to have some documentation of what each of these mean. I was shortly confused by STATE_END being the start start state for @@ which happens in the middle of a patch.
|
|
||
| if (memcmp(ctx->line, op->str, min(op_len, ctx->line_len)) != 0) | ||
| if (transition->expected_state != state || | ||
| memcmp(ctx->parse_ctx.line, transition->str, min(op_len, ctx->parse_ctx.line_len)) != 0) |
There was a problem hiding this comment.
I think git__prefixcmp would be more appropriate here. Otherwise you're letting this match when either string is the prefix of the other.
|
|
||
| typedef struct { | ||
| const char *str; | ||
| parse_header_state expected_state; |
There was a problem hiding this comment.
Is this called expected_state because it's what you expect to be in in order to match? If we're going to go for a state machine, I think it makes more sense for this to be called current_state.
| memcpy((char *)ctx->content, content, content_len); | ||
| } else { | ||
| ctx->content = NULL; | ||
| } |
There was a problem hiding this comment.
Must we copy the contents? We're already read whatever file or buffer into memory.
| return 0; | ||
| } | ||
|
|
||
| int git_parse_advance_digit(git_off_t *out, git_parse_ctx *ctx, int base) |
There was a problem hiding this comment.
This doesn't match the type in the header. We should have int64 here.
| return 0; | ||
| } | ||
|
|
||
| int git_parse_peek(char *out, git_parse_ctx *ctx, int flags) |
There was a problem hiding this comment.
A lot of the uses for this do just want to look at the next character. It might not be a huge deal but I worry about the performance hit when we have this loop in the function that we don't usually want.
There was a problem hiding this comment.
Well, if the skip-whitespace flag is not set we're not looping at all. Are you worried about bad code generated by the compiler or just about calling that function multiple times? I mean I could also split that up into two different functions git_parser_peek and git_parser_peek_skip_whitespace (naming should obviously be improved), but I don't think it's required right now.
| reader.read_ptr = reader.buffer.ptr; | ||
| reader.eof = 0; | ||
| if (result == 0 || result == GIT_ENOTFOUND) { | ||
| git_parse_ctx_init(&reader.ctx, contents.ptr, contents.size); |
There was a problem hiding this comment.
Do we guarantee that we won't touch the fields in contents if it's not found? It makes sense that we wouldn't touch them, but if that's not part of the contract, this could initialise the context with whatever random pointer.
There was a problem hiding this comment.
Do you mean the contents of the context?
The `git_patch_parse_ctx` encapsulates both parser state as well as options specific to patch parsing. To advance this state and keep it consistent, we provide a few functions which handle advancing the current position and accessing bytes of the patch contents. In fact, these functions are quite generic and not related to patch-parsing by themselves. Seeing that we have similar logic inside of other modules, it becomes quite enticing to extract this functionality into its own parser module. To do so, we create a new module `parse` with a central struct called `git_parse_ctx`. It encapsulates both the content that is to be parsed as well as its lengths and the current position. `git_patch_parse_ctx` now only contains this `parse_ctx` only, which is then accessed whenever we need to touch the current parser. This is the first step towards re-using this functionality across other modules which require parsing functionality and remove code-duplication.
Instead of manually checking the parsing context's remaining length and comparing the leading bytes with a specific string, we can simply re-use the function `git_parse_ctx_contains_s`. Do so to avoid code duplication and to further decouple patch parsing from the parsing context's struct members.
The patch parsing code has multiple recurring patterns where we want to parse an actual number. Create a new function `git_parse_advance_digit` and use it to avoid code duplication.
Some code parts need to inspect the next few bytes without actually consuming it yet, for example to examine what content it has to expect next. Create a new function `git_parse_peek` which returns the next byte without modifying the parsing context and use it at multiple call sites.
Upon initializing the parser context, we do not currently initialize the current line, line length and line number. Do so in order to make the interface easier to use and more obvious for future consumers of the parsing API.
The configuration file code grew quite big and intermingles both actual configuration logic as well as the parsing logic of the configuration syntax. This makes it hard to refactor the parsing logic on its own and convert it to make use of our new parsing context module. Refactor the code and split it up into two parts. The config file code will only handle actual handling of configuration files, includes and writing new files. The newly created config parser module is then only responsible for parsing the actual contents of a configuration file, leaving everything else to callbacks provided to its provided function `git_config_parse`.
As the config parser is now cleanly separated from the config file code, we can easily refactor the code and make use of the common parser module. This removes quite a lot of duplicated functionality previously used for handling the actual parser state and replaces it with the generic interface provided by the parser context.
1096c49 to
9e66590
Compare
This is a first attempt at pulling out the diff-header parser and unify its interface, such that it can be reused for other code parts. In this PR I did a PoC to replace the config file parser with this common parser code, which turns out to be quite promising in that it removes roughly 150 lines of code which are not required anymore.
There are two issues left with this PR, which is why I labeled it as WIP. First, there are some small-ish issues with proper indentation when writing config files, which I'll handle as soon as the second problem is solved. The second problem is with recursive include files, where the wrong structures are touched due to our faulty include handling. These things should be fixed by #4250.
As there are quite a lot of merge conflicts due to #4250, I'll stop working on this issue right now until the other PR is merged.