EN TN fixes for Issue #166#185
Conversation
Signed-off-by: Simon Zuberek <[email protected]>
for more information, see https://pre-commit.ci
|
|
||
| domain_graph_with_class_tags = ( | ||
| pynutil.insert("domain: \"") | ||
| pynutil.insert('domain: "') |
There was a problem hiding this comment.
Alias all strings as variables
There was a problem hiding this comment.
I aliased all string literals when appropriate. All variables were placed in graph_utils.py.
Signed-off-by: Simon Zuberek <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Simon Zuberek <[email protected]>
for more information, see https://pre-commit.ci
tbartley94
left a comment
There was a problem hiding this comment.
Just alias the variables, not the insertion operation. You're going to find cases in the future where you'll also want to delete, or run a cross mapping. Easier if you just keep it simple with aliasing the string itself.
| | (pynutil.delete(' field_order: "') + NEMO_NOT_QUOTE + pynutil.delete('"')) | ||
| ) | ||
|
|
||
| # String literal insertions |
There was a problem hiding this comment.
Hmm, do we have no where else to put these? I guess they need to accessible to both taggers and classifiers, but putting them in the graph_utils seems odd.
There was a problem hiding this comment.
Considering that instead of operations on the strings, we just want simple pointers AND that graph_utils.py already has a bunch of pointers to strings, I think it's as good a place as any other. It may also make it easier for imports.
| ) | ||
|
|
||
| # String literal insertions | ||
| insert_username = pynutil.insert('username: "') |
There was a problem hiding this comment.
I wouldn't create insert_double_quote. Just create a variable named double_quotes then call the insert function in the appropriate case. You don't want to limit yourself in case you need a deletion function as well.
| numbers = NEMO_DIGIT | ||
| else: | ||
| numbers = pynutil.insert(" ") + cardinal.long_numbers + pynutil.insert(" ") | ||
| numbers = insert_space + cardinal.long_numbers + insert_space |
There was a problem hiding this comment.
Yeah here it would be better as pynutil.insert(NEMO_SPACE).
| protocol_file_start = accept_file + insert_space + (accept_colon_triple_slash @ protocol_symbols) | ||
|
|
||
| protocol_end = pynutil.add_weight(pynini.cross("www", "WWW ") + pynini.accep(".") @ protocol_symbols, -1000) | ||
| protocol_end = pynutil.add_weight(pynini.cross("www", "WWW ") + accept_period @ protocol_symbols, -1000) |
There was a problem hiding this comment.
Why does the capitalized form have a space?
There was a problem hiding this comment.
It may be so that you don't get things like WWWdot nvidia dot com .
Signed-off-by: Simon Zuberek <[email protected]>
for more information, see https://pre-commit.ci
|
Jenkins fails bc Sparrowhawk fails on EN to begin with. This was the case before the current fix was implemented. |
This reverts commit 727f0fb. Signed-off-by: Simon Zuberek <[email protected]>
* EN TN fixes for Issue #166 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases string literals when appropriate Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes string variables from the class body Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases all commonly used strings in ELECTRONIC tagger Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Alex Cui <[email protected]>
* EN TN fixes for Issue #166 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases string literals when appropriate Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes string variables from the class body Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases all commonly used strings in ELECTRONIC tagger Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Alex Cui <[email protected]>
* EN TN fixes for Issue #166 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases string literals when appropriate Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes string variables from the class body Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases all commonly used strings in ELECTRONIC tagger Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Alex Cui <[email protected]>
* EN TN fixes for Issue #166 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases string literals when appropriate Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes string variables from the class body Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases all commonly used strings in ELECTRONIC tagger Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Alex Cui <[email protected]>
* EN TN fixes for Issue #166 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases string literals when appropriate Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes string variables from the class body Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases all commonly used strings in ELECTRONIC tagger Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Alex Cui <[email protected]>
* EN TN fixes for Issue NVIDIA#166 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases string literals when appropriate Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes string variables from the class body Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases all commonly used strings in ELECTRONIC tagger Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Namrata Gachchi <[email protected]>
* EN TN fixes for Issue NVIDIA#166 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases string literals when appropriate Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes string variables from the class body Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases all commonly used strings in ELECTRONIC tagger Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* EN TN fixes for Issue #166 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases string literals when appropriate Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes string variables from the class body Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aliases all commonly used strings in ELECTRONIC tagger Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
Fixes the issue where the sentence-final period in sentences ending with domain is incorrectly normalized as part of the domain. The PR also adds a handful of relevant tests and expands on the top-level domain mappings.
To note: A handful of Sparrowhawk tests failed for the current implementation of EN TN. This fix does not address those failures.
Before your PR is "Ready for review"
Pre checks:
git commit -sto sign.pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...pytestand Sparrowhawk here.__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files?Copyright 2015 and onwards Google, Inc.. See an example here.try import: ... except: ...) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.