Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

chore: remove transformers cap and raise training floor#3264

Merged
mergify[bot] merged 2 commits intoinstructlab:mainfrom
jaideepr97:bump-transformers-training
Apr 16, 2025
Merged

chore: remove transformers cap and raise training floor#3264
mergify[bot] merged 2 commits intoinstructlab:mainfrom
jaideepr97:bump-transformers-training

Conversation

@jaideepr97
Copy link
Copy Markdown
Contributor

removes the temporary cap placed on the transformers library and bumps training to >=v0.8.1

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Functional tests have been added, if necessary.
  • E2E Workflow tests have been added, if necessary.

@mergify mergify Bot added the dependencies Relates to dependencies label Apr 8, 2025
@mergify mergify Bot added the one-approval PR has one approval from a maintainer label Apr 8, 2025
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed one-approval PR has one approval from a maintainer labels Apr 8, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 8f3e19e to 4b3cd03 Compare April 9, 2025 14:05
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 9, 2025
@booxter booxter self-requested a review April 9, 2025 20:23
Copy link
Copy Markdown
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include transformers cap removal.

booxter
booxter previously requested changes Apr 9, 2025
Copy link
Copy Markdown
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@mergify mergify Bot added the one-approval PR has one approval from a maintainer label Apr 9, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 4b3cd03 to b3beb76 Compare April 9, 2025 21:28
@mergify mergify Bot removed the ci-failure PR has at least one CI failure label Apr 9, 2025
@jaideepr97
Copy link
Copy Markdown
Contributor Author

@booxter updated and also removed the comment above it
GH UI is slow for some reason and not showing that yet

@mergify mergify Bot added the ci-failure PR has at least one CI failure label Apr 9, 2025
@booxter
Copy link
Copy Markdown
Contributor

booxter commented Apr 9, 2025

More issues in training library with the transformers?..

  0%|          | 0/5 [00:00<?, ?it/s]/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/transformers/generation/configuration_utils.py:820: UserWarning: `return_dict_in_generate` is NOT set to `True`, but `output_logits` is. When `return_dict_in_generate` is not `True`, `output_logits` is ignored.
  warnings.warn(
'tuple' object has no attribute 'logits'

@RobotSail
Copy link
Copy Markdown
Member

@booxter It looks like the issue is the new transformers version changed their API. You can see in the current implementation of the full train pipeline, it's getting the output by calling return_dict=True, but that warning is expecting return_dict_in_generate=True (I'm guessing to be set in the initialization).

output = model(**batch, use_cache=False, return_dict=True)

So the solution here is to either change the logic to just read the model output as a tuple or rename the parameter.

Long-term, we need to just move this function to live in the training library instead so we aren't duplicating this logic.

@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from a369140 to 842cb12 Compare April 10, 2025 15:08
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 10, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 842cb12 to fb5691c Compare April 10, 2025 16:55
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 10, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from fb5691c to bfbe624 Compare April 10, 2025 17:50
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 10, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from bfbe624 to dc95760 Compare April 11, 2025 13:58
@mergify mergify Bot removed the ci-failure PR has at least one CI failure label Apr 11, 2025
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 14, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 6dd265d to 6aef684 Compare April 15, 2025 12:56
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 15, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 6aef684 to 5abdb89 Compare April 15, 2025 13:43
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 15, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 5abdb89 to 9237ee4 Compare April 15, 2025 14:23
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 15, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 9237ee4 to 0199d37 Compare April 15, 2025 14:57
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 15, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 0199d37 to 19f7074 Compare April 15, 2025 16:33
@mergify mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 15, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 19f7074 to 8b19f09 Compare April 16, 2025 13:44
@mergify mergify Bot removed the ci-failure PR has at least one CI failure label Apr 16, 2025
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from 8b19f09 to af2995e Compare April 16, 2025 13:46
Comment thread src/instructlab/train/linux_train.py Outdated
Comment thread src/instructlab/train/linux_train.py Outdated
Comment thread tests/test_package.py Outdated
Comment thread src/instructlab/train/linux_train.py
Comment thread src/instructlab/model/full_train.py
Comment thread src/instructlab/model/full_train.py Outdated
@jaideepr97 jaideepr97 force-pushed the bump-transformers-training branch from bc9ec6a to fe4f715 Compare April 16, 2025 14:26
Currently full training relies on some APIs from instructlab/training which arent actually
helping with CPU legacy training. This commit resolves that by moving the necessary function
needed for loss correction when training with gradient accumulation to live inside of
the full train function, since it is only using a reduced amount of instructlab/instructlab
capability.

Signed-off-by: Oleg Silkin <[email protected]>
Comment thread src/instructlab/model/full_train.py
Comment thread requirements.txt Outdated
Comment thread requirements/cuda.txt
student_model_arch = get_model_arch(pathlib.Path(params["model_path"]))
if ctx.obj.config.general.use_legacy_tmpl:
train_args.use_legacy_tmpl = True
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change mean that legacy tokenizer config won't be detected anymore?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@booxter We don't need it anymore

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the change or we don't need support for legacy auto-detection? And is this change related to transformers cap removal / training dependency update?

loss = None
if isinstance(output, tuple):
loss = output[0]
if len(output[0].shape) != 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if len(loss.shape) != 0? (Or even if loss.shape?)

loss = output.loss
if loss is None:
raise ValueError(
"Loss is None. Ensure the model's output contains a valid loss."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These errors will bubble up to users through CLI error. I am not sure they have the context to interpret the errors meaningfully. As a user, how do I "ensure the model's output contains a valid loss"? Are you asking the user to check their model inputs / config perhaps? If so, this is what should be communicated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to other errors here I think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@booxter This shouldn't be happening, so if this happens then we want users to report this to us. They are also welcome to look at the internals of the CLI and potentially contribute back. We shouldn't assume that our users are incapable of being technical.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So it's always a bug in the code and not the inputs from the user? If so, the suggestion to "ensure the model's output" seems misplaced and we may instead want to direct the user to report the issue. (And maybe dump some more info to include with the report?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a few things, the model implementation could have changed or the transformers API itself could have changed (as in the case here). So when that happens we just have this as a safeguard

Comment thread src/instructlab/model/full_train.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

dependencies Relates to dependencies testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants