chore: remove transformers cap and raise training floor by jaideepr97 · Pull Request #3264 · instructlab/instructlab

jaideepr97 · 2025-04-08T20:31:19Z

removes the temporary cap placed on the transformers library and bumps training to >=v0.8.1

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Changelog updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Functional tests have been added, if necessary.
E2E Workflow tests have been added, if necessary.

booxter

Please include transformers cap removal.

booxter

.

jaideepr97 · 2025-04-09T21:31:36Z

@booxter updated and also removed the comment above it
GH UI is slow for some reason and not showing that yet

booxter · 2025-04-09T22:53:40Z

More issues in training library with the transformers?..

  0%|          | 0/5 [00:00<?, ?it/s]/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/transformers/generation/configuration_utils.py:820: UserWarning: `return_dict_in_generate` is NOT set to `True`, but `output_logits` is. When `return_dict_in_generate` is not `True`, `output_logits` is ignored.
  warnings.warn(
'tuple' object has no attribute 'logits'

RobotSail · 2025-04-09T23:05:58Z

@booxter It looks like the issue is the new transformers version changed their API. You can see in the current implementation of the full train pipeline, it's getting the output by calling return_dict=True, but that warning is expecting return_dict_in_generate=True (I'm guessing to be set in the initialization).

instructlab/src/instructlab/model/full_train.py

Line 223 in 12b1217

output = model(**batch, use_cache=False, return_dict=True)

So the solution here is to either change the logic to just read the model output as a tuple or rename the parameter.

Long-term, we need to just move this function to live in the training library instead so we aren't duplicating this logic.

Signed-off-by: Jaideep Rao <[email protected]>

Currently full training relies on some APIs from instructlab/training which arent actually helping with CPU legacy training. This commit resolves that by moving the necessary function needed for loss correction when training with gradient accumulation to live inside of the full train function, since it is only using a reduced amount of instructlab/instructlab capability. Signed-off-by: Oleg Silkin <[email protected]>

booxter · 2025-04-16T15:43:06Z

-    student_model_arch = get_model_arch(pathlib.Path(params["model_path"]))
    if ctx.obj.config.general.use_legacy_tmpl:
        train_args.use_legacy_tmpl = True
-    else:


does this change mean that legacy tokenizer config won't be detected anymore?

@booxter We don't need it anymore

We don't need the change or we don't need support for legacy auto-detection? And is this change related to transformers cap removal / training dependency update?

booxter · 2025-04-16T15:48:09Z

+            loss = None
+            if isinstance(output, tuple):
+                loss = output[0]
+                if len(output[0].shape) != 0:


nit: if len(loss.shape) != 0? (Or even if loss.shape?)

booxter · 2025-04-16T15:51:22Z

+                loss = output.loss
+            if loss is None:
+                raise ValueError(
+                    "Loss is None. Ensure the model's output contains a valid loss."


These errors will bubble up to users through CLI error. I am not sure they have the context to interpret the errors meaningfully. As a user, how do I "ensure the model's output contains a valid loss"? Are you asking the user to check their model inputs / config perhaps? If so, this is what should be communicated.

This applies to other errors here I think?

@booxter This shouldn't be happening, so if this happens then we want users to report this to us. They are also welcome to look at the internals of the CLI and potentially contribute back. We shouldn't assume that our users are incapable of being technical.

I see. So it's always a bug in the code and not the inputs from the user? If so, the suggestion to "ensure the model's output" seems misplaced and we may instead want to direct the user to report the issue. (And maybe dump some more info to include with the report?)

It could be a few things, the model implementation could have changed or the transformers API itself could have changed (as in the case here). So when that happens we just have this as a safeguard

mergify Bot added the dependencies Relates to dependencies label Apr 8, 2025

booxter approved these changes Apr 8, 2025

View reviewed changes

mergify Bot added the one-approval PR has one approval from a maintainer label Apr 8, 2025

RobotSail approved these changes Apr 8, 2025

View reviewed changes

mergify Bot added ci-failure PR has at least one CI failure and removed one-approval PR has one approval from a maintainer labels Apr 8, 2025

jaideepr97 force-pushed the bump-transformers-training branch from 8f3e19e to 4b3cd03 Compare April 9, 2025 14:05

mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 9, 2025

booxter self-requested a review April 9, 2025 20:23

booxter reviewed Apr 9, 2025

View reviewed changes

booxter previously requested changes Apr 9, 2025

View reviewed changes

mergify Bot added the one-approval PR has one approval from a maintainer label Apr 9, 2025

jaideepr97 force-pushed the bump-transformers-training branch from 4b3cd03 to b3beb76 Compare April 9, 2025 21:28

mergify Bot removed the ci-failure PR has at least one CI failure label Apr 9, 2025

mergify Bot added the ci-failure PR has at least one CI failure label Apr 9, 2025

jaideepr97 force-pushed the bump-transformers-training branch from a369140 to 842cb12 Compare April 10, 2025 15:08

mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 10, 2025

jaideepr97 force-pushed the bump-transformers-training branch from 842cb12 to fb5691c Compare April 10, 2025 16:55

mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 10, 2025

jaideepr97 force-pushed the bump-transformers-training branch from fb5691c to bfbe624 Compare April 10, 2025 17:50

mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 10, 2025

jaideepr97 force-pushed the bump-transformers-training branch from bfbe624 to dc95760 Compare April 11, 2025 13:58

mergify Bot removed the ci-failure PR has at least one CI failure label Apr 11, 2025

mergify Bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 14, 2025

jaideepr97 force-pushed the bump-transformers-training branch from 6dd265d to 6aef684 Compare April 15, 2025 12:56