Support MXNet 1.3 with its training script format changes#446
Conversation
Codecov Report
@@ Coverage Diff @@
## master #446 +/- ##
==========================================
- Coverage 93.75% 93.63% -0.13%
==========================================
Files 55 55
Lines 4100 4114 +14
==========================================
+ Hits 3844 3852 +8
- Misses 256 262 +6
Continue to review full report at Codecov.
|
eslesar-aws
left a comment
There was a problem hiding this comment.
Finished an edit pass.
| ''''''''''''''''''''''''''' | ||
| Your MXNet training script must be a Python 2.7 or 3.5 compatible source file. | ||
|
|
||
| The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as |
There was a problem hiding this comment.
"...environment variables, including the following:"
Question: Is this an exhaustive list of the available environment variables? Does one exist?
There was a problem hiding this comment.
there's an exhaustive list at https://github.com/aws/sagemaker-containers#list-of-provided-environment-variables-by-sagemaker-containers - I'll add that somewhere
|
|
||
| The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as | ||
|
|
||
| * ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. |
There was a problem hiding this comment.
"A string that represents the path where the training job writes the model artifacts to."
Comment--the next sentence makes this unclear. Is SM_MODEL_DIR a directory in the container? And model artifacts are uploaded to S3 from there?
There was a problem hiding this comment.
yes, that's correct. I reworded it to make it hopefully clearer - let me know if you think it needs more tweaks
| * ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Outut artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. | ||
| These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts. | ||
|
|
||
| Supposing two input channels, 'train' and 'test', were used in the call to the MXNet estimator's ``fit`` method, the following will be set, following the format "SM_CHANNEL_[channel_name]": |
There was a problem hiding this comment.
Here's my attempt at restructuring this:
SM_CHANNEL_XXXX: A string that represents the path to the directory that contains the input data for the specified channel. For example, if two input channels, named 'train' and 'test', are specified in the call to the MXNet estimator'sfitmethod, the environment variablesSM_CHANNEL_TRAINandSM_CHANNEL_TESTare set.
There was a problem hiding this comment.
I like that a lot better. I changed the second sentence a little bit but added the extra bullet point and removed the paragraph/separate list
| from sagemaker.session import get_execution_role # noqa: F401 | ||
|
|
||
| __version__ = '1.12.0' | ||
| __version__ = '1.13.0' |
There was a problem hiding this comment.
should we bump it to 2.x since this is a breaking change?
There was a problem hiding this comment.
well, this PR doesn't technically have breaking changes because we're not bumping the default version of MXNet. I was going to wait until the PR that makes framework_version required.
| logger.warning(empty_framework_version_warning(MXNET_VERSION)) | ||
| self.framework_version = framework_version or MXNET_VERSION | ||
|
|
||
| if self._script_mode_version(): |
There was a problem hiding this comment.
Do we still launch the parameter server with single host training?
There was a problem hiding this comment.
yes, one can still use the kvstore needed even with only one host
|
|
||
| For versions 1.3 and higher | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| Your MXNet training script must be a Python 2.7 or 3.5 compatible source file. |
There was a problem hiding this comment.
I am wondering if there is a good way to factor this part out since TensorFlow script mode will have this exact same document in the readme.
There was a problem hiding this comment.
the original splitting of the README deliberately chose to have repeated documentation - I'd love to revisit this idea later, but I don't think now is the time for another README overhaul :/
| def test_estimator_script_mode_launch_parameter_server(sagemaker_session): | ||
| mx = MXNet(entry_point=SCRIPT_PATH, role=ROLE, sagemaker_session=sagemaker_session, | ||
| train_instance_count=INSTANCE_COUNT, train_instance_type=INSTANCE_TYPE, | ||
| distributions=LAUNCH_PS_DISTRIBUTIONS_DICT, framework_version='1.3.0') |
There was a problem hiding this comment.
The default framework version is stored in a constant, right? Can we use that here?
There was a problem hiding this comment.
the default is being left at 1.2.1 so that this isn't a breaking change. Also the point of these three new unit tests is to be deliberate about the framework version
There was a problem hiding this comment.
I guess what i was thinking was do we need a latest_version constant. Are we going to change this framework version number here.
There was a problem hiding this comment.
this test just needs a version >= 1.3.0; no need to change it with later versions
Description of changes:
This adds support for MXNet 1.3, which will come with changes in the training script format.
A note on integ tests - because we're leaving the default MXNet version as 1.2.1, I left the tuning integ test using framework mode so there's at least one test (and it is included in the continuous testing) running that.
In other news, the underlying migration with our MXNet container code also means
requirements.txtwill be supported, which addresses #284.Merge Checklist
Put an
xin the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.