Please fill out the form below.
System Information
- Framework: Pytorch
- Framework Version: 1.0.0
- Python Version: 3
- CPU or GPU: CPU
- Python SDK Version: 1.18.4
- Are you using a custom image: No
Describe the problem
Model deployment fails with cryptic errors. See the logs below. The command issued to deploy the model is the following:
MODEL_PATH = 's3:///sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz'
MODEL_NAME = 'improved-ner-model-model-' + os.environ['ENVIRONMENT']
ENDPOINT_NAME = 'improved-ner-model-sagemaker-endpoint-' + os.environ['ENVIRONMENT']
DEPLOY_INSTANCE = 'ml.m5.large'
model = PyTorchModel(model_data=MODEL_PATH, role=ROLE, entry_point='train_model.py',
sagemaker_session=sm_session, py_version='py3', framework_version='1.0.0',
name=ENDPOINT_NAME)
model.deploy(initial_instance_count=1, instance_type=DEPLOY_INSTANCE, endpoint_name=ENDPOINT_NAME)
The model is publicly available here:
https://s3.us-east-2.amazonaws.com/sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz
It contains a directory called flair which contains the final_model.pt
The (relevant) part of the train_model.py script is the following:
def model_fn(model_dir):
f_out = os.path.join(model_dir, 'flair')
m = SequenceTagger.load_from_file(os.path.join(f_out, 'final-model.pt'))
return m
def input_fn(request_body, request_content_type):
if request_content_type.lower() != 'application/json':
raise ValueError('Content type must be application/json')
if 'sentence' not in request_body:
raise ValueError('Request must be JSON formatted with key: sentence')
return request_body['sentence']
def predict_fn(input_data, model):
return model.predict(input_data)
if __name__ == "__main__":
args, _ = parse_args()
flair_out = os.path.join(args.model_dir, 'flair')
trainer(flair_out) # This trains a model using flair.trainer.ModelTrainer
model = SequenceTagger.load_from_file(os.path.join(flair_out, 'final-model.pt'))
# create example sentence
sentence = Sentence('I love Berlin')
# predict tags and print
model.predict(sentence)
Minimal repro / logs
The CloudWatch logs are very opaque. One of the errors is the following:
sagemaker_containers._errors.ClientError: [Errno 30] Read-only file system: '/opt/ml/model/flair/final-model.pt'
Then, much later, these errors pop up:
Processing /opt/ml/code
Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/pip-req-tracker-27gca9by/35241637574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3'
You are using pip version 18.1, however version 19.0.3 is available.
[2019-04-12 03:49:26 +0000] [25] [ERROR] Error handling request /ping
Any ideas of what is actually causing the error, or some other steps to take to make it easier to debug?
Please fill out the form below.
System Information
Describe the problem
Model deployment fails with cryptic errors. See the logs below. The command issued to deploy the model is the following:
The model is publicly available here:
https://s3.us-east-2.amazonaws.com/sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz
It contains a directory called
flairwhich contains thefinal_model.ptThe (relevant) part of the
train_model.pyscript is the following:Minimal repro / logs
The CloudWatch logs are very opaque. One of the errors is the following:
Then, much later, these errors pop up:
Any ideas of what is actually causing the error, or some other steps to take to make it easier to debug?