WaitForJobDone() should wait until all tasks fully finished#145
Conversation
Controller.WaitForJobDone() watches JobStatus in etcd to determine when the task job has finished. Now, JobStatus is set in Framework.ShutdownJob() -- a function called by task implementer to notify the framework its job has been done. After the notification, the framework will then call Framework.releaseResource() and Task.Exit() to do cleanup work. However, at this time point, the controler may have already stopped (since it has seen the JobStatus change), and therefore causes the task cleanup process to stop early. After this patch, we set JobStatus only after the task cleanup process has been done, in order to make sure it won't be wrongly stoped by controller.
|
Hi @5kg . Only the person calling ShutdownJob() is supposed to change the job status. However, your proposal is trying to let everyone to change it. Nonetheless you have raised a good point. I have noticed this for quite a while. Actually, I am trying to rectify it in my next PR that will also take epoch out. See: Previously, the behavior is:
In the future, I expect the behavior to be:
Let me know what you think. I always like to hear more suggestions. Thanks again for finding the issue around. |
|
Hi @fengjingchao , |
|
The CI error looks like the unknown race as in #140. |
After 955c459, the Controller.WaitForJobDone() will wait until tasks fully finished. This patch changes JobStatus to a atomic counter counting the number of finished tasks. Only after every task has exited, Controller.WaitForJobDone() will then return.
I fully agree this proposal. |
Controller.WaitForJobDone() watches JobStatus in etcd to determine when the task
job has finished.
Now, JobStatus is set in Framework.ShutdownJob() -- a function called by task
implementer to notify the framework its job has been done. After the notification,
the framework will then call Framework.releaseResource() and Task.Exit() to do
cleanup work. However, at this time point, the controler may have already
stopped (since it has seen the JobStatus change), and therefore causes the task
cleanup process to stop early.
After this patch, we set JobStatus only after the task cleanup process has been
done, in order to make sure it won't be wrongly stoped by controller.