Skip to content

WaitForJobDone() should wait until all tasks fully finished#145

Open
5kg wants to merge 2 commits into
taskgraph:masterfrom
5kg:fix_early_quit
Open

WaitForJobDone() should wait until all tasks fully finished#145
5kg wants to merge 2 commits into
taskgraph:masterfrom
5kg:fix_early_quit

Conversation

@5kg
Copy link
Copy Markdown
Contributor

@5kg 5kg commented Jun 28, 2015

Controller.WaitForJobDone() watches JobStatus in etcd to determine when the task
job has finished.

Now, JobStatus is set in Framework.ShutdownJob() -- a function called by task
implementer to notify the framework its job has been done. After the notification,
the framework will then call Framework.releaseResource() and Task.Exit() to do
cleanup work. However, at this time point, the controler may have already
stopped (since it has seen the JobStatus change), and therefore causes the task
cleanup process to stop early.

After this patch, we set JobStatus only after the task cleanup process has been
done, in order to make sure it won't be wrongly stoped by controller.

Controller.WaitForJobDone() watches JobStatus in etcd to determine when the task
job has finished.

Now, JobStatus is set in Framework.ShutdownJob() -- a function called by task
implementer to notify the framework its job has been done. After the notification,
the framework will then call Framework.releaseResource() and Task.Exit() to do
cleanup work. However, at this time point, the controler may have already
stopped (since it has seen the JobStatus change), and therefore causes the task
cleanup process to stop early.

After this patch, we set JobStatus only after the task cleanup process has been
done, in order to make sure it won't be wrongly stoped by controller.
@hongchaodeng
Copy link
Copy Markdown

Hi @5kg .

Only the person calling ShutdownJob() is supposed to change the job status. However, your proposal is trying to let everyone to change it.

Nonetheless you have raised a good point. I have noticed this for quite a while. Actually, I am trying to rectify it in my next PR that will also take epoch out. See:

https://github.com/fengjingchao/taskgraph/blob/a535cb4c3c613d8c0eb8046e33404a2f6f5e8afa/framework_interface.go#L29

Previously, the behavior is:

  • a master task changed the job status
  • everyone, including controller, noticed it. But they shut down themselves individually.

In the future, I expect the behavior to be:

  • Everyone stop itself after some point is met.
  • Controller keeps watching everyone's termination and shut down the job after everyone completed.

Let me know what you think. I always like to hear more suggestions. Thanks again for finding the issue around.

@5kg 5kg force-pushed the fix_early_quit branch from 4286747 to eb5f1e2 Compare June 29, 2015 05:39
@5kg 5kg changed the title WaitForJobDone() should wait until task fully finished WaitForJobDone() should wait until all tasks fully finished Jun 29, 2015
@plutoshe
Copy link
Copy Markdown
Contributor

Hi @fengjingchao ,
After a discussion of kevin, who is good at kubernetes. I have a question about controller.
As I know, the controller simulates kubernetes resource assignment.
But the kubernetes is stateless, therefore etcd initialization ,which need set etcd status, seems incompatible with the controller.
Should we move this to non-controller part?

@5kg 5kg force-pushed the fix_early_quit branch from eb5f1e2 to e186ffb Compare June 29, 2015 06:06
@5kg
Copy link
Copy Markdown
Contributor Author

5kg commented Jun 29, 2015

The CI error looks like the unknown race as in #140.

After 955c459, the Controller.WaitForJobDone() will wait until tasks
fully finished.

This patch changes JobStatus to a atomic counter counting the number of
finished tasks. Only after every task has exited, Controller.WaitForJobDone()
will then return.
@5kg 5kg force-pushed the fix_early_quit branch from e186ffb to 55216cf Compare June 29, 2015 06:55
@5kg
Copy link
Copy Markdown
Contributor Author

5kg commented Jun 29, 2015

Hi @fengjingchao

In the future, I expect the behavior to be:

  • Everyone stop itself after some point is met.
  • Controller keeps watching everyone's termination and shut down the job after everyone completed.

I fully agree this proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants