Skip to content

Fix stale executions permanently blocking concurrency=forbid jobs#1947

Merged
vcastellm merged 3 commits intomainfrom
copilot/fix-skipping-concurrent-execution
Mar 29, 2026
Merged

Fix stale executions permanently blocking concurrency=forbid jobs#1947
vcastellm merged 3 commits intomainfrom
copilot/fix-skipping-concurrent-execution

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 14, 2026

Proposed changes

Jobs with concurrency=forbid get permanently blocked when an execution gets stuck as "running" in persistent storage (e.g., node crash, gRPC failure). The Busy tab shows nothing because it only checks in-memory state, while isRunnable() checks persistent storage — creating a deadlock where the job never runs again.

The prior fix (d207f79) added persistent storage checks to survive node restarts, but introduced this regression: no mechanism existed to detect or clean up orphaned executions.

Changes:

  • Stale execution detection: isRunnable() now cross-references storage-based running executions against in-memory active executions. Executions only in storage and older than DefaultStaleExecutionThreshold (4 hours) are automatically marked as failed via Raft
  • Reordered checks: In-memory active executions are checked first (authoritative), then storage is checked for post-leader-change scenarios
  • Conservative fallback: Storage-only executions within the threshold still block, preserving correctness during leader failover
  • Tests: Added TestConcurrencyForbid_StaleExecutionCleanup and TestConcurrencyForbid_RecentExecutionNotCleaned
// Stale execution in storage but not active in memory → auto-cleanup
runningFor := time.Now().UTC().Sub(exec.StartedAt)
if runningFor > DefaultStaleExecutionThreshold {
    exec.FinishedAt = time.Now().UTC()
    exec.Success = false
    // Apply cleanup via Raft for cluster consistency
    j.Agent.RaftApply(cmd)
}

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 127.0.0.10
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)
  • 127.0.0.12
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)
  • 127.0.0.13
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)
  • 127.0.0.15
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)
  • 127.0.0.16
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)
  • 127.0.0.22
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)
  • 127.0.0.23
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)
  • 127.0.0.24
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)
  • 127.0.0.26
    • Triggering command: /tmp/go-build4003139311/b001/dkron.test /tmp/go-build4003139311/b001/dkron.test -test.testlogfile=/tmp/go-build4003139311/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.run=TestJob|TestStore|TestConcurrency|TestAPI g_.a pimachinery@v0.2-ifaceassert ux_amd64/vet -p gzip -lang=go1.24 ux_amd64/vet -o fMM2-uc6t lient-go@v0.22.2-ifaceassert ux_amd64/vet -p e/govmomi/proper-atomic -lang=go1.23 ux_amd64/vet (packet block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Skipping concurrent execution is reported but job is not listed in Busy tab</issue_title>
<issue_description>Describe the bug
I've been investigating a few jobs not running on days they're supposed to run (and the next run timestamp is not correctly set).
Looking at the logs of the dkron server, I've found the following:

Mar 08 21:20:00 dkron-server dkron[x]: time="2026-03-08T21:20:00Z" level=info msg="job: Skipping concurrent execution (found running executions in storage)" concurrency=forbid job=myjob job_status=success node=dkron-server running_count=1
Mar 08 21:30:00 dkron-server dkron[x]: time="2026-03-08T21:30:00Z" level=info msg="job: Skipping concurrent execution (found running executions in storage)" concurrency=forbid job=myjob job_status=success node=dkron-server running_count=1

Scheduling seems to work (the job is triggered according to the cron schedule I've specified) but the job is not run because there is another execution found in the storage and the concurrency is forbidden.

When I look at the Busy tab in the UI, I cannot see an execution for that job and the Next tab in the UI shows a date in the past. See the screenshot below

I don't see an API call to directly query the storage and I think the storage is in-memory so I don't think I can attach to it to view the content. Need assistance to troubleshoot this issue.

To Reproduce
I don't have the steps to reproduce the problem unfortunately.

Expected behavior
Either an execution to be listed in Busy section or the job to run successfully without the concurrency error.

Screenshots
Image

Specifications:

  • OS: linux
  • Version: 4.0.9
  • Executor: shell

</issue_description>

<agent_instructions>Investigate the storage of executions that could cause the issue</agent_instructions>

Comments on the Issue (you are @copilot in this section)


📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

Copilot AI and others added 2 commits March 14, 2026 09:30
When a running execution gets stuck in persistent storage (e.g., due to
node crash or gRPC failure), it permanently blocks jobs with
concurrency=forbid from running.

The fix cross-references storage-based running executions with in-memory
active executions. If an execution is in storage but not active in memory
and has been running longer than DefaultStaleExecutionThreshold (4 hours),
it is automatically cleaned up by marking it as failed via Raft.

Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix skipping concurrent execution issue in Busy tab Fix stale executions permanently blocking concurrency=forbid jobs Mar 14, 2026
Copilot AI requested a review from vcastellm March 14, 2026 09:34
@vcastellm vcastellm marked this pull request as ready for review March 29, 2026 15:32
@vcastellm vcastellm merged commit a666cf1 into main Mar 29, 2026
4 checks passed
@vcastellm vcastellm deleted the copilot/fix-skipping-concurrent-execution branch March 29, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skipping concurrent execution is reported but job is not listed in Busy tab

2 participants