Distributed locking for hosted agents

Most hosted agent systems start with a loop: poll for work, choose a task, call tools, write the result. That works until there are two workers. Then the same issue, customer ticket, evaluation, or deploy request can be picked up twice.

The fix does not have to be a new workflow platform. Often the missing primitive is ownership. Before an agent starts, it asks a shared service whether it can own the task for a bounded amount of time.

The lease model

OctoStore exposes ownership as HTTP locks. An agent sends POST /locks/:name/acquire with a TTL and optional metadata. Use a stable name such as issue-1842. If the response says acquired, this agent owns the work. If it says held, another worker owns it.

shell

curl -X POST https://your-octostore.internal/locks/issue-1842/acquire \
  -H "Authorization: Bearer $OCTOSTORE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "ttl_seconds":120,
    "metadata":"runner=agent-7 repo=acme/app issue=1842"
  }'

Metadata is deliberately small and operator-readable. Put the runner ID, task URL, trace ID, branch, model, or runbook hint where a human can see it.

Renew while work is live

A lease is not a permanent claim. Long-running work should renew before expiry:

shell

curl -X POST https://your-octostore.internal/locks/issue-1842/renew \
  -H "Authorization: Bearer $OCTOSTORE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"lease_id":"...","ttl_seconds":120}'

Renewal acts as a heartbeat. A healthy worker keeps the lease alive. A crashed worker stops renewing, so the task eventually becomes available again.

Carry the fencing term

Every acquisition returns a strictly increasing fencing token. Pass it to downstream systems when possible. Once a resource has accepted term 1843, it can reject a late write from a stale worker holding term 1842.

This closes the gap that TTL alone cannot close. Expiry decides who should be leader now. Fencing lets the resource reject who used to be leader.

Release on completion, expire on failure

When work completes, release the lock with its lease ID. If the worker dies before release, expiry is the recovery path. Pick a TTL that bounds how long a task can look owned after a crash, then renew during normal execution.

What a lock does not solve

A distributed lock prevents ordinary duplicate starts. It does not make external side effects transactional. Hosted agents should still write idempotently, handle retries, and gate irreversible actions. Treat the lock as the coordination boundary, not a substitute for safe application logic.

Need a fleet leader instead?

OctoStore also provides account-free remote leader election. Open a room, campaign over HTTP, and let one process lead while the rest wait.

Read the election guide