I was managing a monolithic Clojure application. As you might guess from the term “monolith” it became too large and I needed to break it out into smaller components for unsurprising reasons. I found that NATS met all of my needs. It allowed me to split up my application into smaller components which were easier to monitor and maintain. It provided that separation of concerns I wanted and allowed me to decouple the parts of my application from one another. It was very easy to set up and requires almost nothing from me to maintain. After migrating from the application to a new server, however, I started seeing weird failures relating to NATS. A background job scheduler in one of my services would intermittently hang for exactly one hour, then succeed immediately on retry. The NATS connection was rock-solid — no disconnects, no errors in the logs. The requests just vanished. The source, was a bug in my setup that had been there for years, but wasn’t uncovered until the migration simply because the new server setup was slightly different from the previous one.
The Setup
The service uses NATS for inter-service communication with a request-reply pattern for dispatching jobs. Scheduled workers fire off job requests and block until the job completes:
(defn send-start-job-request
[ctx job-type params]
(let [data (->RunJobReq job-type params)
response (deref
(.requestWithTimeout nats-conn
"reqs.jobs.run" (serialize data)
(Duration/ofHours 1)))]
(deserialize response)))
On the receiving end, a pool of worker threads subscribes to the same subject using a queue group, so NATS load-balances incoming requests across the pool:
(doto (.subscribe nats-conn "reqs.jobs.run" "job-runner")
(start-handler-thread
(fn [message]
(let [result (run-job message)]
(.publish nats-conn (.getReplyTo message)
(serialize result))))))
Both the requester (the scheduled worker) and the responder (the job runner pool) used the same NATS connection. That’s the bug.
The Symptoms
CancellationException: "Future cancelled, response not registered in time"after exactly one hour (the configured timeout)- Retries immediately succeeded
- NATS connection monitoring showed a healthy, stable connection — no disconnects or reconnects
- The job runner workers never received the request (no job IDs were created in the database)
The one-hour timeout sent me chasing the wrong cause for most of my debugging time. I kept looking for network issues, NATS server problems, queue backpressure — anything that might explain dropped messages. The connection was fine. The messages weren’t being dropped by NATS. They were being swallowed by the client itself.
The Root Cause
When you call .requestWithTimeout() on a jnats Connection, the client creates an internal inbox subscription (_INBOX.*) on that connection to receive the reply. It publishes your request to the target subject, and the responder is expected to publish the response back to the reply-to address.
Here’s what happens when the responder is subscribed on the same connection:
graph LR
A["Scheduled Worker<br/>.requestWithTimeout()"] -->|"1. publish to reqs.jobs.run"| B["NATS Server"]
B -->|"2. route back to same connection"| A
A -.->|"3. mux dispatcher intercepts<br/>message never reaches<br/>job runner handler"| D["Job Runner Pool<br/>❌ never invoked"]
style A fill:#f96,stroke:#333
style D fill:#f96,stroke:#333
The jnats client’s internal mux dispatcher has to route the incoming message on the same connection that’s waiting for a reply. The request goes out, NATS routes it right back to the same connection, and the mux dispatcher — which is already managing the _INBOX subscription for the pending reply — processes the inbound message in a context where the queue-group subscription handler never fires. The message silently never arrives at the job runner’s handler. This is a known issue documented in jnats GitHub issues #980 and #996, but it’s easy to miss. While I encountered this with jnats specifically, the self-request anti-pattern is worth watching for in any NATS client library that multiplexes subscriptions on a single connection.
The Fix
Dedicate a separate NATS connection for the job runner pool’s subscriptions:
graph LR
A["Scheduled Worker<br/>Connection A"] -->|"1. .requestWithTimeout()"| B["NATS Server"]
B -->|"2. route to subscriber"| C["Job Runner Pool<br/>Connection B"]
C -->|"3. publish response<br/>via Connection B"| B
B -->|"4. _INBOX reply<br/>delivered to Connection A"| A
style A fill:#6b9,stroke:#333
style C fill:#6b9,stroke:#333
The request travels out on Connection A, NATS delivers it to Connection B’s subscription, the worker processes the job and publishes the response back through the NATS server, and it arrives at Connection A’s _INBOX subscription to complete the future. No self-routing.
In code, the change was minimal — create a second connection component and wire the job runner pool to use it for subscriptions:
;; Before: job runner subscribes on the shared connection
(let [nats-conn (:nats-conn ctx)]
(subscribe-with-queue nats-conn "reqs.jobs.run" "job-runner" handler))
;; After: job runner gets its own dedicated connection
(let [nats-conn (:nats-conn job-runner-conn)]
(subscribe-with-queue nats-conn "reqs.jobs.run" "job-runner" handler))
Two files changed, five lines added, two removed. The one-hour timeouts stopped immediately.
Takeaways
If you use NATS request-reply, never subscribe to the request subject on the same connection that sends the request. The jnats mux dispatcher doesn’t handle this case well, and the failure mode is completely silent.
The broader lesson: when a messaging system silently drops messages but the connection is healthy, look at whether you’ve accidentally created a routing loop within the client itself. The server might be doing its job perfectly — it’s the client-side routing that’s confused.