Ensure that when scaling down, the Service does not invalidate route in Router if other replicase are still running #93

Closed
opened 2025-11-02 07:36:13 +00:00 by stanislav.hejny · 1 comment

why: When testing scaleability of a service, after successful test that verified the replica count being reduced, the further interaction with the service is not possible, all requests fail with HTTP 503 - Service not available error, despite the fact that at least one replica of the service is still up and running. It appears that REMOVE_ROUTE event that the Service emits on shutdown, invalidates route even if there are still active replicas.

what: REMOVE_ROUTE event is a feature to allow proper client notification if the service exists because it run out its allocated time. This is meant to support limited service, like it was envisioned in CleverData service, where a specific stack would be started based on the job description, and the stack would exist only for limited time, after which the route to it should be removed.

RCA: the Route Database does not contain tag identifying instance associated with the route. Thus REMOVE_ROUTE event invalidates route for all replicas, not only for the one that exited.
Solution: add SWARM_TASK_ID (a global instance ID) as tag on the route in Route Database, and upon REMOVE_ROUTE event invalidate only the route for this particular replica.

why: When testing scaleability of a service, after successful test that verified the replica count being reduced, the further interaction with the service is not possible, all requests fail with HTTP 503 - Service not available error, despite the fact that at least one replica of the service is still up and running. It appears that REMOVE_ROUTE event that the Service emits on shutdown, invalidates route even if there are still active replicas. what: REMOVE_ROUTE event is a feature to allow proper client notification if the service exists because it run out its allocated time. This is meant to support limited service, like it was envisioned in CleverData service, where a specific stack would be started based on the job description, and the stack would exist only for limited time, after which the route to it should be removed. RCA: the Route Database does not contain tag identifying instance associated with the route. Thus REMOVE_ROUTE event invalidates route for all replicas, not only for the one that exited. Solution: add SWARM_TASK_ID (a global instance ID) as tag on the route in Route Database, and upon REMOVE_ROUTE event invalidate only the route for this particular replica.
stanislav.hejny added this to the V.01 milestone 2025-11-02 07:36:13 +00:00
Author
Member

Implemented in AMQ Library v0.2.71 as part of issue #81 (ensure that CI/CD pipeline operates)

Implemented in AMQ Library v0.2.71 as part of issue #81 (ensure that CI/CD pipeline operates)
Sign in to join this conversation.
No milestone
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
clevermicro/amq-adapter-python#93
No description provided.