Integrate CleverMicro and CleverSwarm in shared environment and test the Backpresure scale UP/DOWN #57

New issue

Closed

opened 2025-06-23 13:08:42 +00:00 by stanislav.hejny · 0 comments

stanislav.hejny commented

2025-06-23 13:08:42 +00:00

Member

Current situation.

CleverMicro defines an OVERLOAD condition, that would automatically trigger SCALE-UP event as:

average CPU load exceeds OVERLOAD_THRESHOLD (configurable, let's say 90%) for last continuous OVERLOAD_WINDOW seconds (configurable, set to 30sec now)
or number of parallel requests served by the node equals or exceeds predefined node capacity (PARALLEL_THRESHOLD)

The IDLE condition is consequently when average CPU usage is lower than IDLE_THRESHOLD for continuous IDLE_WINDOW seconds (initially set 10% and 30 sec), or for last continuous IDLE_WINDOW the number of parallel reqeuests being served is 0.

The Challenge:

CleverSwarm is not designed for parallel execution (i.e the methods to process uploaded files are not re-entrant, thus the Clever Micro cannot use even the pseudo-paralellization python's Thread class offers, and the code is hard-coded to use single-thread variant of the handler. Thus the parallelisation level can never exceed 1 and the OVERLOAD condition can never be satisfied based on number of requests being handled concurrently.
There's lot of I/O operations, which reduce CPU usage, thus getting consistent high CPU load may not be achievable, and OVERLOAD condition based on CPU usage may not be triggered.

SUGGESTION:

I believe we may need to add a monitor of the inbound queue length and trigger an OVERLOAD condition when the length of the inbound queue exceeds the predefined threshold.
We have discussed this feature with @freemo before, and we rejected this monitoring option, as the Backpressure feature is meant to be pre-emptive, scaling up the instance before it runs out of the resources.
However in the context of CleverSwarm, I believe that the inbound queue is the only measure that we can get to assess how fast is the service consuming incoming requests.
We understand that this is Reactive measure, that it diverges from the original intent, but we believe that for single threaded consumer service, which is I/O heavy, this is only measure that will be able to assess the current back-pressure

Now when there's a shared environment (that receives CleverMicro updates from the git merge requests), we are going to test the Scalability / Backpressure management However we need an opinion on the actual implementation provided by Clever Micro: Current situation. ============= CleverMicro defines an OVERLOAD condition, that would automatically trigger SCALE-UP event as: 1. average CPU load exceeds OVERLOAD_THRESHOLD (configurable, let's say 90%) for last continuous OVERLOAD_WINDOW seconds (configurable, set to 30sec now) 2. or number of parallel requests served by the node equals or exceeds predefined node capacity (PARALLEL_THRESHOLD) The IDLE condition is consequently when average CPU usage is lower than IDLE_THRESHOLD for continuous IDLE_WINDOW seconds (initially set 10% and 30 sec), or for last continuous IDLE_WINDOW the number of parallel reqeuests being served is 0. The Challenge: =========== 1. CleverSwarm is not designed for parallel execution (i.e the methods to process uploaded files are not re-entrant, thus the Clever Micro cannot use even the pseudo-paralellization python's Thread class offers, and the code is hard-coded to use single-thread variant of the handler. Thus the parallelisation level can never exceed 1 and the OVERLOAD condition can never be satisfied based on number of requests being handled concurrently. 2. There's lot of I/O operations, which reduce CPU usage, thus getting consistent high CPU load may not be achievable, and OVERLOAD condition based on CPU usage may not be triggered. SUGGESTION: =========== I believe we may need to add a monitor of the inbound queue length and trigger an OVERLOAD condition when the length of the inbound queue exceeds the predefined threshold. We have discussed this feature with @freemo before, and we rejected this monitoring option, as the Backpressure feature is meant to be pre-emptive, scaling up the instance before it runs out of the resources. However in the context of CleverSwarm, I believe that the inbound queue is the only measure that we can get to assess how fast is the service consuming incoming requests. We understand that this is Reactive measure, that it diverges from the original intent, but we believe that for single threaded consumer service, which is I/O heavy, this is only measure that will be able to assess the current back-pressure

Rows
Columns