(Article published at https://medium.com/google-cloud/autoscaling-in-dataproc-e02bf446a509)

“Autoscaling” is a Dataproc API that automates the process of monitoring YARN memory utilisation and adding/removing capacity to achieve optimal usage. It takes away both — the need to over-provision the cluster for seasonal spikes and the risk of under-provisioning the cluster that could cause business critical latencies.
Dataproc’s Autoscaling control plane (in true Forrest Gump fashion) is
Easy as Easy Gets
Dataproc attempts to keep the configurational complexity to a minimum which means that the knobs and levers available to tune an autoscaling policy are not many.
With such few configurations, when customers run into an issue with autoscaling, they are often left wondering — “Where could we have gone wrong?”
In this post, we will address 3 questions with one common answer that customers ask while using autoscaling –
- Why is my cluster not scaling up though I have a new burst of workloads?
- Why is my cluster not scaling down? Why am I paying for idle nodes though there is only a very small workload running?
- Why is the cluster in an “UPDATING” state for a long time and not allowing me to perform other updates to the cluster?
Good To Know
A Dataproc cluster has 1 or 3 (for HA) master nodes that run the name node daemon and a bunch of worker machines that do all the heavy-lifting (compute). There are 2 types of workers –
- Primary Workers — Workers that participate in HDFS storage and also contribute to compute
- Secondary Workers — Workers that DO NOT participate in HDFS and ONLY perform compute
With this setup, you can add or remove secondary workers as required without impacting any of your HDFS data. This adding/removing of secondary workers can be done manually or you can use the Autoscaling API.
Primary workers can be scaled up/down too but is not recommended. You’d want to usually right-size your base capacity (primary worker pool) based on your BAU workloads and use secondary workers to accommodate spiky workloads.
As of date in 2022, Dataproc only considers the YARN memory while deciding whether to scale or not. The algorithm is simple —
- If there’s more “Available Memory” than “Pending Memory”” in the cluster, try to remove nodes
- If there’s more “Pending Memory” than “Available Memory” in the cluster, try to add nodes
There is a bit of 7th grade mathematics behind the calculations but we will skip those implementation details in this post. Click Here to read more.
Solution
Troubleshooting
Dataproc doesn’t allow concurrent updates. This means that if a cluster is going through an update of any kind — labels being renamed, cluster being resized, properties being modified, et al then no other update operation can be performed during that time.
Resizing a cluster i.e. Adding/removing additional nodes to a cluster is a type of update operation that happens in Dataproc’s control plane.
It is possible that the cluster is undergoing a downscale operation as part of a recommendation made by the autoscaler engine and that is blocking further updates in spite of a new workload that demands additional memory.
A downscaling operation accounts for multiple factors. Simplistically, once the autoscaler engine determines that the cluster has nodes that can be removed, it marks them for removal and the corresponding capacity is immediately taken away from Yarn. However, the machines are not immediately killed. If the machines have applications already running on them or contain shuffle/scratch data waiting to be served to an application running on other nodes, then Dataproc will wait for the jobs to finish or for the shuffle data to be served or for the graceful decommissioning timeout window to close, whichever is earlier.
Time taken to complete downscaling = 
MIN 
(
time taken by existing applications to finish, 
time taken for shuffle data on the node to get served, 
graceful decommissioning timeout
)This means that downscaling, on many occasions, is not an instantaneous activity and can cause the cluster to be in prolonged “UPDATING” periods. This is especially prevalent when not using Enhanced Flexibility Mode and with long Graceful Decommissioning Timeouts (GDT).
To verify this, go to Cloud Logging and use this MQL to filter down to the autoscaler logs —
resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name="<cluster-name>"
log_name="projects/<project>/logs/dataproc.googleapis.com%2Fautoscaler"This MQL will pull up all the autoscaler logs for the cluster. A quick way to determine if your cluster has a pattern of longish downscaling windows is to view the histogram generated in Logging. To get a more definitive picture, increase the time window to a factor of your Graceful Decommissioning timeout (at least 2x).
In our example, I’ve used 6 hours since the GDT was 3 hours

Notice how the autoscaler shows bursts of logging activity, goes silent for 3 hours and then again wakes up before going silent again? Those 3 hours are the autoscaling policy’s GDT. This is an indication that the cluster is frequently running into a scenario where many idle nodes are waiting to serve shuffle data before getting removed.
There are a few repercussions –
– You are paying for idle nodes since the cluster has downscaled but the machines are still running.
– New workloads will not be given additional capacity until the GDT is hit.
So What Can Be Done?
- The idea is to bring down the GDT window as close to zero as possible. If your applications are resilient enough and some latency incurred due to restarts are acceptable, then set the GDT to a low value (or even zero) so that nodes are removed immediately irrespective of whether they are idle or not. Fail fast, recover sooner.
- Enhanced Flexibility Mode or EFM is a feature provided by Dataproc which allows shuffle data to be stored only in the primary worker nodes therefore allowing secondary workers to downscale immediately. Of course, any applications running on them will need to restart on a different node.
- If modifying the GDT is not acceptable to your business needs, you can tune the below knobs in the autoscaling policy —
 – Scale up/down factor (0–1)
 This controls how aggressively the cluster scales up or down.
 Below is an illustration of how the scale-up factor works. The math remains the same for the scale-down factor as well except that the variable would be the available memory.
Let’s say that the amount of memory to scale up by = f(x,y,z)
where
x = Number of pending containers, e.g. — 100
y = Memory requested by each container, e.g.- 512 MB
z = Scale up factor, e.g. — 0.5Therefore, the total pending YARN memory = x*y = 50 GB
Amount of memory to scale up by = Total pending YARN memory*Scale up factor = 25GBSo if each node has 8 GB of memory and assuming 80% allocation to YARN,For z=0.5, number of nodes = 100*512*0.5 / (8*1000)*0.8 = 4
For z=1, number of nodes = 100*512*1 / (8*1000)*0.8 = 8– Scale up/down min worker fraction
This controls how frequently the cluster scales up or down.
Like the earlier point, the below illustration applies to both fractions.
Let the amount of memory to scale up by = f(x,y,z)
where
x = Number of pending containers, e.g. — 100
y = Memory requested by each container, e.g.- 512 MB
z = Scale up factor, e.g. — 0.5
N = Number of nodes in the cluster, e.g. — 20
S = scaleUpMinWorkerFraction, e.g. — 0.3
M = N*S = 6 nodesTotal pending YARN memory = x*y = 50 GB
Amount of memory to scale up by = Total pending YARN memory*Scale up factor = 25GBSo if each node has 8 GB of memory and assuming 80% allocation to YARNFor z=0.5 —
Nodes = 100*512*0.5 / (8*1000)*0.8 = 4.
Since this is less than M = 6, no scale up will occurFor z=1 —
Nodes = 100*512*1 / (8*1000)*0.8 = 8.
Since this is greater than M = 6, cluster will scale up by 8- If the Scale down min worker fraction is set to a higher value, then the cluster will only scale down when there is a larger chunk of nodes to remove. This will reduce the frequency of downscaling and therefore reduce the probability of the cluster being in an UPDATING state when additional workloads are submitted. This will also help reduce frequent oscillations in cluster capacity.
Summary
This post details one of the several reasons why an update operation is stuck or not working as expected. Most other reasons boil down to an infrastructure issue or a corrupt service. The troubleshooting detailed in this post explains a reason that arises out of application logic or cluster configurations. Dataproc’s autoscaling policy tries to keep the implementation as simple as possible for the end user but if not properly tuned, it can lead to erratic performance and increased costs.
 
          
        