Jerome Rajan

0 %
Jerome Rajan
Staff Solutions Consultant at Google
Data & Analytics
  • Residence:
    India
  • City:
    Mumbai
SQL
Dataproc, EMR
Hadoop
BigQuery
AWS Glue
PySpark, Python
Data Pipeline Design
Tableau, Redshift, Snowflake
IBM DataStage
  • AWS Lambda, S3, EMR, SQS, DynamoDB, Step Functions, Cloud Functions
  • Unix Shell Scripting, Python
  • Oracle, DB2, Redis
  • Alteryx, VBA, Blueprism, UiPath
English
Tamil
Hindi
Malayalam
Marathi

Autoscaling In Dataproc

August 16, 2022

Scalability is one of “THE” most important reasons why customers choose to migrate to the cloud. And as with all enterprises, customers who find themselves in such crossroads have most likely already done most of what they could with bare metal servers and self managed, over-provisioned data centres. 

Before “Cloud” became a thing, distributed computing was the go-to model for enabling scalability and Hadoop was at the forefront of distributed data processing technologies. But as businesses started becoming smarter (and thriftier), they realised that provisioning and maintaining a Hadoop cluster was not easy and seasonal spikes in workloads meant that clusters were often over-provisioned for the most part resulting in unnecessary increase in annual costs.

This is where managed Hadoop offerings like Dataproc start helping customers optimise their investments. In this post, we will focus on “Autoscaling” which is a feature that comes with all managed Hadoop offerings on the cloud and takes away the overhead of right-sizing the cluster for seasonal spikes with almost zero operational cost.

A Dataproc cluster has 1 (or 3) master nodes that runs the name node daemon and a bunch of worker machines that do all the heavy-lifting (compute). There are 2 types of workers –

  • Primary Workers – Workers that participate in HDFS storage and also contribute to compute
  • Secondary Workers – Workers that DO NOT participate in HDFS and ONLY perform compute
With this setup, you can add or remove secondary workers as required without impacting any of your HDFS data. This adding/removing of secondary workers can be done manually or you can leave it to Dataproc to do it for you. 
Note – Primary workers can be scaled up/down too but that is not a recommended pattern. You’d want to usually right-size your base capacity (primary worker pool) based on your BAU workloads and use secondary workers to accommodate spikey workloads.
Dataproc’s Autoscaling control plane (in true Forrest Gump fashion) is 

Easy as Easy Gets.  

As of today (August 2022), Dataproc only considers the YARN memory while deciding whether to scale or not. The underlying algorithm used to arrive at the actual numbers involves a bit of 7th grade math but feel free to pass over this formula if you find it repulsive. You cannot control or change how these calculations are performed 

dataproc-autoscaling-formula

At a 50,000 feet level, all it does is – 

  • If there’s “Available Memory” in the cluster, try to remove nodes
  • If there’s “Pending Memory” in the cluster, try to add nodes

Dataproc’s Autoscaling Policy does provide a few knobs and levers that you can use to – 

  • Set the minimum/maximum size of the cluster
  • Set how frequently to check if scaling is required or not (Cooldown Duration)
  • Set how aggressively the cluster scales
  • Set how frequently the cluster scales
These 2 are very important knobs that a lot of customers tend to leave at their OOTB default values without paying attention to their exact roles in tuning the policy. Let’s dive deeper with some sample numbers!
Scale Up Factor and Scale Down Factor
Let’s say that the amount of memory to scale up by = f(x,y,z)
where
x = Number of pending containers, e.g. – 100
y = Memory requested by each container, e.g.- 512 MB
z = Scale up factor, e.g. – 0.5
Therefore, the total pending YARN memory = x*y = 50 GB
Amount of memory to scale up by = Total pending YARN memory*Scale up factor = 25GB
 
So if each node has 8 GB of memory and assuming 80% allocation to YARN
For z=0.5, number of nodes = 100*512*0.5 / (8*1000)*0.8 = 4
For z=1, number of nodes = 100*512*1 / (8*1000)*0.8 = 8 

 
The same logic applies to scale down factor but instead of using the pending memory, the formula will use the available memory as the driving variable.  As you can see, the higher the scale-up/down factor, the more aggressive the node addition/removal is. If the current pending memory demands adding 10 nodes, then a scale up factor of 1 will add all 10. If the factor was 0.5, it will add only 50% of the actual ask. 
Dataproc will then wait for the duration defined in the “Cooldown Period” and again check if scaling is required. 
 
Scale Up/Down Min Worker Fraction
Let the amount of memory to scale up by = f(x,y,z)
where
x = Number of pending containers, e.g. – 100
y = Memory requested by each container, e.g.- 512 MB
z = Scale up factor, e.g. – 0.5
N = Number of nodes in the cluster, e.g. – 20
S = scaleUpMinWorkerFraction, e.g. – 0.3
M = N*S = 6 nodes
Total pending YARN memory = x*y = 50 GB
Amount of memory to scale up by = Total pending YARN memory*Scale up factor = 25GB
So if each node has 8 GB of memory and assuming 80% allocation to YARN
For z=0.5, 
Nodes = 100*512*0.5 / (8*1000)*0.8 = 4. 
Since this is less than M =6, no scale up will occur
For z=1, 
Nodes = 100*512*1 / (8*1000)*0.8 = 8. 
Since this is greater than M = 6, cluster will scale up by 8.
 
These knobs are effectively helping you control the step size as a direct function of the cluster size. This becomes very beneficial in avoiding frequent oscillations in your total node count. Downscaling every time there’s a recommendation to remove 1-2 workers becomes an overhead since another subsequent workload may need to spin additional workers immediately after a downscaling operation. 

Another issue is that workers marked for removal may contain shuffle data and/or other tasks. Unless you are using EFM (applicable and recommended only for Spark), forcefully downscaling such nodes will result in task failures and lend to an increase in execution time. 

If you use YARN’s graceful decommissioning timeout, then the cluster will continue to wait in UPDATING state during the timeout period and not allow any other update operations including upscaling operations. This becomes a bottleneck to new workloads that get submitted with increased memory requirements.

Understanding how these levers and knobs function will help you create a well tuned autoscaling policy that aligns with your SLA requirements and workload patterns.

Drop a comment below and I will be happy to address any queries you may have.
Posted in TechnologyTags:
Write a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Be Original
Would the boy you were be proud of the man you are?