Big data databricks clusters usage

What is the cluster in Databricks?

A Databricks cluster is a collection of resources and structures that you use to perform data engineering, data science, and data analysis tasks, such as ETL pipeline production, media analysis, ad hoc analysis, and machine learning.
You run these tasks as commands in a notebook or as automated tasks. Bricks make the difference between a general-purpose cluster and a job compute cluster. You use different clusters and analyze data together using interactive notebooks. You use a cluster of tasks that run quickly and automate complex tasks.

It is very important to know when to use which type of cluster.

So here is my story,

In the last few months, the Databricks DBU cost has increased significantly from what was planned and budgeted in my project. It was noticed that the majority of the cost was attributed to “All-purpose compute” jobs.

When you have a cluster that is set up to run “All-Purpose compute” the DBU cost is roughly 3x more than when the jobs are run as “Jobs compute” type.

Azure Databricks bills you for virtual machines (VMs) provisioned in clusters and Databricks Units (DBUs) based on the VM instance selected. A DBU is a unit of processing capability, billed on per-second usage. The DBU consumption depends on the size and type of instance running Azure Databricks.

Workload	DBU prices—standard tier	DBU prices—premium tier
All-Purpose Compute	$0.40/DBU-hour	$0.55/DBU-hour
Jobs Compute	$0.15/DBU-hour	$0.30/DBU-hour

So if you have to just run the Databricks job, we should use the Job compute to save the project cost whereas, for debugging, coding, and analyzing the notebook code, all-purpose can't be dismissed to use.

#databricks #clusters

Databricks solution

Databricks solution

Which Type of Cluster to use in Databricks?