Databricks ARC Compute Optimization
Checklist: Optimizing Azure Databricks Compute (with Serverless Focus)
Cluster & Compute Configuration
Jobs
- Use Serverless Compute by Default (if supported): Favor Azure Databricks Serverless Jobs Compute whenever possible. It requires no manual sizing or VM management – Databricks handles provisioning and scaling automatically. This simplicity reduces setup errors and ensures fast startup times (usually a few seconds) for clusters. There is a cluster policy setup for these jobs, please use the policy if available.
- Use Performance optimized Serverless Compute where necessary: If your workload requires more performance, you can use the performance optimized serverless compute. This is a higher cost option, but it provides better performance and reduced spin up time for workloads that require it.
- If serverless is not a valid option then prefer Jobs Clusters over All-Purpose Clusters: Run production ETL pipelines and ML training as Jobs (scheduled or manual runs) on dedicated job clusters. Jobs compute is priced lower (per DBU) than all-purpose interactive clusters and isolates workloads. Each job can use a fresh cluster to avoid interference, or a multi-task job can reuse one cluster so startup happens once for the whole pipeline. There are job cluster policies available to enforce best practices, such as auto-termination and autoscaling.
- Use of All-Purpose Clusters for Job Workloads: This should almost never be done. Please chat with the Data Analytics team if you think you need an all-purpose cluster. This is both expensive and can lead to resource contention with production jobs. If you must use an all-purpose cluster, ensure it has a short auto-termination timeout (e.g. 10-30 minutes) to avoid leaving it running idle.
- Use SQL Warehouses: For SQL analytics or BI workloads, prefer SQL Warehouses as there should be one running or a serverless option. SQL Warehouses are optimized for SQL workloads and can handle interactive queries efficiently.
Development
- Use Serverless: For development work, prefer using Serverless Compute (Serverless Jobs Compute or Serverless SQL Warehouses) to avoid the overhead of managing clusters. This allows for quick experimentation without worrying about cluster management. It also matches the environment used in production jobs, ensuring consistency.
- Use Personal Compute Clusters for Development: For ad-hoc development or exploratory work. These are small, personal clusters that auto-terminate after a short idle period (e.g. 10 minutes). They allow developers more control over their compute allowing many configurations to be set, while still being cost-efficient. They are ideal for quick tests or small data exploration tasks. This is also good because they are not shared with other users, so you can run your own jobs without impacting others.
General Best Practices
- Leverage Serverless SQL Warehouses for Ad Hoc Analytics: For interactive SQL queries or BI tools, use Serverless SQL Warehouses instead of all-purpose clusters. They spin up in seconds and auto-pause when idle, enabling you to terminate idle compute without long restart delays. (Non-serverless SQL endpoints take minutes to start, so teams often leave them running; serverless avoids that by supporting nearly instant on-demand usage.)
- Enable Auto-Termination on All Clusters: Configure automatic termination for clusters to shut down after a short idle period. This prevents paying for idle time. For interactive clusters, 10-30 minutes of inactivity is a common setting. For serverless resources, you can go even lower – e.g. serverless SQL warehouses default to 10 minutes idle auto-stop (minimum 5 min via UI, or 1 min via API) – allowing aggressive shutdown since restart is fast.
- Consider Single Node Clusters for Small Jobs: For lightweight ETL or ad-hoc analysis, use single-node clusters (1 driver, 0 workers) to avoid unnecessary overhead. This is especially useful for small data processing tasks that don’t require distributed compute. Single-node clusters are cheaper and start up quickly, making them ideal for quick tests or small jobs.
- Only use multi-node clusters when needed: For larger workloads, use multi-node clusters with autoscaling enabled. This allows the cluster to scale up during peak loads and down when idle, optimizing resource usage. Avoid using large clusters for small jobs, as this leads to wasted resources and higher costs.
- Turn On Autoscaling: When using non-serverless clusters, enable autoscaling so the cluster can scale up to meet peak demand and scale down when load drops. This avoids over-provisioning a cluster for its entire lifetime. (For streaming workloads, consider Databricks Enhanced Autoscaling or Lakehouse pipelines for more dynamic scaling.)
- Use Cluster Pools for Faster Startup (if not Serverless): If you must use custom clusters (for unsupported workloads), configure a pool of idle instances to cut down startup and scaling times. Pools keep VMs ready without incurring DBU charges while idle, reducing cluster spin-up delay and saving cost on frequent job runs.
- Apply Cluster Policies for Consistency: Work with your admins to enforce cluster policies that embed these best practices. For example, policies can require autoscaling with a reasonable min/max, enforce auto-termination timeouts (e.g. 1 hour max), restrict extremely large or costly instance types, and disallow long-running all-purpose clusters for certain users. Predefined policy templates (personal compute, jobs compute, etc.) can simplify cluster setup and ensure cost-efficient choices.
- Right-Size Instance Types: Choose the VM types based on workload profile for optimal price/performance. For example, use memory-optimized instances for heavy shuffle or join workloads and ML training (to reduce out-of-memory or spilling), storage-optimized instances with local SSD for data that benefits from caching (repeat data reads in ad-hoc analysis), and compute-optimized for CPU-intensive tasks like streaming or lightweight ETL maintenance jobs. Avoid oversizing clusters – prefer a few larger nodes vs. many small ones for shuffle-heavy jobs to minimize network I/O. Conversely, for embarrassingly parallel tasks, more smaller nodes might be beneficial – tailor to the job’s needs.
- Use Latest Databricks Runtimes (and Photon where apt): Always run jobs on the newest Databricks Runtime compatible with your code. Newer runtimes bring performance optimizations that can speed up your pipelines (reducing compute time and cost). If your workloads are SQL or DataFrame-heavy, take advantage of Photon, Databricks’s vectorized engine. Photon is enabled by default in serverless SQL warehouses (and can be turned on for clusters) to accelerate queries and lower cost per workload. Note: Photon incurs a higher DBU rate on jobs clusters, so evaluate its benefit for large ETL jobs (some teams see ~10% speedup for ~2× cost – your mileage may vary).
- Limit GPU Clusters to Specific ML Needs: Only use GPU instances if you are running GPU-accelerated libraries (e.g. deep learning training). GPU VMs are much more expensive, so they should be reserved for workloads that truly require them. If used, ensure the code leverages the GPU (e.g., TensorFlow/PyTorch). Otherwise, stick to CPU instances for cost-efficiency.
- Consider Spot Instances for Non-Critical Jobs: When not using serverless, you can configure clusters to use Azure spot instances for worker nodes to save on VM costs. This is suitable for batch jobs that can tolerate retries or slower execution if a spot VM is reclaimed. Always use an on-demand driver node (the Spark master) and evaluate using spot for workers to cut costs. This strategy is best for fault-tolerant workloads without strict SLAs. Job Authoring & Workflow Best Practices
- Schedule and Batch Workloads: There are no strict timing SLAs, but it’s still wise to schedule jobs during off-peak hours if appropriate (for example, heavy ETL at night). This won’t affect cost on serverless directly, but it can avoid interactive users competing for resources. If using cluster pools or spot instances, off-peak scheduling can also improve availability of cheaper compute.
- Favor Notebooks or Repo-based Jobs: Develop jobs as notebooks or packaged code in Repos following team conventions. This makes it easier for others to understand and optimize the code. When everyone uses a similar approach, it’s simpler to apply broad improvements (like switching the cluster type or tweaking configurations).
- ML Training Tips: When developing machine learning jobs, start experimenting on a single-node cluster (for example, a driver-only cluster with a large instance). This avoids overhead of Spark shuffles for smaller datasets. When scaling up model training, prefer adding a modest number of workers (e.g. move from 1 -> 2 or 4 nodes) and monitor the speedup; do not assume linear scaling, as too many nodes can hurt training due to data shuffle costs. Use the Databricks ML Runtime for built-in optimized libraries and MLflow support. If the training job can use serverless compute and doesn’t require GPUs or custom native libraries, run it on Serverless Jobs Compute for simplicity – otherwise, use a job cluster with an appropriate instance type. Code & Data Best Practices
- Enable Resource Tagging: Tag your Databricks resources – clusters, jobs, SQL warehouses, pools – with relevant metadata (e.g. Project:XYZ, Team:DataEng, Environment:Production). Azure Databricks allows custom tags, and these propagate to cloud billing and Databricks cost reports. Establish a team convention for tags so you can easily track costs per project or department in Azure Cost Management. We will have tagging policies in place to enforce this eventually.
- Use Databricks Cost Reports and System Tables: Leverage Databricks’ built-in system tables and usage logs to monitor resource utilization and cost. The system.billing.usage table in each workspace (or the account console) can break down DBU usage by cluster, job, user, etc.. Databricks provides a pre-built “usage dashboard” that you can import to track daily DBU spend, including separate views for job clusters, all-purpose clusters, serverless SQL, etc.. Regularly review these reports to identify cost outliers (e.g., a job that suddenly used far more resources).
- Monitor Cluster Performance: Examine the Spark UI and Ganglia metrics for their jobs. Long task durations, frequent spilling to disk, or executor skew indicate inefficiencies that can be optimized (either in code or by adjusting cluster configuration). For jobs, you can attach an execution summary or Spark UI link in the run output for easy access. Over time, tuning these will reduce runtime and costs.
- Periodic Cost and Performance Reviews: Establish a routine (monthly or quarterly) to do a cost audit of Databricks usage. In these reviews, involve both the engineering team and FinOps (if available) to discuss: Are we using the right cluster types? Can some jobs be consolidated? Are there new features (like a newer Photon version or improved autoscaling) we should adopt? This helps ensure continuous improvement. For example, you might find an opportunity to switch an ETL job to a simpler design that runs faster, or determine that a job can safely use spot instances for more savings.