ARC Data Analytics Handbook

Version 0.0.2

All things data analytics at ARC Resources.

Databricks

Databricks is a unified data analytics platform that allows for:

  • Storage of data, both structured and unstructured
  • The creation of data pipelines to ingest and transform data
  • The creation of machine learning models
  • The creation of dashboards and reports
  • Tracking of data lineage and quality as well as the governance of data

We use Databricks as our primary data analytics platform. All data in ARC should be stored or connected to Databricks. Everyone in ARC has been granted access to Databricks and should have the necessary permissions to access the data they need. If you do not have access to Databricks, please contact the Data Analytics team.

Getting Started

All users have access to the data catalog in Databricks (This is called Unity Catalog in Databricks). The data catalog is a central repository of all data in Databricks. To access the data catalog, follow these steps:

  1. Log in to Databricks using your ARC credentials
  2. Click on the Catalog tab on the left-hand side of the screen
  3. You will see a list of all the data in Databricks in a tree structure, the tree structure is generally organized into schemas (a grouping of tables) and tables. Clicking on a schema will show you all the tables in that schema. Clicking on a table will show you the details of that table.
  4. Search for the data you need by typing in the search bar at the top of the screen. It is possible to find tables, columns, and schemas by searching.

Unity Catalog

Unity Catalog is a central repository of all data in Databricks. See this blog for a great overview of the Unity Catalog: Unity Catalog.

We have an active project to imagine what the Unity Catalog could be in the future. If you want to see what we are thinking or contribute to the discussion: Unity Catalog Structure.

Databricks SQL

Databricks SQL is a SQL query editor that allows you to query data in Databricks. SQL stands for Structured Query Language and is a standard language for querying data in relational databases. For more information on SQL, see the W3Schools SQL Tutorial or the Databricks SQL Documentation.

To access Databricks SQL, follow these steps:

  1. Log in to Databricks using your ARC credentials
  2. Click on the SQL Editor tab on the left-hand side of the screen
  3. You will see a SQL editor where you can write and execute SQL queries. You can also save queries to the workspace for later use.
  4. Once you query returns results you can view the results in a table format or create visualizations using the “+” button at the top of the results to create a new visualization.

Databricks Dashboards

Databricks Dashboards are a way to visualize data in Databricks. Dashboards can be created using SQL queries. Dashboards can be shared with other users by clicking the share button at the top of the dashboard. For more information on Databricks Dashboards, see the Databricks Dashboard Documentation.

To access Databricks Dashboards, follow these steps:

  1. Log in to Databricks using your ARC credentials
  2. Click on the Dashboards tab on the left-hand side of the screen

Databricks Notebooks

Databricks Notebooks are a way to write and execute code in Databricks. Notebooks can be written in multiple languages including Python, Scala, SQL, and R. Notebooks can be used to create data pipelines, machine learning models, and reports. For more information on Databricks Notebooks, see the Databricks Notebook Documentation.

To access Databricks Notebooks, follow these steps:

  1. Log in to Databricks using your ARC credentials
  2. Click on the + New button at the top of the screen
  3. Click on Notebook to create a new notebook
  4. You will see a notebook editor where you can write and execute code.

If you don’t have access to the Databricks notebooks please contact the Data Analytics team.

Notebooks can be scheduled to automate the execution of code. Schedules can be created to run notebooks, jars, or python scripts. These can run at specific times or be triggered by events. For more information on scheduling notebooks, see the Schedule Notebook Jobs.

Databricks Compute

Databricks is a cloud-based platform that allows for the creation of compute resources to run code.

When running sql queries, the compute resources are already created and managed by the Data Analytics team.

If you are running notebooks, you can choose the type of compute you want to use. There are three general types of compute available in Databricks:

  • interactive
  • multi-node
  • serverless

An interactive cluster is a single virtual machine, these machines come in three sizes,

  • 32 gigs of ram and 4 cores
  • 64 gigs of ram and 8 cores
  • 128 gigs of ram and 16 cores

This is the simplest and cheapest option and in general you want to start with this option when running code in Databricks. You can always scale up if you need more resources.

Multi-node is a cluster of virtual machines. This is more expensive but can in some cases run in parallel which is faster. The multi-node cluster can be created with a fixed number of nodes or can be auto-scaled. The auto-scaled cluster will scale up and down as needed depending on the workload. In general you need to write code specifically to take advantage of the parallel processing. If you are not sure if you need a multi-node cluster, you probably don’t. The nodes of the cluster come in the same sizes as the interactive cluster. We have put in limits on the cost of a multi-node cluster to prevent accidental overspending. If you need a multi-node cluster that exceeds the limits, please contact the Data Analytics team.

Serverless is a compute resource that is hosted by Databricks. This means it can be ready to go in seconds and you only pay for the time you use it. There are limits to the configuration of the serverless cluster, but in general, it is a good option for running code quickly.

If you need help choosing the right compute resource for your code, please contact the Data Analytics team.

Databricks for Citizen Developers

We have provisioned areas in Databricks for teams to use as a sandbox. This is a place where you can experiment with code and write data into your own tables. This is a great place to learn how to use Databricks and to test out new ideas.

If you have been provisioned access to a team you should have access to the following:

  • A team catalog named team_<team_name>
  • A team group named Team - <team_name>
  • A team shared folder in Workspace/Team/Team - <team_name>
  • A secret scope named team-<team_name>

These resources are for your team to use and cannot be accessed by other teams. The data stored in your team catalog is meant for internal analysis and should be treated as a development level data source. If you have data that you would like to share with other teams we can move the code and data to a production level data store that can be accessed by all teams.

If you do not have access to a team and would like to get access, please contact the Data Analytics team.

Databricks Developers

In the past we had segregation of teams by separation of workspaces, developers in these workspaces generally had admin access to the workspace using their developer admin account. A by product of this was that we had some siloing of artifacts, workflows and data. In an effort to break down these silos we have moved to a new model where we use the same set of dev, test and prod workspaces for all teams. This allows for easier sharing of workflows and data between teams, better tracking of costs, reduction of overhead and better governance of the data. This new model is a better model for the company as a whole, and we are excited to move to this new model. This will mean that we will be changing the process of admin access to the workspaces and unrestricted cluster creation.

Currently we are moving to the following workspaces:

  • arcDevDAL
  • arcUATDAL
  • arcPrdDAL

In general all the users in the company will have access to the arcPrdDAL workspace. This workspace is where all the production level data will be stored. The arcUATDAL workspace is where we will test workflows before it goes to production. This workspace is only accessible by the Data Analytics team and the developers who are working on the specific workflows. The arcDevDAL workspace is where we will develop new workflows. This workspace is only accessible by the Data Analytics team and the developers who are working on the specific workflows.

In these workspaces we will move away from using admin accounts as developer accounts. The admin accounts will only be used for workspace management and the developers personal accounts will be used for development. This will allow us to have better control over who has access to the workspace admin.

Another implication of removing admin access is that developers will no longer have unrestricted ability to create clusters. We have implemented cluster policies that allow a developer to choose from a set of predefined cluster atributes when they create a new cluster. If you need a cluster that is not in the predefined attribute list, please contact one of the Databricks admins.

We are also moving to using Infrastructure as Code (IaC) to manage the workspaces. This means that the workspaces will be created and managed using code. This will allow us to have better control over the workspaces and track changes to the workspaces. The code for the workspace configuration will be stored Here. If you see something that needs to be changed in the workspace configuration, please submit a pull request to the repository.

If you have any questions about the new workspace structure or the new cluster policies, please contact the the Manager of Data Analytics.

Last updated on 18 Aug 2025
Published on 18 Aug 2025
 Edit on GitHub