The purpose of this article is to highlight three significant considerations when architecting Databricks for use with an Analytics platform such as Sigma. The three areas we will cover are:
- The type of Databricks SQL warehouse that should be used.
- The type of tables that should be created in Databricks.
- Considerations when partitioning Databricks tables.
Classic vs. Serverless SQL Warehouse
Connecting Sigma to Databricks requires a Databricks SQL warehouse. When creating a Databricks SQL warehouse, customers can decide to use a Classic or a Serverless SQL warehouse as the compute resource for running SQL queries.
Databricks architecture consists of a control plane and a data plane. The control plane includes the backend services that are always managed by Databricks in its own cloud account. The data plane is where data is processed by clusters of compute resources. One of the most important differences between the Classic data plane and Serverless data plane is where the compute resources run. In the Classic data plane, Classic SQL warehouses exist in the customer’s cloud account whereas, in the Serverless data plane, Serverless SQL warehouses exist in Databricks’ cloud account.
We would like to investigate which type of warehouse is a better choice depending on the type of analytics workload. We will look at two types of analytics workloads:
- A workload where users are constantly using Sigma, and therefore, queries are expected to execute against the warehouse every 5 minutes during business hours. An example of such a workload would be a large number of users in an organization using Sigma for operational analytics.
A Classic warehouse can take several minutes to start. During startup, queries have a high chance of timing out. To avoid a degraded user experience, a classic warehouse for this scenario can be set with an auto-stop value of 10 minutes to avoid shutting the warehouse down during business hours. Assuming 8 business hours in a day, the warehouse will stay up for 8 hours and 10 minutes/day.
Serverless compute, on the other hand, brings a truly elastic, always-on environment that’s instantly available. Auto-stop for a Serverless warehouse for this scenario will also be set to 10 minutes. The warehouse will again stay up for 8 hrs and 10 minutes/day.
- A workload where usage is infrequent. Let’s assume once every hour during business hours. An example of such a workload would be periodic reporting or a small set of users in a small organization.
For this scenario, auto-stop for a Classic warehouse will be set to 1.5 hours to avoid queries from timing out during business hours. The warehouse will stay up for 9.5 hours/day.
Auto-stop for a Serverless warehouse for this scenario can be set to 10 minutes, due to the always-on nature of the compute environment. Assuming queries take 2 mins to run, the warehouse will stay up for 96 mins/day.
We will make the following assumptions about cost:
Classic SQL Compute at $0.22/DBU/hour
Serverless SQL Compute at $0.70/DBU/hour
On-Demand hourly rate for an AWS i3.2xlarge EC2 instance of $0.624
The following table lists the AWS costs incurred for each Classic cluster size. Note that these costs do not apply for a Serverless warehouse.
The following table lists the total cost incurred/day for both scenarios.
Serverless is significantly less expensive for scenario 2.
Even though Classic is slightly less expensive for scenario 1, consider the following:
- the additional benefits of using a Serverless SQL warehouse as summarized in this Databrick blog
- the continued investment being made in Serverless data plane by Databricks, and
- the fact that personnel cost associated with maintaining cloud infrastructure for the classic data plane is not included in this cost analysis
Therefore, it is our conclusion that using Serverless SQL warehouses, even in Scenario 1, is advisable due to the overall benefits it provides despite the slightly higher cost.
It is possible to create two kinds of tables in Databricks: Delta tables and Non-Delta tables.
A Delta table uses Delta Lake as the data source and is the default option when a table is created.
A Non-Delta table is either a table:
- whose data source is one of the following - TEXT, AVRO, CSV, JSON, JDBC, PARQUET, ORC, LIBSVM, a fully-qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister. When creating such non-Delta tables, the USING clause in the CREATE TABLE statement specifies the data source as one of the above values instead of DELTA for Delta tables.
- whose data source is one of the following external systems - Amazon Redshift, Azure Cosmos DB, Azure Synapse Analytics, Cassandra, Couchbase, ElasticSearch, Google BigQuery, MongoDB, Neo4j, Redis, Snowflake, SQL databases using JDBC.
Non-Delta tables are not backed by Delta Lake and do not provide the ACID transactions and optimized performance of Delta tables. For example, retrieving a count of all rows from a non-Delta CSV table with approximately 2-3 million records takes 2-3 minutes. A query to calculate a measure grouped by one column takes even longer.
For the same table created as a Delta table, a count of records takes less than a second, and a query to calculate a measure grouped by one column takes ~2 seconds with a small SQL warehouse.
Therefore, Sigma should only be used to query Delta tables in Databricks for any Analytics workload.
This is arguably the simplest of the three items discussed in this whitepaper. Similar to other warehouses, Databricks allows a table to be partitioned using a set of columns. To use partitions, a set of partitioning columns are defined when creating a table by including the PARTITIONED BY clause. When inserting or manipulating rows in a table Databricks SQL automatically dispatches rows into the appropriate partitions.
Using partitions can speed up queries against the table as well as data manipulation. However, partitioning columns need to be selected carefully. For example, assume a sales order table has a date column which stores the order date. If the order date column is used for partitioning, but analytics workloads use quarter or year as filters, partitioning can in fact slow queries down as multiple partitions will need to be scanned to aggregate data for the quarter or month. In this example, a partition on the quarter of the order date is prudent. The performance impact can be dramatic depending on the number of rows in each partition, going from many minutes to a few seconds.