May 2, 2022

A Case for Machine Generated SQL

A Case for Machine Generated SQL

We are settling into a future where companies like Snowflake and Databricks are at the core of every organizations’ data practice. This is a delight for the Data Engineers, Analytics Engineers, Data Scientists, and others who have chosen to invest time sharpening the technical skills required to get the most out of these new platforms. However, for every person who loves SQL (this author included) there are 99 others who simply want to use data to tell stories and make decisions.

To these folks, SQL is akin to a gatekeeping bridge troll. It blocks their path, requiring them to answer tedious riddles about DATEADD’s arguments, impeding their journey to report on last quarter’s customer acquisition costs. Don’t get me wrong. There are certainly folks who enjoy solving these riddles. However, their passion for unraveling the mysteries of a new data warehouse’s SQL dialect isn’t representative of the whole. If we believed that teaching everyone to write SQL was the future of analytics, we’d have pivoted to an education company a long time ago.

Instead, the team at Sigma firmly believes in the power of Machine Generated SQL to make data analysis 100x more accessible. We predict that by the end of this decade, the growth in folks deriving value directly from new data management systems— without downloads or extracts — will have vastly outpaced the growth of SQL programmers. This post makes the case for that belief, by discussing our perspective, our product, and our approach to query generation.

A screenshot of a Sigma Workbook. A Workbook contains data visualization elements such as tables, charts, pivots, and maps. It also contains filters and control parameters, that enable interactive data exploration.
Medium doesn’t support our live embedding, so we need to resort to a static screenshot of a Sigma Workbook. You can follow this link to the interactive Workbook, or sign up for a free trial and create your own.

Over the years we’ve watched folks with little or no SQL expertise generate over 1,000 queries in a single day, many of which contain enough joins, aggregations, windows, and CTEs to make even the most expert SQL programmer’s head spin. When folks explore and analyze data with a Sigma Workbook, they don’t engage with SQL directly. Instead, we interactively compile the user interface specification into SQL, submit those queries to their data warehouse, and return the results back to the browser, where they’re brought to life in a visualization.

Our case for Machine Generated SQL is rooted in its power to make data analysis accessible to as many folks as possible. We are not aiming to replace all SQL programmers with machines. Rather, we’re building towards a future where anyone can leverage the power of data without first becoming SQL programmers.

Building the Next Digital Camera

The skills gap present in the data space is not unlike the one that existed in photography over thirty years ago. A small number of experts developed the skills to reliably develop negatives and navigate a darkroom, compared to the total number of folks who simply wanted to take pictures. It’s no surprise that point-and-shoot, disposable cameras, and 1-hour photo labs became such widespread and successful businesses.

Then came digital cameras. Folks no longer needed to buy film, stress over the contents of their 36 exposures, and drop the rolls off at the photo lab, just to find out the lab tech botched the development process. Instead, folks could take as many photos as they liked, upload them to their computer, and print them at home. Fast forward to now, and in less than a minute we can take a picture of a puppy, zhuzh it up with filters, text, and gifs, and share it on IG.

This image shows two photographs printed in a darkroom, both with the same negative. Although the contents are identical, the prints have different contrast and exposure parameters. The time to prepare, expose, and develop a single image on the photo-sensitive paper can take up to 5 minutes. This makes iteration and experimentation fairly time consuming. In contrast someone can use a smartphone and — in 60 seconds — take a few photos, pick and edit one, and post it to a social media account.
The iteration cycle for producing just one photograph above can reach 5 minutes. In contrast, someone with a smartphone can — in less than a minute — take a few photos, pick and edit one, and share it on social media.

The most important takeaway is the way in which advances in technology made photography 100x more accessible. Sure, some things are inevitably lost moving from film to digital. But as the technology has gotten better, those differences have increasingly become matters of aesthetics, not gaps in functionality. Furthermore, new technologies have enabled novel and creative approaches to digital photography and editing that were impractical or impossible using film.

In the analytics space, we haven’t replaced film photography yet. Most new analysis products are busy building photo labs, developing negatives, and navigating darkrooms. There’s nothing wrong with this state of affairs. There’s so much demand for improved tools and workflows that keeping with the same general approaches taken by the last generation of successful products will benefit many analysts, and result in financially successful enterprises. At Sigma, we’re more interested in building the next digital camera.

Workbooks: A Frontend for SQL Generation

A Sigma Workbook is a collaborative document that presents folks with an interactive, declarative programming environment. It draws its inspiration from various WYSIWYG systems: spreadsheets, website builders, notebooks, and traditional business intelligence tools. The core data model is a directed acyclic graph of elements, such as tables, pivots, charts, and parameters. This graph is reactive: a bar chart sourced from a filtered table element will automatically update when the table’s filter is changed.

The Workbook is built for the browser. When it needs to fetch the data for an element, it must communicate with the data warehouse. To do this, we package up the specification of the user interface and make an API request to our backend services. However, before we can issue a query we need to compile the Workbook specification into a query that will run against the target data warehouse. This is where Machine Generated SQL enters the chat.

Our query generation service is a compiler of sorts. It treats the Workbook specification like a program, which is parsed, analyzed, and optimized, before it finally generates a query in the SQL dialect of the target data warehouse. In many ways, this flow is no different than a traditional programming language compiler. When folks write code in a language like Python or Rust, they’re significantly more productive than if they attempted to produce the same outputs writing machine code by hand. SQL provides the same benefits — writing an efficient, scalable, and correct join algorithm implementation in a general purpose language is a non-trivial endeavor for any programmer, and a waste of time for folks who are in it for the analytics.

The history of innovation in programming languages is, therefore, not so different from photography. Advances in technology enable programmers to provide better abstractions and products, which in turn make great outcomes more accessible and efficient for everyone at all skill levels.

The Hegemony of Dimensions and Measures

All business intelligence tools present folks with a user interface that translates calculations, filters, and other analytical specifications into a query against some database. Historically, popular systems have oriented their interfaces and abstractions around the metaphor of a Data Cube. While there isn’t an obvious consensus on what makes a cube a cube, every implementation has the same core constructs: (1) dimensions — columns that are often used for filtering and grouping; (2) measures — aggregate expressions that are calculated along a (sub)set of dimensions.

The CUBE operator is the N-dimensional generalization of simple aggregate functions. The 0D data cube is a point. The ID data cube is a line with a point. The 2D data cube is a cross tab, a plane, two lines, and a point. The 3D data cube is a cube with three intersecting 2D cross tabs.
An illustrative figure from Jim Gray’s original Data Cube paper. The dimensions include the make, color, and year of a vehicle. The measures include a vague “sum” of some numeric attribute.

The dimension / measure model has found massive commercial success over the past few decades — it’s the point-and-shoot camera of data tools. However, like a point-and-shoot it optimizes for a fixed workflow. When folks need to break out of those constraints, their path forwards almost certainly requires writing SQL. For a person who is not yet skilled in the art, this is an easy cliff to fall off and a precarious one to climb down.

It’s no wonder then, that folks joke about how the “Download to Excel” button is the most requested feature for any business intelligence tool. Spreadsheets are perhaps the most accessible and adopted programming environment in the world. This is — in no small part — due to their visual interface, straightforward formulas and calculation model.

An example of a straightforward spreadsheet formula is SUM(A10:A20). It’s easy to understand logically and visually which cells the aggregation operates on. This is a much more direct specification than the conceptually equivalent SQL — SELECT SUM(A) FROM TABLE WHERE ROW BETWEEN 10 AND 20. For someone who doesn’t know SQL, the calculation and filter are the only specifications they care about. Everything else the query requires them to specify is only a potential source of frustration.

A SQL Generating Compiler

When folks’ mental model of a system matches its actual behavior, it greatly improves their productivity while reducing mistakes and frustration. Using this lens, the job of a programming language is to establish and communicate the mental model; the job of a compiler is to bring that model to life. This is how we view the relationship between our Workbook and our SQL query generation.

When we started building Sigma, the first analytical shape we focused on was percent of total calculations. These are great because they’re arithmetically straightforward — folks using a spreadsheet can easily write a formula like A:14 / SUM(A10:A20). They’re also easy for folks to manually verify in a WYSIWYG. What makes them a great starting point is that they’re surprisingly verbose to express in SQL. They require either a self-join to an aggregated subquery or a windowed aggregate calculation. Both of these require folks to have an accurate mental model and working knowledge of advanced SQL, as well as the capabilities of their specific data warehouses. This is a steep cliff.

Formulas in Sigma aspire to the clarity of direct specification found in spreadsheets. Therefore we saw only one viable approach: formulas must express percent of totals as a simple ratio. Our view then (and now) was that a language containing explicit sub-queries or window-like specifications is out of the question. However this presented us with an exciting challenge. When folks write A / SUM(A), how do we specify the grouping keys?

This led us to Groupings: a primitive for establishing a nested relationship between entities in a data model. For example a State contains one or more Counties; every County records its population. When State is used as the key for a Grouping, it logically and visually establishes a nested relationship. With this in hand, it’s clear what to expect from calculations such as sum(county_population) or county_population / state_population.

These are fairly rudimentary examples for someone who is exploring a census dataset in a spreadsheet . That’s exactly the point. For most folks, expressing these same operations in SQL is a tall order — the purpose of our compiler is to bridge this gap. Rather than attempting to build our 4th (and the industry’s 256th) iteration of a SQL JOIN user interface, we built a version of VLOOKUP into our language instead.

In Sigma, a single calculation that is straightforward for anyone to write and maintain can cause our compiler to produce SQL statements with numerous joins, aggregations, sub-queries, and windowed calculations. Sure, these queries are sometimes verbose or pedantic to a degree that makes experienced SQL programmers cringe. But they aren’t the folks we’re building Sigma for.

In Closing

The team at Sigma firmly believes that Machine Generated SQL is here to stay. Folks use our product to tell stories and ask questions of their data, without ever learning SQL. The adoption and success we’ve seen over the last few years is incredibly inspiring. It also seems like the industry is increasingly moving in the same direction: tools like dbt and Malloy use Machine Generated SQL to enable analytics engineers and data scientists to do more with the skills they already have; products like Hex and Canvas are exploring the intersection of spreadsheets, notebooks, and — to varying degrees — Machine Generated SQL. As such, it isn’t a stretch to view this trend as a rising tide that lifts all boats. If you find these challenges interesting, we’d love to have you join us on this exciting adventure.

Max Seiden
Software Engineer
No items found.
No items found.