January 27, 2023

The Potential of Predictive ML Models in Sigma

Fran Britschgi
Strategic Solutions Engineer
The Potential of Predictive ML Models in Sigma

Here’s what happened when we combined machine learning models of Snowflake with the flexibility and cloud-nativeness of Sigma. Spoiler alert: Amazing things are suddenly possible.

People don’t always think of Business Intelligence and Machine Learning as technologies that work together. As a Solutions Engineer at Sigma, I work with the product every day, but I recently discovered yet another new way that Sigma pushes the analytics envelope and goes past traditional dashboarding by applying predictive machine learning models.

The historical inability to fully integrate model-scoring capabilities into the BI platform has created a barrier between business analysts and the ML work done in the modeling arm of an organization. You may have experienced a consequence of this firsthand while you wait for the Data Science team to upload the newest scored version of the data before you can explore the results!

I started exploring ways that we could use the flexibility of the Sigma interface to call models directly through and from the cloud data warehouse. Fantastic work done at Snowflake in the last two months has opened up the possibilities of registering open-source models into saved models within the CDW. Combine that with the strength and flexibility of Sigma’s connection to Snowflake, and we arrive at an exciting conclusion:

Business users run the models they need for their work—within the same tool they are doing the rest of their analyses—live on the data that is relevant to them

This makes the robust, real-time, model-scoring previously the exclusive domain of the data scientist, possible in a BI platform and available to the business person. 

Score, Analyze, and Decide!

The Typical Machine-Learning Workflow Space Today

There are a few ways to present the results of ML models. That format could be visualizations,  which help to communicate the performance and predictions of a model to stakeholders. Or that format could be BI analytics, used to monitor the performance of a model over time and identify any issues that may require attention. 

In discussions of this workflow with a number of our commercial partners, I learned that many organizations deploy their models in roughly the same way:

  • Model is written, trained, and saved on a local machine 
  • Data is run through this model and scored to create relevant output variables (e.g., A company might run a model to determine whether or not an applicant is approved for a credit card)
  • Output files generated by the model are then uploaded to the server side of the BI tool in order to be visualized

Often, this cumbersome workflow exists only because of technical limitations. And while functional in the most straightforward sense, it brings many weaknesses to an organization. I was able to identify five significant business consequences that were limiting our partners:

  1. Delayed decision-making: The delay caused by having to manually push model output files to a BI tool often leads to delayed decision-making, as the data isn’t immediately available for analysis.
  1. Lack of real-time monitoring: Without the ability to live-score data in a BI platform, organizations are unable to monitor their data in real-time, which can lead to missing important trends or issues as they occur.
  1. Limited automation: If model output has to be pushed to a BI tool, organizations may not be able to fully automate certain business processes, such as data-error detection or predictive maintenance, which can lead to increased costs and inefficiencies.
  1. Limited scalability: Without the ability to live-score data in a BI tool, organizations may have limited scalability when it comes to data analysis, as they may have to rely on manual processes to handle increasing amounts of data, such as partitioning the data.
  2. Limited data product offering: Without the ability to live-score data in a BI tool, organizations are further limited in their ability to provide modern, sophisticated data offerings like customer recommendations or live predictions in their customer-facing products.

Example Use Case

Imagine that you work for the technology retail company PLUGS, and you are interested in exploring the model that the company uses to approve customers for its exclusive Loyalty Program. You’ve been told that only the best of the best get approved for the Loyalty Program and its generous perks, and as an analyst charged with exploring the kind of impact this may have on PLUGS revenue, you are certainly interested in exactly who “the best of the best are”

Thanks to Snowflake’s Snowpark, your Data Science team has been able to develop and register that model in Snowflake, alongside your customer database. 

Now, all you have to do to access that model is to make use of Sigma’s Pass-Through functions and call that model just like any other Sigma function! In my example, I wrap it in a Logical() function so that we get a True/False response.

Now that we have access to the Loyalty Program determination for all customers in the datasets, our next steps are limitless. We can build visualizations to plot out “confusion matrices” in order to thoroughly understand the accuracy of our model. For example, we discover that our model incorrectly approved 5% of the customers for the Loyalty Program in the False Positive column.

Or we may be interested in how our model assesses people around the country or across variables not included in the model, like gender. This is a critical aspect of model validation and ensuring ethical models.

Or perhaps we want to explain to a high-level stakeholder how the model works—or how the effects of the model express across a single variable, like Age. We can clearly see that age doesn’t have much of an effect until the elderly, at which point the model begins to exhibit a clear bias.

Or maybe we want to make use of Sigma’s easy-to-run Joins and pull in data on 4.5 million rows of retail sales to analyze the customer profile of each store that we’ve sold in—we may be interested in predicting how different stores and their Product offerings may support different Loyalty groups!

Finally, we can make use of Sigma’s unique Input Table functionality to provide direct to CDW input functionality. As a result, business users can instantly see the impact of the model on hypothetical data.

Method

Thanks to the direct connection between Sigma and Snowflake’s Snowpark, the framework already exists for establishing a live-scoring system within Sigma itself. Here’s how you can make it happen: 

  1. Use Python to define and register the model natively in Snowflake with the Snowpark developer framework, allowing for the use of the model within the Snowflake platform with built-in governance.
  2. Train the registered model directly within Snowflake’s Snowpark on the data, allowing for the model to be trained on the most recent and relevant data at the scale of the data.
  3. Create a UDF (User Defined Function) in Snowflake that allows organization members to use the model on new data, making the model accessible and usable for a wide range of users.
  4. Access the UDF through a Sigma Pass-Through Function, allowing the model to be used for creating Sigma Workbooks and Datasets.
  5. Build out your first workbook in Sigma to assess the model and its results, providing visibility into the performance of the model and making it easy for stakeholders to understand and interact with the data and the model. 

These steps provide a balance of flexibility, scalability, and governance while also making sure that the model is accessible to the users who need it and providing visibility into the performance of the model. 

Conclusion

In short, we have formalized a methodology for real-time Model Scoring without ever having to leave the analytics platform - all within the governed framework of a Snowpark-deployed model in Snowflake. The extensions and applications of this methodology go far beyond the imagination of the author - a proposition that eagerly awaits the input from Sigma users across the world! As a start, I would love to address the limitations that were listed earlier in this article, adjusted to reflect the realities of a Sigma + Snowpark solution space. 

  1. Real-time decision-making: The scored data is immediately available for analysis, eliminating the delayed decision-making caused by manual model push methods.
  1. Real-time monitoring: The ability to monitor the data and model outputs enhances your business user’s live view of important trends and issues. Alerting will be done at the instant that new data enters the system. 
  1. Enhanced automation: With direct model output, organizations enhance the ability to automate certain business processes, such as data-error detection or predictive maintenance, giving your team further ability to decrease costs and time-heavy operations.
  1. Full scalability: When the model runs directly within the CDW alongside the data, organizations will not face the limited scalability inherent in a local-scoring methodology, where they may have to rely on manual processes to handle increasing amounts of data, such as partitioning the data.
  2. Enhanced data product offering: With the ability to live-score data in a BI tool, organizations are further empowered by Sigma in their ability to provide modern, sophisticated data offerings like customer recommendations or live predictions in their customer-facing products.

We are Sigma.

Sigma is a cloud-native analytics platform that uses a familiar spreadsheet interface to give business users instant access to explore and get insights from their cloud data warehouse. It requires no code or special training to explore billions of rows, augment with new data, or perform “what if” analysis on all data in real⁠-⁠time.