Turning Everyday Business Files Into Trusted, Queryable Data With Unstructured AI
January 22, 2026
Jeff Carpenter
Senior Solution Engineer
Table of Contents
About 80%-90% of the data organizations collect is unstructured, created as part of everyday business operations. PDFs, images, slide decks, call recordings, and documents are just a few of the unstructured data sources businesses generate, but aren’t tapping into.
This is valuable information that carries context business teams care about deeply. So, why is it not in use? The answer is fairly simple—because it doesn’t live in a data warehouse, it’s difficult to work with this data in a governed, repeatable way. Just think about photos that reveal why a claim looks suspicious, or Word documents that explain certain business decisions. When you’re able to connect that unstructured information back to real warehouse data, the picture of your business gets much clearer.
But until recently, most teams didn’t realize this kind of analysis was even possible inside their governed data environment. Combine that with the fact that 33% of executives say they often don’t even get around to using the data they receive, and you start to see just how much information businesses aren’t taking advantage of. Now, Sigma and Snowflake are bridging that gap together by making it easy to explore unstructured data where it already lives, using the same permissions, controls, and context as the rest of the business.
Hi. My name is Jeff Carpenter. I'm a solution engineer with Sigma Computing. Today, I'm gonna focus on using AI in Sigma to ask questions and gain insight into your historically, what I'd say, is dark data, meaning PDF files, Word documents, PowerPoints, audio, video files, images. We're gonna focus on that. But to tell the story in this particular use case, I'm gonna highlight another great feature we have along the way, which is the ability to use APIs in Sigma. In this case, I'm gonna do read and writes into Salesforce. So I'm gonna show you that in a few minutes along the way as we go. But the reason I'm starting here is I'm gonna talk to a supply chain ghost hunt type of scenario. Let's say we have some product that's gone missing, and we need to find it. What you'll see here in red is I have an opportunity, which is a renewal for one of our customers, ABC, and they're super upset because they're missing a particular order number. So I'm gonna key in on this order number and see if I can find out where the missing inventory is. And you have to remember, the reason there may be issues like that with companies, they might lose a lot of time and money, is because people might not have time to enter things into the proper system in real time. But by gathering insight from unstructured data and allowing people to load in audio files, pictures, things like this along the way, even if they're in the warehouse, then I can get immediate insight via Sigma. So let me show you how that can work. I'm gonna open up my other workbook here, which I give access to my warehouse staff. And what they can do is load things like bills of lading. So let's say the bills of lading are are highlighted here. Sometimes they print them out in the morning, and it looks like, you know, product is all set to go. And maybe when ABC called in, they said, oh, yeah. It's you guys should have it. But maybe there was an incident. Maybe something happened. And in this case, there is exactly that. I'll play a snippet of this audio. There was an incident in zone b at four PM. Incident in zone b, I won't play the whole thing, but I'm using AI Transcribe here in Snowflake, which is what I'm connected to, to transcribe that audio into full text. And it's also allowing me to do a similar thing using AI parse document to do the same on my bills of lading. So you see here, we're x y z. We have products set to do ship out to a b c. In this order number, this is the key one that's in question. So it has been tagged in this bill of lading. It has been packaged up seemingly onto a pallet, but we just can't find it. Alright? Now when I'm loading this, the AI parse document, the transcribe, I'm actually pulling in the full contents. So in other words, it's it's pulling out a parse, like, everything in the document. Right? But I'm just showing a summary view, so it's very easy for end users, business users to to consume this, to look at it, to see what's here. I also give the ability to translate this. Maybe we're a global operation, and I want people all over the world to be able to access this and use it in language of choice and flip around to whatever they they need. So we give you the the ability to load files. You can review the files in raw format. You can do AI against them. And then you can do what I would call multimodal inquiries. So in other words, asking questions across different unstructured data types. So in other words, I can ask questions, you know, how long is is the audio? And it's gonna search across all of these, and it's gonna say, oh, I found this new recording thirty four, and it looks like it's eleven point two four seconds long, you know, based on your question. But I can also ask questions against all the documents, and I'll show you that in one second. Let's say, however, though, I'm gonna take the role now of somebody in the warehouse floor, and I'm gonna say, you know what? Maybe I'm allowing them to load images. So this is connected to my iPhone. You see here I have the old iPhone mini. But I can connect to this same workbook via my mobile device, and I can even interact with it. So I can load I could choose a photo. I don't wanna bore you with all my pictures, I just pulled one of them down into my download section. But what this allows me to do is load from a mobile device as well. So say I'm a warehouse worker snapping a picture because I just moved a pallet around. I even, in this case, put a handwritten note on the pallet. And using AI complete, it's gonna look at that image. It's gonna even recognize what's on the handwritten note. So think of how powerful this is for so many different types of use cases. So you see here, there's the preview of of the image in the the mobile device. It's running the AI right now, so it loaded the file to, in this case, some point it to an s three bucket. Snowflake is unable to read it. It's able to do the transcribe. If I come back to this view as my desktop worker, I'm simply refreshing. We'll see that images just loaded that that my worker on the floor did. And look at that. That's the bill of lading that I need. That's the missing pallet. And somebody even put a handwritten note. They said, hey. We moved it by the side gate. So now when I ask a question and I wanna say, hey. Missing inventory for this order number from the customer. Where could it be? And and notice that number is not used in the file name or anything, but with the AI parse document, it found that that customer was listed on this bill of lading, the three four one one triple zero. So now when I ask a question like, hey. Where could this be? It's looking not only does it find that on the right bill of lading, you know, which is represented right here. It found it on the bill of lading. It it knows how many packages and one pallet, But also this. Look. Four PM. There was an incident in zone b. It also can see that there's the the photograph here, on on the side gate. So after looking through all this, it it's saying, hey. Check the overflow area. I found your product. Now I can come back to the system. I can click on ABC renewal. I can say, you know what? I'm calling the customer. I get their buy in. I say we found it. We're gonna ship overnight. I'm able to confidently put this back into the negotiation phase. Maybe I'm gonna put it at a ninety five percent. I'm gonna change the, the note here. Maybe I would leave the old one. That's fine. I'm updating. Now what's happening here? It's using the API. It wrote to Salesforce. It did the update. Then it did an immediate read to show the the update in my view here. And now when I click to open this up in Salesforce, what I can see is it's changed to to negotiation ninety five percent, and down here is resolved. Okay. So these things are alive. If I change that probability and I say, you know what? It should really just be eighty percent. You know, you know, let's talk more. All this is live. When I do this stuff, I update back to to Salesforce. It's immediate. So whether people are looking here in Sigma or whether they're here in looking in Salesforce, all of this information is passing through. So there's the the new eighty percent, and it's resolved. Let's talk more. Now if I dissect this just for a moment, just for the technical folks to show you how this works, it it's really three things here. When I do a load of a file, I'm actually looking at the the metadata to see the type. And I'm saying, hey. If it's an image type, I'm gonna do, what we call, like, a two file. Like so we're gonna use that metadata to to grab the guts of that file, and then we're gonna call AI complete and snowflake and give a detailed description. In this LLM, what this is doing is I have an input box hidden on my my admin page, and it's just saying, in this case, use Claude Sonnet four five. For audio video, it's doing an AI transcribe on the file, and then it's saying with AI complete to give a detailed description. For documents, it's doing an AI parse, and then it's doing a get for the content field within the, JSON, basically, of that full parse. And then it again, it's using AI complete. So that's how I'm getting full descriptions of all the different types of of media types or file types. And if I was just to show you actually from complete scratch, I'm just gonna do a brand new page for a minute. I'm gonna show you doing an input table. So when we're using the browse button and I'm loading, I'm actually just feeding that through an action into an input table, something like this. So we have this new file type. I can come in here. I can grab, like, that image that we were looking at a minute ago. And and now I can just start dissecting. So in other words, if I did a calculated field and I just said, for example, JSON and I click on, you know, this new file type, what it does is it shows me all the metadata. So in other words, I can see the name of the file, you know, what what type it is, and then that's able then I'm able to do those questions and say, hey. Are you an image? Are are you text, etcetera? But then I can start getting some into some really sophisticated things. Like, I can put in the AI complete call here just with another calculation. So if I wanted to just come back to my summary for a minute and just grab this one for images, I'm just gonna come in here to calculation and paste that in. And I'm gonna make one small change. If I do it through, just an input table itself versus the the file upload, I just need to put in this quick little array thing to grab the first object. And now what this is doing is it's running that AI complete. It's it's analyzing the picture, and it's saying, hey. Just give a full blown description. I could ask other questions like, what colors are represented here? Or what would you classify this type of object as, and it would come back and say a palette. But in this case, I'm just saying, give me a full dump of everything you got on this picture. Alright? That's it. So, we're happy to dive in deeper with any sorts of explanations, demos. We love talking about AI in in all its capabilities within Sigma. There's lots of other ways we can use AI in Sigma, whether it's helping you build out your workbook pages, your dashboards, your forms, your workflows, for formulas, for just doing a NAS data. Like, just ask across my organization. What were sales today? Things like this. Happy to talk to to folks about all that great stuff. So hope this helps, and, take care. Thank you.
Let's back up for a moment and first clarify what we mean by "unstructured data", because it's a term people use in different ways. I usually think about it as:
Structured unstructured data: documents that may look different from one another, but are answering similar questions. Healthcare claims are a good example—one might be a single page, another three pages. The layout changes, the terminology varies, logos move around, but you’re still trying to understand the same core details: what happened, who it involves, and what was charged.
Totally unstructured data: cases where you might have two or more documents with no consistency at all. One could be 20 pages long, another just two pages, covering different topics with different formats. But even then, it can be valuable to ask questions like, “What’s the summary?” or “What was the main point?” or even to pull out a specific detail, like how long a contract is valid for, for example.
Multimedia data: things like images, audio, and video also contain a wealth of information that provides necessary business context to dig into from an analytics perspective.
In all these instances, the asks and auto-generated responses to the questions from this type of data fall under the umbrella of "unstructured AI". The value here comes from being able to ask these questions at scale, in a way that makes this data usable alongside everything else the business already analyzes.
No, you can’t just do it with ChatGPT
Tools like ChatGPT and Gemini, for example, are incredible. They make it easy to upload a document or an image and start asking questions right away. But for businesses, the limitations of these tools quickly become clear.
For instance, when you copy and paste sensitive documents or customer information into a standalone chat tool, that data no longer sits inside your Snowflake account. It’s not protected by the same permissions, audit controls, and security models you trust for the rest of your data, and that poses significant risks. Scale is another major limitation. Uploading one file at a time might work for a demo, but it doesn’t work when you already have thousands of documents sitting in S3, Azure Blob, or Google Cloud Storage. You can’t easily run the same analysis across all of them, and you can’t connect the results back to the structured data that already lives in your warehouse.
That’s exactly what Sigma and Snowflake tackle together. When AI runs inside Snowflake, those models are operating within your governed data environment. Nothing is being exported or handed off to a third party. And when Sigma sits on top of that, it gives business users a way to ask questions of unstructured data using the same interface they already use for analytics, while keeping everything connected, secure, and repeatable.
How do Sigma & Snowflake enable you to work with unstructured AI?
So, how does it work? The key is making unstructured AI accessible through a workflow that business users can navigate on their own:
Step 1: Upload files directly into your Sigma workbook This is intentionally lightweight. You can upload a handful of files directly in Sigma to get started, and those files are written to your configured cloud storage infrastructure, such as Amazon S3 or Google Cloud Storage. From there, you can immediately begin exploring what’s possible with unstructured AI, without setting up pipelines or logging into other systems.
Step 2: Access files currently sitting in your enterprise cloud storage locations (Amazon S3, MS Azure Blob and Google Cloud Storage) Keep using the same cloud storage infrastructure the business already trusts. Sigma can see and access all of these cloud-based files (if granted secure access of course), while the actual content stays inside your governed environment.
Step 3: Ask questions using AI functions Sigma can call Snowflake functions like AI_PARSE_DOCUMENT for any non-image/audio/video files, accessing the full text from a given document. Likewise AI_TRANSCRIBE can be called for audio/video files. By calling AI_COMPLETE, a core Snowflake Cortex function next, Sigma enables users to ask questions of images directly, or the results from the parsing or transcription mentioned. If the user is working with multiple files, they can run the same question across all of them at once. And this doesn’t have to strictly be Snowflake—you can leverage numerous AI functionalities with our other cloud data warehouse partners to achieve similar results.
Step 4: Work at scale, not one file at a time Once the files are in place, users can filter, group, and compare results just like any other dataset. They can ask questions across thousands of files that already live in cloud storage.
Step 5: Connect results back to warehouse data Join outputs back to structured data, like sales, claims, inventory, or customer records, retrieving image and document insights that can be analyzed alongside everything else the business already tracks.
Sigma enables teams to directly upload and manage unstructured files like images, PDFs, and audio recordings to bring 'dark data' to life inside workbooks.
By asking a simple question, users can instantly surface actionable insights from billboard photos and maintenance information directly within their existing workflow.
What’s most powerful here is the ability to ask a single question across multiple data types—documents, images, and beyond—to understand how they relate to one another. This is where unstructured AI begins to feel truly multimodal, enabling insights that simply aren’t possible when each format is analyzed in isolation.
Formulas you can use to parse unstructured data in Sigma
For more detail, here are the three key calculations I used in the video at the top of this blog. In Sigma, you can write formulas to invoke LLMs—a process we call AI Query—that are hosted by your cloud data warehouse (e.g. Snowflake, Databricks, Google BigQuery, etc.). While the calculations below are used in an app with a Snowflake connection, you can write similar functions for other cloud warehouses. For step-by-step instructions on how to build an AI App that analyzes multiple file types, review this Quickstart.
Process images
Text(CallVariant("ai_complete", [LLM], "give a detailed description of the following: ", CallText("to_file", [Stage], Text(Json(Text([File])).id))))
Process audio and video files
Text(CallVariant("ai_complete", [LLM], Concat("give a detailed description of the following: ", CallText("to_varchar", CallText("ai_transcribe", CallText("to_file", [Stage], Text(Json(Text([File])).id)))))))
Process documents
Text(CallVariant("ai_complete", [LLM], Concat("give a detailed description of the following: ", CallText("get", CallText("ai_parse_document", CallText("to_file", [Stage], Text(Json(Text([File])).id))), "content"))))
The next era of AI and analytics
What this ultimately changes is how we think about analytics. For a long time, it’s been defined by what fits neatly into tables and dashboards. But a huge amount of business context lives outside those boundaries, in documents, images, and files that teams rely on every day but rarely analyze.
By bringing unstructured AI into the same environment as BI, Sigma extends analytics rather than replacing it. Business users don’t need to learn a new tool or hand work off to specialists. They can explore questions, validate results, and connect insights back to warehouse data using workflows they already understand.
Just as importantly, they can do it without breaking governance or security. Unstructured AI moves from isolated experiments to something teams can safely explore and turn into repeatable workflows. When AI lives where business data already lives, it starts becoming part of how decisions actually get made.
Ready to apply AI to unstructured data? Build your own AI Apps at Workflow, Sigma's User Conference.
Note: This content was made with Snowflake functions, but our other cloud data warehouse partners offer similar functionality in varying capacities.