June 12, 2020

Data Demystified Podcast Ep 1: Building Data Architecture From the Ground Up

June 12, 2020

‍Data Demystified is Sigma’s monthly video podcast series about building data literacy and surfacing transformative data insights. Sign up to receive enlightening conversations with some of the biggest and most influential personalities in the data and analytics world sent directly to your inbox.

Lending plays a critical role in economic development. Loans help people participate in activities that can help them rise out of poverty like obtaining an education, starting a business, or making an investment.

In some parts of the world, a loan can mean the difference between life and death. The money may be needed to purchase a bus ticket out of a war-torn area, provide food during a famine, or obtain critical medical supplies.

But access to credit is severely limited to many people that need it most. In the developing world, many who live in poverty are unable to obtain loans from traditional sources. Their only options are to use private money lenders — a practice that is both dangerous and risky.

Migo is a cloud-based platform that is changing that. Using a combination of technologies including SMS messaging and feature phones and data sources like phone bill records, they’re able to underwrite loans and help people in the developing world obtain the vital credit they need. Augmenting traditional banks with Migo, thousands of people throughout the developing world are establishing businesses, obtaining critical necessities, and building better lives for themselves and their communities.

Integrating all of these technologies and data sources together and establishing consensus is no easy feat, but it empowers everyone in the organization to take a holistic approach and collaborate to uncover transformative insights. In this episode of Data Demystified, we sat down with Joseph Bates, Analytics Architect at Migo to dive into his approach and process for building their data architecture. Check out the transcript below to learn:

How multiple data sources are efficiently funneled into Snowflake
The data architecture choices that improved reaction time to volatility in the business
Ways their system enables true self-service analytics

Here is the transcript of our conversation with Joseph Bates from Migo:

Joseph Bates, Analytics Architect, Migo

Daniel Codella: Hello, everyone. Welcome to Data Demystified. This is a video podcast about cultivating data literacy and surfacing data insights. I'm Daniel Codella, Data Evangelist at Sigma Computing. And today, I'm joined by Joseph Bates, Analytics Architect over at Migo. Joseph, thanks so much for joining us.

Joseph Bates: Yeah, my pleasure.

Daniel Codella: Awesome. So, before we get started here, what is Migo and what are you doing over there?

Joseph Bates: Yeah, so Migo is a Series B FinTech. We're a cloud-based platform that basically extend new payment and credit options to customers who are underbanked. So, most of our business is in the third world, Western Africa and South America. And so, people who don't have credit scores, don't have banking history, you can still get access to those vital cash lifelines, whether it's to run their business or for more personal uses. We're trying to kind of extend that level of flexibility to everyone.

Daniel Codella: That sounds awesome. What was the status of data analytics at Migo before you joined?

Joseph Bates: Very, I would say definitely in its infancy. We're kind of doing a hodgepodge of things. We had set up Superset to run kind of directly against our MySQL transactional tables. It was a bit clunky. It was kind of getting the job done, but it was holding us back from doing a whole lot more third-party stuff. So, yeah, I mean, things were generally functional, but we knew that we had kind of been building up a lot of debt. And so, that's why they opened up this position. That's why they hired me to come in and do that.

Daniel Codella: So, before you started building the data architecture, you and the team came up with three kind of guiding principles. What were they?

Joseph Bates: Sure. The first principle was just that, we've hired good people. These are smart people and they know lots of stuff. So, the main thing was that we wanted to bridge any gaps that they were kind of missing on the skills side, to do their jobs. So, all of the things around data acquisition and cleanup and analysis, we wanted to make that easy for them.

And then the second one was that we knew that the architecture choices that we were making would help influence how people think, so that we could kind of guide the company toward a unified language and a unified, essentially making the data architecture match the business model. So, our data model would match the business model and kind of keep everyone, yeah, I mean, basically speaking the same language.

And then the third was just that, by making good choices around the data, the kind of extension of the second principle is that it helps the engineers and the data experts understand, build a mutual understanding with the domain experts, the people who really understand how to move the needle on the business.

So, those were kind of the three guiding principles for our architecture choices.

Daniel Codella: So, what were the project's main goals?

Joseph Bates: Sure. So, I mentioned kind of our guiding principles. The goals are a bit more quantifiable. Firstly, we wanted to remove engineering from the critical path for business and career enlightenment. And by that, I just mean that, so many kind of new data use cases or bringing in a new data source, it was all went into the engineering backlog. As most startups are, we're severely understaffed. So, every engineering person hour is hugely valuable. And so, it was just hard to move stuff to the front of the engineering queue. It requires product to get involved to help kind of triage and field these requests. And it's just not... It's very easy for the folks asking and making these requests to feel like they're not being heard. So, that was kind of our top goal.

We also just want to reduce latency. It takes a long time, it was taking a long time for the data to get into the systems that it needed to, for it to be accessible. We were rebuilding the database every night and that's a lengthy process. We want it to reduce the maintenance and attention costs. So, from engineering or me and my team or anything, we wanted to kind of reduce the amount of time that we were spending babysitting the data.

And then we also wanted to improve just the overall level of access. Everything was kind of siloed. We didn't have a true unified data store with front-end tools and GUIs that would help people access it. So, that's kind of a multicomponent thing. We wanted to bring in something that would provide a GUI for people to access all that data, but then also make it so that all the sources that they could want, and all the kind of data that needs to be conjoined, was all there.

Then the kind of last and nebulous one is that when you look at our data model, it's a funnel. And people will come and check us out on the website, or they know about us from our SMS kind of application. They can apply for, let's say, a loan or a payment option on their utility bill or something like that. They can contact us to get offers on loans, and then maybe they take an offer. That's great. They get all the way through the process of setting up their bank accounts so they can get the funds. And then, they receive the funds. They get the loan. They pay back the loan. Great. They've come all the way through the funnel.

But there are obviously numerous places where there's drop-off in that funnel. And so, we were very keen to kind of convert all of those to nice automated loops through marketing automation that could get people back into the correct nurturing campaigns, so that they're really in our overall business model. There would be no more funnels, no more termination points. It would all be loops that would have routed back through our marketing automation system. So, that one was a bit harder to quantify, but that was basically, we knew we needed a marketing automation system. We were doing it all kind of more by hand, essentially, with a few exceptions that we built in-house, but it was a bit of a mess. Yeah. Those are our five goals.

Daniel Codella: Oh, yeah, pretty ambitious goals. Love the approach. Could you walk us through what you were able to build?

Joseph Bates: Sure. So, we decided to go with kind of a hub and spoke model, or essentially, we put our data warehouse in the middle, that was Snowflake. And we were pumping all of our platform information directly into Snowflake. So, we had set up some logging through Sumo. And so, we kind of had some legacy stuff that we could rewire up there, pull some data directly out of S3. We run Apache Fineract for our loan ledger and that's pumped in. And then we have some MySQL databases that are also pumped in.

And then, there are numerous kind of spokes that go out. So, Snowflake feeds Blueshift, which is our new marketing automation platform, which itself does a bunch of texting and emailing and things like that. Most of our business is done through texting, which presents interesting challenges, but it's pretty cool. People with feature phones can interact with our apps just by texting star 123 or whatever it is for their country. And then, we also kind of used that to corral all the dark data, meaning local files on people's computers and Google sheets, and our corporate Google account, and Zendesk and our survey data as well, all flow into Snowflake.

And then we kind of have an interesting setup with our ERP and FP&A system. So, we use Sage and Tax for our ERP. It's where we store the general ledger, and adaptive insights for FP&A platform. And both of those have an integration with Snowflake to where all of the fields between the two APIs are kind of mapped, and we can do nightly dumps of data, the most recent data into those platforms. Not only that, we can also take the accounting-enriched data or the forecasting-enriched data and bring it back into Snowflake really easily. So, for all of the kind of easy copies of data into Snowflake, we use Fivetran. Then for all of the more nuanced API to API conversations, we use Tray.

Then, so some of those spokes are bilateral. And then, we also have bilateral connections to our data science and analytics platforms. So, for our analytics platform, we have Sigma and for our data science platform, we use a combination of Jupyter and PyCharm. And both of those have rewrite to certain schemas to be able to do kind of handy materialization jobs or performance optimization.

So, it kind of makes this nice model where Snowflake knows everything and then, all of those downstream systems can kind of feed off of Snowflake. And so now, we're just kind of in the mode of ratcheting down the latencies. We're going to roll out an event, a proper event schema that pumps data in through Kinesis, and it'll be in micro-batches of like two minutes instead of doing what we were doing in the olden days of rebuilding the database every night and then recopying it in every day. That was a mess and takes like a couple hours. So, our goal is to get it down to two minutes.

Daniel Codella: Really interesting technologies that play too with the SMS messaging. Really interesting.

Joseph Bates: Yeah. I mean, that's still the majority of the world. We have an Android app, we have a WhatsApp plugin. We have some of the core kind of more modern things that are starting to get a hold in Western Africa and South America, but the majority of people and the people who need banking services the most tend to be on a feature phone, not a smartphone.

Daniel Codella: That's incredible. So, what have been some of the results that you can report?

Joseph Bates: Well, I would say the biggest thing has been, I'd say the goal that we've achieved the most out of the five has been improving accessibility and ease for the data. So, we have Sigma now properly set up. It's doing the majority of our materialization inside Snowflake. We haven't yet set anything up like DBT or Airflow to kind of do that for us. So, we're actually using Sigma pretty heavily on the orchestration side, just to keep our stuff fast. That's a little bit, maybe not how Sigma was intending for it to be used, but that's what we're using it for, and it's really nice.

And then we've also, I mean, one of the big wins is just rolling out those third-party systems. We are no longer keeping track of our FP&A in an Excel sheet that crashes because it's too big. So, having direct Snowflake to adaptive connections, and Snowflake to Sage connections has been huge. The biggest one, going back to the goal of converting funnels into loops, was Blueshift. So, we do big JSON and CSV dumps into Blueshift on regular intervals to be able to do text and email campaigns. So, for every point in our business model where people can drop off, we are trying to kind of tackle each of those within our tree campaign.

And then finally, we're also, for folks who have defaulted on their loans for whatever reason, we're working on recovery campaigns. So, even that, it's like the worst-case scenario for us, which is that someone defaults on a loan or a payment. We have recovery campaigns that can run, that can say, "Hey, we realize that life can get in the way, but if you pay back, let's say 60% of what you owe, you can come back into the platform and then take out another loan." It really is a super important lifeline for folks. It's a little bit foreign to us. And in that, being able to take out a loan that feels... It's a 14-day, 30-day loan, but it could be the difference between being able to pay a medical bill or being able to buy a bus ticket out of a dangerous area. And a lot of this stuff is really life or death. So, it's been really interesting to see and hear the stories about how people are using this.

Daniel Codella: Wow. So, yeah, that exchange of data, I mean, there's lives on the other end of that, that really depend on that. That's so interesting. To switch gears a little bit, Migo seems like a very data-driven organization.

Joseph Bates: Totally.

Daniel Codella: How do you break down the data language barrier between data and domain experts, and get everybody collaborating effectively? Are there any tactics or strategies that you've used?

Joseph Bates: The biggest thing I think was, was... So, I spent my first 30 days at the company, this was back last year, just building a data dictionary, and not even, I wouldn't even call it a data dictionary, just a company dictionary, to smooth out the different terms that people are using to mean the same thing, to disambiguate terms that were being used to mean multiple things. People were talking about LTV and to a finance person, that's loan to value. For a product person, that's lifetime value of a customer. So, trying to kind of bridge all those gaps, and to build a data model that we could really form a consensus around.

Consensus is the name of the game. And it takes a while to get buy-in. And the nice thing is that you can kind of, when you're setting up these systems, especially when you're kind of new, like I was, and you're given a pretty strong mandate, I could kind of cheat as far as achieving that consensus in that, in setting up Sigma, I get to name everything and call everything and aggregate everything the way that I know to be correct, or the least ambiguous or whatever. And so, I could kind of impose my will on that, which is a little bit antithetical to consensus, but it kind of speeds things along.

And eventually, you can achieve consensus just by virtue of attrition. As long as you keep everything named consistently and keep all the definitions up to date, eventually, if people just get on a train and it's fine. So, obviously, getting Sigma and Jupyter set up and running off of Snowflake was a huge win. It gives us much faster reaction times to volatility in the business, which is also key for keeping everyone's language the same. You release a new product, you have to really jump on it. A new product, new feature, you have to really jump on it and make sure everyone's talking about it the right way, using the same aggregation definitions and things like that. And so, being able to react to those things more quickly is huge for kind of keeping everyone on the same page.

Daniel Codella: Nice. I love that. So, common vernacular and building consensus around every term.

Joseph Bates: Yeah. I mean, and the great thing about doing that is it makes for a nicer work environment. I mean, if everyone kind of agrees on the nature of the business, like what the business model is, how the world works, and what all our terms mean, and how our metrics are aggregated, you can't really disagree on much, and disagreements that you do have are going to be around strategy and things where disagreement is healthy. Disagreement based on, because my version of the truth is different than your version of the truth, those are unhealthy. Those are completely unproductive, as we've seen over the last four years.

Daniel Codella: Yeah. That makes total sense, having that kind of, and also a single source of truth. That's great.

Joseph Bates: Yeah. Yeah. I mean, I'm a little bit down on that phrase because I think it gets overused and kind of loses its meaning. I think that in general, what I kind of preach is like, sure, you want a zone that people are aware of. This zone is single source of truth. But then you have this kind of DMZ area where it's like, there is no single source of truth. This is stuff that's immigrating into the better zone, but that takes time. That takes a maturity process to get it there.

And so, it's just, you had to be clear on where that white part is and where the gray part is, as far as how stuff gets into that area, where we've agreed that there is a single source of truth. There's a lineage to get it there. So, I think the whole thing around always everything, single source of truth, it's just a bit overblown, and in some cases, counterproductive because you need to bring in new stuff, that maybe hasn't been properly defined yet. And doing the work up front to try to get it all the way to that level of kind of crystalline perfection that you want, might be throwaway work if it's a product or feature that you're not going to keep, or if you're in a super volatile business, like we are.

Daniel Codella: Well, before we wrap up, what's next for yourself or for Migo? Anything interesting coming in the near future?

Joseph Bates: Well, the biggest thing for us is making improvements. We've hired this total crack team of Ph.D. statisticians to rebuild our models. For us, we live and die by the quality of our data. Our customers do not have credit scores. So, we build our own credit scores and distribute loans based on those. And so, if we're not scoring correctly, we're hosed. We've already accepted the fact that we're going to see a roughly 50% default rate on first-time loans. It doesn't matter because enough of those people stick around and then take out a bunch of loans, Migo becomes a lifeline for them, and they're constantly taking out loans to improve their business, that it outweighs the cost of acquiring all those customers and losing half of them.

But as you can imagine, if we were better at scoring those customers based on the data we have about them, we could fine-tune the amounts or the terms or whether or not we even give the loan in the first place, to basically make better predictions and cut those losses. And then once we cut the losses, you could imagine that we could kind of redistribute that margin in terms of providing lower interest rates too, to customers who we know, and that'll create a better lock-in and better lifetime value.

Joseph Bates: So, a lot of our efforts now are around kind of bringing all this, coalescing all of this stuff together, improving, not only our model's quality, but also just the frequency at which we update them. That was a big thing that Snowflake helped us solve, was being able to bring all the data together, rebuild the models quickly, and be able to rescore on a more regular cadence. As you can imagine, the more frequently we rescore everyone, the better our data, the better the model's going to perform.

Daniel Codella: That's so fascinating. Well, we hope to maybe even get an update from you in the near future.

Joseph Bates: Sure.

Daniel Codella: Well, Joseph, thanks so much for joining us today. It was super insightful and we look forward to watching Migo continue to evolve.

Want to keep reading? Learn how Migo's Marketing Team Achieves Transformational Customer Retention and Recovery with Sigma.

Let's Sigma together! Schedule a demo today.

2025 Gartner® Magic Quadrant™

Insights