In this episode of Behind the Data, Matthew Stibbe interviews David Green, CTO at IWSR, the global leader in data, analytics and insights for the beverage alcohol industry. They discuss the evolving landscape of data management in the beverage industry, and explore the need for human expertise in data analysis, the challenges of data ingestion and the limitations of traditional tools like Excel. David shares insights on how IWSR is leveraging CloverDX to automate data processes, allowing researchers to focus on their core expertise while scaling operations effectively. The conversation highlights the need for a balance between automation and human intuition in data-driven decision-making.
00:00 - Introduction to IWSR and Emerging Trends
03:03 - The Role of Human Expertise in Data
06:04 - Understanding IWSR's Data Collection Process
08:52 - Challenges in Data Ingestion and Automation
11:58 - The Limitations of Excel in Data Management
14:45 - Implementing CloverDX for Data Automation
21:00 - Scaling Data Processes and Future Insights
Matthew Stibbe (00:01)
Hello and welcome to Behind the Data with CloverDX. I'm your host, Matthew Stibbe and today I am with David Green, who is CTO at IWSR. Now, some of you may know from previous episodes, I'm a bit of a wine geek and I used to do, I used to have a little wine importing business and I did my diploma in wine at the WSET. And when we were studying the industry, we had quite a lot of IWSR data coming through in our lectures and courses. And so it's a double pleasure to have you on the show today. Welcome, welcome David.
David Green (00:35)
Thank you Matthew, great to be here, looking forward to it.
Matthew Stibbe (00:37)
So before we dive into your world and IWSR a little bit, I'd love to start with something a little bit more immediate. And are there any emerging trends or themes that you're particularly interested in and what's going on in your world?
David Green (00:53)
I think the thing that's exciting for me at the minute is that it finally feels like we're moving on from years of hype on dashboards and AI. AI is everywhere at the minute. I'm very excited about AI, but think the exciting thing for me is that we're moving past that now. We're looking more at the more mundane, unglamorous work that's necessary. How do you clean your data? How do you integrate data? How do you ingest it? Doing all of the glue work that needs to be done, and now we're focusing on that much more.
You know, the bottleneck has always been messy and inconsistent, and you know, kind of human data and kind of a lot of that messy human stuff, and actually turning that into something useful and reliable has always been the real challenge. And I think we're starting to focus more on that now and getting less excited by all the new shiny gadgets.
Matthew Stibbe (01:33)
I think AI does... the large language models do a pretty good job of appearing and potentially are intelligent, but you keep stumbling across the limitations as well, don't you?
David Green (01:47)
Yeah, exactly. I mean, these tools are amazing. I'm a huge fan. I'm very excited about what they can offer. But I think you quite quickly realize where the limitations are. They're good for some things, and they're incredibly powerful. But as soon as you start getting into dealing with human data sets, for example, we tried throwing some messy Excel sheets at a ⁓ large language model. What can you work with this? It does a kind of halfway OK job. And then you look at it, it's like, ⁓ no, it's just missed, because it's missed all of this other... Because there's some commentary and there's some highlighting and there's some weird notes. How is AI supposed to pick out all of these? It's hard for humans to understand how data is structured. Never mind a machine.
Matthew Stibbe (02:25)
So there's still a role and a need. I'm an AI enthusiast too. I describe myself as a people-first AI pragmatist, right? But I don't want to, I don't think it's going to instantly replace everybody and everything. And I think there's still a role for human intuition and judgment and experience, isn't there?
David Green (02:42)
Yeah, absolutely. I mean, I was... our business is based on that human intuition and the value of that. I mean, you know, in my time here, it's been, it's been so key to understand that, you know, with all of the tooling, all the automation that we can offer, actually, it's our human experts that are the real value that we as a business offer to our clients, you know, and the more we can enable them to provide that expertise is absolutely key because we don't want to automate that. As a business, our customers rely on us for that human judgment, that human expertise, the decades of history that we have. And you can't automate that away. No LLM can replace that deep market judgment that our researchers have.
Matthew Stibbe (03:18)
Yeah, inherently, someone who's got 20 years in the industry and who knows kind of the ins and outs and has seen the patterns, that information isn't available in a form that an LLM can ingest. It's, I mean, in a sense, it's your proprietary knowledge and know-how and expertise, and it's been captured in the training models and things.
David Green (03:37)
Yeah, exactly.
Yeah, exactly. And our researchers, they're researching markets, so they're going around interviewing people, talking to people. We talk to thousands of people every year. It's the stuff you read between the lines, right? The stuff that people might not want to write down, might not like to write down, or isn't written down anywhere. That triangulation, that human ability to take that information that is hard to acquire, but that people are willing to tell you. And you sit down, have a beer with them, sit down and interview them.
That information is invaluable for our researchers to really understand what's actually happening. And I think, until there's robots out in the world doing that and having the same kind of conversations over a glass of beer, we're fine.
Matthew Stibbe (04:21)
I will, I hope so, because I enjoy talking to people, right? The great thing about talking to people is talking to people. So tell me about IWSR. What is it that you actually do? Where does the data and knowledge come in?
David Green (04:24)
So, we're a market data research company. So we research 160 markets across the world and we figure out at a brand line level how much alcohol is sold year by year. And we sell that data back primarily to the global drinks industry, who it still amazes me don't necessarily know kind of where their product ends up. It's a complex globalized supply chain, alcohol is regulated all over the world. So there's all sorts of quirks and consequences from that.
So yeah, so we have a team of expert researchers who travel the world every year, working out at a detail level exactly how much alcohol ⁓ brand by brand has been sold in all these markets. And we tabulate that. So obviously some of this is, so there's a huge data ingestion process there. Some of the data comes through messy handwritten notes, heard verbally, some of it's stuff read through press, but a lot of it inevitably comes through Excel files. So a big part of what we end up doing is taking a lot of that data and being able to triangulate it and then incorporate that into all of the other sources that we've heard.
Matthew Stibbe (05:31)
So you're bringing data in from manufacturers?
David Green (05:36)
From both the producers and distributors. So we work across the supply chain to try and understand, you know, this many cases of Johnnie Walker left the factory. This many went through this distributor. This guy over here says he saw some, you know, so we're trying to understand exactly the flow. So our the data point we're trying to get to is how much went into retail. Because it's especially... so the company started with wine and spirits and the thing with wine and spirits is that they can be held in warehouses for a really long time. So you get you can get a stock problem where I've... you know, I've I've sold a lot from my... left the chateau or left on a boat somewhere but then who holds on to it? Well it can be anywhere in the supply chain so what we're trying to understand is what's actually going into retail, what's being sold.
Matthew Stibbe (06:15)
So by the time it's reached retail, it's reached effectively, you assume it's reaching consumers.
David Green (06:19)
Exactly, because no pubs and bars, no supermarkets are going to hold on to vast volumes of inventory. The thing with retailers is that they want to be shot of that inventory quite quickly generally, rather than suppliers further up the supply chain. It may be valuable to pay for warehousing space. It might be cheaper for me to hold on to a ton of spirits now and then sell it next year when I expect the price to be better. So that's always been our angle is what's going into retail really. That's the unique insight that we offer, because obviously the producers know how much they've made.
But what they don't necessarily know is how much people are buying.
Matthew Stibbe (06:49)
Or how much other people are making, or how much of other people are....
David Green (06:59)
Well, yeah, and then it's that cross-market view and consistent. It's that 160 market-wide view that's the real unique selling point for IWSR, is that nobody else has that same consistent methodology. So you can take a massive market like the US and a tiny market like Botswana and compare them exactly the same. So you know you're looking at like for like rather than comparing.... Developed markets like the UK and America have really good.
automated data collection. But actually you can't use that consistently. If you're looking at a different market, you can't use the same method. So that's always been our reliance on this human data collection.
Matthew Stibbe (07:25)
So on some level, what you're doing is processing data that's coming in different formats and sources for different countries and making it comparable, normalizing it.
David Green (07:34)
Exactly, yeah, yeah, exactly. So that's a big part of the human judgment there is making sure that everything is ultimately cross-comparable that we're in every market we're comparing like with like. So there's a huge process there around ensuring that consistency.
Matthew Stibbe (07:46)
Okay, fascinating. So when that data has been processed, how do you deliver that back to your clients?
David Green (07:57)
So we have an online platform that our clients have access to. So that's how they can deliver, they can get access to the raw numbers. And we also then provide a kind of written commentary on top. So providing that kind of expert qualitative judgment. You know, there's the real art there, everybody wants the numbers, of the raw data. How did I do? How did my competitors do? What markets are interesting? But actually that expert judgment, the written opinion is often just as valuable to kind of give that color onto what the raw numbers actually mean. So yeah, so our platform delivers kind of this mixture of both qualitative and quantitative content.
Matthew Stibbe (08:34)
So as CTO, you're sort of... this data is moving through IWSR's data pipelines, what are some of the challenges that you're facing?
David Green (08:45)
So the challenge really was one of time. So our research process is incredibly condensed. In the world of instantaneous data, we work on an annual publication cycle. So we publish data every May for the previous calendar year, which means our research process starts January 1st, people start traveling.
And then our customers need that data basically as soon as we can get it. We publish in May and it's a real push to get all of the traveling, the in-market interviews, to meet everybody, to collect all of the data, to clean it, to go through all the QC checks it needs to go through to get that published in time for May.
As we started reviewing that process, one of the things we realized is that our domain experts, our researchers were spending a frightening amount of time on what's effectively manual data entry. We did a screen recording with one of our researchers really early on in my time here, which became legendary just because it was terrifying. I was watching a highly skilled human being copy pasting numbers from one spreadsheet into our tool one at a time.
It's just... This is an insane waste of your ability. Like you're an expert in the field, what you don't know about beverage alcohol in your market isn't worth knowing. And you're sitting here spending an hour of your day just copying numbers back and forth. This is not what it was.
Matthew Stibbe (10:00)
Yeah, I mean, who on their deathbed really wants to look back on their life and go, I wish I'd spent more time transposing data from spreadsheets, right?
David Green (10:07)
I mean, job satisfaction must be awful at that point. This for me was the real trigger that we needed to do something. And the business needed to be able to scale. The danger is we're trying to collect more data, get more depth into more markets, research new products. The team is having to constantly scale. That's no good for a business if you have to keep hiring more people to scale. Really what we needed was to bake some more scalability. And so we needed to look at more automation.
So this was where we started to look at how can we take this data from Excel and how can we build an ingestion pipeline?
Matthew Stibbe (10:40)
And so when you were planning this ingestion pipeline, this project, how did you, I mean, how many, give me a sense of the scale of it. I mean, how many data points, how many spreadsheets, how many data sources?
David Green (10:53)
I mean, we receive, I think tens of thousands of files a year. And each one will have hundreds, thousands of data points in them. So it's a lot of data entry. The net effect was amazing. So our research team of what, 20, 30 people, they each saved eight days per year by the time we'd done this. And that's over a period of only four months, really. They're only really kind of collecting data for four months. So per individual, that's an amazing time saving.
Matthew Stibbe (11:22)
And this is... are you giving spreadsheet templates to your data providers or?
David Green (11:28)
We do to some, obviously, because we're working with a wide range of different partners. Some of them will have big complex ERP systems so that we just go to dump in whatever format their ERP system wants to output and we have to deal with that. But for many of our suppliers, we send them a starter Excel, basically, here's an output from our system, which still runs on Excel. And then they fill that in and send it back to us, which helps because mostly that means it's a really understandable format because we know exactly how it's structured. Brand lines are here, volumes are here. We know how everything's organized.
But inevitably, because they're often human-edited, then there's still mistakes and weirdness is creeping because somebody does something strange and adds a new brand line in the wrong place or adds volume somewhere weird or adds a comment, this number's not true, I'll follow up by email. I mean, the range of complexity can be insane. So we try and standardize this, but we can't mandate that our partners provide data in a specific format. There's a real balance there that we have to meet our partners halfway rather than dictate exactly how they send us data, because we need the data.
Matthew Stibbe (12:31)
Well, and I think this is... we were talking about this earlier and I mentioned Pavel Najvar at Clover who says that if you're using Excel, you failed in some way. What are some of the problems with using Excel as a data ETL tool?
David Green (12:49)
I mean, Excel's brilliant, right? But it's so pervasive and it's so powerful, but it's used in so many different ways. There's no single way of using Excel. I remember hearing something a while ago, it was a long time ago now, that it was a revelation for Microsoft when they realized that people were using Excel to keep lists. So they started adding sorting and filtering functionality, because they hadn't thought that people were collecting lists in it. So of course people collect lists in Excel. What else are people using it for? But this is the trouble, right? Everybody's got their own weird way of using Excel.
David Green (13:17)
So if you say send us your data in Excel, yeah, here it is. My favorite example, we received one spreadsheet and there was like 20 tabs, 10 tabs, can't remember, loads of tabs, all with very similar looking data. And which tab has the correct data in it? Because these all look the same. Apparently the answer was the one labeled 'For Dave'. That's not me. I don't know who Dave was.
David Green (13:40)
Some other guy somewhere in the chain had and it's like how is anybody supposed to know that that was the correct thing? In the end somebody had to go you know a human had to engage another human saying how do I understand this where's the data? It's in For Dave. Sorry I should have said.
Matthew Stibbe (13:53)
In my world, mad people sometimes send us copy for websites in Excel. They've used it as a word processor, which is like, and if you've tried writing things and you have to do shift return in cells, it's like the worst, worst, worst way how they do. And my wife who is an accountant and incredibly talented, but therefore she thinks she sees the world through an Excel lens. When we were designing the extension here she did the interior design in Excel with each cell being like 10 centimeters. I mean, how, using it as a CAD tool? Absolutely. So yes, it's incredibly powerful, but also completely irritating. I would ban it if I could. Back in the day when I used to do a lot of marketing for Microsoft, that would have been heresy, but I can say it now. Okay, so.
David Green (14:26)
It's so powerful. I mean, this is it. People use it because it's a really powerful, really flexible tool. So everybody uses the tool that's at hand, you know, and if that's the one you're familiar with, then everybody uses it. This has become our lingua franca, right? You know, we don't have a data interchange format, you know, we're working with partners. Do they send us CSV files? No. Do they send it in JSON? No. Do they send us Parquet? No. It's Excel. Of course it's Excel. And this has become the lingua franca for the whole world.
Matthew Stibbe (15:10)
So at IWSR, you have this tidal wave of Excel files coming in and you are ingesting them how? How did you improve that?
David Green (15:21)
So the big change was we started integrating CloverDX. So this really enabled us to start automating this incredibly manual, time-consuming labor-intensive process. So by using Wrangler, this gave us a way of, instead of having domain experts spending their valuable time working out how to parse an Excel file, actually we can pay relatively cheap interns to come in and work with us because actually Wrangler's really easy to use and it gives you a nice visual model of... here's what the Excel file looks like. Just label what the columns are and it'll do the import.
Everything we're receiving is basically tabular data, but there's often headers, probably weird stuff off to the side that we don't care about. So actually working out, this is where the brand lines are, this is where last year's volumes are. Okay, great, sorted. We'll pull that data out and then we can do something with it.
So this was a really key first step in our process of not having humans copy-paste values out of spreadsheets. But the real challenge for us has always been mapping. What we receive, once you've got past all of the 'For Dave' nonsense and the weird annotations and the weird commentary and weird structure, you start to get into the real messy mechanics of data. We take brand line level data, which means that all of our partners send it how they think of their brand lines, not how we think of their brand lines. And obviously we have a canonical name for all of the world's drinks brands.
Matthew Stibbe (16:38)
So Smirnoff Vodka would be a brand line?
David Green (16.38)
Exactly, yeah, yeah. Obviously some people would just call that Smirnoff vodka. So Absolut Vodka is called Absolut Vodka. One of the examples I bumped into recently, so there are various flavored versions of Absolut Vodka. We abbreviate flavored all across our database to 'flav'. Because it's briefer, because some of them are really quite long names. You know, the time we've got Absolute Raspberry-Flavored Vodka is quite a long string. So we shorten it. But obviously some of our partners spell flavored because why wouldn't they? Some of them probably just miss the word flavored because absolutely raspberry is obviously the flavored vodka.
So you have to map from all of the various strings with foreign characters that we do or don't replicate to misspellings that we do or don't replicate. So all of this mapping process from what third parties call brand lines to what we call them, it was really the value that we were getting out of the researchers in this process was that detailed understanding of in this market, these two brand lines that look really similar, are they actually the same thing or are these two completely different products?
Because this happens increasingly. You'll have a product in one category, so it's increasingly now, to get slightly drinks-geeky, a lot of spirits drinks now are moving into this ready-to-drink category. So these are kind of prepackaged drinks in a can or a bottle that's ready to drink. They often have really similar names. So knowing that, no, that's the raw spirit versus, yeah, this is the slightly similarly named but prepackaged version can be quite nuanced. And lots of them have strange exotic names because there's lots of creativity in that space at the minute. So the researchers' expertise of knowing which brand line is which, and are these two separate things that we need to separate and we need to add something to our database... Or is this something we already know about?
So this was where Data Manager came in. So this allowed us to build a mapping process so we can take just the brand lines. So now we've got the data out of Excel. Let's now just go through each brand line. So this one specific brand line, this looks like - here's some candidates it might be. Does this match? Does this correlate with what we know? Is it the right category? Does the volume look like what we expect it to be? What kind of data do we have to help us triangulate?
So this was the, this was the next really key step in the process then of trying to focus in on just that bit of domain knowledge. What's the important bit to get out here? How do we actually map from what our partners call these brands to what we call them?
Matthew Stibbe (18:45)
And this Data Manager, these mappings, once you sort of built that engine to process that, you can run that against different spreadsheets and different data source?
David Green (18:56)
Yeah, I mean, we don't tend to because our current assumption is that we can really only... so we do keep track of the mappings that we've made. But our assumption is, is that those mappings will probably be supplier-specific. So if we've worked with one partner and they've called it this, well, we can only really safely guess that you will call that that next year. So next year should be amazing. So this year, we had to go through a fairly expensive process of mapping all of these things for the first time. Next year, we'll have a really good database. So these are all the things that we've had from all the various suppliers.
We could have taken a view from other suppliers, because some of these are really fiddly and really quite fine-grained, the safest thing to do was to say, actually, we'll keep this quite supplier-centric. It gives us a bit more work to do now. But then it means next year, it will be really robust. What we don't want is to introduce errors here. If we mis-map these things, you could end up with some really unfortunate errors.
Matthew Stibbe (19:46)
You're sort of learning how each supplier describes their thing and how that translates into fluent IWSR.
David Green (19:53)
Exactly, exactly. That's the mapping we built up this year. And then next year we should save even more time because then we've not got to go through mostly through this mapping process. There'll be some changes and there'll be some renames and there'll be new brand lines appearing. But hopefully it should mean that that work is steadily reduced every year.
Matthew Stibbe (20:08)
How do you pick up those exceptions?
David Green (20:11)
So that will be through the Data Manager process. And next year as we start ingesting new data, most of it will flow through automatically and Data Manager won't prompt the users to map anything. So this will be the nice thing now is that because we have a process now where things go to Wrangler, data appears in Data Manager, then somebody goes into Data Manager and works out what needs mapping. Actually, if everything's already pre-mapped, it says there's nothing to do, move on. And if there's a handful of brand lines that need picking up, because it will recognize the ones that it's not got an exact match for, and it will then prompt us to then actually come up with a with a new mapping.
Matthew Stibbe (20:42)
So this is sort of turning the whole situation on its head instead of somebody having to manually copy paste and most of the time it's brainless and sometimes there's an exception. Now the system's gonna do that and only ask for the expert interpretation when there's something new or something different.
David Green (21:00)
Exactly, exactly. So it means we can keep that expertise really just focused on the differences year to year, which should be relatively minimal. We work with 20,000, 30,000 brand lines. It's not going to double next year.
Matthew Stibbe (21:12)
And what difference is that going to make to IWSR? Obviously, people are going to have happier lives doing less manual grunt work. But how does it help you as a business?
David Green (21:23)
The real thing it will enable us to do now is to scale... to basically start ingesting new data sets. So now we have a process that we know works and that we know we can apply. We'll now be able to take different data sets and work into different markets, taking on either additional data or entirely new data sets. So it'll allow us to scale the business. So taking in more data, selling more products to more customers, making more value to our clients, but without having to pay more humans because... the expertise is now more concentrated.
This was never an exercise in reducing staff. This was about trying to free up time for our experts to be the experts that they are. This means we can take in more data without taking up more of their time, and they're still free then to provide their expert insight and analysis that LLMs can't... you know, that color on top of the raw data to say, actually, what's really going on here is this long-term trend is reversing because of blah, blah, blah. And that insight is invaluable. So we really should free them up to do that, but work on ever bigger data sets.
Matthew Stibbe (22:20)
Free the humans. I completely... I love that. We're almost out of time, but I'd love to ask you just one last question. From that process, deploying that data ingestion, what did you learn? What would you have done differently knowing what you know now?
David Green (22:20)
Yeah. Slightly difficult question. I think mainly we just should have started earlier. As a business, we've spent too long letting humans do work that machines could have done.
So I think in many ways it's a shame we didn't start this earlier, but I'm glad we did do it. And the end result has been excellent. I mean, the feedback from researchers has been amazing that, you know, we've saved them time. They can focus on doing the part of the job that they enjoy, going out, talking to people, being the experts in their field, not just data entry.
So yeah, it's just a shame to me that we didn't start this earlier.
Matthew Stibbe (23:12)
This is, I think, a common regret everywhere. Back when I was a pilot, we used to say there are three useless things, airspace above you, runway behind you, and fuel on the ground. And time that you could have spent doing something that you didn't is in that sort of category. Well, so on that nostalgic revelation, I think that brings this episode to a close. David, thank you so much for sharing your experiences. It's been a pleasure having you on the show.
David Green (23:40)
Thank you, Matthew. Pleasure.
Matthew Stibbe (23:41)
And if you're listening to this or watching this and you'd like more practical data insights or you'd like to learn more about CloverDX, please visit cloverdx.com/behind-the-data.
Thank you for listening.