Building resilient data pipelines for sensitive, high-impact use cases

Listen to this episode:

About this episode:

In this episode of Behind the Data, Matthew Stibbe interviews Darren Rooney, Senior Data Engineering Manager at Benchmark Analytics. They discuss the emerging trends in data engineering, particularly the impact of AI, and how Benchmark Analytics is leveraging data to improve police force management. Darren shares insights on the challenges of data ingestion and processing, the implementation of CloverDX for data standardization, and offers valuable advice for successful data integration.

Key discussion topics

AI adoption strategy: Being a "fast follower" in AI - learning from others' successes and failures rather than rushing to implement untested solutions.
Data quality in high-stakes environments: Improving efficacy through rigorous data wrangling and standardization, for cleaner data to hand to data scientists.
Scaling data ingestion: Evolving from custom scripts to standardized pipelines, reducing human time-on-data from 80% to 20% while onboarding 800+ agencies.
Standardized data pipelines with CloverDX: Handling disparate data formats - from paper-based systems to digital feeds - with flexible, parameterized pipelines.
The POC-first approach: Focus tightly on business needs, proof-of-concept multiple solutions, and avoid unnecessary complexity.
Automating error recovery Moving from manual intervention to self-healing pipelines that reduce developer workload and improve reliability.

Full transcript

Matthew Stibbe (00:01) Welcome to Behind the Data with CloverDX. I'm your host, Matthew Stibbe and today I'm talking to Darren Rooney, who is Senior Data Engineering Manager at Benchmark Analytics. Welcome to the show, Darren. Lovely to have you.

Darren R (00:14) I'm happy to be here.

Matthew Stibbe (00:17) So before we dive into your world, let's start with a really practical question. What emerging trends or technologies are catching your eye? What are you particularly excited about?

Darren R (00:30) good, I mean we can't shy away from the AI movement, right? We have various code bases and code stacks in our system and AI is proving to be a useful tool. We can't turn a blind eye into it. Other businesses are having competitive advantage emergence from it. We must do the same. However, we're trying to not just dive in as it's pioneered in the industry, but we're trying to lag behind it so we can see what other companies are doing good, what they did bad. We don't want to, ⁓ you know, reinvent the wheel, we want to see them proceed and have success, and then we may adopt what they have. But AI is out there, and we need to adopt or lose that competitive advantage.

Matthew Stibbe (01:12) There's a sort of a sense across lots of industries at the moment that people are using, ⁓ they're trying to juice their stock price or appease their investors or look like they're on trend by going, 'we have AI in our product' without necessarily thinking through how it's going to be best used. And I'm interested in this sort of idea of yours of being, if I can put it like this, being a fast follower. do you engage in any sort of... market intelligence activity to see what other people are doing or you looking how do you track what other people are doing and think we should be using that.

Darren R (01:44) That's a good question. I mean, the different teams are doing it differently. Like our data scientists, I'm sure are looking at it at a very deep level. What models to use, the securities around it, it, know, is it Claude, is it ChatGPT or whatnot?

We as developers are looking at it in terms of what are the efficiencies and what are the gains that other businesses are doing? Let's just take a pipeline, for example, if they're able to spin up a pipeline in three days they use AI and it's been QA'd and good and ours is let's just say four or five weeks instead. Then we really need to adopt what they're doing because we're gonna lose that speed and that efficiency. So we look at it terms of efficiency and I'm sure our data scientists are getting way deeper into it than we are.

Matthew Stibbe (02:28) There's a bit for letting people who are not data scientists and not developers do some of that stuff that's kind of interesting. And I'm neither of those things. I love that. And then there's the stuff that actual proper engineers and proper data scientists need. And there's a different quality to it. I suspect there's a level of fidelity and accuracy and reliability that is at a higher level than vibe coding and chucking some data into ChatGPT and see what it tells you, right?

Darren R (02:36) That's very true. For example, I use it for strategy discussions. I'll ask it. I have a use case. This is what I would like to have it do, just on a strategy level, not necessarily write code, but interact with it in a conversational level for the strategy that I'm working on. My developers may use it for, hey, is this the best way to write this piece of code? Does this make sense? Write me a doc string, all these types of things that they do. And like the line level so to speak, but I would use it for a strategy. Is this the direction we need to go? Is it cost effective? All those types of things. I don't have a senior right now. I have two professionals. So I have to bridge that gap of cost effectiveness. Where's the team going? And let the team members kind of build it up.

Matthew Stibbe (03:44) Your world - we'll come onto the business and what you do, but I know that the data is sensitive, it's safety related, it's career personal related. And so you need some guardrails in there. Tell me a little bit about how you could, what your priorities are in terms of efficacy and quality and so on.

Darren R (04:07) Yeah, so our data is, it's police force management data. So you would expect it to be not very accurate and all over the place. So it's very important that we wrangle that data as... not as quickly as possible, we want to get data in as quickly, but accuracy is the forefront. So we take that data, we will wrangle it as much as possible, standardize it, right? Get it into our pipelines and then feed that to our data scientists.

If we didn't do the wrangling part accurately enough, or we kind of got lazy or we didn't fix errors within an SLA or something like that, our efficacy rate would drop. Right now it's around 87, 89%. We want it to go up. So the cleaner data that we can get through, our data scientists can do a better job. We don't get clean data in, they have a harder job to do and that rate may drop.

Matthew Stibbe (04:57) There's no room in or little room in there for the risk of hallucination. You know, if you in AI and that has to be explainable in some way.

Darren R (05:03) Very true, very true.

Exactly, so we don't want to just shotgun AI into all of our data processing because the QA would run crazy, right? And then QA has to use AI to do it. So we have processes running AI to create code to do things. And then we have QA running AI to do it. Like we have to get a varying framework to make sure our data is correct. We don't want hallucinations.

We don't want ⁓ our public sensitive PII data to be in these open models. That's absolute no-no. So we have to use things like it was clogged encapsulated and things like that. So we're limited very much on the government data we have, but hallucination would not be something we would want to do. So a lot of QA work needs to go into it if we did that.

Matthew Stibbe (05:36) Yeah.

This is fascinating thinking about the limitations as well as the opportunities of AI. But we've jumped right into this and I'd love to know, tell me about the business, tell me about Benchmark Analytics. What is it that you do?

Darren R (06:05) Sure. So Benchmark Analytics is an early intervention system ⁓ for police force management, backed by machine learning, research-based decisions.

We will run that through kind of the pipeline I talked about earlier, where we wrangle it, make sure it's good, standardize that through, you know, probably from a bronze to a silver, and then we'll push that through our application into what consider our gold standard data, which is where our data scientists will take it and then run. And the goal of that is to see if the officer is advisable for action, actionable for action, or neither of those things. And then it's up to top brass or whomever is using the application in that agency to make a decision.

Matthew Stibbe (06:56) Forgive my ignorance. Advisable for action means what? They're at risk of exceeding their authority or something like that.

Darren R (07:03) It depends, right?

It's open to interpretation based on the agency. So based on the forms and again, this is a space of data science I'm not privy to in terms of depth, but our application will say an officer is advisable. That means a person or an officer in the agency needs to take a look at this officer and interpret kind of what's going on to say, Hey, maybe we need to have a course of action with this officer or talk to them for wellness, talk to them for kind of the arrest patterns they're doing. Anything that they may need to intervene on is where that goes from there. And actionable is just really close. They're farther down that pit that we really need to talk to that officer. Not us, of course, but the police.

Matthew Stibbe (07:46) Yes. Fascinating. How many sort of agencies, police forces are, you know, in using this? What sort of organizations are you getting data from?

Darren R (08:01) Yeah, good question. I mean, we're growing fast, right? So we're in a growth phase. So we have about seven, maybe I think about 800 agencies. Now I'd have to look to give you the right number, but we're growing about 800 agencies and targeting more every day if we can.

So we are targeting as many as we can because we are making real impact. The efficacy rate I told you about is really good because any data that we get, we're doing a very good job, the data scientists are making interpretations on that and feeding the models. And we are making real impact with that.

Matthew Stibbe (08:48) you measure efficacy how?

Darren R (08:50) That I can't answer, it's going to be the data scientists that give you that. We have a monthly meeting with them. They tell us that rate, they get into it, but it's kind of more of a here's how we're doing instead of here is how we're doing it to the business.

Matthew Stibbe (08:53) Okay. Yes. Right. OK. And. What kind of data is coming into the system?

Darren R (09:14) Great. So ⁓ it's police data on the officers themselves. So it's PII and GovCloud data, which means we need to stay in GovCloud data. That data cannot leave for any circumstance. That types of data would be their personal, their demographic data, their badge number, their gun number, social security number, anything that's PII related to the officer. We also get their employment history through their career in that police agency. We also get their arrest data that the officer has made and any narratives and AI data. Not AI, sorry, IA data. I get those two mixed up now.

Matthew Stibbe (09:53) Internal Affairs.

Darren R (09.56) Internal Affairs. And we take all that data in, we get all that data in and then we feed that through the model from there.

Matthew Stibbe (10.05) And what is the mix of structured and unstructured data and indeed digital and non-digital analog data?

Darren R (10:09) Also a good question. So we get a wide array, right? So we have some, again, we have some backwater agencies that are still on paper that we'll need to scrub, you know, OCR that data off, get it digitized, and then we're off to the races of getting it in. Some agencies are proficient where they have an IT department and they give us almost exactly what we need, which is great. It's not often the case. And then some agencies don't have any IT, but they're digitized and they say, here's everything. Can you please interpret it for what you need?

Those are ones we like because we get all these data sets that we didn't ask for that might be interesting. Like if a police officer has off duty jobs like bouncing or party manager or anything like that, it's actually interesting that data science would like to use that. So there's varying different classifications of our agencies that we refer to. And some of them are difficult. Some of them are not, but the data at the end of the day is that police force management, the arrest, the IA, the arrests, use of force and demographics.

Matthew Stibbe (11:08) So it's operational and HR type of information.

Darren R (11:11) HR, basically yes, we are. So for example, an officer will get all their demographic details, address, like I told you before, social security, name, all those things. We also get their employment history through their agency. So where they have been, maybe a demotion, maybe a promotion, maybe a separation, I'm not sure. We get all of that.

So we are a... we are a police management software, so to speak, but we have other facets. Like we call it our first sign. That's where that actionable and advisable suite is. But we need the information about the officer in there before we can make that decision. So we do have, so to speak, kind of HR presence. But that first sign, ⁓ we'll call it a product of ours, is where ⁓ that's where the heart of it is, the actionable flex.

Matthew Stibbe (11:59) Yes. Okay, and tell me about the process of onboarding a new agency. I mean, how do you go in and audit what data they have? How do you start planning the ingestion and processing of that data?

Darren R (12:17) Good question. So this was actually a scale issue for us as we were growing from our startup to our growth is that we were getting so many agencies, our sales team was churning out successful pitches and we were getting agencies in and the data engineering team couldn't scale enough to get that data from what they were giving us into our pipeline. So we had to construct a team specifically that is client facing that handled getting data out of systems and wrangling, that that was their expertise.

We put that team on the forefront of our pipeline, right? All the way to the left. And they are the quick, like the quick action force, right? They get the data, they wrangle it as much as possible. And then they feed it to us. That would be, know, bronze, maybe silver layer. And then our pipelines will take it from there. And that's what the engineers work on, is that standardized pipeline. Now, the wide array of data we get, like I told you, there's several classifications. It can be messy.

And it takes a long time to understand the data and build ⁓ the necessary structure that we need for our pipelines to consume it. And sometimes it's very quick. It all depends on the classification of the agency.

Matthew Stibbe (13:24) You might be looking at sort of arrest records and applying some sort of schema to get that data out and transform it into data that is into a format that's standardized for you, for example. I'm imagining that, yeah.

Darren R (13:36) That's exactly right.

Yeah, think of it like a funnel, right? So we have, we know what our model is for the pipeline to accept it and consume it, but we get all these types of disparate data, but we need to wrangle that down to a format that we can consume. Otherwise we wouldn't be able to scale with that bespoke agency by agency kind of ingestion pipeline. We wouldn't scale with that. So we put that team out there to get data in a format as much as possible. And then we standardize.

Matthew Stibbe (14:06) So I'd love to hear about how you use Clover to sort of start standardizing some of those processes. What was the situation before you brought Clover in?

Darren R (14:19) Sure, that's a great question. before we brought on Clover, we had a team of engineers with varying levels of skill sets, it's important to note, because we were doing bespoke agency by agency extraction scripts all the way from target to source, which was fine at the time because we only had so many agencies and we could keep up with that and there was some stress and some strain all the way down to the team of morale because there was a lot of work.

Matthew Stibbe (14:46) Sorry, that's my doorbell. Sorry, listeners, real life happening here. ⁓ I've set up ⁓ a home automation script. I don't have to go answer it. Someone else will. When someone rings the doorbell, I flash my lights, and that just made me deeply happy. That's the geekiest thing I've done this year. Anyway, sorry, Darren, I was distracted by that. You were saying about how you were writing custom scripts. Was that time consuming?

Darren R (14:52) Okay.

Okay, no worries.

Yes, exactly. we had bespoke.

⁓ go ahead. Sorry.

Matthew Stibbe (15:17) What were the challenges with doing custom per agencies?

Darren R (15:20) Very time consuming. Yeah.

So we would get an agency, we'd have to put a developer tied to that agency to write the script, to get it out all the way, like I said, from source to target. There was a lot of time on data for a single human. The goal was to reduce the time on data by human and put that into a standardized pipeline, which means we had to invest in a product that gave us zero, well, not zero, little standardization to a lot of standardization, very quick, without getting a workforce kind of skillset bump that we would need to do that.

So I'll give you an example, a little bit of example. So we had an agency that gave us data that was very complex. Like they just said, here's our data. I don't know what you need. I don't know how to do it, but it's very complex. That engineer took several months just to wrangle the data to get it to a point where we can ingest it.

Needless to say, that doesn't go away with that team I told you about. They still have to wrangle it, but it's not months because they're not worrying about the standardization through the pipeline because we built that through Clover.

So when we brought Clover on board, it got us from here to here very quickly in terms of productionalizing the standardized pipeline, which was good because now we can put the people with very high expertise in other areas of our business and kind of grow that competitive advantage. That was our whole goal behind it.

Matthew Stibbe (16:36) So Clover's giving you a of a template of a pipeline, and then you can adjust and tweak that for individual.

Darren R (16:42) Yeah, parameters and varying classifications will move switches in the pipeline for sure, because like I told you, there's varying classifications of agencies. They may have a different need because we're SaaS. We need to be flexible with what we ⁓ do and tell our clients. So if they want something that we need to compensate for, so there's a lot of parameters and switches in the pipelines, but still it's in a standardized format. So our time on data for a human is very short and time on data for processing is very large. That is our, that was our main goal.

Matthew Stibbe (17:13) Time on data, is that time to set up the pipeline or the time spent sort of reviewing, processing the data as it's flowing through both?

Darren R (17:22) Both. ⁓ Yeah, so a human is on a computer working on a piece of data for X amount of time. I wanted to limit that as much as possible. AI may help with that severely, but well, three years ago what we were doing is a human would be on that data. Let's say on data, wrangling, understanding, getting it, 80 % of our workflow. Now it's down to about 20 where they can wrangle it, get it to a point, and then the pipeline will do most of the work to get it in there because it knows what to do now that we built that out.

Matthew Stibbe (17:54) Right, I understand. And in implementing HubSpot, sorry, Freudian slap, in implementing CloverDX to do that, what did you learn and what would you have done differently if you had the time again?

Darren R (18:08) Oh, good question. I came on, I was handed a problem and a solution and I was laser focused on solving it, which is fine. But if I would have in retrospect stepped back and say what I would have done differently, would have been collaborated with other teams a little bit more farther left and early on. Like I said, we were laser focused on getting the solution done and we did it. It was great. But I could have benefited from conversations with other departments because they were here longer than us. They may have faced challenges that we didn't know.

Or there's, let's give an example. We have endpoints that our pipeline hits. Some of those endpoints had problems. Some of them didn't do exactly what we needed to do. They were more legacy. And instead of working around them and dealing with that problem, we could have talked to them and say, can you fix this endpoint a little bit? Granted, there was a trade-off there. We can't make the decision in a vacuum. We had to get the job done. But if I were to go back, I would have said, hey, can we work a little bit and collaborate on this instead of just kind of doing workarounds and workarounds to get the data in.

Matthew Stibbe (19:11) And where is your Clover instance sitting? Where's all this? Is it up in sort of AWS GovCloud?

Darren R (19:18) It's on AWS GovCloud. Yeah, we'd have that there. So we have varying server levels, right? Our lower environments, our UAT, and our production environment. And we do CI/CD to promote around. Our developers will develop in what we call an ETL environment. It's a prod-like environment just so we can get real-world data. And then we promote when we, know, through a CI/CD process, code review and merge requests.

Matthew Stibbe (19:43) How, in the three years that you've been with Benchmark Analytics, how has that evolved and what would you have done differently at the beginning to make that more efficient? If indeed it is, that starts with the assumption that it wasn't.

Darren R (19:58) I mean, it's a well-oiled machine now. We have it down to a pretty good science. Our CI/CD is running. The team members are familiar with the product and we're churning out solutions, which is great. I'm not sure if I would change anything, but there's always room for improvement, right? I would say we can get better at our observability and our SLA timing.

Maybe we could handle things differently in the pipeline or what we call recovery. So if there is an error, we have a process that takes that error, makes a decision to try to fix itself. That's probably where we would go from here, because right now we will get an error and a human needs to intervene and kind of react and make a change. What we would like to see is more of a recovery process. Here is the errors for today, kind of go through them and see if it can try to fix itself, that way a developer doesn't have to do it.

Matthew Stibbe (20:44) When you're talking about errors though, you're talking about errors in the pipeline or in the data, exceptions in the data.

Darren R (20:48) In the pipeline, the process, there's always types of errors. For example, a required field that was null, that that team missed, hit the pipeline and that record fell off. So just does it alert somebody? Does it report on it? Does it fall off to the ether, which would be really bad. But we could catch that and then we could alert or modify, not modify, sorry, notify key stakeholders and they will go back to the agency and say, hey, we need this file.

Matthew Stibbe (21:05) Yes.

Darren R (21:16) And then they'll send it to us again, and the pipeline will just consume, and then we don't have an error. That would be great if that was automatic instead of a user querying the error table and saying, oh, we have an error, and then they write the email, and then they reach out, it's a lot of work that we just necessarily don't need a human to do.

Matthew Stibbe (21:33) So sort of automating or semi automating the exception and error catching and. ⁓

Darren R (21:38) That's right. Yeah, we call it recovering here, but that's essentially it.

Matthew Stibbe (21:42) Is there a loop from that through to the pipeline where you can learn from... we, we, we saw a lot of these errors. We had to do a lot of recovery on that. And if we put that into the pipeline, we can catch them, you know, next time without having to...

Darren R (21:55) Absolutely.

This is a living pipeline. We take those errors and then we rebuild. The idea is to not have them ⁓ as much as possible. We call them type two errors. Type two errors are what we produce in our pipelines. Those need to go away because we would take, for example, if we were hitting the downstream API or endpoint and we didn't have a retry mechanism on it and it just failed once and then it stopped the pipeline, a developer can go in and try to...

hey, this is not what this is supposed to do. Let's get a retry mechanism in there. And then we might not see that error again. The type one errors are errors from agencies where they send us, we do data contracts, right? So if they change the contract or change the file schema or something like that, it will break some of these things. And we're trying to work on that.

Matthew Stibbe (22:44) Forgive my ignorance tell me what a data contract is.

Darren R (22:47) Sure. So we get, we enter in contract, data contract with an agency. An agency has promised to send us this data in this format and this cadence. And then we build pipelines around that. So we're not event driven pipelines, not yet, because our agencies want to know exactly when we're processing that data. So when they do that contract and they say, this is the schema that we have, we are locked into that schema.

So if they change it, we'll know immediately. Because for example, first name and last name, maybe they switched them and we wouldn't know. It's a very, very small minute example, but then the pipeline would start changing names and things like that. So that contract is locked in there. We're having an...

Matthew Stibbe (23:25) So sort of data typing and things. Yeah. Okay, fascinating. All right. Well, so we're almost out of time, but before we bring this interesting conversation to a close, I'd love to ask you if there's a piece of advice you would give for anyone who's embarking on a data integration similar to or related to some of the things we've been talking about today.

Darren R (23:50) I'd say POC a lot. Get your use cases set up, get your functionality that you need for the business. Try to keep it tight. Not all bells and whistles are related to what you need for your solution, and then focus on getting that solution done. POC a lot of software, POC a lot of data pipelines. Spend a little money on materializing those POCs and then weigh out the costs and benefits. But what I see a lot is that the first part doesn't happen.

We say we have a solution, let's just go find it, but we need to be focused on a list of exactly what we need and kind of weed out what we don't need as an anti-pattern so we don't get distracted with all these bells and whistles that software companies like to show you.

Matthew Stibbe (24:31) I love the idea of doing proof of concepts and ⁓ starting with something and learning cheaply. And I remember something from my coding days back in the very long ago, non-existent code doesn't crash. And the thought behind that was if you focus, as you've been describing, on the things that you absolutely need to have, you're not creating all this debt and this sort of surface that can go wrong and has to be maintained.

You know, you can expand from that by adding other things that you need to have. The risk I keep seeing is people going, okay, we're gonna gold plate it. We need to do this and we need to do this. And the specification is this and the requirement is something you've got this really big complicated thing. Yeah. So yeah.

Darren R (25:10) Right, right, It can be done. We've done this a few times, because we were in our startup, we tried a lot of things. And then we went through growth and we're trying to mature on what we need to do. And we did several POCs that didn't work, but we learned a lot. We learned what not to do is most importantly, as learned what to do. And then we just dwindled down on what we didn't need into a list of what we did need. And then we found a solution to fit that list. It took a lot of time, but it saved us down the road in building, because building takes a lot of time as well. So we knew what we wanted to do and we were all in agreement and we think it works well and then we built the time to do it.

Matthew Stibbe (25:49) There's a poem that I forget, but it ends with the line, fail better, I think. Fail faster, fail better. Good. Well, on that bombshell, this has been a non-failure high success interview. Thank you very much, Darren. I've really enjoyed it. And that brings this episode to a close. If you're listening and you'd like to get more practical data insights and learn more about CloverDX, please visit cloverdx.com/behind-the-data

Matthew Stibbe (26:18) Thank you very much for listening and goodbye.

Darren R (26:21) Thank you.