Ethics, critical thinking and disambiguation in data

Written by CloverDX | Oct 28, 2025 2:08:57 PM

About this episode:

In this episode of Behind the Data, Matthew Stibbe interviews James Randall, CTO of Blackdot Solutions, discussing the impact of AI on data analysis in the open source intelligence (OSINT) space and how Blackdot's Videris platform helps experienced investigators analyze vast amounts of structured and unstructured data to identify financial risks. The conversation explores how large language models are revolutionizing entity disambiguation and extraction from unstructured data, the growing challenge of AI-generated "slop" polluting online information sources, and the critical importance of data provenance and ethical sourcing in investigations.

Timestamps

00:00 Intro
00:55 Using AI for entities disambiguation
02:13 LLMs for interpreting and analyzing unstructured data
06:17 Open source intelligence and the Videris platform
14:27 James's career journey
19:23 Applying critical thinking to data sources
20:22 Grounding LLMs in trustworthy information
23:45 Human vs LLM knowledge graphs
25:23 The ethics of data collection in OSINT

Key discussion topics

Why LLMs dramatically outperform traditional NLP for entity extraction and disambiguation
How to ground AI analysis with trustworthy data to reduce hallucinations and maintain reliability in investigations - similar to the way human investigators build knowledge graphs
Why understanding data provenance, and how much you can trust it, is critical in the age of AI-generated information
The importance of applying human judgment about which data sources align with ethical standards - legal doesn't always equal ethical

Transcript

Matthew Stibbe (00:00) Welcome to Behind the Data with CloverDX. I'm your host, Matthew Stibbe, and today I'm talking to James Randall, who is CTO at Blackdot Solutions, who are another client of ours and a fabulous company. Great to have you on the show, James.

James (00:14) Thanks for having me Matthew, nice to be here.

Matthew Stibbe (00:16) Well, before we dive into your world and your career, what are you geeking out about at the moment?

James (00:23) Somewhat predictably, AI, which I think is probably an inevitable answer to that question at the moment. We've been spending a lot of time researching AI on a number of fronts. I think as an engineer, there's the obvious, having it help me do engineering and help our teams do engineering, but also how it can help our customers and supplement or replace some of the techniques we've used in the past for analyzing data and looking at data.

And one of the interesting things that's fallen out of it is entities disambiguation and how good LLMs are at figuring out what the entities are in documents and evaluating if they are what I would call meaningfully the same. I think meaningful is a good word because it's easy to look at a list of properties and say, are they the same values or are they roughly the same values?

But actually 'is this content meaningfully the same as this content?' is a slightly harder question.

Matthew Stibbe (01:23) Can you give me some, just in your world, we'll introduce Blackdot in a minute, but what sort of entities might need to be disambiguated?

James (01:32) Yeah, so the Videris [platform] looks at ⁓ a whole load of open source data. Typically it's focused on individuals or organizations. It's an investigations platform, so people are trying to figure out does somebody represent some kind of financial risk or does an organization have risky investors?

They're looking for those kind of things and Videris deals with, I would say, two kinds of data. Structured data, so data we've got from ⁓ sources that are curated and you can trust the results. They're often presented to you as JSON. You've got a surname property, a forename property, you can trust that result.

But we also deal with unstructured data, increasingly so in fact. And that can be web pages, PDF files, images, and they contain all sorts of content. And often what you're looking to do is figure out firstly, what entities might be in that document. You know, is Matthew in this document? Does he appear in this document? Are we sure it's Matthew? Why are we sure it's Matthew? But also once we've got that out, is that entity the same as another entity that we already are aware of? And that can be quite interesting because often they are partial. They're not always the same.

The other thing that is interesting when you look at that, particularly when you look at unstructured data - and traditional techniques have not always been that good at this - the pieces of data, the pieces of relevant information about the entity, so let's say it's a date of birth, a surname and address, they're not necessarily close to each other in the document. And even if they are, they're linked with many other words. And traditional NLP has not always done a great job of making those associations.

And what we're finding, and again, we kind of started looking at LLMs, We actually looked at LLMs initially with a feature introduced to help people browse the web in a secure fashion. And we were extracting entities from those documents. And we experimented with using a traditional NLP solution. And at the same time, we also said, let's try an LLM. And we applied an LLM to the same problem. And the difference in quality of that extraction and subsequent disambiguation was really quite impressive. The LLM won by a good margin.

The traditional NLP approach would struggle to kind of collect together the properties about that person from across the document, whereas the LLM, okay, they don't, I always struggle to talk about LLMs because I really don't want to say they think, they know, they reason, because none of that is true, but they managed to make that association across gaps, across a gap of language in a much more efficient manner.

And in the same way, can send... what's also really interesting about LLMs is, particularly in the data space, is they don't really care about what format the data's in. So you can lob in some JSON, you can lob in a CSV with the same entity in, and it can be partial and say, 'is this the same thing?' or 'consolidate those two things' and it will generally do a pretty good job of it.

Whereas if you wind back a year or two, you'd be trying to kind of parcel that in code, get it into some sort of comparable format and really worrying about the structure of the data and then the content of that data in the structure. What's really interesting about LLMs is you tend not to have to do that. You can help them, but they are very good at working with loose structures.

Matthew Stibbe (05:16) It's extraordinary, isn't it? An old colleague of mine runs a war game simulation business now, and they can take the output of some of their simulations that they run. mean, they do it for militaries and things, and put that just log file into an LLM, and it then turns it into a narrative of what happened in the war games.

James (05:36) It's crazy. Yeah. I was looking at it, it's kind of, it is a data problem, but it's a bit more on the coding side. I was looking at a lengthy stack trace, a big sort of code dump, think, something had gone wrong, big, big log file. And I was struggling to figure out what had actually happened. And I just pasted it into an LLM and said, can you summarize what's actually gone wrong here? And it was spot on and told me in seconds. Oh my goodness, this is quite astounding.

Matthew Stibbe (06:03) So we might come back to this, but what I'm sure now everyone listening is eager to know is, what does Blackdot Solutions do? What is Videris? Tell us about the world of open source intelligence.

James (06:07) Sure. So as I mentioned earlier, it's an investigations platform and it's generally been targeted at what I would call experienced investigators. So there are people in organizations such as large corporations, financial institutions, government agencies who need to conduct quite thorough investigations into individuals.

A bank might be looking for connections to sanctioned individuals, particularly obviously as the Ukraine situation developed and all of a sudden there was lots of sanctions applied to lots of people. You know, obviously Russians and people related to those Russians. You know, that was a big job for people to understand all that and the connections are not always obvious. So these people are really good at covering their tracks. You know, a Russian oligarch does not directly invest in a utility company in the UK, you know, not any more anyway. You know, they wouldn't be allowed to.

But they will do it not just through shell companies, but through individuals they know who are often connected across perhaps social media. And in some cases, they've hardly left any tracks for connecting the two people together. But they are connected and they are actually deeply connected and attempting to evade sanctions for financial gain. So they will use our tool to kind of explore that space and try and understand where the connections are.

And Videris will help them in a number of ways. It will present... one, it gives them access to an awful lot of data sources, open source data sources, which you can get to through Google often, or log into a social network or by visiting companies' records. But doing that at scale for lots and lots of... just trying to get to that data at scale, is really difficult.

It's time consuming. You know, Videris can do that for you instantly. Go and find this person in all the corporate records systems, you know, and it will just come back with the results for you. You know, look for this person in social media. It will come back with results for you. So it shortcuts the data gathering process, but we then kind of present that information in a graph to people so they can start to visualize just how things are connected and, you know, kind of annotate that graph, add things to that graph, explore further.

And ultimately we hope they can get to a conclusion, which is that person is okay or that person is not okay. It's human led as a result and these kind of experienced investigators they are in the driving seat and what we try and do is provide them tools to assist them. So we're not making decisions for them but we will try and surface information to them that is relevant and if it's not relevant they can discard it and move on.

Matthew Stibbe (09:06) And there's an element, isn't there, I think, of making that data evidentially, if that's a word, it is now, ⁓ valid, recorded that, and what was seen, and so on.

James (09:17) Yes, it's all becomes recorded kind of in the system. There's a full audit trail so that somebody goes in there and starts tinkering with it, you can see it. You can't kind of bypass the sources of truth. I mean, it can be valid to edit this data for all sorts of reasons, but that all gets recorded. We don't generally get involved in kind of criminal stuff for the most part. It's mostly in the financial corporate space.

The controls on that are not quite as stringent as they would be in say, you know, a police has to... if you're working in the police, you've obviously got a bunch of controls that you need to adhere to, which is not quite the same in let's say a corporation. And then you have a bank who is kind of in a slightly different position where they're financially regulated and have, they have controls, but they're generally different kinds of controls to say a law enforcement agency. So we kind of sit in the middle of that.

Matthew Stibbe (10:12) And inside Videris, there must be, if you've got hundreds of sources of data coming in, there must be a lot of work to process, evaluate, translate that. Can you talk a little bit about the world of data inside Videris?

James (10:26) Yeah, sure. We have a system, which runs in the cloud for the most part, which will connect outwards to all these different data sources. And obviously there's a lot of data sources. I can't even tell you how many there are. And it gets added to all the time. So we have a tool that makes it easy for people to build new connectors to new APIs. This data is normally exposed through an API.

So we have a tool we call extension builder, which is, it basically allows a non-engineer to connect to an API, shape the data and bring it into the system. So that's kind of like data entry. The system that sits in the middle is called COG. And it talks to those extensions or runs those extensions. It coordinates the activity across many data sources. So if you search for James Randall, you might be searching across Google, corporate record databases, social networks, it will fire off all those searches and queries and then bring the results back together. At the same time, it will start extracting entities from those results. So, if you've got something from Google, you've got search results and it will try and figure out what's in the search results that might be relevant too. And that eventually gets pushed back into Videris the application.

It appears kind of piece by piece. We don't wait for that entire thing to complete. One of the sort of interesting, slightly awkward things about the open source world is not all the data sources are reliable all the time. You are very much talking to a lot of platforms and some of those platforms are talking directly to the internet, as it were. So things can appear and disappear and change. So there's quite a high degree of fault tolerance around that.

And part of that flow is we don't wait for it all to complete because something could be taking five minutes for some transient reason on the internet. That goes back into Videris and depending on what we use, the kind of context to what the user did, it will get presented to them in a number of ways. One of the most common ways is our search interface, which looks a little bit like a search engine interface, but is sat on top of this wider variety of data. And people can kind of see the facets and the sources.

We attempt to apply a risk rating to it. So, if they're looking for kind of like risk and that's like generally like financial type risk...

Matthew Stibbe (12:55) Risk meaning confidence in the data rather than risk in a financial services sense?

James (13:00) Risk in a financial sense or a criminal sense, you know, does this person have indicators that they are a risky person? So a good example would be we've been able to identify they are sanctioned. That would obviously constitute quite a risk to, no matter who you are, if you're a kind of legitimate business or organization, that's a generic kind of risky, risky state for an individual to be in. So we surface all that up in a kind of Google search type interface or a sort of traditional search engine type interface. It's not quite the same. And users will pick those things. We rank them, they will look through them and they'll start to move those into our chart, at which point they can start forming links and the system starts forming links for them to understand that.

So that's the general data flow. There are some spins on that. We have something which is called enrichment, which is where you've got one of these items of data and you want to find more out about them. So an example might be you have found a company and the company is available to you. You can enrich that to find all the offices that were associated with that company. And from those offices, you could go and find other companies or perhaps social media that they're on and so on and so forth through the system.

Matthew Stibbe (14:15) I've seen the thing and it sort of ends up looking like this sort web network of relationships and entities. It's quite a fascinating thing to see. So tell me, how did you get to Blackdot? Tell me a little bit about your career.

James (14:19) Yeah. Yeah. So I think I've taken a fairly traditional route for somebody of my generation. I'm just knocking on the door of 50, which horrifies me, I'm counting down the months in a really horrible way... can't believe I got there. But so I guess I, you know, I was born in the mid 70s and as such that kind of wave of accessible 8-bit micros was perfectly time for me. I think I

Matthew Stibbe (14:56) think we're both BBC micro collectors, right?

James (14:58) Exactly. Yeah, I had a BBC Model B, I think in 1983, which my dad brought home. And it's interesting, I enjoyed the games, but I was really interested in coding from a very early age. I found it fascinating, like, enthralling. So I was, you know, writing really basic stuff when I was sort of six or seven years old and never stopped, never ever stopped. Still do, constantly.

Not in BBC Basic so much, but I do dig it out from time to time and have a little play. Yeah, so I ended up, I think I, it was quite a long time ago, I think I had a place at university, but I took a year out. I was kind of a bit burned out by education. I'm definitely somebody who prefers just to have some space to problem solve in. I'm not always great in the kind of like super structured environment. I think I thrive a bit better elsewhere. So I took a year out.

And I ended up, again, kind of... sign of an era, I sent a floppy disk. It was three and a half inches. It was not super floppy, but it was a floppy disk to a local company with some C source code on, which is a game I'd written. And they gave me a job, basically. So I ended up working as a developer on an absolutely fascinating project. Again, I could sort of wax lyrical about this stuff forever, but it's a fascinating... To me, I find it fascinating because it shows how the industry has changed. I was 18, 19 years old. I was the only developer on this project. And I turned up and they wanted me to write what now I know is a scripting engine. I didn't know it was scripting engine then. I was writing a very simple operating system for this little 68,000 based embedded device and like a control program on a PC that would run all this thing. And it was a training system for phones. It was a mini switchboard and the PC would show training software. Now I think it's fascinating because now if you went to do that, somebody would say, you need 10 people and several million pounds. It's fascinating how the industry has changed. But yes, I ended up moving on to a very traditional sense, moved through a few companies, spent a lot of time in EdTech, which was really interesting. I really enjoyed working there, unfortunately, the company kind of, as a lot of companies do over time it changed and it became not for me anymore and I moved on.

I spent time in EdTech, a little bit of time in consultancy and some kind of freelance on and off. And as part of freelancing, at one point I was working with the founder of Blackdot who had moved on to a different business. He'd started another business, but he still was involved in Blackdot. And I did some work for him for his new business. And at some point, obviously I wasn't at Blackdot so I can't quite say exactly what's going on.

At some point, Blackdot was looking for somebody to run their engineering teams and I was brought up as, James did some great work for us, maybe we should speak to him. So I ended up at Blackdot.

I'd been a CTO previously, so it wasn't my first kind of CTO gig, but CTO, it's a bit like architect. What does it mean? Like it means completely different things in different businesses and... you know, to me at least, I prefer to look at it just as a problem solver. You know, I like to create solutions. I love to create products. And, you know, I kind of happily, yeah, I don't, yeah, it's an interesting term CTO. It can mean such different things.

Matthew Stibbe (18:23) We were discussing earlier some of the challenges that are coming out in your world, just for viewers in the interest of full disclosure. I also host a podcast for Blackdot where I talk to a lot of open source intelligence practitioners and investigators and people. And it comes up a lot there as well... this question of source criticism, as I would call it, or critical thinking and how much you can rely on data.

And I'm wondering if that is something that is under threat when you're thinking about CTOs in the biggest sense, when you're thinking about data. How do you protect your critical thinking?

James (19:01) Yeah. It's really hard. It's a very good question. It's a very broad question, I think. So I'll try and pick off some relevant parts.

I think the first thing have to be aware of if you're applying critical thinking to data you've sourced from online or various sources is what's the provenance of that data and how much can you trust it? There's an awful lot of data, again, back to AI, the amount of, what do we call it? Slop, essentially, slop, which is not just rubbish YouTube content, but also vast amounts of just nonsense blogs and really low value content that you can't really trust. It's generally created by AI. It might have been created with good intentions. It might have been created with misleading intentions. Sometimes it's being created to literally obscure a problem.

Matthew Stibbe (19:58) sort of digital chaff to mess up the picture.

James (20:10): Exactly, and you have to pick your way through that. So part of it is understanding the provenance of the data. And I think something we do at Blackdot there is... I come back to these curated data sources. We use some curated data sources which can at least help ground an investigation in something that you can trust. And interestingly, if you apply LLMs to this space, I've found a good way to... you can't eliminate hallucinations and you can't eliminate LLMs going off the rails, but a good way to help them stay on the rails is to ground them with information that they can trust.

And then having got some of that trustworthy information into the LLM, allow it to then do more exploratory sourcing of information, but giving it something that's grounded. And I think that applies to people too. If you've got something that is a solid starting point for your thought process and you feel you can trust that data, you can kind of start to push out from there, look at other spaces, look at less trustworthy sources and have some means to assess them back against that sort of data set.

It is difficult. I think this is going to become an increasing challenge. I think in about 2010 people would often talk about bots, you know, bots. But they were kind of limited in scale and I think a lot of the time, it's probably not entirely true, but we've often talked about in the context of like state of elections, know, Russians interfering, back to Russians, Russians interfering with elections, you know, Russia interfering with elections. It's not Russians, I shouldn't say that, Russia, the Russian state interfering in elections, you know, through the use of bots on social media, all of a sudden that stuff's been ramped up to a massive scale and that's in the open source space.

So if you're working in that space, it's becoming really challenging. And I think that challenge is only going to increase over the next few years. It's going to get harder and harder. Part of the solution is to, sorry, Part of the solution to that, I think, is to adopt some of the same techniques in reverse. So if people are using LLMs to generate lots of rubbish data, you need to start using similar solutions to try and unpack that to what is not relevant and what is relevant. I think that's early days. I mean, historically, AI has not been good at sniffing out AI. And there's a reason Facebook don't just use AI to moderate Facebook. It's because if they could, they would. I think they're now starting to push more into automation. They've always had a heavy hand of automation. But I think we'll see them reducing their manual efforts, but it's been really hard for AI to spot AI.

Matthew Stibbe (22:40) It feels like you can, if you've got sort of reliable data or reliable processes, like deterministic processes, on one side of the equation, you can be more confident about the outcome. But if you've got sort of a non-deterministic process on the one hand and unreliable or polluted data on the other, the outcome... the... it's going to... the problem is going to multiply the noise is going to multiply rather than reduce. And I think I'm certainly from the Clover perspective, they're often talking about using Clover to deliver reliable data and then into an AI. Anyway, I'm not I'm not here to sort of shill from Clover. It's interesting. Yeah.

James (23:05) Exactly. Yeah, no, but I think that is a way to go with AI.

It's interesting if you watch, you know, I often sit with a chap who works for us called Brett, who's a really seasoned investigator who works for us and helps our customers. And if you sit with him, it's really fascinating to watch what he does. He essentially builds up a knowledge graph, you know, in his head. So he's working in Videris, he's got a graph there, but then you realize he's got this whole other graph in his head.

And he's literally building up these same connections and applying these same kind of criteria to them subjectively. And I think when you look, it's interesting when you look at AI and you look at LLMs and you look at how they work in this kind of agentic model that people talk about, it often comes back to building up and maintaining a knowledge graph. And I think as we move forward, there's gonna be a lot of work and thought going into how do you build up a knowledge graph, which has a sense of trust in it and level of trust across it, because it won't all have an equal level of trust.

You know, helping the LLM to navigate that space. It's much the same. It's interesting watching those two things, kind of a human and an LLM do similar things. The underlying thought process, as it were, I hate again to apply that word to LLM, but the underlying process is not that dissimilar in some ways.

Matthew Stibbe (24:32) There's a real danger of anthropomorphisation with LLMs. You see what they produce and you think they've gone through a sort of human intelligence process to do it. You infer that and that's not what's happening. Something interesting, the outcome looks like that. I'm thinking about the challenges of how do we produce a of a reverse Turing test to prove I am human rather than I'm not. Anyway. We're almost out of time. There lots of interesting directions we can go in. But there's one last thing I wanted to ask you about. It's related to this reliability of data. How do you evaluate the ethical sourcing of data? It must be very tempting or very easy in that OSINT world to use data that is either illegal or unethical. They're not always the same.

James (25:23) Yeah, it's really interesting. A lot of our customers are very concerned about the ethics behind the data collection. And so as a result, we have to be very concerned about the data collection. Essentially, the way we work is we... anything that we look to present to a customer through Videris, we do a degree of diligence around. How are they collecting the data? Where are they based? You know, who's... essentially we run an investigation on the organization slash individual providing the data in the same way that somebody would anywhere else.

And it's interesting because there is... some stuff is legal yet unethical. You can do some things, particularly, it's not even across the world as well, you can do things that are legal, but many businesses would not want to get involved in. So you have to of avoid those things, but there is... It just comes down to applying an ethical standard to the data sources you adopt. We do that by humans. We have our product team, we have a compliance team. They will go and look at it and go, is this a data source we are comfortable adopting? And there are many we have not adopted as a result. Sometimes people want, but we can't adopt this.

Matthew Stibbe (26:36) It's important, I think. ⁓ Anyway, sorry, we are out of time now. So as we bring this episode to a close, James, it's been an absolute delight talking to you. Thank you very much for being on the show.

James (26:47) Thanks for having me on.

Matthew Stibbe (26:56) And ⁓ if you're still with us, if you'd like to know more about Blackdot or Videris, blackdotsolutions.com is the domain, the website. And on there, you'll find a link to the From the Source podcast if you want more of me doing interviews, that's a good place to go. And if you'd like more practical data insights or to learn more about CloverDX, please visit cloverdx.com/behind-the-data. Thank you very much for listening and goodbye.

Related episodes

Translating complexity into clarity: Ethics, alignment and the human side of data, with Mirela Mart, CFO at Articulate Marketing
The challenges of implementing AI in data systems, with Andrew Coulson, Cloud Architect at Epicor
The vital importance of data governance, with James Courtney-Smith, Solutions Consultant at Lucid Data Services

Resources

Blackdot Solutions: blackdotsolutions.com - Provider of the Videris investigations platform for financial risk assessment and compliance
From the Source Podcast: blackdotsolutions.com/podcast - Blackdot's podcast featuring interviews with OSINT practitioners and investigators
CloverDX Behind the Data: cloverdx.com/behind-the-data - Podcast homepage with more episodes on data integration and management

View full post