CloverDX Blog on Data Integration

Building trust in open source data: Investigations, AI and ethical sourcing

Written by CloverDX | October 30, 2025

James Randall, CTO at Blackdot Solutions, is focused on helping investigators find clarity in a noisy world.

His team builds and manages Videris, a platform that supports financial institutions, corporations, and government agencies as they trace connections between people, organizations, and risks across vast quantities of open source data.

Their goal isn’t just to surface information but to surface it responsibly.

 

Disambiguating entities across messy data

Videris processes both structured and unstructured data to support investigations. Structured data is more reliable, with consistent formatting. But much of the real insight comes from unstructured sources like web pages, PDFs, and images, where names, dates, and identifiers are often incomplete, far apart in the document, or buried under irrelevant text.

That’s where large language models (LLMs) have proven surprisingly helpful.

“We were extracting entities from those documents. We experimented with a traditional natural language processing (NLP) solution… and we also said, let’s try an LLM,” James explains. “The difference in quality of that extraction and subsequent disambiguation was really quite impressive. The LLM won by a good margin.”

Notably, the model was able to associate data points across long distances in the document. “You might be looking for something like a date of birth or an address. They’re not necessarily close together, and even if they are, they’re surrounded by a lot of other words,” James says. “Traditional NLP has not always done a great job of making those associations.”

Flexibility is the future

James emphasizes how LLMs don’t depend on strict input formatting to perform well. “You can lob in some JSON, you can lob in a CSV with the same entity in, and it can be partial, and say, ‘is this the same thing?’ or ‘consolidate those two things,’ and it will generally do a pretty good job of it.”

In the past, this kind of comparison would’ve required extensive preprocessing and data normalization. “What’s really interesting about LLMs is you tend not to have to do that,” James adds. “You can help them, but they’re very good at working with loose structures.”

Investigations powered by human expertise

Blackdot’s mission isn’t to automate decision-making, but to assist trained professionals in complex investigative work. Videris allows users to query across a wide range of sources (including corporate records and social media), then present the results in a visual graph that helps uncover links between entities.

“A Russian oligarch does not directly invest in a utility company in the UK,” James says, as an example. “They’ll do it not just through shell companies, but through individuals they know who are often connected across social media. And in some cases, they’ve hardly left any tracks.”

The platform brings all that data together and gives users the tools to explore, annotate, and decide. “Ultimately, we hope they can get to a conclusion. Which is, that person is okay or that person is not okay,” James explains.

Everything is auditable, he explains: “There’s a full audit trail. So if somebody goes in there and starts tinkering with it, you can see it.”

From data ingestion to actionable info

The underlying platform is powered by a toolset that allows teams to quickly connect to APIs, pull data in, and extract relevant entities.

“We have a tool we call Extension Builder. It allows a non-engineer to connect to an API, shape the data, and bring it into the system,” James says.

This ingestion process is designed for fault tolerance. With sources across the open web, delays and dropouts are common. “We don’t wait for it all to complete because something could be taking five minutes for some transient reason on the internet,” he adds.

Instead, data flows into Videris as it becomes available. “One of the most common ways people access it is our search interface. It looks a little bit like a search engine, but it’s sitting on top of this wide variety of data.”

Trust starts with provenance

The team at Blackdot also takes care to vet the data sources themselves. Legal doesn’t always mean ethical, and both matter.

“Some stuff is legal yet unethical. You can do some things, particularly across the world, that are legal but many businesses would not want to get involved in,” James explains. “It just comes down to applying an ethical standard to the data sources you adopt.”

That process is human-led, he explains: “We have our product team and our compliance team. They will go and look at it and go, ‘is this a data source we are comfortable adopting?’”

Critical thinking is a disappearing skill

In an age of AI-generated content and information pollution, critical thinking is more essential, and more endangered, than ever.

“You have to be aware of the provenance of that data and how much can you trust it,” James says. “There’s an awful lot of what we call slop. Not just rubbish YouTube content, but vast amounts of nonsense blogs and really low-value content that you can’t really trust.”

Some of it is accidental, but much of it is deliberate. “Sometimes it’s being created to literally obscure a problem,” James adds.

Blackdot is looking at ways to combat this with the same tools used to create it. “If people are using LLMs to generate lots of rubbish data, you need to start using similar solutions to try and unpack that, to find what is not relevant and what is.”

Building human-like reasoning into AI workflows

One of the most promising paths forward is to mimic the kind of cognitive modeling used by expert investigators.

“If you sit with someone like Brett Redman, Head of Intelligence at Blackdot, who’s a really seasoned investigator, he’s essentially building up a knowledge graph in his head,” James says. “He’s working in Videris, he’s got a graph there, but then he realizes he’s got this whole other graph in his head.”

James sees parallels in the agentic workflows used with LLMs. “The underlying process is not that dissimilar in some ways. It’s about building up and maintaining a knowledge graph, and eventually we’ll need ways to track trust across that graph. Because not every connection will be equally trustworthy.”

To get more insights from James on OSINT and AI, listen to the full Behind the Data podcast episode here: Ethics, critical thinking and disambiguation in data