Building high-volume, high-stakes, real-time data pipelines

Written by CloverDX | May 22, 2025 5:00:14 PM

About this episode:

In this episode of Behind the Data, Matthew Stibbe interviews Yossi Leon, CTO of FIA Tech, discussing the complexities and innovations in data engineering, particularly in the financial sector. Yossi shares insights on the challenges of normalizing diverse data sources, the importance of real-time data processing, and the development of the Trade Data Network. The conversation highlights the significance of adopting industry standards and the role of technology in enhancing data efficiency and transparency.

AI-generated transcript

Matthew Stibbe (00:01.386) Well, hello everybody. Welcome to Behind the Data with CloverDX. I'm your host, Matthew Stibbe, and today I'm with Yossi Leon, who is Chief Technology Officer at FIA Tech. Great to have you on the show, Yossi.

Yossi Leon (00:14.456) Thank you, thank you. Glad to be here.

Matthew Stibbe (00:17.29) And before we dive into your world, I'd love to start by asking you, what is really exciting you at the moment in the world of data engineering?

Yossi Leon (00:28.014) I think the fact that we're in a jungle of data and everybody's speaking a different language and you're actually the one bringing them all together and trying to normalize and be the translator between those different languages, this is what excites me that we're basically helping bringing people, organizations together and speak the same language.

Matthew Stibbe (00:55.107) And what's particularly, what piece of technology is allowing you to be that ambassador, that translator between worlds?

Yossi Leon (01:03.982) We're basically connected to all different participants in the market and us consuming their data and being able to normalize this using our ETL CloverDX to basically normalize it to a single and generic reference data basically helps to everybody just see the same content and speak the same language.

Matthew Stibbe (01:36.758) Well, so tell me about this world where you've got the data coming in. Where does FIA Tech come from? Who are the sponsors? Let's start with business, with the organization.

Yossi Leon (01:46.894) Yeah, that makes sense. Yeah, so I'll take a step back. FIA Tech is a company where technology, global technology company, and we provide solutions to increase efficiency, but also reduce risk. And all of this is happening in the futures market options listed derivatives. And so we build products like bulk-wage settlement platform, legal document repository, reconciliation platform, cross-exchange reference data subscription, and also a trade ledger called Trade Data Network.

So we're very much active in the futures market and our customers are financial institutions from the biggest banks to the smallest hedge funds.

Matthew Stibbe (02:44.443) And you're the fabric that connects all of that together.

Yossi Leon (02:48.34) Exactly, yeah. So they're using our products when they want to trade futures. They go into a legal agreement in our platform and they all sign those legal agreements before trading futures. They are paying each other brokerage fees using our brokerage settlement platform and in the trade data network they're basically all sending us data.

Matthew Stibbe (03:14.919) So this is quite high stakes stuff, right? I mean, it's got to work or else money is lost, yeah?

Yossi Leon (03:21.986) Yeah, 100%. It's all done in real time. The market is very much dependent on us, especially in times when there is a lot of volatility in the market. So they're dependent on us and our platform to provide them transparency and showing where the data, where the trade is at a given time.

Matthew Stibbe (03:45.782) So you talked earlier about being an intermediary, I used the word ambassador, a translator. How many different sources of data are we talking about?

Yossi Leon (03:57.9) Right, so we're talking about 100 different data sources and some of them are coming from the buy side firms and some of them are coming from the sell side from the brokers in the market. Some of them are coming from the clearing houses and exchanges sending us data. So those are different type of sources and they're all sending us the data. So really a lot of sources with a lot of formats with different frequency and different type of connectivity as well all coming into our platform.

Matthew Stibbe (04:36.98) And that all has to be translated and normalized into one dataset or one format?

Yossi Leon (04:45.534) Exactly. So as part of the definition of how the data structure looks like, the schema, we have the standards. And in many cases, we ask the participants - use the standard to send us the data. But in many cases, this is a long term project for them to be able to build to the standards. So they're basically sending us whatever they have.

Matthew Stibbe (05:14.166) They've got...

Yossi Leon (05:14.508) ...the way they represent the trade exactly and we're doing all the translation to basically transform this into a standard format, standard reference data.

Matthew Stibbe (05:26.742) And one of the challenges, I guess, must be around data validation and anomaly detection for that. And how do you manage that? And how is that changing, especially if you're having to do it faster and more in real time?

Yossi Leon (05:41.772) Right, there is a lot that needs to be done and as you mentioned also it's all done in real time. So we're identifying and trying to normalize this to something that based on the reference data that we have. But in cases that for example a new account is being introduced, a new type of trade, they send corrupted data. This is basically when you need to raise a flag and an exception and put this on an exception queue for either a system to review this and trying different type of options or eventually gets to an operation person that will validate this anomaly and provide the right reference data or aliasing.

Matthew Stibbe (06:37.323) Is there a role for AI in that? Is that something that's coming or here?

Yossi Leon (06:41.43) Yes, I mean definitely there's a lot of historical data that basically you can train and get the model learning and basically identify what potentially could be that new symbology, that new trade type and basically assign that and this is something that we're working on and basically get to a point that we can use something like this on the day to day.

Matthew Stibbe (07:16.246) How did FIA Tech come into existence? When and why?

Yossi Leon (07:22.734) We started a long time ago as a need for financial institutions to have a technology company that will be able to build products that does not make sense for each company to build, but it requires some kind of a hub or a network to be built.

A good example is the brokerage settlement platform instead of like each firm building their own settlement platform and doing that with their counterparty. So to have one centralized product to do that made sense and this is how we all started with the need coming from the futures industry participants of a technology solution and then we saw one solution after the other basically came into place in need. In many cases, you involve, for example, lawyers. There is a lot in compliance as part of what used to be our mother company, FIA, that they're getting involved. In other cases, you have working groups that helps to shape the product. So representative from the different financial institution helping us to build the product based on their needs.

Matthew Stibbe (08:48.404) And that sounds like there's quite a complicated governance model or a lot of stakeholders involved. Can you talk to me a little bit about how you manage, wrangle all of that? Or maybe you don't, maybe you don't need to. I don't know what your setup is.

Yossi Leon (09:01.95) It is complicated to basically as a start get an agreement or get an alignment from so many participants. But I think throughout the time we'll learn to work with those firms and they're helping us to shape our products. In some cases, as I mentioned, we use another, we use FIA, which is the Future Industry Association, they have committees that are very much specialized on the legal and compliance side. So we take recommendation from that committee to know how to build the product, what are the requirements for the product, so it's aligned with the requirements coming from them.

Matthew Stibbe (09:47.742) I see. So let's dive into a specific project. And I think you've mentioned the trade data network. Tell me first of all, the history of that. When did that project start? What was it intended to achieve?

Yossi Leon (10:02.666) Right. So that started 2021 post our series A investment by 12 different banks. They basically wanted us to do more and bigger. And the goal was to build a product that will be a golden source trade ledger. One of the challenges in the market today in the futures market is that there is inconsistency. There are many breaks, trade breaks because each participant, each side of the trade is seeing the trade differently.

So the goal was to build a golden source that will capture the data from the different sources and basically provide the transparency on how the trade was executed. What is the count and account for this trade and what are the fees and commissions for this trade?

As I mentioned, lot of breaks for those reasons that either firms participants are not sure how the trade was executed. Good example is whether it was done electronically or it was done via voice.

Matthew Stibbe (11:13.374) Does that still happen?

Yossi Leon: Yeah, that still happens and it causes many, breaks. Basically, having this trade letter and start consuming data from the different sources.

Yossi Leon (11:28.256) and almost like building a puzzle that correlates between the different pieces of how the trade is being represented by those different sources and reflecting it back to the participants and the counterparties on the trade. So this is the trade data network and as it sounds, it's a network. So you need to start having participants joining this, both from the clearinghouse but also from the buy side and also from the sell side, which are the banks, the brokers.

So this is, it started 2021. We built this as a microservice cloud native solution. And basically about two years later, we're able to already have something available in production and firms starting to use it.

Matthew Stibbe (12:24.534) I read that that's hosted on AWS, but what are the elements and components of the tech stack to make that work?

Yossi Leon (12:32.236) Right, is all of this is being done in microservice Kubernetes using AWS native components, but also have an ETL layer outside of this that is highly available scaling vertically and horizontally and being able to consume so much data in real time, and being able to scale it up and down, especially when the market is volatile and we have volume, high volume. So that was really the goal. And AWS enabled us to build something robust like this.

Matthew Stibbe (13:20.31) So the volume of activity and the data flows vary over time. Is it very peaky?

Yossi Leon (13:29.44) it can be very peaky. Like at times like right now, it peaks and we're seeing the challenge with other participants not being able to send us the data because they have issues on their end because their system are having difficulty to cope with the volume.

Matthew Stibbe (13:47.958) So you've got the scalability, they don't perhaps. And the ETL piece, is that on Clover?

Yossi Leon (13:56.302) Yeah, this is CloverDX and basically we tried different ETLs. Initially as engineers you're saying, okay, I know how to build something like this, but then you ask yourself, why would I? There are enough solutions out there. So I'd rather waste the time and money on something that doesn't exist out there.

Matthew Stibbe (14:07.615) Hmm.

Yossi Leon (14:17.998) So we looked for different solution. We tried one ETL, but it wasn't scalable enough. It wasn't generic enough. You could not build something... the design was not flexible enough. And this is when we switched to CloverDX and we found something that works very well from scalability perspective, from design perspective, from basically building generic components that can be used for different use cases and working well in the AWS environment and in a case of high availability, redundancy. So that worked very well with us.

Matthew Stibbe (15:04.17) And you describe data coming in, but you have to also then transform it and do things and then get the data back out to people. In what ways is the data consumed once it's been ingested and processed? Where does it go?

Yossi Leon (15:23.69) It goes back to the market participants. So basically think about this, the trade coming in, we get one side of the trade, we get another side of the trade, we match between those two sides and basically send back information to the broker. One good example is a broker is sending us a trade on how the trade was executed.

The client is saying, okay, I want you to allocate the trade 40%, 60 % to two different accounts. And we send those instructions back to the broker so they know how to allocate. So this is one example. Other examples, our customers are interested to get an update for every event that is happening, every change on the trade. They want to get a dispatch, an event going out using an API. They want to consume it so they can also reflect this in their system.

Matthew Stibbe (16:28.694) When you were designing and building this system, what were the biggest challenges that you had to overcome?

Yossi Leon (16:37.762) I think it's thinking generically and thinking about all the possible use cases. So we don't need like a year from now to rebuild some components. So make it generic enough, building the trade lifecycle generic enough, building a matching engine that is generic enough. And also on the ETL side, building components that will serve us not just for the next 10 clients, but for the next 100 clients.

That will minimize the time to onboard new firms and also minimize the number of issues and the data quality issues that we're gonna have.

Matthew Stibbe (17:22.048) Were you able to, for example, thinking about data quality, build validation systems or checks or something that you could then reuse for each new inbound pipeline? How did you approach that?

Yossi Leon (17:36.94) Yeah, in some cases we had to build even microservices that will serve and be used by the ETL. A good example is reference data symbology. Each market participant is using.. might use a different symbology. One is using Bloomberg symbology on how to define the product in the market.

Matthew Stibbe (18:04.384) This is the future I want to buy. It's called this.

Yossi Leon (18:07.788) Yeah, exactly. Corn on CME. The other one is using Reuters symbology. The other one is using clearing product codes. GMI. They're just different type of symbologies out there. And you want to be able to eventually translate all of them into one generic... we use clearing product code. So we built a microservice that is basically, you're able to call it and say, okay, this is what I got. I'm not sure what to do with this. And then you get back the response. This is the product code you should use. And this is how it's flowing now into the product. So that's a good example of something we've built that is basically being used by the product itself, but also by the ETL there.

Matthew Stibbe (18:41.323) Right.

Matthew Stibbe (18:58.922) And how did you tackle this challenge of making it, you use the word generic, but I suppose future-proof or scalable. What did you have to do to make that happen?

Yossi Leon (19:13.934) It's being able to first of all break it into the right services or microservices knowing when it's too big and breaking it to the microservice so you know that you can scale each microservice by itself. So when we're seeing increase in volume and matching service for example needs to have more capacity it's able to consume more building additional matching groups for example and process them in time.

So it's a lot about building it right from infrastructure perspective but also from code perspective to utilize the microservice architecture and all of this enabled us to have something more scalable.

Obviously we haven't thought about everything and we're learning as we go sometimes. So one component you identify after six months, it's not scalable enough, it's not generic enough, it doesn't have all the use cases. We see a peak that its not being able to handle, so you fix that specific component, but you don't need to start rebuilding your entire platform.

Matthew Stibbe (20:31.97) It's been a, my programming days were before the days of containerization and microservices. You just wrote a big load of C and C++ and then it just ran somewhere. It's, forgive me, I don't know if this is the case or not, but it seems to me with microservices, a big challenge is to figure out too small is bad, too big is bad, but getting the right level of granularity. Is that, am I on the right lines there?

Yossi Leon (20:59.476) Yeah, yeah, no, it's definitely one of the challenges in microservices. You can break it to endless number of components basically. So you need to have something that has enough but not too much as you said. And developers have, think they learned in the last few years like more and more where is that borderline.

And in some cases, we see that they're building something too big. So as part of code review, your goal is as a manager and saying, okay, just think about this, what you build this functionality, if you break into two different pieces, it can be used by other services. So it makes more sense. So that's something that you feel like is, is part of education and it's in some cases it's just trial and error.

Matthew Stibbe (21:58.518) What are the, we touched on this a little bit with the symbol code lookup. What are the challenges around building, it's been a long day here in London, I've forgotten the name, but sort of a dictionary definition of all the data structures and having that sort of regularized and standardized.

Yossi Leon (22:20.782) Luckily, that's something that we have. It's one of our core offerings as a company, something called Data Bank, that is already capturing the data from 75 different exchanges. Basically, having all the different codes, product codes of all those exchanges, including position limits and fees and product definition, but that starts as this is your, this is the base that you start with. And now you need to start mapping everything around to this base and do the correlation.

So we're in a position that we already have the core of the data. And now the challenge is to know how to do the linkage between the other reference data into this core reference data. And it sounds, some people feel like it's trivial, but it's far from being trivial just because there is a lot of logic in many cases on how to map something. It's not one-to-one in many cases.

Matthew Stibbe (23:37.758) Yes, simple isn't always easy and easy isn't always simple. You mentioned to me before we started recording this lovely idea that the markets will reward firms that treat data not just as infrastructure but as product. And I wondered if you could unpack that a little bit and explain a little bit what you meant and how that applies to this trade data network perhaps.

Yossi Leon (24:06.284) Yeah, I think that companies are more and more embracing normalizing product reference data and basically adopting market standards. And we see those challenges with large organizations like clearing houses, but also with brokers and buy side.

It's difficult to embrace standards because your systems in many cases are either legacy platforms that are just not able to support that format or it just takes a lot of operational resources but once you're able to comply to that or align yourself your systems into those standards you really start benefiting from having minimal breaks and have more efficiency and really speaking the same language as the market. And that means also streamline your data in a better way to FIA Tech but also to other market participants. So it really becomes essential to adopt standards for firms.

Matthew Stibbe (25:36.288) There's a network effect in play here, isn't there? The more people adopt the standard, the more valuable the standard becomes, I suppose.

Yossi Leon (25:44.086) Yeah, yeah and in many cases you see the value if you're not seeing the value for your own firm you start seeing the value for other firms and at some point you feel like okay I'm now I'm not top of the list in terms of efficiency and that is also causing for firms to pick other firms that are top from efficiency perspective because they also from cost perspective probably they can afford lower cost. So it ends up to be something that firms feel the need to adopt at some point, not to be left behind. And as a firm, what we've done in the past, we provided scorecards for firms to basically tell them where they are compared to their peers. Obviously, they see themselves compared to everyone else.

Matthew Stibbe (26:24.308) Hmm.

Yossi Leon (26:41.966) It's not necessarily open for everyone, but at least they have a sense where they are and where they were a quarter ago and a year ago.

Matthew Stibbe (26:52.394) Data efficiency scorecards benchmarking. I love that idea. going to use that. But we're almost out of time, Yossi, although I think we could continue with this. So fascinating. So I'd like to close with one last question. Thinking back to, so what, four years now to the beginning of this Trade Data Network project, knowing what you know now, what do you wish you had known four years ago?

Yossi Leon (27:20.429) Yeah.

I feel like it was a very interesting journey. And I enjoyed personally going through this from beginning to productionizing trade data network. But I think it's part of the evaluation that we have done, especially on the ETL side. Basically being able to think more about the real use cases that you're going to have and try as much as possible to build a proof of concept that is close enough to what you think is going to happen once the product is going to be in production. So build something very narrow but robust enough to mimic both the volume and the complexity of the data structure you expect to get.

I think this is something that is essential and in many cases you're saying, okay, I tested enough. I did my due diligence. I feel comfortable. But you need to spend, I feel like more time, more use cases, more close to real life examples to really test products that you're going to use.

Matthew Stibbe (28:45.302) In my, that resonates so much for me, in my history, I used to design and program computer games a long time ago. Nobody can write down what's gonna make a game fun in a specification, right? You have to prototype and do proofs of concept and explore. And when you sit at the beginning of a project and go, I'm going to write specification that's gonna cover all possible eventualities, how can you do that? Nobody in the world can write down a list of the things they do not know.

And so there is absolutely a role for planning and thinking and specifications and generalizing. And there's a role for like, let's do something, learn something, see what we need to know. And on that bombshell, if I can say, you'll see it's been an absolute delight talking to you. Thank you very much indeed. And that brings the episode to a close. Thank you for joining us.

Yossi Leon (29:37.166) Thank you. Thank you for having me.

Matthew Stibbe (29:39.25) And if you're watching this or listening to this and you'd like to get more practical insights about data and ETL and learn more about CloverDX, please visit cloverdx.com/behind-the-data. That's the URL for this podcast. Thank you for joining us today and goodbye.

View full post