CloverDX is a new name for CloverETL Learn more
The concept of “event-driven programming” has been around for a long time. And this approach can actually be implemented when designing data transformations with virtually no additional effort—depending on proper tooling, of course. From my experience, every data-driven company is in dire need of having data movement automated.I want to kick off a blog series by focusing on what event-driven design is and its value compared to timetable scheduling. In subsequent posts, I’ll address how the day-to-day struggles of companies can be minimized or even completely solved by event-driven design. Let’s start with the basics.
Regular timetable scheduling certainly has its place for things that don't have downstream dependencies, like taking database snapshots or backups, doing regular housekeeping, or updating reporting databases. However, it starts to fall apart in larger integration and data pipeline projects, especially when pipeline timing is of the essence, but data may come in irregularly. Imagine a situation where data arrives a couple of seconds later than the processing is scheduled. In the worst case, this might cause overloading of the ETL (Extract-Transform-Load) server with data—the next iteration would have to process everything from the previous batch plus everything from the new one.
I’ve heard many opinions about how event-driven ETL processing is an unnecessary luxury. These voices usually claim that if infrastructure is carefully timed, there really shouldn’t any issues or risks. But I’ve witnessed this schema fail way too many times when, for example, billing systems were empty because invoice data was stuck somewhere mid-process due to a network spike and/or outage, waiting for the schedule of next application to kick off.
If you’re lucky, the support teams (who are, of course, assigned to individual applications rather than being dedicated to the overall pipeline) will notice the delays in the inbound data. Their typical remedy is to run out-of-schedule load jobs or change the schedule, both of which could present a rather unpleasant ripple effect in the entire pipeline.
With ever-growing data volumes and infrastructure complexity, the question is not if that’s going to happen, but rather, when? Take for example a manufacturing company running its own B2B eCommerce system. The billing team could tell you how they feel coming to work in the morning only to find out there’s nothing to process, making them wait until the support team pushes data through manually. But I’m getting a little ahead of myself, since I’ll cover this more in depth in a later blog.
What are the advantages of the event-driven ETL approach then? First, it gets rid of one variable in the “What will go wrong?” equation: the missed processing windows followed by overloading the subsequent iterations. Reacting to events directly rather than imposing some arbitrary schedule also creates a more fluid and reliable pipeline.
An event-driven setup allows a batch processing engine to act in real time as close as possible. Actually, “near real-time” would be a much better term, as we’re talking about processing data as soon as possible, not necessarily guaranteeing a predefined response time as “real-time” would demand.
Yes, there is some overhead, but it’s usually negligible in comparison to the cost of spinning up a scheduled job just to realize there’s nothing to process. You wouldn’t wake up in the middle of the night, take a shower, get dressed for work, and walk outside just to realize it's 4am and you still have a few precious hours left of shut eye. Instead you’d just check the alarm clock, see it’s still early, and go back to sleep.
Some actions may not even be possible to schedule efficiently, especially if you’re dealing with people’s unpredictable behavior—e.g. generating data based on a form submission. People don’t like waiting, even if it’s just for a few seconds (not even for pedestrian signals, myself included!)
An event-driven approach is also not restrictive in terms of spreading out the workload. Data gets processed as it arrives throughout the day. Of course, you could simulate similar behavior by simply scheduling a frequently recurring check (down to once a minute in “cron” or even shorter intervals with dedicated schedulers) but this can have the negative performance impact I mentioned earlier. Running jobs this frequently can also cause trouble with parallelism. The moment they take longer than a single minute or whatever interval you choose, you can end up with multiple processes fighting over locked files or simply overloading the system. Yes, there has been tremendous progress in dealing with mutexes (e.g. lockfile or flock), but I still don’t trust them too much. With event-driven systems, you can throttle the event responses into a queue and thus ensure a much more predictable operation.
So how does all this translate to CloverDX?
There are a couple of neat options in CloverDX which can be used to set up triggers that fire (start a job, send email, etc.) when something happens. (We’re talking about Server here, Designer doesn't have these functions.) It has never failed me in terms of its capabilities and it’s pretty straightforward to configure, especially with some of the latest improvements. If you have any experience with event triggers in CloverDX or in any other data integration tool you liked (or hated), feel free to share your thoughts in the comments section below.
There are two different ways of “listening to events” available in CloverDX Server Corporate and Cluster instances at the time of writing:
With these kind of “event listeners” you can subscribe to an event source, basically saying, “Let me know whenever something happens and I’ll react to it.” For those who are keen on programming patterns, this would be similar to an Observer.
Since some events don’t have a particular source that would actively broadcast to its subscribers, CloverDX has to proactively check some state (e.g. contents of a folder) and trigger the associated job once it sees some change.
Here’s a quick overview of all available listeners:
|Jobflow event listener||File event listener|
|Graph event listener||Universal event listener|
|JMS message listener|
|Task failure listener|
Even though they're all very different, they share a common feature: not only can they run CloverDX jobs (transformation graphs, jobflows and profiling jobs), but they also are capable of running external scripts, sending emails, aborting running job, posting JMS messages, or executing any Groovy code snippet without needing to open CloverDX Designer at all, making CloverDX Server a powerful orchestration and automation tool that extends beyond just ETL and data transformations.
There’s one more listener that I won’t cover in this post. It’s called Launch Services and basically allows you to publish data transformations (and jobflows) as an HTTP API endpoint. I hope to get into more detail of this exciting capability in one of my followups. The great news is that there is a plan improve the feature in a future version of CloverDX, hopefully 4.7. I can’t wait!
In this blog, we covered caveats of timetable scheduling and why I think they’ll fail in the long term, advantages of event-driven ETL, and which event triggers CloverDX Corporate and Cluster Server support. In the future, I’ll continue this theme by explaining what steps need to be taken to get your (existing) jobs and transformations event listener ready, and also talk about file event listeners more in depth. We’ll see if and how file event listeners save the day.