CloverDX Blog on Data Integration

Joining Data with RelationalJoin

Written by Martin Janik | February 11, 2010

CloverDX version 2.8 offered a brand new commercial component called RelationalJoin. It extends the CloverDX pallete of joiner components with new functionality - joining data with relational operators different from equal (=). This component has two input ports for master and slave data records, and a single output port for joined data records. Master and slave data records are joined and sent to the output port if they are in a specific relation to each other. In other words, it's just another joiner that uses a relational operator to drive the joining process.

Relations Between Data Records

The relation between two data records is specified by any of the following relational operators: !=, <, <=, >, >=. For example the < operator means that each master data record is less than all the slave data records that it was joined with. If you look at it from the slave's perspective, all slave data records joined with a certain master data records are greater in this case. It's valid in both ways. ;-)

In order to make the joining process as effective as possible, input data records coming through both input ports have to be sorted appropriately. Except for the != operator that doesn't require any sorting at all. However, if you choose the < or <= operator, you need to sort both streams of input data records in the descending order. In case of the > and >= operators, you need to make sure the data records are sorted in the ascending order. If the sort order is invalid, execution of the component fails.

When processing large data sets, be aware that the master data records are processed one by one while the slave data records need to be buffered. In the worst case, all the slave data records need to be buffered and thus their number should be as small as possible.

Practical Example

They say that an example is worth a thousand words so let's create a simple one. Imagine that you've got a set of several distinct numbers. If for some reason you need to pair each of them with all numbers from the same set that are greater, it is a pretty simple task for RelationalJoin! :-) See the example graph below.

 

We read the numbers from a flat file, sort them appropriately, send them as two independent data streams to RelationalJoin which produces the desired output, and finally write them to another flat file. Pretty simple, don't you think? ;-)

In order to configure RelationalJoin, we need to specify a transformation, join key and join relation. Setting the first two is simple, you have done that a hundred times. In case of join relation, it is simple as well, just select "master(D) < slave(D)" from the combo box. The letter D in the round brackets denotes that we need to sort both streams of data records in the descending order. Thus we need to configure ExtSort in this way.

That's all we had to do, just run the graph and see the desired results!