CloverDX Blog on Data Integration

4 Tips for Integrating CloverDX to Third Party Applications

Written by Martin Zatopek | August 06, 2010

I am sure you are often finding yourself in your work in a situation when you have  data so-called „in your hands“ and you would like to transform them to a proper form, clean up them or load them to a data destination (e.g. a database). But you do not want to reinvent the wheel again and again. It is the time when you start searching a suitable technology which can help you to avoid developing what has been already developed, tuned and is  stable. Most of the ETL tools often satisfy all these requirements and CloverETL is not an exception. In addition, CloverETL Engine provides a few tricky approaches how to process your data a little bit unconventional.

  1. Transformation graph does not have to be statically defined in a standard XML file with an .grf extension – if you need to generate a graph based on dynamic user settings, it is nothing easier than to create your transformation dynamically (details are described at http://wiki.cloveretl.com/doku.php?id=embedding_clover) in your Java application. If you have your graph already created (you have its instance), no matter if it was created as a result of an .grf file or you created it manually, then you can easily reuse it – run the graph repeatedly without a need to rebuild it again and again. This ability is very useful for transformations that run really short and need to be triggered a couple of times per second (for instance as a reaction on a user 'click' in your massive multi-user application).
  2. Next interesting extension of your development tools can be creating your own shell stream filter – a simple application which transforms standard input (stdin) to standard output (stdout) (for example sort, cat) – everything based on CloverETL. You can prepare a whole set of specialized command line applications which can be exploited in your complex shell scripts. It will be enough if your graph contains just one UniversalDataReader and just one UniversalDataWriter with a fileURL parameter set to the dash character '-' which stands for standard input (stdin) or standard output (stdout) stream. I have prepared a simple example that filters out undesirable lines from standard input. A pretty straightforward usage:cat employees.txt | clover -D:metadata=“employees.fmt“ -D:filter=“$salary>10000“ filter.grf > beFired.txt
  3. In the first paragraph, I tackled a possibility to build transformation graphs directly from your Java code. In this approach, you will encounter nature limitation of most of ETL tools very soon – narrow set of data types processed by public set of components. Typically you have java.io.InputStream with the data ready for further processing. If we decide to use classical Clover instruments, we would have to store the data into a temporary file first and prepare a graph which reads the temporary file. We would have a similar issue with handling output data, a temporary file would have to be used again. The issue is obvious – how to avoid needless temporary files? CloverETL engine has a prepared solution for this purpose called Dictionary. It is a sort of shared memory which can be concurrently accessed by all components all the time a graph is running. The dictionary is typically used for passing a data object as an input into the graph and retrieving another data object as an output from the graph. We can look at Dictionary as a classical Map<key, value>. Key is a string identifier of a value which can take a whole set of different dictionary types – integer, boolean, decimal, … and last but not least ReadableChannel and WritableChannel. The two last mentioned dictionary entry types can be exploited for passing and retrieving data to and from a graph. See the following piece of code how dictionary can be populated://prepare channel with the data for ETL processing
    InputStream inputData = getInputDataStream();
    //prepare channel where the resuled data will be formatted
    OutputStream outputData = prepareOutputDataStream();

    //create graph instance based on grf file and initialize it
    TransformationGraph graph = TransformationGraphXMLReaderWriter.read(File);
    EngineInitializer.initGraph(graph);

    //initialize graph dictionary
    //our input channel will be registered under „inputStream“ key
    graph.getDictionary().setValue(„inputStream“, „ReadableChannel“, inputData);

    //our output channel will be registered under „outputStream“ key
    graph.getDictionary().setValue(„outputStream“, „WritableChannel“, outputData);

    //execute graph – output data will be pushed to output stream during graph run
    runGraph.executeGraph(graph);

    Now you probably ask yourself how the graph knows that input data are ready in dictionary under „inputStream“ key and on the other hand how it knows where to write the result output data. The answer is simple – fileURL attribute of UniversalDataReader/Writer has a specialized syntax for dictionary entries. Reader can have fileURL set to "dict:inputStream". In case of Writer we need to setup fileURL attribute to "dict:outputStream". That is all – the CloverETL engine takes care of data transmission between your data streams and the CloverETL graph automatically. Data prepared in input stream will be parsed by a dedicated data reader and will be passed as Clover data records for further processing to next components down the graph. And incoming data to a data writer will be formatted into your channel prepared in dictionary under „outputStream“ key.

    As it was already mentioned,  Dictionary can handle various data types. Beside already described data streams, it is possible to store all basic Clover data types – string, integer, long, number (Java equivalent double), decimal (Java equivalent BigDecimal), byte, and boolean. So Dictionary can be used for passing input values to a graph or also for inter-component communication – the first component writes some semi-result into Dictionary and the second component can pick up this value for further processing.

    Probably the most advanced way how to exploit Dictionary is possibility to define your own proprietary dictionary data types. Similarly to components, connections, CTL functions and so on, the dictionary entry types are also fully pluginable. So you can easily introduce your own type that corresponds to your needs. For example, you can extend CTL by your own function that allows you to access this data value from Clover and converts it to a CloverETL data record – the basic data element processed by CloverETL engine. It is certainly possible to create a new set of components that understand your specific data format. In the scope of  component run your custom data format can be retrieved from Dictionary, transformed into a standard CloverETL data record and passed to an output port for following processing. We have been using this approach successfully in several projects where the data format was totally incompatible with CloverETL records.

I hope this little bit technical insight into CloverETL engine inspires you for its usage in situations that seemed inappropriate till now.