PowerUp
Talend Cache components

Talend component : tCache



Background : Talend Open Studio can read/write from/to many different sources, so it is generally easy for a good data integration architect to design solution that "cache" recordsets in db tables, temporary files etc.
However this adds additional steps, sometimes even additional architecture components such as a database, adding complexity to the job itslef.
Sometimes it would be easy to have a simple buffer in memory to store temporary information, maybe populated incrementally with a number of iterations.Or it would be nice to be able to dump this buffer to disk and re-load it whenever needed.

The solution

A couple of components that handle the cache management tasks, both in memory and on disk.
They are easy to use in any Talend Data Integration job, allowing temporary or persistent data storage. Cache files and memory buffers can be loaded incrementally using loops.
It is also possible to "init" a chache form a previously stored file and incrementally append new records.
Storing data in memory can quickly use up the java heap, also because this release of the routines is not yet optimized. Currently there is no data compression involved, this could be included in a future release.I succesfully managed to load a 6 Million records table (4 fields) in memory, but you should account for concurrent processes that also may use heavily the heap.

Current version

tCacheOutput
Version : 0.4
Release Date : Jan 31 2012
Status : Beta

tCacheInput
Version : 0.4
Release Date : Jan 31 2012
Status : Beta
Note (0.3) : Improved memory cleanup on exit if the "remove buffer after read?" option is set
Note (0.4) : the global buffer name variable is now dynamic, meaning you can set it like : "buffer_"+context.bufname". For this reason , when using a constant string, it NOW has to be enclosed in ""

Example

The job tCacheDemo demomstrates the usage of the two components


The Cache Load Loop subJob loads the cache buffer in memory and generates a set of cache files.
To demonstrate the incremental load capabilities, a Loop is used to process multiple times a data source (in this examples an RSS feed, any recordset would work).
tLoop_1 This step simply performs 100 iterations.

tRSSInput_1 A sample data source returning 10 records.In the demo job it is configured to read a xml file "news.rss", being a local copy of a Google news RSS Feed.

tCacheOutput_1 This component is set to cache data both on disk and in memory. The global buffer variable name is an arbitrary name that was set to "cache".
The Cache file name is set to context.baseDir+"/cache"+((Integer)globalMap.get("tLoop_1_CURRENT_VALUE"))+".dat" to demonstrate the ability to change the file name at each iteration.Finally the append to file option is unchecked to reset the cache files at each run. Feel free to play around with these settings.
This component is a "DATA_AUTOPROPAGATE" component, meaning that the same flow that arrives in input can also be retrieved in output, allowing the component to be used as a "middle" step in a flow.

The "Record count" subJob simply uses a tJava component to output the two record counters available in the tCacheOutput component, one being the number of records processed in the current iteration and the other being the number of records stored in the memory cache.

tJava_1 Simply outputs a string in the console, the code is
System.out.println("Iteration records : "+((Integer)globalMap.get("tCacheOutput_1_NB_LINE"))+" Record in memory cache : "+((Integer)globalMap.get("tCacheOutput_1_NB_CACHE_LINE")) );

Finally the Cache read subjob is activated once all the iterations are terminated (via a "subjob OK" trigger link, originating from tLoop_1) and starts the output to the tLogRow component. In a real world application this would be connected to a destination table, a tBufferOutout or another flow consumer.

tCacheInput_1 By setting this component to read from the memory "cache" buffer, all the records stored with the 100 iterations are returned.
An optional check is set to remove from memory the buffer once it is read. You can decide to leave data in memory if you need to process it again (also with another tCacheInput component).
It is possible to alternatively set the cache source to one of the cache files generated.





Downloads
  • Download tCacheInput
  • Download tCacheOuput
  • Download sample Job Remember to configure the baseDir context variable, this will be the location for the cache files and the sample input data
  • Download sample data

  • License

    THIS SOFTWARE IS PROVIDED BY POWERUP ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL POWERUP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.