30 October 2014

Cascading

Cascading is an alternative approach to Hive and Pig for developers where processing of big data is done using workflow streams of map, filter, and reduce steps. The work is bounded within the ETL process using the metaphor of directed acyclic graphs for direct source to sink work streams. Also, the approach adds an abstraction over the explicit programmatic complexities of the underlining MapReduce job implementations. Cascading has a dependency on the Hadoop layer but also provides connectivity with a multitude of data sources. Hadoop can either be used as standalone or in a clustered environment. A developer can then work through an entire process stream in a singular or integrated workflow making abstractions in ETL for business domains very plausible and reduction in complexities of handling large amounts of data. The process streams can even be made available in visual representations. Cascading also provides various wrappers in form of Scalding, Cascalog, PyCascading, and others. The application platform is a nice alternative for developers looking to integrate and think through problems in business domain abstractions using entire feature and story cases of complex data processing through the test-driven and behavior-driven approaches within the agile practicing team.