30 August 2014

When Not To Use Hadoop

Hadoop has become a necessity for almost all analytical applications that have huge data processing requirements. It also offers an open source flexibility as well as a range of subprojects to facilitate processing, ingestion, and downstreaming of input/outputs. However, Hadoop is not appropriate for all business applications. Often times a first litmus test when deciding to use Hadoop should be to answer a few specific questions around loading and processing of data. If one can load the data in a standard database without much problems then surely Hadoop is not really the way to go. Is a few hundred MB size dataset for processing a business case for Hadoop? What about a few hundred GB of datasets? It is also not a replacement for standard databases. In general, Hadoop has problems dealing with small files. So, having large number of small files is going to be suboptimal for Hadoop compared to large number of large files for processing. This is primarily why the platform works of a MapReduce approach and why the underlining layer is HDFS as standard approaches are just unable to handle such large data processing efficiently, albeit at a cost. Also, working with XML/RDF type of data will pose much problems and require pre-processing for deserialization to other processing formats such as SequenceFilesAvro, Protocol Buffers, and Thrift. Hadoop is also not appropriate for direct real-time processing needs. Although, stream processing has become available. It is most appropriate for as a flexible data warehouse where generally static data is stored for analysis rather than a rapidly changing dataset. It is useful for merging and unlocking large amounts of corporate and even web data from various data sources and providing analytical processing for useful insights and filtering to other systems. Hadoop in the cloud can save much headache for operations management. However, it still requires a careful strategy in the management of an appropriate cluster and capacity planning over namenodes. Otherwise, costs can invariably get out of hand in the cloud very quickly due to high computational processing requirements of Big Data.  The term Big Data also needs some clarity. Datasets in the order of terabytes and petabytes at web scale are aptly classed as Big Data where not only one is working with unstructured data but also size of data is so huge that it could not sensibly fit into a standard data architecture for continuous processing. Hadoop here could work wonderfully with HBase as a storage layer for the unstructured data and then filter more structured data downstream to other more appropriate systems. Increasingly, NoSQL approaches have also started to provide their own equivalent support for MapReduce. For example, MongoDB provides a MapReduce functionality and with its varying use cases, it is also widely used for real-time advertising. Although, MapReduce on MongoDB may not be in any comparison to the level of processing that could be done on Hadoop at scale. One obviously needs to understand firstly their data, and secondly what they plan to do with it. The below links provide further views on why Hadoop may not be the right approach for solving particular business problems.