13 September 2016

MapR vs Cloudera vs Hortonworks

Distributions Compared

Pivotal HD

Four factors for comparing the top Hadoop distributions
comparing hadoop distributions

Certifications Compared

MapR has a more accessible free courseware option and has a less complex pathway to learning. Although, they provide more customizations to their platform. Cloudera pathways are more rigorous and more expensive. But, their certifications are recognized as a pedigree in the big data space. Cloudera also have significant customizations to their commercial product offerings which means a more stable platform. Hortonworks provide flexibility between the developer, administrator, and data analyst. They also cover mostly open source stacks which also means the product offering can be less stable. Also, they provide a full self-paced training but with a premium price tag as material from their essential courses may not be sufficient for a certification study. If one wants to focus on open source choose the Hortonworks pathway. If one wants more rigor and a data scientist pathway choose Cloudera for CCP exam. MapR can offer a developer pathway somewhere in between which also is more easy on the pocket. But, ultimately the employer dictates the appropriate certification choice that one takes for the workplace and the requirements of Hadoop distribution to use/support. In end, it is down to requirements and the value one puts towards such attainment and measure of certifications.

Quick Vocabulary Lesson

Kafka (publish/subscribe messaging system)
Mahout (machine learning)
Hive (map data to structures and use SQL-like queries)
Pig (data transformation language for big data)
Zookeeper (used to manage and administer Hadoop)
Sqoop (extract external sources and load to Hadoop)
Storm (real-time ETL)
Oozie (workflow scheduler)
Avro (data serialization like JSON)
Flume (ingest unstructured data)
Nutch (crawler)
Ambari (provisioning, managing, and monitoring Hadoop)
Chukwa (data collection)
Tez (data-flow framework)
Hama (big data analytics)

Columnar (HBase, Cassandra)
KeyValue (Riak, Redis)
Document (MongoDB, CouchDB)
Graph (Neo4J, Titan)