17 February 2017

R, Python, Scala, and Julia

Three languages have become critical as part of the data scientist arsenal of choice: R, Python, and Scala. Major ecosystem of accessible libraries to support statistical computing and machine learning are critical especially at scale. Scala is still a struggling block for data scientists as the language can be quite complex. Often data scientists use R and Python without venturing beyond. However, there is a significant window of computational and data intensive gains to be made with utilizing languages like Julia and Scala. Although, in certain microbenchmarks even the performance of Julia can come into question and even the state of the language. If one is a graduate and just starting out in the domain of data science then Python is the best choice. As a research scholar languages like R, Python, Scala, and even Julia become the languages of choice.  As an employee the usual alternatives are again Python and R and even Scala especially with Spark. However, if one is willing to take the plunge Julia is emerging to be useful contender for Big Data and likely to play a stronger role in the future if the language takes shape within the open source community. In general, if one has a need to be flexible and work with data across a multitude of different algorithms then the choice is often to use R. However, if such flexibility needs to be extended into the use of data structures and external application integration then Python seems to be a better alternative with the optimizations that can be gained from low-level C implementations. But, to build massively scalable components utilizing batch and streaming data pipelines then one can't beat the ecosystem of Big Data use with Java/Scala and Python. Julia still has a long way to go in catching up to the likes of Python. A few areas that still require improvements are in performance, syntax, interoperability with other languages, text formatting, testing issues that make it difficult to write robust code with defensive programming, accessibility of native API, still a very research-led language that is fairly limited in accessibility for the larger open source community for contributions of libraries and frameworks.