16 May 2019

Industrial Data Science

Over the years, Data Science as a field has emerged to play a major pivotal role for many industry sectors providing avenues for analytical growth and insights towards more effective products and services. However, there are several glaring aspects of the field that is riddled with misconceptions and ineffective practices. Traditional Data Science was about Data Warehouses, relational way of thinking, Business Intelligence, and overfitted models. However,  in the current landscape, Artificial Intelligence, as a discipline, is more about out-of-the-box style of thinking and is having an impact to Data Science practice. Data Engineering and Data Science functions tend to merge as one in AI practice. Relational Algebra is replaced with semantics and context via Knowledge Graphs that form the important metadata layer for a Linked Data Lake. While traditional Data Science relied fully on statistical methods, the new approaches rely on combining Machine Learning and Knowledge Representation and Reasoning approaches in a hybrid model for better Transfer Learning and generalizability. Deep Learning, which is a pure statistical method and a sub-field of Neural Networks, is by its very nature implemented as a set of distributed and probabilistic graphical models. It makes very little sense to split teams between Data Engineering and Data Science as the person building the model also has to think about scalability and performance. Invariably, splitting teams means duplication of work, communication issues, and degradation of output in production (when passed from Data Science to Data Engineering). In many AI domains, there is an inclination towards open box thinking about problems. In AI, only 30% of the effort is Machine Learning while the rest 70% is Computer Science principles and theory. An evidence of this can be seen in the Norvig book which is often the basis of many taught AI 101 courses. Often at universities, in advanced courses, they reluctantly forget to cover the entire Data Science method and only stress on Machine Learning and statistical methods, at exploratory stage, while forgetting the rest of the Computer Science concepts. As a result we see many Data Scientists with Phds that are ill-equipped to tackle practical business cases with productionization of their models against small and large datasets, with appropriate Feature Engineering for semantics, and the associated pipelining. Furthermore, at many institutions Feature Engineering is often skipped entirely which is really 70% of the Data Science method, and possibly the most important stage of the process. Invariably, this Feature Engineering step is partially transferred over as part of the Data Engineering function. One needs to wonder why the Data Scientist is only doing 30% off the work from the Data Science method even after holding a Phd, while passing the reminder of the hard work to the Data Engineer comprising as part of the formal ETL process. The whole point of a Knowledge Graph is really to add the value of semantics and context to your data, and moving towards information and knowledge. This becomes a very important aspect to not only Feature Engineering but a feedback mechanism where one can cyclically improve the model learning while allowing the model to improve on the semantics in a semi-supervised manner. The Knowledge Graph also enables natural language queries, making the data available to the entire organization. No longer the need to hire specialists who understand SQL in order to produce Business Intelligence reports for the business. The whole point is to make data available and accessible for the entire organization while also increasing efficiencies as well as enabling a manageable way of attaining trust through centralized governance and provenance of the data. Thus, enabling data to adapt to the organizational needs and not the organization having to adjust resources for the needs of working with the data. There needs to be a shift in the way many organizations build Data Science teams, how the subject matter is taught at universities, as well as how they architecture for AI transformational solutions. Although, Deep Learning is good at representation learning, it initially requires a large amount of training data. Where large amount of training data is lacking one can rely on semantic Knowledge Graphs, human input, and clustering techniques to get further with Data Science executions which in the long-term will have a far greater benefits to an organization. Many organizations seem to ignore the value of metadata at the start and with the growth of data adds to the complexity and its many challenges for integration. Why must we always push for only statistical methods if many of the direct value can be attained through inference over semantic metadata or a combination of both approaches. By nature for humans probability is unintuitive. When does the average human ever think in statistics when they go about their daily lives from traveling to work to buying groceries at a supermarket to talking to a colleague on the phone - hardly ever. And, yet, an average human is still smarter in many respects, across domains of understanding and adaptability through transfer learning and semantic associations, compared to the most sophisticated Machine Learning algorithm that can be trained to be good at a particular task. However, when the human Data Scientist arrives at work they reduce the scope of the business problem-solution case to a mere statistically derived methods. If the AI is to move forward we must think beyond statistical methods of thinking through complex business cases, flexible semantics, and take more inspiration from the human mind for all the things that we already take for granted in our daily lives that machines still find significantly complex to understand, adapt, and learn.