Mabble Rabble: Data Science Methods

13 January 2019

Data Science Methods

Generalizable Method (your mileage may vary, given business case and time constraints):

Identify and understand business case (as a use case or story) - in most cases you are not provided a translated use case or story so it is really about understanding the problem
Explore and prototype including background research (exploratory stage)
Identify cases for reuse
Identify whether this story even requires a model
Identify relevant datasets - curation
Visualize the data (how sparse/dense/dirty it is, multiple open source tools available for refinement steps for features, identify additional effort necessary for model build)
Identify the relevant variances and biases (will the model steps lead to an overfitting or underfitting - the objective is to build a generalizable model)
Feature Selection/Extraction (may use other ML or natural computation techniques here)
Feature engineering (this may also include curation/enrichment of metadata)
Feature re-engineering (this may also include curation/enrichment of metadata)
Identify the simplest solution that is possible
Identify the reasoning of using a complex solution
Custom model to solve the business case (do not just copy model out of a research paper - this is what the exploratory stage was for)
Evaluation and Benchmarking (formal tests may/may not be used, depends on business case)
How well does the model scope against small data and large data - identify sufficiency at average and worst time/cost
Re-Tune/Rinse/Repeat
Incrementally improve the model
Incrementally optimize/scale the model (scale only when necessary)
One simple one complex - one that is sub-par, and one that is riskier
Evaluation and Documentation
Pipeline the Solution in Dev-mode (Identify bottlenecks with the model - dry run/end-2-end for integration - at this stage a repeatable build/test/deploy/evaluate cycle may be used - DS/DE)
A/B/N/Bandit Testing in Stage (generally this stage is covered by the product team, alongside automated acceptance tests, if they know the techniques, or DS/DE maybe involved)
Release/Integrate for Production (depends whether this is a B2B or B2C case, or beta mode)
Storytelling (how well does the model answer/solve the question or problem statement - 'through the looking glass’ - refers to both dev, stage/prod cases)
User/Stakeholder/Client Feedback (Rinse & Repeat, depending on B2B or B2C cases)
Incremental Analysis and Review of Models
Rinse & Repeat (some of the steps above repeated multiple times before production release)

Process Flow:

R&D → Dev/UI/UX → Prod

Generally, with a heavy R&D/Backend focused team, the features and functionality tend to be dictated by the forward flow (Bottom-up approach), most AI projects at startups tend to be built that way. The frontend then becomes a thin client as a view to the world for assimilation of the backend efforts, typical pattern tends to be an informational dashboard for storytelling- 'through the looking glass'. This is because, in a top-down approach many of the backend efforts would get lost in translation (equally, in some business cases it may work better).

Data → Information → Knowledge

State-of-the-Art may not imply state-of-the-art for your business case and may in fact lead to a sub-optimal results and more effort. It is all very subjective, depends on the data, the associated features for training a model, and the business case you are trying to solve. Work towards least effort, mostly efficient or sensible outcome.