13 January 2019

Data Science Methods

Generalizable Method (your mileage may vary, given business case and time constraints):
  • Identify and understand business case (as a use case or story) - in most cases you are not provided a translated use case or story so it is really about understanding the problem 
  • Explore and prototype including background research (exploratory stage)
  • Identify cases for reuse
  • Identify whether this story even requires a model
  • Identify relevant datasets - curation
  • Visualize the data (how sparse/dense/dirty it is, multiple open source tools available for refinement steps for features, identify additional effort necessary for model build)
  • Identify the relevant variances and biases (will the model steps lead to an overfitting or underfitting - the objective is to build a generalizable model)
  • Feature Selection/Extraction (may use other ML or natural computation techniques here)
  • Feature engineering (this may also include curation/enrichment of metadata)
  • Feature re-engineering (this may also include curation/enrichment of metadata)
  • Identify the simplest solution that is possible
  • Identify the reasoning of using a complex solution
  • Custom model to solve the business case (do not just copy model out of a research paper - this is what the exploratory stage was for)
  • Evaluation and Benchmarking (formal tests may/may not be used, depends on business case)
  • How well does the model scope against small data and large data - identify sufficiency at average and worst time/cost
  • Re-Tune/Rinse/Repeat
  • Incrementally improve the model
  • Incrementally optimize/scale the model (scale only when necessary)
  • One simple one complex - one that is sub-par, and one that is riskier
  • Evaluation and Documentation
  • Pipeline the Solution in Dev-mode (Identify bottlenecks with the model - dry run/end-2-end for integration - at this stage a repeatable build/test/deploy/evaluate cycle may be used - DS/DE)
  • A/B/N/Bandit Testing in Stage (generally this stage is covered by the product team, alongside automated acceptance tests, if they know the techniques, or DS/DE maybe involved)
  • Release/Integrate for Production (depends whether this is a B2B or B2C case, or beta mode)
  • Storytelling (how well does the model answer/solve the question or problem statement - 'through the looking glass’ - refers to both dev, stage/prod cases)
  • User/Stakeholder/Client Feedback (Rinse & Repeat, depending on B2B or B2C cases)
  • Incremental Analysis and Review of Models
  • Rinse & Repeat (some of the steps above repeated multiple times before production release)

Process Flow:

R&D → Dev/UI/UX → Prod

Generally, with a heavy R&D/Backend focused team, the features and functionality tend to be dictated by the forward flow (Bottom-up approach), most AI projects at startups tend to be built that way. The frontend then becomes a thin client as a view to the world for assimilation of the backend efforts, typical pattern tends to be an informational dashboard for storytelling- 'through the looking glass'. This is because, in a top-down approach many of the backend efforts would get lost in translation (equally, in some business cases it may work better).

Data → Information → Knowledge

State-of-the-Art may not imply state-of-the-art for your business case and may in fact lead to a sub-optimal results and more effort. It is all very subjective, depends on the data, the associated features for training a model, and the business case you are trying to solve. Work towards least effort, mostly efficient or sensible outcome.