17 January 2019

NLP Games with Purpose

Games with a purpose are essentially types of games applied to annotations in NLP to make the process fun for the oracle (annotator), often in a crowdsourced manner. A few examples in context are listed below:

  • Phrase Detective
  • Sentiment Quiz
  • Guess What
  • ESP Game

Active Learning

Active Learning Approaches:
  • Pool-based Sampling
  • Member Query Synthesis
  • Stream-based Selective Sampling
  • Uncertainty Sampling
  • Query-by-Committee
  • Expected Model Change
  • Expected Error Reduction
  • Variance Reduction
  • Density-Weighted Methods
  • Query from Diverse Subspaces
  • Exponential Gradient Exploration
  • Balance Exploration and Exploitation

The Language Grid

The Language Grid

13 January 2019

Data Science Methods

Generalizable Method (your mileage may vary, given business case and time constraints):
  • Identify and understand business case (use story) - in most cases you are not provided a use story so it is really about understanding the problem
  • Explore and prototype including background research (exploratory stage)
  • Identify cases for reuse
  • Identify whether this story even requires a model
  • Identify relevant datasets - curation
  • Visualize the data (how sparse/dense/dirty it is, multiple open source tools available for refinement steps for features, identify additional effort necessary for model build)
  • Identify the relevant variances and biases (will the model steps lead to an overfitting or underfitting - the objective is to build a generalizable model)
  • Feature Selection/Extraction (may use other ML or natural computation techniques here)
  • Feature engineering (this may also include curation/enrichment of metadata)
  • Feature re-engineering (this may also include curation/enrichment of metadata)
  • Identify the simplest solution that is possible
  • Identify the reasoning of using a complex solution
  • Custom model to solve the business case (do not just copy model out of a research paper - this is what the exploratory stage was for)
  • Evaluation and Benchmarking (formal tests may/may not be used, depends on business case)
  • How well does the model scope against small data and large data - identify sufficiency at average and worst time/cost
  • Re-Tune/Rinse/Repeat
  • Incrementally improve the model
  • Incrementally optimize/scale the model (scale only when necessary)
  • One simple one complex - one that is sub-par, and one that is riskier
  • Evaluation and Documentation
  • Pipeline the Solution in Dev-mode (Identify bottlenecks with the model - dry run/end-2-end for integration - at this stage a repeatable build/test/deploy/evaluate cycle may be used - DS/DE)
  • A/B/N/Bandit Testing in Stage (generally this stage is covered by the product team, alongside automated acceptance tests, if they know the techniques, or DS/DE maybe involved)
  • Release/Integrate for Production (depends whether this is a B2B or B2C case, or beta mode)
  • Storytelling (how well does the model answer/solve the question or problem statement - 'through the looking glass’ - refers to both dev, stage/prod cases)
  • User/Stakeholder/Client Feedback (Rinse & Repeat, depending on B2B or B2C cases)
  • Incremental Analysis and Review of Models
  • Rinse & Repeat (some of the steps above repeated multiple times before production release)

Process Flow:

R&D → Dev/UI/UX → Prod

Generally, with a heavy R&D/Backend focused team, the features and functionality tend to be dictated by the forward flow (Bottom-up approach), most AI projects at startups tend to be built that way. The frontend then becomes a thin client as a view to the world for assimilation of the backend efforts, typical pattern tends to be an informational dashboard for storytelling- 'through the looking glass'. This is because, in a top-down approach many of the backend efforts would get lost in translation (equally, in some business cases it may work better).

Data → Information → Knowledge

State-of-the-Art may not imply state-of-the-art for your business case and may in fact lead to a sub-optimal results and more effort. It is all very subjective, depends on the data, the associated features for training a model, and the business case you are trying to solve. Work towards least effort, mostly efficient or sensible outcome.

8 January 2019

NLP High-Performance Computing

There are two primary approaches for working towards high-performance computing in NLP domains:
  • Add GPUs to server
  • Connect CPUs on multiple servers
Scaled out approaches generally tend to work towards maximization of constant RAM utilization, where they are able to automatically traverse the computational graph to allocate resources and optimize on throughput. In many cases, in particular, to deep learning models, the heavy acceleration of parallelized matrix multiplications makes a big difference. In neural networks, backpropagation is more computationally expensive than forward activation. Once the model is trained the weights and structures can be exported on any hardware for model prediction whether that be a forward pass or an inference pass.

Approximate Nearest Neighbor Matching

Annoy-Hamming
BallTree (NMSLib)
Brute Force (BLAS)
Brute Force (NMSLib)
DolphinnPy
RPForest
Datasketch
MIH
Panns
Falconn
FLANN
HNSW (NMSLib)
Kdtree
NearPy
KeyedVectors (Gensim)
FAISS-IVF
SW-Graph (NMSLib)
KGraph (NMSLib)

ANN Benchmarks

Chatbot Prizes

Loebner Prize
Alexa Prize
Winograd Schema
Marcus Test
Lovelace Test

Sentence Piece

Sentence Piece

Sentence Segmentation

Spacy
DetectorMorse
CoreNLP
SyntaxNet
NLTK - Punkt

TM-Town

TM-Town

DeepMind QA

DMQA

Visual Question Answering

Visual Question Answering

Manythings

ManyThings