27 February 2017

Model Evaluation Techniques

data mining map

Things to watch out for in cross-validation:
  • when training data forms a representative sample of population, new data should have representative coverage of this training data, otherwise estimates are optimistic and as such minimize the bias in training data
  • when working with temporal datasets, structure the cross-validation so all training set data is collected before the testing set
  • when working with larger number of k-folds, the better the error estimates will be but longer the program will take to run, 10-folds or more is better, for models that predict quickly use leave-one-out cross validation

ROC curves are applicable on binary classification where predictions are divided into negative and positive classes. The area under the ROC curve is the AUC or the area under the curve which is another evaluation metric. On multiclass one uses the one-versus-all trick. In most cases of multiclass, one uses both the ROC curve and the confusion matrix. The confusion matrix shows the class-wise accuracy using a two-by-two diagram. Regression performance is measured using the root-mean-squared error, MSE, or R-squared. Other regression evaluation metrics include: AIC and BIC. A brute-force grid search is a standard way to optimize the choice of tuning parameters which ties the strategies between cross validation and model evaluation. 

Rated Funds

Rated funds

Elasticsearch Graph

Elasticsearch Graph
elasticsearch expands relationship modeling with graph

24 February 2017

Supervised Learning Use Cases


Example Use CasesType of ML
Spam FilteringClassification
Sentiment AnalysisClassification
Fraud DetectionClassification
Customer Ad TargetingClassification
Churn PredictionClassification
Support Case FlaggingClassification
Content PersonalizationClassification
Detecting Manufacturing DefectsClassification
Customer SegmentationClassification
Event DiscoveryClassification
GenomicsClassification
Drug EfficacyClassification
Stock Market PredictionRegression
Demand ForecastingRegression
Price EstimationRegression
Ad Bid OptimizationRegression
Risk ManagementRegression
Asset ManagementRegression
Weather ForecastingRegression
Sports PredictionRegression
Product RecommendationRecommendation
Job RecruitingRecommendation
Netflix PrizeRecommendation
Online DatingRecommendation
Content RecommendationsRecommendation
Incomplete Patient RecordsImputation
Missing Customer DataImputation
Census DataImputation

Mern Stack

Mern

22 February 2017

Outstanding Ontologies

There are different types of ontologies ranging from knowledge representation ontologies, domain ontologies, linguistic ontologies, and top-level ontologies. A selection of a few examples from different types are provided below.

Knowledge Representation Ontologies:
Frame Ontology
OKCB

Top-Level Ontologies: 
Cyc
SOWA
Standard Upper Ontology

Linguistic Ontologies: 
Wordnet
Generalized Upper Model
Sensus
Eurowordnet
Mikrokosmos

Ecommerce Ontologies (Domain Ontology): 
United Nations Standards Products and Services Codes
North American Industry Classification System
Standard Classification of Transported Goods
E-Cl@ss
RosettaNet

Medical Ontologies (Domain Ontology):
GALEN
UMLS
ON9

Engineering Ontologies (Domain Ontology):
EngMath
PhysSys

Enterprise Ontologies (Domain Ontology):
Enterprise Ontology
TOVE

Chemistry Ontologies (Domain Ontology):
Chemicals
Ions
Environmental Pollutants

Knowledge Mgmt Ontologies (Domain Ontology):
KA Ontology - Project, Organization, Person, Publication, Event, Research-Topic, Research-Product

Nature.com Subjects Ontologies

Infrastructure as Code & Automation

Terraform / Nomad / Vault / Consul (Hashicorp)
Cloudformation (Troposphere)
Boto
Chef
Puppet
Heat
SaltStack
Ansible
Fabric
Pallet
Rundeck

20 February 2017

JavaScript Templating Engines

Mustache
Handlebars
Dust
EJS
Underscore
Jade (Pug)
doT
ECT
jTemplate
Marko
Template7
Nunjucks
Swig
Twig
Vash
Hyperscript
Hogan
Closure Templates
JsRender
Pure
Json2html
jQuery Templating

best javascript templating engines
top templating engines for javascript

Customer Relationship Management

CRM is a very critical aspect of business. An applied area that provides insights into how a business either retains customers, loses customers, or gains customers. Customer dealings whether that be email, phone, by post, or in person is a critical part of business. However, in most capitalist societies customer service seems to take a backseat at which point businesses lose face and reputation which inevitability leads to fall in sales. While social media has made the customer service even more complex. Understanding customer needs can also help shape and drive business products and services. Machine Learning can play an even greater role in defacing customer services of businesses towards targeting more effective customer relationship management KPIs and ROIs. Important areas include customer loyalty and retention, promotions, offers, engagement, measuring spending, predicting campaign performance, identifying new customers, identifying changes in customer behaviors, customer churn prediction, customer segmentation, and customer lifetime value forecasting. Unstructured data can come in form of purchase behavior, customer service interactions, social media, and responses to previous campaigns. Several areas where machine learning can come into play are listed below:

  • Phone/post/email/in-face/social media monitoring
  • Customer retention
  • Customer insights
  • Customer predictions
  • Customer segmentation
  • Customer scores and profiles
  • Customer sentiment management
  • Customer promotions
  • Customer engagements
  • Customer transactions
  • Customer KYC
  • Customer efficiency
  • Customer/Provider Database
  • Customer analysis
  • Customer care management
  • Customer intelligent agents/advisors
  • Customer faqs knowledgebase
  • Marketing analysis
  • Marketing quizzes
  • Provider scores and profiles
  • Provider analysis
  • Discovering the 'why'
  • Measuring/Predicting customer/provider value
  • Measuring/Predicting customer/provider reputation and effects from forecasting services
  • Measuring/Predicting churn
  • Leads generation
  • Planning effective campaigns and promotions
  • Personalizing customer/provider measurements and predictions
  • Ontologies for customer/provider knowledge representation and metadata enrichment

Key CRM Vendors:

Open Source CRM Platforms:

Natural Language Processing Diagram

Artificial Intelligence

Data Science Cheatsheets

data science machine learning cheat sheets

17 February 2017

Blazegraph

Blazegraph - Semantic Graph Database / Triplestore
Technology To Watch For 2016

Analytical Task Workflows

Celery
Akka
Luigi
Airflow
Dask
Azkaban
Oozie
Aurora
Falcon
Chronos
Sparrow
Pinball
BigDataScript
Makeflow

Cool Vendors for Data Integration & Data Quality

Alation
Capsenta
Cirro
Qubole
Verato

R, Python, Scala, and Julia

Three languages have become critical as part of the data scientist arsenal of choice: R, Python, and Scala. Major ecosystem of accessible libraries to support statistical computing and machine learning are critical especially at scale. Scala is still a struggling block for data scientists as the language can be quite complex. Often data scientists use R and Python without venturing beyond. However, there is a significant window of computational and data intensive gains to be made with utilizing languages like Julia and Scala. Although, in certain microbenchmarks even the performance of Julia can come into question and even the state of the language. If one is a graduate and just starting out in the domain of data science then Python is the best choice. As a research scholar languages like R, Python, Scala, and even Julia become the languages of choice.  As an employee the usual alternatives are again Python and R and even Scala especially with Spark. However, if one is willing to take the plunge Julia is emerging to be useful contender for Big Data and likely to play a stronger role in the future if the language takes shape within the open source community. In general, if one has a need to be flexible and work with data across a multitude of different algorithms then the choice is often to use R. However, if such flexibility needs to be extended into the use of data structures and external application integration then Python seems to be a better alternative with the optimizations that can be gained from low-level C implementations. But, to build massively scalable components utilizing batch and streaming data pipelines then one can't beat the ecosystem of Big Data use with Java/Scala and Python. Julia still has a long way to go in catching up to the likes of Python. A few areas that still require improvements are in performance, syntax, interoperability with other languages, text formatting, testing issues that make it difficult to write robust code with defensive programming, accessibility of native API, still a very research-led language that is fairly limited in accessibility for the larger open source community for contributions of libraries and frameworks. 

9 February 2017

Big Data Watch

Airflow
Apex
Arrow
Beam
BlinkDB
Cascading
DL4J
Drill
Druid
Flink
Flume
Gearpump
GlusterFS
H2O
Hadoop
Heron
Ignite
Impala
Kafka
Kudu
Mahout
Nifi
Phoenix
Prestodb
Samza
Scalding
Spark
Storm
Streamsets
Zookeeper
Oryx

hadoop ecosystem table

OpenBankProject & OpenTransactions

OpenBankProject
OpenTransactions

BankInnovation

Chatbots

wit.ai
api.ai
luis.ai
rasa
mindmeld
chatbots.io
bot builder
chatscript
rebot.me
imperson
errbot
nestor
gupshup
botkit
will
motion.ai
recast.ai
snips.ai
amazon lex
facebook messenger

bots apis
chat apis
twitterbots
the best twitter bots of 2015
best twitter bots 2016
botwiki
how to code twitter bot
comparison between luis.ai vs api.ai vs wit.ai
easy context intent prediction and slot detection
exploiting shared information for multi intent natural language sentence classification
chatbot architecture

ConceptNet

ConceptNet

ConceptNet paper

Google Knowledge Graph Search API

Knowledge Graph API

Deep Learning for Various Languages

There are different kinds of deep learning architectures: generative, discriminative, and hybrid. Generative architectures are unsupervised and extract features from data. Discriminative architectures are supervised and classify inputs into classes. Hybrid architectures are made up of both generative and discriminative architectures (generative network feeds into discriminative network). The following provide deep learning libraries in various programming languages, albeit not exhaustive.

Python
Java/Scala
Javascript
Various

3 February 2017

Containerization

Docker/Swarm
CoreOS/RKT
Kubernetes
Canonical
OCI
Mesos
CloudFoundry Garden

Serverless Container Architecture with Funktion
CI/CD Automation with wercker and shippable
Alternatively, in combination with Jenkins for development, while RunDeck for operations.

2 February 2017

Text-Driven Forecasting

Text-Driven Forecasting is about building systems that are able to predict on the future by analyzing collection of a body of natural language documents. Often they predict numeric quantities about a certain event based on various textual sources/feeds (e.g. news, twitter, facebook, polling data, opinion blogs, financial reports, amazon reviews, economics data, etc) as input and gather information gain from aspects of sentiment analysis and subjectivity. Machine Learning algorithms that can be applied to such a domain can range from regression, deep learning, decision trees, and others. 

Examples:
Predicting movie reviews using social media
Predicting opinion polls using social media
Predicting stock volatility using financial data
Predicting government elections and referendums
Predicting product sales using social media
Predicting property prices in the future
Predicting risk of a potential course of action or decision

smith whitepaper

Related Courses & Resources:
Priberam Labs
Social Media Analysis & Computational Social Science
Natural Language Processing & Social Interaction
Computational Social Science
Social & Information Network Analysis
Text as Data
NLP for Social Science
Computational Social Science
Computational Linguistics / Computational Social Science
Predicting Economic Indicators from Web Text Using Sentiment Composition
Making Predictions with Textual Contents

Converting Natural Language to Queries

Distributed queries in form of natural language can be very versatile and useful for analytics in Big Data. Linked Data in form of a data lake can provide a way to semantically produce natural language questions that are then translated into queries especially in form of SPARQL. However, such approaches can further be extended into other types of queries. Natural Language Generation is another aspect of such conversion and tranformation steps. Often such approaches are replicated in a search engine or in semantic web where tokenized words are exposed using subject-predicate-object that are linked to a relative URI reference that map to an ontology schema such as from a custom knowledgebase like DBPedia. An application of such an implementation approach can be found in Quepy which uses transformations and semantic relations.

Data Science Competitions

kaggle competitions
crowdanalytix competitions
drivendata competitions
innocentive competitions
tunedit competitions
texata championships
topcoder competitions
data science challenge
EvalAI Challenge

best kept secret about data science competitions
data science bowl