27 November 2016

PoolParty Academy

PoolParty Academy is an initiative to provide certifications for semantic web and for the PoolParty Semantic Suite as well as integration for metadata management. The list below highlights the various certification roles on offer. PoolParty is used extensively in industry to manage SKOS based semantic schemas and for the linked data management which are applied towards various enrichments for knowledge engineering in natural language processing tasks.

  • Semantic Web Associate
  • Knowledge Engineering Specialist
  • Semantic Integration Expert

2 November 2016

Clueless Interviewers

When one is stuck in a room with an interviewer and through the process one comes to a realization that the person has no clue about what they are talking about and yet they are recruiting for a Big Data Engineer. It seems such interview episodes are a common occurrence in the Big Data world where even managers or architects have no idea what Big Data is about nor how to tackle it for their next project. However, one would suppose that the first step would be to recruit sufficiently skilled individuals through a sufficiently experienced and professional hiring practice. It is even worse when the interviewer comes back with feedback which clearly displays their lack of understanding of the Big Data concepts leaving not only a bad taste but also a humorous impression with the obviously stated opinion of the interviewer. Such roles are even more difficult for human resources to recruit for as the number of keywords far out way their often limited vocabulary. The dynamic nature of Big Data is also a challenge as companies want to be able to use pragmatic and cost effective ways of implementing for the future. Training the management is often the right first step in order to define a convincing strategy towards adapting Big Data for projects. There is also a higher level of risk involved due to the heavy requirements of data cleansing and out of silos and into a formation of a data lake. Many companies are still struggling to understand Semantic Web and Linked Data and the benefits of such approach for Big Data. The complexity of the domain is often met with clueless management, interviewers, and human resources personnel who are adapting and tackling a dynamic environment for change where newly recruited data engineers and data scientists are expected to provide considerable input and guidance towards such a shift in Big Data adoption. Frustration is often met with talented engineers as they are relegated against more keywords on a CV rather than the context of use and even bemused interviewers trying to get by with only a few whimsical attempts to understand the Big Data Landscape. For many data engineers who have to wear multiple hats while seeped in annoyance from the inept data scientists. And, as we know everyone is calling themselves a data scientist these days. But, how many of them actually are qualified to do the job with a well rounded skill set in the area is fairly doubtful.

CAP Theorem

The following guarantees of the Brewer's Theorem (CAP Theorem) play a balancing act in a distributed system especially in context of big data. 

  • Consistency
  • Availability
  • Partition Tolerance

26 October 2016

Machine Learning Taxonomy

Machine Learning is about designing algorithms that provide a computer the means to learn, often from finding patterns in the data. The below outline the key taxonomy areas of machine learning.

Semi-Supervised Learning
Unsupervised Learning
Reinforcement Learning
Transduction
Learning to Learn

Scala Data Tools

A list is provided below of the general mathematics and machine learning data tools that have emerged in Scala aside from the Hadoop and Scala API's for databases.
  • Algebird: Twitter’s API for abstract algebra that can be used with almost any Big Data API.
  • Factorie: A toolkit for deployable probabilistic modeling, with a succinct language for creating relational factor graphs, estimating parameters, and performing inference.
  • Figaro: A toolkit for probabilistic programming.
  • H2O: A high-performance, in-memory distributed compute engine for data analytics. Written in Java with Scala and R APIs.
  • Relate: A thin database access layer focused on performance.
  • ScalaNLP: A suite of Machine Learning and numerical computing libraries. It is an umbrella project for several libraries, including Breeze, for machine learning and numerical computing, and Epic, for statistical parsing and structured prediction.
  • ScalaStorm: A Scala API for Storm.
  • Scalding: Twitter’s Scala API around Cascading that popularized Scala as a language for Hadoop programming.
  • Scoobi: A Scala abstraction layer on top of MapReduce with an API that’s similar to Scalding’s and Spark’s.
  • Slick: A database access layer developed by Typesafe. 
  • Spark: The emerging standard for distributed computation in Hadoop environments, as well in Mesos clusters and on single machines (“local” mode).
  • Spire: A numerics library that is intended to be generic, fast, and precise.
  • Summingbird: Twitter’s API that abstracts computation over Scalding (batch mode) and Storm (event streaming). 

25 October 2016

Reactive Manifesto

The Reactive Manifesto is an effort to provide a definition of what a reactive system should look like with four sets of characteristics:
  • Message or Event-driven: As a baseline the system needs to respond to messages or events
  • Elastically Scalable: System needs to meet scale out demands (horizontal scaling via processes, cores, nodes)
  • Resilient: System needs to be able to recover gracefully from failures
  • Responsive: System is available for service requests even if this means graceful degradation of failed components during high traffic

Reactive Extensions
Functional Reactive Programming
Akka (Actors Model)

21 October 2016

Alternatives to Kafka

Kinesis
RabbitMQ
ZeroMQ
Kudu
Storm
Samza
SQS
Redis
Aeron
MAPR Streams

Kafka for Beginners
Confluent

One must make note that Storm and Samza can in fact be used along side Kafka in a data pipeline. It is the context of how one plans to use a platform, invariably dictated by the given constraints of the problem at hand, which may be in form of either batch or real-time streams for that matter.

18 October 2016

Beer Slangs

Homebrew uses beer analogy as a MAC package manager. Beer is also a staple for social gatherings with the data science field. It has become an essential element of society. Over the years it has evolved with a diverse set of regional slangs as well as the variety of flavors from around the world. Even an ontology can be produced for the consumable term for beer in form of a concept or thing as well as a product with a set of ingredients, categories, and tastes. In process, helping people to explore and produce a recommendation graph to associate to their evolving tastes, merry meet ups, and as a choice for food accompaniment. 

beer slang
thrillist
beerslanging
15 brewtastic ways say beer
craftbeer
alldownunder
irishdrinking
1800s beer slang

13 October 2016

Frozen Yogurts in London

Frozen yogurts are an interesting analogy of applying machine learning or specifically data science towards understanding the customer based on the scoops and taste choices. Analytics has given way towards self-service frozen yogurts putting the choice of the flavors at the hands of the user in process improving the customer experience. This defines a value shift towards the user and the association of data that relates to them. It also shows huge investments do not need to be made to shift business models. A self-service actually reduces labor costs. This is all part of analytics towards the maximization of revenue. By shifting the control to the user, one can allow a customer to attain better satisfaction and a sense of assurance that they are getting their money's worth. The below list provides a few interesting frozen yogurt places in a dynamic society of London.  

Pinkberry
Snog
Itsu
Frae
Moosh
Moto Yogo
Yoomoo
Yogland

5 October 2016

Visual Business Intelligence & Analytics

Tableau
Qlik
Pentaho
Looker
SAS Visual Analytics
Tibco Spotfire
SiSense
GoodData
Alteryx
Google Chart Tools
Geckoboard
Raw
NVD3
Google Fusion Tables

Intelligent Data Center

The holy grail of data center is complete automation and intelligent management of all the services, infrastructure, storage, security, and data. However, to get to that point one has to think outside of the box of a standard system. Data Centers run many complex and large-scale applications that are difficult to manage. There is ultimately a requirement to manage the infrastructure at massive scale especially for Big Data and the abstractions of the Cloud. Why do we need engineers in data centers when we can converge, automate and build software that allows intelligent agents to do our work for us. How does one derive intelligence into an existing system or sub-systems? Through machine learning and the representation of knowledge. The following sections look at various areas of tackling impedance of the data center and complexities as well as towards intelligent data protection services. 

Key areas identified for data center efficiency and management:
  • data center operation automation
  • characterization and synthesis of workload spikes
  • dynamic resource allocation
  • quick and accurate identification of recurring performance problems
  • optimization of systems
  • energy resource optimisation
  • fault tolerance
  • operational readiness and maintenance
  • fundamental protection of data

Knowledge representation is already available in databases. However, this is not semantic enough for agents to understand. Going further they also help to categorise and facilitate searching for information. One immediate benefit is in smart and custom catagorization as well as for defaults and the merging of both. This can also be extended into all products and services, into a data protection ontology, as well as a knowledge representation for the entire data center. Such approaches can even be programmatically applied so agents can infer and reason on concepts and things. Another place where semantic ontologies can be applied here are towards entity management, search, and dynamic reasoning. Various reasoning approaches can be applied going further from the constraints of simple rules to more complex reasoning metaphors such as probablistic, commonsense, deductive, and inductive. Once semantics are applied, an agent can utilise BDI approach and ultimately gain from argumentation in communication via game theory. Smarter agents mean better data protection and more cohesive fault tolerance, recovery, and management of data centers. Extending this further into graph theory one can apply a complete map of everything that is happening in a data center or set of dispersed geolocated data centers for a customer using connected linked data for knowledge discovery as well as complex networks. Recommendations can be derived using various approaches such as LDA and matrix factorisation. Ultimately the semantic knowledge branches will grow and consume more data incrementally like a connected knowledge graph through clustering even applying such approaches as word2vec or glove for word embeddings. Tagged data can facilitate for topic maps and identify contextually. Probabilistic reasoning can also add value to contextual scores. Going further approaches like deep learning can extend from a semantic bayesian network into a deep belief network of understanding the complexities of data center infrastructure and harness the existing resources to build a conceptual map of the world. Another approach is to also use a global optimisation strategy in form of swarm intelligence as an inspiration from natural computation for foraging and detecting points of anomaly and increasing fault tolerance. There is even benefits here towards reducing energy consumption to facilitate from cost effective data center management. This is only touching at the surface of what automation of data center should mean. An instant advisor to engineers and managers via a mobile phone smart bot like siri or alphago that can answer questions about the data center, and provide deep insights, but at same time connect with the client in order to make their lives simpler and easier. Is just providing an informational dashboard sufficient? Perhaps, being able to interact with a virtual advisor through the dashboard may even add to the entire conversation of engineer and the objective of managing a data center effectively. Having lots of data is good, but when it truly matters is when one can automate and make that data semantically meaningful for consumption of information for humans and also reduce the burden of storage and infrastructure management through intelligent means without any manual intervention. Also, a question might arise here that the limitations of insights lies to some degree in the third-party software. However, can these products be enriched to facilitate even more contextualization and intelligence.

Additionally, such approaches can be reused outside of data center context as well such as for lead generation for sales representatives and for sentiment analysis for marketing and PR helping to find new customers in process as well as to identify more features that customers desire out of the product. Features can be provided in form of modules in a microservices platform to the client that allows them to pick and choose a tailored option that meets their data center requirements. The goal being to provide not just actionable insights but also an entire intelligent path towards the automation of the entire data center from view of data management and infrastructure.

Architecture and Implementation Ideas:
  • Knowledge Graph: Cassandra/Titan/Elasticsearch
  • Deep Learning: DL4J or TensorFlow
  • Big Data: Spark/Flink/Hadoop, Kafka, and others
  • Semantic Linked Data: DBPedia, ConceptNet (analogical reasoning), Wordnet, SKOS (Thesaurus server), Event Calculus (commonsense reasoning), Reasoners, semantic & faceted search
  • Probabilistic Reasoning: Figaro/Factorie
  • NLP: CoreNLP, UIMA, Gate, OpenNLP, Sphinx, DL4J/TensorFlow
  • Microservices: Restlet/Dropwizard,  Distributed Tracing, Service Discovery, Anomaly Detection, Anomaly Correlation, Centralised Logging, Reactive Programming, Circuit Breakers
  • Dashboard: D3, Bokeh, Seaborn, Gelphi

Connected Concepts & Things

A sample idea for connected retail:

Within the etiquette of robots.txt constraints crawl the link graph as determined by the sitemap in order to formulate a custom ontology which could then be linked to DBPedia and GoodRelations schema and various search engines (especially Google). The ontology is then mapreduced against any products and services available on Amazon. Do this across all UK and US retailers based on consumable context of products and services. However, this may grow regionally. Such ontological context can then be derived as schema.org markup to enrich searchability whether that be in context of chatbots, web search, mobile, contextual advertising, and even in store promotions.

Example of Supermarkets:
  • Sainsburys
  • Morrisons
  • Asda
  • Lidl
  • Iceland
  • Aldi
  • Tescos
  • Walmart
  • Marks and Spencer
  • Whole Foods
  • Farmer's Market
  • Lowes
  • Giant
  • Safeway
  • Vons
  • Shoprite
  • Meijer
  • Costco
  • Kroger

Example of Departmental Stores:
  • Selfridges
  • Harrods
  • Macy's
  • Bloomingdales
  • Debenhams
  • Harvey Nichols
  • Fenwicks
  • House of Fraser
  • Fortnum and Mason
  • Marks and Spencer
  • Neiman Marcus
  • Saks
  • Kohls
  • Sears
  • Dillards
  • Nordstrom
  • JC Penny's
  • Lord & Taylor
  • Target
  • KMart
  • Walmart
  • Marshalls
  • John Lewis

Individual Retail Brands
And, various consumable and service contexts ranging from banking to clothing/apparels, and electronics. Essentially, many of the similar domains as Amazon categories.

Benefits of such things incorporate:
  • Free and Open Source so any business can make their products and services more reachable and findable to target customers
  • Free for customers to compare prices
  • Free for customers to check for availability
  • Free for customer recommendations
  • Free for enriching localized ecommerce searching
  • Free for enrichment of products and services for retailers
  • Free for SPARQL queries
  • All products and services essentially become resources in context of URI/URLs
  • Free to check custom and focused chat bots for customers
  • Free to leverage insights from customer behavior via machine learning.
  • However, all data storage is decentralized so no real localized store for any personal information on customers and all competitive data is stored on retailer systems.
  • Such services provide Linked Data services as a Web of Data one can do more NLP and Semantic Web to better understand customers as well as product pricing and sales
  • They are also a way to make it easier for customers to find things on the web, and shop on the go. One can even target or identify customers who are not entering stores
  • One can also find out clusters through network science about customers and which customers one should be targeting and in what way.
  • This approach of resources turns queries into a connected linked data graph or knowledge graph
  • Basic knowledge is already derived from DBPedia which understands what a retail is in context to a business and various other concepts and things.
  • Also, it can be applied to keeping track of new releases, new fashions, new trends, and news in general on retail
  • Postal Deliveries and shop at your convenience
  • Semantic Product/Service and Collaborative Recommendations
  • Semantic Sentiment Analysis on Customer Service Experience
  • Semantic Intent Graph formulations
  • Semantic Customer Understanding

The services are supposed to be free and funded by ad revenue to try remain objective in the searchability with no preferential affiliations. Possibly, even with percentage on targeted conversion.

Such things are the natural steps towards Web 3.0 and Internet of Things where everything is available. Retailers are facing tough competition from Amazon and attempt is to try to make all retailers essentially more available, targetable, and reachable to customers as a basic enrichment to the customer service experience.

Alternatively, such approaches are currently being used for:

  • connected libraries
  • connected research
  • connected learning
  • connected businesses
  • connected social
  • connected games
  • connected entertainment
  • connected interests
  • connected travel
  • connected news
  • connected profiles
  • connected utilities
  • connected city
  • connected publishing
  • connected ads
  • connected things 

This is in context to connected retailers especially where the market is so fiercely competitive it makes more sense.

3 October 2016

Scala Pronunciation

Scala is a very full featured language. One would assume at first that the language having roots from Lausanne, Switzerland would have an ornate history like a circular stairway leading to some fantasy fan fair like the derivative logo design. It is a very continental language, dispersed in many forms which is even displayed in the complexity it derives from being a mix between functional and object- oriented. However, in reality the Scala is an abbreviation simply of the two words combined: Scalable Language. One then wonders how is the language name pronounced. Some people choose to pronounce Scala like the literal manner that it is derived from 'Scale' or 'Scalable' which seems more appropriate in the American English. While others pronounce it as 'Scarlet' which seems more appropriate in the British English.  Apparently, in the way a community likes to be eccentric the wording is rightly pronounced in manner of the 'Scarlet'. However, such things are such misnomers at best at least from the literal abbreviation roots of a 'Scalable Language' as it were.


13 September 2016

Pragmatic Programmer

Every programming language has its strengths and weaknesses. After having achieved an understanding of the theory of programming languages it becomes easier to adapt and quickly learn new ones, with a bit of practice. There are always going to be some programming languages that are more popular in industry verses others that are more used academically. With a bit of pragmatism one can work out when to use which language and treat it as a dispensable tool. When new programming constructs and dialects evolve, out of research, to reduce the increasing dynamic changes in system and application complexity, it equally becomes paramount to keep one's language skills up-to-date. The below list provides the current popularity in programming languages that are most used in industry and those that could potentially be the next ones on a programmer's radar for learning.

Popular Programming Languages in Industry:

R

Evolving Programming Language Trends:

D

MapR vs Cloudera vs Hortonworks

Distributions Compared

Cloudera
MapR
Hortonworks
Pivotal HD



dezyre
curiousinsight
Four factors for comparing the top Hadoop distributions
comparing hadoop distributions

Certifications Compared

MapR has a more accessible free courseware option and has a less complex pathway to learning. Although, they provide more customizations to their platform. Cloudera pathways are more rigorous and more expensive. But, their certifications are recognized as a pedigree in the big data space. Cloudera also have significant customizations to their commercial product offerings which means a more stable platform. Hortonworks provide flexibility between the developer, administrator, and data analyst. They also cover mostly open source stacks which also means the product offering can be less stable. Also, they provide a full self-paced training but with a premium price tag as material from their essential courses may not be sufficient for a certification study. If one wants to focus on open source choose the Hortonworks pathway. If one wants more rigor and a data scientist pathway choose Cloudera for CCP exam. MapR can offer a developer pathway somewhere in between which also is more easy on the pocket. But, ultimately the employer dictates the appropriate certification choice that one takes for the workplace and the requirements of Hadoop distribution to use/support. In end, it is down to requirements and the value one puts towards such attainment and measure of certifications.

Quick Vocabulary Lesson

Kafka (publish/subscribe messaging system)
Mahout (machine learning)
Hive (map data to structures and use SQL-like queries)
Pig (data transformation language for big data)
Zookeeper (used to manage and administer Hadoop)
Sqoop (extract external sources and load to Hadoop)
Storm (real-time ETL)
Oozie (workflow scheduler)
Avro (data serialization like JSON)
Flume (ingest unstructured data)
Nutch (crawler)
Ambari (provisioning, managing, and monitoring Hadoop)
Chukwa (data collection)
Tez (data-flow framework)
Hama (big data analytics)

Columnar (HBase, Cassandra)
KeyValue (Riak, Redis)
Document (MongoDB, CouchDB)
Graph (Neo4J, Titan)

6 September 2016

Delta Architecture

Delta

Lambda Architecture

Lambda architecture essentially is composed of the batch layer, speed layer, and the serving layer.

Kappa Architecture

Kappa architecture is essentially composed of the speed layer and serving layer. The batch layer becomes a subset of the speed layer.

Apache Oryx

Oryx

5 September 2016

SKOS

SKOS is a very common data model for representing knowledge in form of thesauri or controlled vocabularies which can provide for interlinked knowledge graphs as a form of linked data. SKOS is a lightweight and flexible OWL ontology representation format available in various RDF syntax. OWL on the other hand is an ontology language. It is possible to convert from SKOS to OWL and even to combine them. The below links provide some related tools and libraries for working with SKOS models. 

JSKOS
SKOSAPI
OWLAPI
SKOSEd
OpenSKOS
TemTres
THManager
PoolParty
TopBraid
Thesaurus Master
Lexaurus
Fluent Editor
Intelligent Topic Manager
SKOS2OWL
Protege
SKOSIFY
Poolparty Consistency Checker
KEA
SKOSMOS
SILK

W3C SKOS
SKOS: A Guide for Information Professionals
SKOS Taxonomy
The Accidental Taxonomist
Knowledge Engineering with Semantic Web Technologies
LinkedData Engineering
PoolParty Academy
Gate
Ontotext
Knowledge Extraction
Taxonomy Warehouse
Synaptica

31 August 2016

Artificial Intelligence for Retail

The below outlines some areas for feature engineering that could be applied using various machine learning and deep learning techniques. There is also options here to build significant robots or drones.

In-Store Analytics (conversion of customers when in the store)
  • sentiments of customers
  • product vs purchase stock history
  • order fulfillment
  • stock and inventory monitoring
  • loyalty promotions
  • personalization
  • helping customers find bargains
  • identifying customer shopping basket history
  • maximization of conversion
  • shelving analysis what product gets bought more next to what product
  • supply-chain on-demand by product (product just bought, re-shelf, check stock availability)
  • curiosity shopping conversion
  • price analytics
  • offer tracking
  • streaming offers for loyalty customers
  • ereceipts
  • discount tiers (more customer buys the more discounts they get)
  • targeting age group buying habits
  • semantic search relevance
  • customer agents
  • cashier agents
  • warehouse agents
  • morelikethis
  • track customer experience
  • deep insights on product recommendations (i like this heel, this color, this buckle, perfect!)
  • nutritionist/wellness agents (people that are positively conscientious of their health)
  • product tabs

Out-Store Analytics (People passing outside the store)
  • competitor Insights (is the product cheaper at Y retailer, is the product available at X retailer)
  • insight on window dressing (engage the right window dressing to attract customers)
  • insights on social media/viral marketing
  • promotions
  • augmented reality (customers can check product availability at X retailer, price, promotions, etc)
  • lead generation
  • product trends
  • reviews
  • social sentiment about the store/identify why the customer does not enter the store
  • effectiveness of advertising to conversion ratio
  • local store optimization (location)

Types of customers:
  • Curiosity/Bargain Shoppers
  • Spendthrift/Impulse Shoppers (heavy shopping one day, no shopping the next - mood swingers)
  • Loyalty/Informed Shoppers
  • Indecisive Shoppers
  • Wanderers
  • Complainers
  • Green Shoppers (Vegans, Weight Watchers, Calorie/Nutrition, Organic, Free Range, Gluten-Free, Religious)

Other Areas:
  • Time of Day (night, morning, afternoon, weekday, weekend, bank holiday)
  • Product Categorization and Labeling
  • Teens
  • Professionals
  • Parents
  • Pensioners
  • Students
  • Tourist
  • Singles
  • Kids

Core Areas of Retail Analytics:
  • In-Store (local user experience)
  • Out-Store (local user experience)
  • Ecommerce (online user experience)
  • Home Delivery (remote user experience)
  • Supply-Chain & Logistics (inventory/product/stock/transportation/warehousing)
  • Pricing/Loyalty (core sales/loyalty user experience/competition)
  • Contextual Advertising (core marketing)
  • Social Media/Rumor Mill (Sentiment Analysis - Reviews/Brand/Product/Event/Experience)
  • Recruitment (staffing)
  • Geographical (Locational/Regulatory Compliance)
  • Security (Local/Online/Remote)
  • CRM (core 360)

Key In-Store Q’s for Analytics:
  • Shrinkage - shoplifters/theft/weak links
  • Managing the Moment - achieving customer needs in real-time
  • Measure Customer In-store experience (wants/needs/desires)
  • What drives sales conversion
  • What is not in the basket
  • Feedback on Promotions/Effectiveness of Promotions
  • Complex Data Insights/Summaries
  • Information on ROI - Return on Investment
  • Optimization for Product Mix - What moves the Shopper to Purchase
  • Shopper vs Buyer

Pricing and Semantic Publishing Pipeline


Microservices Subsystems for Data Protection


Apache Spark Architecture



CV and JobSeeker Profile Enrichment Pipeline


Entity Extraction Enrichment Pipeline


Generalization of Machine Learning Pipeline

Open-Domain Question/Answering Pipeline


Data Science Projects

Data Source
Description
Land Registry 10 Years DataBuild a story visualization of sold property prices and timeline of trends across UK
Marvel APIUsing the Marvel API and social media, collect, mine and build a comical visualization story for characters
TFL Data FeedsTrack TFL Data across London
Local Urban DataWhatsOn, Congestion, Events, Hubbub, GeoLocation
Social Media, Blogs, News, ReviewsProduct or Brand tracking/engagement on the web
Github, Twitter, Meetups, Quora, Stackoverflow, MailingLists, stackshareMonitor/track technology trends (BigData, ML, Batch/Stream Processing, etc)
Social Media, Blogs, News, AlertsMonitor and visualize political risk, events, and trends with a story timeline
Google N-Grams, Gutenberg, Wiktionary, WordNet, etcSpelling Checker using word2vec/glove
Single and Multi-Documents (News Feeds, Journals, Business Documents, etc)Information Extraction (Summary, Topic Tags, Language Detection, Author, etc)
SantanderMeasuring customer satisfaction
HomeDepotSearch relevance of search terms
Company House, Social Media, Corporate Sites, Compliance, AngelistTrack companies with partners, creditors, suppliers, sponsors, buyers
WalmartUse historical data to predict store sales
Historical Stock Prices, NewsMonitor and track stock prices and news for forecasting
WorldBank Datasets & Indicators, UK Office of National Statistics, US Census Data, IMF Data, Census Hub, and othersTrack and visualization of census data across regions
World University RankingsFind the best universities of the world
World Food FactsFind the nutrition facts in foods
Reddit CommentsStorytelling and visualization of contextualized comments on Reddit
Handwriting and DigitsTraining a computer to detect handwriting
FacesTraining a computer to detect facial expressions
Twitter and othersBuilding a profile of how people view the EU
Cats and Dogs DatasetDistinguish Dogs from Cats
Any music/video streamWrite a Stream Sampler that takes a random (representative) sample of size k from a stream of values of unknown and possibly very large length:
Receiving data the sampler should work with two kinds of inputs:
-values piped directly into process (stdin)
-values generated using a good random source
Expedia Hotelswhich hotel type will an expedia customer book
learning to rank hotels
Amazon Fine Foodsanalyze reviews
what does the product-reviewer graph look like?
what words tend to indicate positive and negative reviews?
what types of food products get reviewed the most?
how does review score distribution vary across reviewers?
what makes a review helpful?
NIPS 2015analyze and explore research papers, citations
Data Curation/Scraping + DBPediaontology engineering of a few custom/domain contexts, scraping, building a commonsense graph/reasoning
Anomaly Detection (Spam, Fraud, Fault, Network)Monitor/Track/Identify Anomalies from Data
Domain DataMonitor/Track Domain Websites
Images/Videos/Music/Shows/News Feeds/Twitter/Facebook/ReviewsDevelop semantic recommendations (processing multiple types of streaming)
FAQ sourcesBuild a FAQ graph and recommendation for technology
Recipes, Barcodes, etcmining ingredients for: wellness, nutrition, religion, quantified self, fitness and health
museum, gallery, and library (worldcat) datasets, catalogs, library of congress, etcmining and visualization of connected archives
relevant contextual datasettopic extraction in NLP in real time to do recommendations using LDA

Public Data Sources