Data Source
|
Description
|
Land Registry 10 Years Data | Build a story visualization of sold property prices and timeline of trends across UK |
Marvel API | Using the Marvel API and social media, collect, mine and build a comical visualization story for characters |
TFL Data Feeds | Track TFL Data across London |
Local Urban Data | WhatsOn, Congestion, Events, Hubbub, GeoLocation |
Social Media, Blogs, News, Reviews | Product or Brand tracking/engagement on the web |
Github, Twitter, Meetups, Quora, Stackoverflow, MailingLists, stackshare | Monitor/track technology trends (BigData, ML, Batch/Stream Processing, etc) |
Social Media, Blogs, News, Alerts | Monitor and visualize political risk, events, and trends with a story timeline |
Google N-Grams, Gutenberg, Wiktionary, WordNet, etc | Spelling Checker using word2vec/glove |
Single and Multi-Documents (News Feeds, Journals, Business Documents, etc) | Information Extraction (Summary, Topic Tags, Language Detection, Author, etc) |
Santander | Measuring customer satisfaction |
HomeDepot | Search relevance of search terms |
Company House, Social Media, Corporate Sites, Compliance, Angelist | Track companies with partners, creditors, suppliers, sponsors, buyers |
Walmart | Use historical data to predict store sales |
Historical Stock Prices, News | Monitor and track stock prices and news for forecasting |
WorldBank Datasets & Indicators, UK Office of National Statistics, US Census Data, IMF Data, Census Hub, and others | Track and visualization of census data across regions |
World University Rankings | Find the best universities of the world |
World Food Facts | Find the nutrition facts in foods |
Reddit Comments | Storytelling and visualization of contextualized comments on Reddit |
Handwriting and Digits | Training a computer to detect handwriting |
Faces | Training a computer to detect facial expressions |
Twitter and others | Building a profile of how people view the EU |
Cats and Dogs Dataset | Distinguish Dogs from Cats |
Any music/video stream | Write a Stream Sampler that takes a random (representative) sample of size k from a stream of values of unknown and possibly very large length: Receiving data the sampler should work with two kinds of inputs: -values piped directly into process (stdin) -values generated using a good random source |
Expedia Hotels | which hotel type will an expedia customer book learning to rank hotels |
Amazon Fine Foods | analyze reviews what does the product-reviewer graph look like? what words tend to indicate positive and negative reviews? what types of food products get reviewed the most? how does review score distribution vary across reviewers? what makes a review helpful? |
NIPS 2015 | analyze and explore research papers, citations |
Data Curation/Scraping + DBPedia | ontology engineering of a few custom/domain contexts, scraping, building a commonsense graph/reasoning |
Anomaly Detection (Spam, Fraud, Fault, Network) | Monitor/Track/Identify Anomalies from Data |
Domain Data | Monitor/Track Domain Websites |
Images/Videos/Music/Shows/News Feeds/Twitter/Facebook/Reviews | Develop semantic recommendations (processing multiple types of streaming) |
FAQ sources | Build a FAQ graph and recommendation for technology |
Recipes, Barcodes, etc | mining ingredients for: wellness, nutrition, religion, quantified self, fitness and health |
museum, gallery, and library (worldcat) datasets, catalogs, library of congress, etc | mining and visualization of connected archives |
relevant contextual dataset | topic extraction in NLP in real time to do recommendations using LDA |
Public Data Sources