31 August 2016

Data Science Projects

Data Source
Description
Land Registry 10 Years DataBuild a story visualization of sold property prices and timeline of trends across UK
Marvel APIUsing the Marvel API and social media, collect, mine and build a comical visualization story for characters
TFL Data FeedsTrack TFL Data across London
Local Urban DataWhatsOn, Congestion, Events, Hubbub, GeoLocation
Social Media, Blogs, News, ReviewsProduct or Brand tracking/engagement on the web
Github, Twitter, Meetups, Quora, Stackoverflow, MailingLists, stackshareMonitor/track technology trends (BigData, ML, Batch/Stream Processing, etc)
Social Media, Blogs, News, AlertsMonitor and visualize political risk, events, and trends with a story timeline
Google N-Grams, Gutenberg, Wiktionary, WordNet, etcSpelling Checker using word2vec/glove
Single and Multi-Documents (News Feeds, Journals, Business Documents, etc)Information Extraction (Summary, Topic Tags, Language Detection, Author, etc)
SantanderMeasuring customer satisfaction
HomeDepotSearch relevance of search terms
Company House, Social Media, Corporate Sites, Compliance, AngelistTrack companies with partners, creditors, suppliers, sponsors, buyers
WalmartUse historical data to predict store sales
Historical Stock Prices, NewsMonitor and track stock prices and news for forecasting
WorldBank Datasets & Indicators, UK Office of National Statistics, US Census Data, IMF Data, Census Hub, and othersTrack and visualization of census data across regions
World University RankingsFind the best universities of the world
World Food FactsFind the nutrition facts in foods
Reddit CommentsStorytelling and visualization of contextualized comments on Reddit
Handwriting and DigitsTraining a computer to detect handwriting
FacesTraining a computer to detect facial expressions
Twitter and othersBuilding a profile of how people view the EU
Cats and Dogs DatasetDistinguish Dogs from Cats
Any music/video streamWrite a Stream Sampler that takes a random (representative) sample of size k from a stream of values of unknown and possibly very large length:
Receiving data the sampler should work with two kinds of inputs:
-values piped directly into process (stdin)
-values generated using a good random source
Expedia Hotelswhich hotel type will an expedia customer book
learning to rank hotels
Amazon Fine Foodsanalyze reviews
what does the product-reviewer graph look like?
what words tend to indicate positive and negative reviews?
what types of food products get reviewed the most?
how does review score distribution vary across reviewers?
what makes a review helpful?
NIPS 2015analyze and explore research papers, citations
Data Curation/Scraping + DBPediaontology engineering of a few custom/domain contexts, scraping, building a commonsense graph/reasoning
Anomaly Detection (Spam, Fraud, Fault, Network)Monitor/Track/Identify Anomalies from Data
Domain DataMonitor/Track Domain Websites
Images/Videos/Music/Shows/News Feeds/Twitter/Facebook/ReviewsDevelop semantic recommendations (processing multiple types of streaming)
FAQ sourcesBuild a FAQ graph and recommendation for technology
Recipes, Barcodes, etcmining ingredients for: wellness, nutrition, religion, quantified self, fitness and health
museum, gallery, and library (worldcat) datasets, catalogs, library of congress, etcmining and visualization of connected archives
relevant contextual datasettopic extraction in NLP in real time to do recommendations using LDA

Public Data Sources