10 June 2016

Brexit Pipeline

Studying Sentiment Analysis in context of Brexit (EU Referendum) is currently an intensive area as the polling stations will very soon be active for voters. Input sources from social media and news feeds can be a focal point for storytelling about the various events. Social media and news feeds can be utilized in form of stream processing which can then be used for machine learning analysis and then indexed for summary into Elasticsearch. A sample workflow example is provided below.  Reader will take notice that the sample workflow is also supported as an example for learning Apache Flink. The workflow can be modified as required for example, one could use a Redis cache layer between the machine learning process and Elasticsearch. Also, could extend with an NLP pipeline (Gate/UIMA) or simply OpenNLP/CoreNLP for extracting information. One could even replace Apache Flink with Spark or GraphLab. Alternatively, one could even replace Kafka with Kinesis and simply apply the AWS data pipeline. Also, the data sources can be stored using S3. Furthermore, one could even use DL4J with Spark on ElasticMapReduce to apply Deep Learning approach in form of convolutional neural network model. Although, Python developers may be more inclined to use Theano, TensorFlow and possibly RabbitMQ. For a graph representation one could use Titan, GraphX, Elasticsearch Graph, Cayley, PowerGraph, Gelly, among others. As one can see there are several ways of implementing a solution on a case-by-case basis to translate the requirements of stories. However, prototype in small is always the best way to go before scaling out incrementally i.e fail fast

Input->Kafka->ApacheFlink->Elasticsearch->Output

Steps:
  1. Collect
  2. Log 
  3. Analyze
  4. Serve & Store
List of Input Sources:

As a side note, GNIP and DataSift provide an entire data source pipeline for building out a firehose of streaming inputs. Live Polling data can also be used to gather voting trends as they happen. However, as the referendum is now past, one can probably get a hold of the dataset or API.