Mabble Rabble: October 2016

29 October 2016

Big Data Stream Processing

Spark
Flink
DataFlow/Beam
Streamsets

awesome streaming

26 October 2016

Machine Learning Taxonomy

Machine Learning is about designing algorithms that provide a computer the means to learn, often from finding patterns in the data. The below outline the key taxonomy areas of machine learning.

Supervised Learning

Semi-Supervised Learning
Unsupervised Learning
Reinforcement Learning
Transduction
Learning to Learn

Scala Data Tools

A list is provided below of the general mathematics and machine learning data tools that have emerged in Scala aside from the Hadoop and Scala API's for databases.

Algebird: Twitter’s API for abstract algebra that can be used with almost any Big Data API.
Factorie: A toolkit for deployable probabilistic modeling, with a succinct language for creating relational factor graphs, estimating parameters, and performing inference.
Figaro: A toolkit for probabilistic programming.
H2O: A high-performance, in-memory distributed compute engine for data analytics. Written in Java with Scala and R APIs.
Relate: A thin database access layer focused on performance.
ScalaNLP: A suite of Machine Learning and numerical computing libraries. It is an umbrella project for several libraries, including Breeze, for machine learning and numerical computing, and Epic, for statistical parsing and structured prediction.
ScalaStorm: A Scala API for Storm.
Scalding: Twitter’s Scala API around Cascading that popularized Scala as a language for Hadoop programming.
Scoobi: A Scala abstraction layer on top of MapReduce with an API that’s similar to Scalding’s and Spark’s.
Slick: A database access layer developed by Typesafe.
Spark: The emerging standard for distributed computation in Hadoop environments, as well in Mesos clusters and on single machines (“local” mode).
Spire: A numerics library that is intended to be generic, fast, and precise.
Summingbird: Twitter’s API that abstracts computation over Scalding (batch mode) and Storm (event streaming).

25 October 2016

Reactive Manifesto

The Reactive Manifesto is an effort to provide a definition of what a reactive system should look like with four sets of characteristics:

Message or Event-driven: As a baseline the system needs to respond to messages or events

Elastically Scalable: System needs to meet scale out demands (horizontal scaling via processes, cores, nodes)

Resilient: System needs to be able to recover gracefully from failures

Responsive: System is available for service requests even if this means graceful degradation of failed components during high traffic

Reactive Extensions
Functional Reactive Programming
Akka (Actors Model)

21 October 2016

Alternatives to Kafka

Kinesis
RabbitMQ
ZeroMQ
Kudu
Storm
Samza
SQS
Redis
Aeron
MAPR Streams

Kafka for Beginners
Confluent

One must make note that Storm and Samza can in fact be used along side Kafka in a data pipeline. It is the context of how one plans to use a platform, invariably dictated by the given constraints of the problem at hand, which may be in form of either batch or real-time streams for that matter.

18 October 2016

Beer Slangs

Homebrew uses beer analogy as a MAC package manager. Beer is also a staple for social gatherings with the data science field. It has become an essential element of society. Over the years it has evolved with a diverse set of regional slangs as well as the variety of flavors from around the world. Even an ontology can be produced for the consumable term for beer in form of a concept or thing as well as a product with a set of ingredients, categories, and tastes. In process, helping people to explore and produce a recommendation graph to associate to their evolving tastes, merry meet ups, and as a choice for food accompaniment.

beer slang
thrillist
beerslanging
15 brewtastic ways say beer
craftbeer
alldownunder
irishdrinking
1800s beer slang

13 October 2016

Frozen Yogurts in London

Frozen yogurts are an interesting analogy of applying machine learning or specifically data science towards understanding the customer based on the scoops and taste choices. Analytics has given way towards self-service frozen yogurts putting the choice of the flavors at the hands of the user in process improving the customer experience. This defines a value shift towards the user and the association of data that relates to them. It also shows huge investments do not need to be made to shift business models. A self-service actually reduces labor costs. This is all part of analytics towards the maximization of revenue. By shifting the control to the user, one can allow a customer to attain better satisfaction and a sense of assurance that they are getting their money's worth. The below list provides a few interesting frozen yogurt places in a dynamic society of London.

Pinkberry
Snog
Itsu
Frae
Moosh
Moto Yogo
Yoomoo
Yogland

12 October 2016

Sentiment Ontologies

SenticNet
Marl
Onyx
EmotionML
EmotionML Vocabularies
Lemon
FOAF

Sentiment Analysis in Social Networks

6 October 2016

SQL on Hadoop Frameworks

Presto
Spark SQL
Impala
Phoenix
Drill
Tajo
Hive

5 October 2016

Visual Business Intelligence & Analytics

Tableau
Qlik
Pentaho
Looker
SAS Visual Analytics
Tibco Spotfire
SiSense
GoodData
Alteryx
Google Chart Tools
Geckoboard
Raw
NVD3
Google Fusion Tables

Intelligent Data Center

The holy grail of data center is complete automation and intelligent management of all the services, infrastructure, storage, security, and data. However, to get to that point one has to think outside of the box of a standard system. Data Centers run many complex and large-scale applications that are difficult to manage. There is ultimately a requirement to manage the infrastructure at massive scale especially for Big Data and the abstractions of the Cloud. Why do we need engineers in data centers when we can converge, automate and build software that allows intelligent agents to do our work for us. How does one derive intelligence into an existing system or sub-systems? Through machine learning and the representation of knowledge. The following sections look at various areas of tackling impedance of the data center and complexities as well as towards intelligent data protection services.

Key areas identified for data center efficiency and management:

data center operation automation
characterization and synthesis of workload spikes
dynamic resource allocation
quick and accurate identification of recurring performance problems
optimization of systems
energy resource optimisation
fault tolerance
operational readiness and maintenance
fundamental protection of data

Knowledge representation is already available in databases. However, this is not semantic enough for agents to understand. Going further they also help to categorise and facilitate searching for information. One immediate benefit is in smart and custom catagorization as well as for defaults and the merging of both. This can also be extended into all products and services, into a data protection ontology, as well as a knowledge representation for the entire data center. Such approaches can even be programmatically applied so agents can infer and reason on concepts and things. Another place where semantic ontologies can be applied here are towards entity management, search, and dynamic reasoning. Various reasoning approaches can be applied going further from the constraints of simple rules to more complex reasoning metaphors such as probablistic, commonsense, deductive, and inductive. Once semantics are applied, an agent can utilise BDI approach and ultimately gain from argumentation in communication via game theory. Smarter agents mean better data protection and more cohesive fault tolerance, recovery, and management of data centers. Extending this further into graph theory one can apply a complete map of everything that is happening in a data center or set of dispersed geolocated data centers for a customer using connected linked data for knowledge discovery as well as complex networks. Recommendations can be derived using various approaches such as LDA and matrix factorisation. Ultimately the semantic knowledge branches will grow and consume more data incrementally like a connected knowledge graph through clustering even applying such approaches as word2vec or glove for word embeddings. Tagged data can facilitate for topic maps and identify contextually. Probabilistic reasoning can also add value to contextual scores. Going further approaches like deep learning can extend from a semantic bayesian network into a deep belief network of understanding the complexities of data center infrastructure and harness the existing resources to build a conceptual map of the world. Another approach is to also use a global optimisation strategy in form of swarm intelligence as an inspiration from natural computation for foraging and detecting points of anomaly and increasing fault tolerance. There is even benefits here towards reducing energy consumption to facilitate from cost effective data center management. This is only touching at the surface of what automation of data center should mean. An instant advisor to engineers and managers via a mobile phone smart bot like siri or alphago that can answer questions about the data center, and provide deep insights, but at same time connect with the client in order to make their lives simpler and easier. Is just providing an informational dashboard sufficient? Perhaps, being able to interact with a virtual advisor through the dashboard may even add to the entire conversation of engineer and the objective of managing a data center effectively. Having lots of data is good, but when it truly matters is when one can automate and make that data semantically meaningful for consumption of information for humans and also reduce the burden of storage and infrastructure management through intelligent means without any manual intervention. Also, a question might arise here that the limitations of insights lies to some degree in the third-party software. However, can these products be enriched to facilitate even more contextualization and intelligence.

Additionally, such approaches can be reused outside of data center context as well such as for lead generation for sales representatives and for sentiment analysis for marketing and PR helping to find new customers in process as well as to identify more features that customers desire out of the product. Features can be provided in form of modules in a microservices platform to the client that allows them to pick and choose a tailored option that meets their data center requirements. The goal being to provide not just actionable insights but also an entire intelligent path towards the automation of the entire data center from view of data management and infrastructure.

Architecture and Implementation Ideas:

Knowledge Graph: Cassandra/Titan/Elasticsearch
Deep Learning: DL4J or TensorFlow
Big Data: Spark/Flink/Hadoop, Kafka, and others
Semantic Linked Data: DBPedia, ConceptNet (analogical reasoning), Wordnet, SKOS (Thesaurus server), Event Calculus (commonsense reasoning), Reasoners, semantic & faceted search
Probabilistic Reasoning: Figaro/Factorie
NLP: CoreNLP, UIMA, Gate, OpenNLP, Sphinx, DL4J/TensorFlow
Microservices: Restlet/Dropwizard, Distributed Tracing, Service Discovery, Anomaly Detection, Anomaly Correlation, Centralised Logging, Reactive Programming, Circuit Breakers
Dashboard: D3, Bokeh, Seaborn, Gelphi

Connected Concepts & Things

A sample idea for connected retail:

Within the etiquette of robots.txt constraints crawl the link graph as determined by the sitemap in order to formulate a custom ontology which could then be linked to DBPedia and GoodRelations schema and various search engines (especially Google). The ontology is then mapreduced against any products and services available on Amazon. Do this across all UK and US retailers based on consumable context of products and services. However, this may grow regionally. Such ontological context can then be derived as schema.org markup to enrich searchability whether that be in context of chatbots, web search, mobile, contextual advertising, and even in store promotions.

Example of Supermarkets:

Sainsburys
Morrisons
Asda
Lidl
Iceland
Aldi
Tescos
Walmart
Marks and Spencer
Whole Foods
Farmer's Market
Lowes
Giant
Safeway
Vons
Shoprite
Meijer
Costco
Kroger

Example of Departmental Stores:

Selfridges
Harrods
Macy's
Bloomingdales
Debenhams
Harvey Nichols
Fenwicks
House of Fraser
Fortnum and Mason
Marks and Spencer
Neiman Marcus
Saks
Kohls
Sears
Dillards
Nordstrom
JC Penny's
Lord & Taylor
Target
KMart
Walmart
Marshalls
John Lewis

Individual Retail Brands
And, various consumable and service contexts ranging from banking to clothing/apparels, and electronics. Essentially, many of the similar domains as Amazon categories.

Benefits of such things incorporate:

Free and Open Source so any business can make their products and services more reachable and findable to target customers
Free for customers to compare prices
Free for customers to check for availability
Free for customer recommendations
Free for enriching localized ecommerce searching
Free for enrichment of products and services for retailers
Free for SPARQL queries
All products and services essentially become resources in context of URI/URLs
Free to check custom and focused chat bots for customers
Free to leverage insights from customer behavior via machine learning.
However, all data storage is decentralized so no real localized store for any personal information on customers and all competitive data is stored on retailer systems.
Such services provide Linked Data services as a Web of Data one can do more NLP and Semantic Web to better understand customers as well as product pricing and sales
They are also a way to make it easier for customers to find things on the web, and shop on the go. One can even target or identify customers who are not entering stores
One can also find out clusters through network science about customers and which customers one should be targeting and in what way.
This approach of resources turns queries into a connected linked data graph or knowledge graph
Basic knowledge is already derived from DBPedia which understands what a retail is in context to a business and various other concepts and things.
Also, it can be applied to keeping track of new releases, new fashions, new trends, and news in general on retail
Postal Deliveries and shop at your convenience
Semantic Product/Service and Collaborative Recommendations
Semantic Sentiment Analysis on Customer Service Experience
Semantic Intent Graph formulations
Semantic Customer Understanding

The services are supposed to be free and funded by ad revenue to try remain objective in the searchability with no preferential affiliations. Possibly, even with percentage on targeted conversion.

Such things are the natural steps towards Web 3.0 and Internet of Things where everything is available. Retailers are facing tough competition from Amazon and attempt is to try to make all retailers essentially more available, targetable, and reachable to customers as a basic enrichment to the customer service experience.

Alternatively, such approaches are currently being used for:

connected libraries
connected research
connected learning
connected businesses
connected social
connected games
connected entertainment
connected interests
connected travel
connected news
connected profiles
connected utilities
connected city
connected publishing
connected ads
connected things

This is in context to connected retailers especially where the market is so fiercely competitive it makes more sense.

3 October 2016

Scala Pronunciation

Scala is a very full featured language. One would assume at first that the language having roots from Lausanne, Switzerland would have an ornate history like a circular stairway leading to some fantasy fan fair like the derivative logo design. It is a very continental language, dispersed in many forms which is even displayed in the complexity it derives from being a mix between functional and object- oriented. However, in reality the Scala is an abbreviation simply of the two words combined: Scalable Language. One then wonders how is the language name pronounced. Some people choose to pronounce Scala like the literal manner that it is derived from 'Scale' or 'Scalable' which seems more appropriate in the American English. While others pronounce it as 'Scarlet' which seems more appropriate in the British English. Apparently, in the way a community likes to be eccentric the wording is rightly pronounced in manner of the 'Scarlet'. However, such things are such misnomers at best at least from the literal abbreviation roots of a 'Scalable Language' as it were.

Subscribe to: Posts ( Atom )