26 October 2016

Machine Learning Taxonomy

Machine Learning is about designing algorithms that provide a computer the means to learn, often from finding patterns in the data. The below outline the key taxonomy areas of machine learning.

Semi-Supervised Learning
Unsupervised Learning
Reinforcement Learning
Learning to Learn

Scala Data Tools

A list is provided below of the general mathematics and machine learning data tools that have emerged in Scala aside from the Hadoop and Scala API's for databases.
  • Algebird: Twitter’s API for abstract algebra that can be used with almost any Big Data API.
  • Factorie: A toolkit for deployable probabilistic modeling, with a succinct language for creating relational factor graphs, estimating parameters, and performing inference.
  • Figaro: A toolkit for probabilistic programming.
  • H2O: A high-performance, in-memory distributed compute engine for data analytics. Written in Java with Scala and R APIs.
  • Relate: A thin database access layer focused on performance.
  • ScalaNLP: A suite of Machine Learning and numerical computing libraries. It is an umbrella project for several libraries, including Breeze, for machine learning and numerical computing, and Epic, for statistical parsing and structured prediction.
  • ScalaStorm: A Scala API for Storm.
  • Scalding: Twitter’s Scala API around Cascading that popularized Scala as a language for Hadoop programming.
  • Scoobi: A Scala abstraction layer on top of MapReduce with an API that’s similar to Scalding’s and Spark’s.
  • Slick: A database access layer developed by Typesafe. 
  • Spark: The emerging standard for distributed computation in Hadoop environments, as well in Mesos clusters and on single machines (“local” mode).
  • Spire: A numerics library that is intended to be generic, fast, and precise.
  • Summingbird: Twitter’s API that abstracts computation over Scalding (batch mode) and Storm (event streaming). 

25 October 2016

Reactive Manifesto

The Reactive Manifesto is an effort to provide a definition of what a reactive system should look like with four sets of characteristics:
  • Message or Event-driven: As a baseline the system needs to respond to messages or events
  • Elastically Scalable: System needs to meet scale out demands (horizontal scaling via processes, cores, nodes)
  • Resilient: System needs to be able to recover gracefully from failures
  • Responsive: System is available for service requests even if this means graceful degradation of failed components during high traffic

Reactive Extensions
Functional Reactive Programming
Akka (Actors Model)

21 October 2016

Alternatives to Kafka

MAPR Streams

Kafka for Beginners

One must make note that Storm and Samza can in fact be used along side Kafka in a data pipeline. It is the context of how one plans to use a platform, invariably dictated by the given constraints of the problem at hand, which may be in form of either batch or real-time streams for that matter.

18 October 2016

Beer Slangs

Homebrew uses beer analogy as a MAC package manager. Beer is also a staple for social gatherings with the data science field. It has become an essential element of society. Over the years it has evolved with a diverse set of regional slangs as well as the variety of flavors from around the world. Even an ontology can be produced for the consumable term for beer in form of a concept or thing as well as a product with a set of ingredients, categories, and tastes. In process, helping people to explore and produce a recommendation graph to associate to their evolving tastes, merry meet ups, and as a choice for food accompaniment. 

beer slang
15 brewtastic ways say beer
1800s beer slang

13 October 2016

Frozen Yogurts in London

Frozen yogurts are an interesting analogy of applying machine learning or specifically data science towards understanding the customer based on the scoops and taste choices. Analytics has given way towards self-service frozen yogurts putting the choice of the flavors at the hands of the user in process improving the customer experience. This defines a value shift towards the user and the association of data that relates to them. It also shows huge investments do not need to be made to shift business models. A self-service actually reduces labor costs. This is all part of analytics towards the maximization of revenue. By shifting the control to the user, one can allow a customer to attain better satisfaction and a sense of assurance that they are getting their money's worth. The below list provides a few interesting frozen yogurt places in a dynamic society of London.  

Moto Yogo

5 October 2016

Visual Business Intelligence & Analytics

SAS Visual Analytics
Tibco Spotfire
Google Chart Tools
Google Fusion Tables

Intelligent Data Center

The holy grail of data center is complete automation and intelligent management of all the services, infrastructure, storage, security, and data. However, to get to that point one has to think outside of the box of a standard system. Data Centers run many complex and large-scale applications that are difficult to manage. There is ultimately a requirement to manage the infrastructure at massive scale especially for Big Data and the abstractions of the Cloud. Why do we need engineers in data centers when we can converge, automate and build software that allows intelligent agents to do our work for us. How does one derive intelligence into an existing system or sub-systems? Through machine learning and the representation of knowledge. The following sections look at various areas of tackling impedance of the data center and complexities as well as towards intelligent data protection services. 

Key areas identified for data center efficiency and management:
  • data center operation automation
  • characterization and synthesis of workload spikes
  • dynamic resource allocation
  • quick and accurate identification of recurring performance problems
  • optimization of systems
  • energy resource optimisation
  • fault tolerance
  • operational readiness and maintenance
  • fundamental protection of data

Knowledge representation is already available in databases. However, this is not semantic enough for agents to understand. Going further they also help to categorise and facilitate searching for information. One immediate benefit is in smart and custom catagorization as well as for defaults and the merging of both. This can also be extended into all products and services, into a data protection ontology, as well as a knowledge representation for the entire data center. Such approaches can even be programmatically applied so agents can infer and reason on concepts and things. Another place where semantic ontologies can be applied here are towards entity management, search, and dynamic reasoning. Various reasoning approaches can be applied going further from the constraints of simple rules to more complex reasoning metaphors such as probablistic, commonsense, deductive, and inductive. Once semantics are applied, an agent can utilise BDI approach and ultimately gain from argumentation in communication via game theory. Smarter agents mean better data protection and more cohesive fault tolerance, recovery, and management of data centers. Extending this further into graph theory one can apply a complete map of everything that is happening in a data center or set of dispersed geolocated data centers for a customer using connected linked data for knowledge discovery as well as complex networks. Recommendations can be derived using various approaches such as LDA and matrix factorisation. Ultimately the semantic knowledge branches will grow and consume more data incrementally like a connected knowledge graph through clustering even applying such approaches as word2vec or glove for word embeddings. Tagged data can facilitate for topic maps and identify contextually. Probabilistic reasoning can also add value to contextual scores. Going further approaches like deep learning can extend from a semantic bayesian network into a deep belief network of understanding the complexities of data center infrastructure and harness the existing resources to build a conceptual map of the world. Another approach is to also use a global optimisation strategy in form of swarm intelligence as an inspiration from natural computation for foraging and detecting points of anomaly and increasing fault tolerance. There is even benefits here towards reducing energy consumption to facilitate from cost effective data center management. This is only touching at the surface of what automation of data center should mean. An instant advisor to engineers and managers via a mobile phone smart bot like siri or alphago that can answer questions about the data center, and provide deep insights, but at same time connect with the client in order to make their lives simpler and easier. Is just providing an informational dashboard sufficient? Perhaps, being able to interact with a virtual advisor through the dashboard may even add to the entire conversation of engineer and the objective of managing a data center effectively. Having lots of data is good, but when it truly matters is when one can automate and make that data semantically meaningful for consumption of information for humans and also reduce the burden of storage and infrastructure management through intelligent means without any manual intervention. Also, a question might arise here that the limitations of insights lies to some degree in the third-party software. However, can these products be enriched to facilitate even more contextualization and intelligence.

Additionally, such approaches can be reused outside of data center context as well such as for lead generation for sales representatives and for sentiment analysis for marketing and PR helping to find new customers in process as well as to identify more features that customers desire out of the product. Features can be provided in form of modules in a microservices platform to the client that allows them to pick and choose a tailored option that meets their data center requirements. The goal being to provide not just actionable insights but also an entire intelligent path towards the automation of the entire data center from view of data management and infrastructure.

Architecture and Implementation Ideas:
  • Knowledge Graph: Cassandra/Titan/Elasticsearch
  • Deep Learning: DL4J or TensorFlow
  • Big Data: Spark/Flink/Hadoop, Kafka, and others
  • Semantic Linked Data: DBPedia, ConceptNet (analogical reasoning), Wordnet, SKOS (Thesaurus server), Event Calculus (commonsense reasoning), Reasoners, semantic & faceted search
  • Probabilistic Reasoning: Figaro/Factorie
  • NLP: CoreNLP, UIMA, Gate, OpenNLP, Sphinx, DL4J/TensorFlow
  • Microservices: Restlet/Dropwizard,  Distributed Tracing, Service Discovery, Anomaly Detection, Anomaly Correlation, Centralised Logging, Reactive Programming, Circuit Breakers
  • Dashboard: D3, Bokeh, Seaborn, Gelphi

Connected Concepts & Things

A sample idea for connected retail:

Within the etiquette of robots.txt constraints crawl the link graph as determined by the sitemap in order to formulate a custom ontology which could then be linked to DBPedia and GoodRelations schema and various search engines (especially Google). The ontology is then mapreduced against any products and services available on Amazon. Do this across all UK and US retailers based on consumable context of products and services. However, this may grow regionally. Such ontological context can then be derived as schema.org markup to enrich searchability whether that be in context of chatbots, web search, mobile, contextual advertising, and even in store promotions.

Example of Supermarkets:
  • Sainsburys
  • Morrisons
  • Asda
  • Lidl
  • Iceland
  • Aldi
  • Tescos
  • Walmart
  • Marks and Spencer
  • Whole Foods
  • Farmer's Market
  • Lowes
  • Giant
  • Safeway
  • Vons
  • Shoprite
  • Meijer
  • Costco
  • Kroger

Example of Departmental Stores:
  • Selfridges
  • Harrods
  • Macy's
  • Bloomingdales
  • Debenhams
  • Harvey Nichols
  • Fenwicks
  • House of Fraser
  • Fortnum and Mason
  • Marks and Spencer
  • Neiman Marcus
  • Saks
  • Kohls
  • Sears
  • Dillards
  • Nordstrom
  • JC Penny's
  • Lord & Taylor
  • Target
  • KMart
  • Walmart
  • Marshalls
  • John Lewis

Individual Retail Brands
And, various consumable and service contexts ranging from banking to clothing/apparels, and electronics. Essentially, many of the similar domains as Amazon categories.

Benefits of such things incorporate:
  • Free and Open Source so any business can make their products and services more reachable and findable to target customers
  • Free for customers to compare prices
  • Free for customers to check for availability
  • Free for customer recommendations
  • Free for enriching localized ecommerce searching
  • Free for enrichment of products and services for retailers
  • Free for SPARQL queries
  • All products and services essentially become resources in context of URI/URLs
  • Free to check custom and focused chat bots for customers
  • Free to leverage insights from customer behavior via machine learning.
  • However, all data storage is decentralized so no real localized store for any personal information on customers and all competitive data is stored on retailer systems.
  • Such services provide Linked Data services as a Web of Data one can do more NLP and Semantic Web to better understand customers as well as product pricing and sales
  • They are also a way to make it easier for customers to find things on the web, and shop on the go. One can even target or identify customers who are not entering stores
  • One can also find out clusters through network science about customers and which customers one should be targeting and in what way.
  • This approach of resources turns queries into a connected linked data graph or knowledge graph
  • Basic knowledge is already derived from DBPedia which understands what a retail is in context to a business and various other concepts and things.
  • Also, it can be applied to keeping track of new releases, new fashions, new trends, and news in general on retail
  • Postal Deliveries and shop at your convenience
  • Semantic Product/Service and Collaborative Recommendations
  • Semantic Sentiment Analysis on Customer Service Experience
  • Semantic Intent Graph formulations
  • Semantic Customer Understanding

The services are supposed to be free and funded by ad revenue to try remain objective in the searchability with no preferential affiliations. Possibly, even with percentage on targeted conversion.

Such things are the natural steps towards Web 3.0 and Internet of Things where everything is available. Retailers are facing tough competition from Amazon and attempt is to try to make all retailers essentially more available, targetable, and reachable to customers as a basic enrichment to the customer service experience.

Alternatively, such approaches are currently being used for:

  • connected libraries
  • connected research
  • connected learning
  • connected businesses
  • connected social
  • connected games
  • connected entertainment
  • connected interests
  • connected travel
  • connected news
  • connected profiles
  • connected utilities
  • connected city
  • connected publishing
  • connected ads
  • connected things 

This is in context to connected retailers especially where the market is so fiercely competitive it makes more sense.

3 October 2016

Scala Pronunciation

Scala is a very full featured language. One would assume at first that the language having roots from Lausanne, Switzerland would have an ornate history like a circular stairway leading to some fantasy fan fair like the derivative logo design. It is a very continental language, dispersed in many forms which is even displayed in the complexity it derives from being a mix between functional and object- oriented. However, in reality the Scala is an abbreviation simply of the two words combined: Scalable Language. One then wonders how is the language name pronounced. Some people choose to pronounce Scala like the literal manner that it is derived from 'Scale' or 'Scalable' which seems more appropriate in the American English. While others pronounce it as 'Scarlet' which seems more appropriate in the British English.  Apparently, in the way a community likes to be eccentric the wording is rightly pronounced in manner of the 'Scarlet'. However, such things are such misnomers at best at least from the literal abbreviation roots of a 'Scalable Language' as it were.