15 December 2015

Automatic Summarization

Automatic Summarization is a valuable aspect of Information Extraction in Natural Language Processing. It is applied within Information Retrieval, news summaries, building research abstracts, and within various knowledge exploration contexts. Automatic Summarization can be applied either over single or multiple documents. There is even aspect of building extractions over simple verses rich textual documents. The following extrapolate the various aspects concerning Automatic Summarization processes that are under active research and utilized for development within the various textual domain contexts.

Summarization Types:
single document
main point
key point

Summary Sentence Approaches:
sentence selection vs summary selection

Unsupervised Methods:
word frequency
word probability
tf*idf weighting
log-likelihood ratio for topic signatures
sentence clustering
graph based methods for sentence ranks

Semantics and Discourse:
lexical chaining
latent semantic analysis
rhetorical structure
discourse-driven graph representation

Summary Generation Methods:
rule-based compression
statistical compression
context dependent revision
ordering of information

Various Genre and Domains:
journal article
conversation log
financial data
social media

linguistic quality

14 December 2015

Question/Answering Approaches In Perspective

Question/Answering has become a hot top in recent years as it can be applied in a variety of domain contexts for data mining for knowledge discovery and as an application of natural language processing. However, one of the core underpinnings has always been about matching a question to an answer and a reformulation. In simple terms, one could apply a decision tree style approach or formalize a keyphrase matching over a set of rules. In recent years, there has been much growth towards applications of probabilistic techniques over rules-based systems. A hybrid approach in artificial intelligence has proven to be an optimal solution in many contexts. And, including semantic constructs through ontologies allows an agent to understand and reason over domain knowledge through inference and deduction. Furthermore, one can take such an intelligent metaphor of understanding a step further into the BDI context of multi-agent systems and mediations for argumentation and game theory. Deep Learning has also provided some robust alternatives. The below is a listing of some proposed ideas on how potentially effective question/answering strategies could be achieved for open/closed-domain understanding. In every case, a semantic ontological understanding becomes important for a somewhat guided way of reasoning about the open world. One can view question/answering as almost like a data funnel or pipeline of question to answer matching through a series of filtration steps in form of Sentiment Analysis, Sentence Comprehension as form of thought chains or tokens, Machine Learning for Classification and Clustering, as well as aspects of semantic domain concepts. In such respects, one can formulate a respective knowledge graph from a generalized view of the open world and gradually apply layers on top of specialized curated domain ontologies to provide for a Commonsense Reasoning, analogous to a human. DBPedia is a starting point to the open world and the entire web is another. A separate lexical store could also be used such as wordnet, sentiwordnet, and wiktionary. Alternative examples to further build on the knowledgebase include: Yago-Sumo, UMBELSenticNet, OMCS, and ConceptNet. One could even build a graph of the various curated FAQ sites for a connected knowledge source. However, one day the Web of Data would itself provide a gigantic linked data graph of queryable knowledge via metadata. Today such options are in form of Schema.org and others. In future as research evolves, cognitive agents will be more self-aware of their world with granular and more efficient ways of understanding without much guidance. Another aspect of practical note here is the desirability for a feedback loop between short-term and long-term retention of knowledge cues to avoid excessive repeated backtracking for inference on similar question patterns in context.

Description StepsAgent Belief
QA Semantic Domain Ontologies/NLP + BDI Multiagent Ensemble Classifiers (potential for Deep Learning)Multiple BDI
QA Semantic Domain Ontologies/NLP + BDI Multiagent Belief Networks using Radial Basis Functions (Autoencoders vs Argumentation)Multiple BDI
QA Semantic Domain Ontologies/NLP + BDI Multiagent Reinforcement Learning/Q LearningMultiple BDI
QA Semantic Domain Ontologies/NLP + Predicate Calculus for Deductive InferenceSingle
QA Semantic Domain Ontologies/NLP + Basic Commonsense ReasoningSingle
QA Semantic Domain Ontologies/NLP + Deep Learning (DBN/Autoencoders)Single
QA Semantic Domain Ontologies/NLP + LDA/LSA/Search DrivenSingle
QA Semantic Domain Ontologies/NLP + Predicate Calculus for Deductive Inference + Commonsense ReasoningSingle
QA Semantic Domain Ontologies/NLP + Groovy/Prolog RulesSingle
QA Semantic Domain Ontologies/NLP + Bayesian NetworksSingle
QA TopicMap/NLP + DeepLearning (Recursive Neural Tensor Network)Single
QA Semantic Domain Ontologies/NLP + QATopicMap + Self-Organizing MapSingle
QA Semantic Domain Ontologies/NLP + Connected Memory/Neuroscience (Associative Memory/HebianLearning)Single
QA Semantic Domain Ontologies/NLP + Machine Learning/Clustering in a DataGrid like GridGainSingle

29 November 2015

Applied Design Patterns

Design Patterns have proven to be quite useful in software engineering practice. When used appropriately they have proven to have many benefits and have become an indispensable design approach for architects. A design pattern is essentially a reusable approach towards repeatable problems in context. Not only do they have benefits for architects and software engineers/developers but also other role players in an agile process including: project sponsors, project managers, testers as well as users. There are many design patterns and the field is always changing as new anti-patterns are found to disprove the existing approaches making way for new patterns.  These patterns may also be elaborated for use at object and class scope levels. The Gang of Four have become a fundamental aspect of object-oriented theory and design. However, not every one is at home with utilizing such approaches within their mindset in practice. Having patterns baked in to the language is often seen as a good thing. And, perhaps this is a flaw in languages like Java which formally expect software engineers to have a design pattern style of thinking towards software development and even object-oriented design in particular. Whereas, functional programming languages take a different route and are simpler.  Considering that there are a wide array of design patterns available and most with their relevant domain contexts it only seems plausible to have an intelligent template solution as a refactoring tool/library/plugin. This could be one extension to an intelligent agent model as part of the software engineering development process. The pragmatic agent would need to be able to interpret the code at both a logical but also at a more contextual basis and reason on the basis of understanding where it is appropriate to apply the right design pattern even to identify an anti-pattern. This context of software development automation could be extended to provide other uses within the refactoring process of both functional, service/object-oriented programming, data/object modelling, as well as various others as listed below.

23 Gang Of Four Design Patterns

Behavioral: manage relationships, algorithms, responsibilities between objects

  • Chain of Responsibility (Object Scope) 
  • Command (Object Scope) 
  • Interpreter (Class Scope) 
  • Iterator (Object Scope) 
  • Mediator (Object Scope) 
  • Memento (Object Scope) 
  • Observer (Object Scope) 
  • State (Object Scope)
  • Strategy (Object Scope) 
  • Template Method (Class Scope) 
  • Visitor (Object Scope) 

Structural: build large object structures from disparate objects

  • Composite (Object Scope) 
  • Decorator (Object Scope) 
  • Facade (Object Scope) 
  • Flyweight (Object Scope) 
  • Proxy (Object Scope) 
  • Adapter (Class/Object Scope) 
  • Bridge (Object Scope) 

Creational: construct objects able to be decoupled from implementation

  • AbstractFactory (Object Scope) 
  • Factory Method (Object Scope) 
  • Builder (Object Scope) 
  • Prototype (Object Scope) 
  • Singleton (Object Scope)

Software Design Pattern
SOA Patterns
Data Science Design Patterns
Big Data WorkLoad Design Patterns
Architectural Patterns
Concurrency Patterns
Interactive Design Patterns
Big Data Architectural Patterns
Microservices Patterns
Microservices Architecture Patterns
Service Design Sheet
Linked Data Design Patterns
Ontology Design Patterns
Enterprise Architecture Patterns
Enterprise Integration Patterns
Cloud Design Patterns

18 October 2015

Enterprise Architecture

Enterprise Architecture is a formidable terrain for large organizations seeped in system complexity and poor business alignments. Hence, various formal frameworks and methods were defined to manage the architecture of such deliverables. Often times the technology architecture mimics the dynamics of a business culture or organizational functions. The below is a list of the four key methodologies that are used for enterprise architecture and a further comparison.

16 October 2015

Startup Stacks

It is always interesting to see what types of technology stacks are being used by startups especially of the ones that have been successful. Compared to enterprises, startups can often times have nominal legacy code and are open to trying out new approaches with bleeding edge technology. The below link sheds some interesting view of what technology stacks are being used by various startups in the industry and to get an idea of the trends across the different tools and services.

7 October 2015

Creative Work Licenses for Software

Original work should always be licensed in some way either for open source community or for full disclosure of protection rights. In a competitive world every one is looking for the shining new piece of artifact that could take a digital community by storm. It seems only plausible that one protect their hard work whether for sharing or otherwise. However, the license terms available are very broad and varied for which one has to be fully mindful and aware of the terms. The below are some helpful links in making an informed decision for the best course of action in the selection of an appropriate license term that best suits an artifact or a project requirements.

3 October 2015

Microservices Monitoring

Breaking down a system into more granular services guided by the single responsibility principle does have multiple benefits of bounded context. However, it can also add a degree of complexity that requires more extensive monitoring. With multiple services interaction in a distributed systems context implies multiple log files and a need to aggregate them as well as multiple places for network latency issues to arise. One simple approach is to monitor everything in the entire workflow of the services as well as the system as whole but at same time try to get the bigger picture through an aggregation process. Also, add structure to the logs by utilizing correlation IDs which can then provide a guided trail. The need to be responsive can also be important so real time alerting may also be needed in order to avoid cascaded issues. One can abstract away the service from the system for a monitoring strategy.  The current trend towards monitoring is in a holistic way to get the full picture of the entire system including all its sub-systems as well as all the services interaction within it. A break down of the types of things that can be monitored and examples of tools is given below.

Service-Level Tracking:
  • check inbound response times, error rates, and application metrics
  • check downstream response health, response times of calls, error rates (Hystrix)
  • standardize metrics collection process and pipelines
  • standardize on logging formats so aggregation is easier
  • check system processes for the OS in order to plan for capacity

System-Level Tracking:
  • check host metrics like CPU
  • check system logs and aggregate them so it is possible to filter on individual hosts
  • standardize on single query option for searching through logs
  • standardize on correlation IDs
  • standardize on an action plan and alert levels
  • unify aggregation (Riemann or Suro)

Logstash and Graphite/Collectd/Statsd are also often used in conjunction for the collection and aggregation of logs. One can also apply the ELK stack. The Java Metrics Library can also be utilized to get insights of code during production. There are other tool options available like Skyline and Oculus for anomaly detection and correlation. 

30 September 2015

Open Data and Knowledge

OpenData is all about making data freely available for all without restrictions and mirrors other open source initiatives. It often parallels that of Data.gov and Data.gov.uk. To get involved with OpenKnowledge one can check out Open Knowledge Labs. OpenKnowledge working group areas and data process tools are listed below.

Lobbying Transparency
Open Access
Open Bibliography
Open Definition
Open Design & Hardware
Open Development
Open Economics
Open Education
Open Government Data
Open Humanities
Open Linguistics
Open Product Data
Open Science
Open Sustainability
Open Transport
Personal Data and Privacy
Public Domain






Further details can be found on School of Data.

Open Data Institute

15 September 2015

Computational Linguistics and NLP Conferences

The below link provides the entire calendar of schedule for computational linguistics and natural language processing conferences in play globally for the year as well as an archive of dates.

18 July 2015

ICML 2015

This year the International Conference on Machine Learning took place in Lille, France. It was a fantastic event to bring research from a diverse areas of Machine Learning in a collaborative setting. The conference went down really well. There was an immense amount of research shared within the community.  Also, a noticeable increase in number of people that attended the conference this year. The schedule was broken down into conferences, workshops, and tutorials. Even an open question and discussion session was available after each session. The banquet was a joyful experience. However, both the banquet and the local Lille food experience was much to be desired. Cheese was on display, in all forms, and showing itself in every french menu. For vegetarians, Lille offers cheese, french fries, and salad. Some of the most popular areas of research covered included: Deep Learning, Topic Modelling, Structured Prediction, Networks and Graphs, Natural Language Processing, Reinforcement Learning, and Transfer Learning. Deep Learning, Reinforcement Learning, and Word2Vec were the most popular researched topics in attendance. Many of the presented papers can be found also on Arxiv. The conference also showed how far Machine Learning has come as well as the level of popularity it has garnered over the years. Machine Learning is proving to be an invaluable area in a multitude of domains which is having profound effects for business and society as a whole.  But, one thing was reverberated throughout the conference that even now there is still a lot to be discovered before Artificial Intelligence can truly match the abilities of a human being.

15 July 2015

London Shopping Centres

Shopping in UK is not comparable to quality and the vast expanse of malls in US. Department stores are also relatively unmatched apart from Harrods, Selfridges, and John Lewis. Not only is shopping in UK far more expensive than US but there is also less variety as well as competition for bargains. But, things are slowly changing in UK especially for London where there is plenty of options and this is because the major city and capital gets such a huge influx of tourists year round. Although, not exhaustive, the following list includes some popular shopping arcades in London as well as a link to a few popular areas around UK. There is also a link to a lonely planet guide to shopping in London.

Awesome Big Data

Big Data has taken off in leaps and bounds for distributed systems as well as machine learning. The following links provides useful set of curated and category list of Big Data frameworks, libraries, resources, and other related technologies. No doubt this will change as the domain has proven to be very dynamic.

3 May 2015

Common Crawl

Common Crawl provides an archive snapshot dataset of the web which can be utilized for massive array of applications. It is also based on the Heritrix archival crawler making it quite reusable and extensible for open-ended solutions whether that be building a search engine against years of web page data, extracting specific data from web page documents, or even to train machine learning algorithms. Common Crawl is also available via the AWS public data repository and accessible via the AWS S3 blob store. There are plenty of MapReduce examples available in both Python and Java to make it approachable for developers. Having years of data at a developer's disposal saves one from manually setting up such crawler processes. 

26 March 2015

Deep Learning for Java

Deep learning has become the next big thing in realization of Artificial Intelligence. However, many libraries and frameworks are still very much experimental and for research purposes. In realistic business applications, in order for deep learning to be a viable option it has to be scalable over Big Data. Cloud environments and massive parallelization have made such scalability requirements of Machine Learning a possibility. DL4J is an open source library much needed in the Deep Learning community. It  provides for an interesting option and an array of developer friendly neural network implementations. Whatever the domain requirements are for a business, DL4J provides a viable and accessible option towards delivering a working production ready implementation.

Deep Learning A Practitioner's Approach
Fundamentals of Deep Learning Designing Next-Generation Machine Intelligence Algorithms

24 March 2015

Natural Language Processing

Natural Language Processing has come a long way from the past eras of rules driven approaches to utilizing more Machine Learning techniques, paving the way to even more advanced hybrid methods. The area is also quite diverse and constantly growing with active research in the community. We also find Natural Language Processing as an applied discipline for almost all web and document related extraction problems. However, there is still room for more scalable libraries and frameworks as they seem to emerge out of mainly research and at times also have restricted user licenses. Natural Language Processing applications are usually designed in a pipeline architecture. They can also utilize rich domain semantics from Linked Data ontologies, vocabularies, thesauri, or even commonsense knowledge bases. Increasingly, they are also utilizing deep learning methods. In general, there are also formal frameworks supported by industry collaborations such as UIMA for building entire pipelines. Or, even frameworks like Gate that provide a variety of pluggable libraries for different domain cases and tasks in the pipeline. The following are some interesting libraries in the domain area that could be applied for Natural Language Processing applications.


16 March 2015

Mind Mapping

Mind mapping and brainstorming tools come in handy for visually working out relationships between ideas and concepts. They often elucidate our thoughts towards more plausible and realistic outcomes. Mind mapping tools are also useful to information architects in structuring out information flows towards concepts within a domain context. Often brainstorming exercises are the best way of working out all the corner cases of a knowledge representation on data. The whole process can also provide a way of discovering new connected ideas and storyboarding before formalizing into an implementation strategy. The below provides links to a few mind mapping tools.

4 March 2015

Online CI Providers

Hosted Continuous Integration is a hot area but also a very competitive domain. While some choose to have it hosted in the cloud others like to have more corporate autonomy with using such tools as Jenkins and TeamCity. Continuous Integration is an agile work flow practice that involves developers to integrate on code, in shared repositories, and utilize automated tests to verify for build quality, in order to allow teams to check for issues, early and often, on a daily basis. A step further in the Continuous Integration process is Continuous Delivery. Continuous Delivery being the hardest bit to fully achieve on a large complex architecture and may even prove to be foolhardy. Although, CI has been around for years, it really boils down to team dynamics and whether one really has the time to manually setup and monitor builds in comparison to a hosted option. In some corporate environments, teams may even have a dedicated team member for build and configuration management. The following is a list of a few hosted Continuous Integration providers and the different use cases that they provide for an agile software engineering process. 


Comparison of continuous integration software

28 February 2015

Alternatives To OpenRefine

OpenRefine which used to be part of a Google project stream has become an almost irreplaceable tool for data cleansing and transformations. This is a part of activity regarded generally as data wrangling. One can clean messy data, transform data into various normalizations/denormalizations, parse data from various websites, merge data from various sources, and reconcile with Freebase (this has now been discontinued and work continues on Wikidata). However, the tool does have its many quirks and limitations. There are quite a few tools available as alternatives, most of which stem from research then end up becoming commercial products in their own right. Unfortunately, other open source options are only left as experimental and then slowly are made unavailable for public use. A few interesting free alternatives are listed below. 

DataWrangler (commercialized into Trifacta)
Many Eyes (discontinued)

School of Data Online Resources

Alternatives To Zookeeper

Zookeeper has over the years become a basis for many open source distributed service projects. Often the important aspects to consider when choosing the right location and coordination services is to understand the right discovery architecture as well as the operational requirements. In general, the key concerns are in load balancing, monitoring, integration, dependencies during runtime, as well as availability needs. As quantity of disparate service needs grow for scalability it becomes paramount to have dynamic service registries and discovery to coordinate their changing location and deployments in order to minimize failure and interruption. In many respects, Zookeeper can be viewed as a relatively old implementation and does not provide many out of the box service discovery options compared to new alternatives. Consul, for example, goes some ways further than Zookeeper in providing certain functional features and capabilities. There are also other interesting options like Eureka, Etcd, and Serf.  The intention for many dynamic service registries is to resolve the downsides of using standard DNS for finding nodes in a highly dynamic environments. The following list provides some alternatives to Zookeeper both from perspective of general and single purpose registries and coordination.


JavaScript Map Libraries

JavaScript ecosystem provides for an insurmountable options for mapping and for building holistic GIS applications. There is a huge array of libraries, plugins, and APIs to choose from to harness, process, and customize the visualization of data. GIS is a hot domain that is advancing at a fast pace especially as public service initiatives are unlocking key data for developers to explore and for building creative applications. Although, not fully exhaustive, the below list provides some interesting JavaScript mapping tools.

MapQuest Maps
Bing Maps
WebGL Globe
Clickable Maps
jQuery Mapael
jQuery Birdseye

A simple map making tutorial 

27 February 2015

Watch Movies Legally For Free

Illegal sites for movies are popping up all over the place on the Web. At the same time, they are also under legal scrutiny for DMCA restrictions. Some search engines, like Google, have even taken the initiative to block such sites from search listings. ChillingEffects is an open archive initiative that works collaboratively to protect lawful online activity from legal threats and provides information on legal rights as well as responsibilities. While free access to content can sometimes be dubious, there are a few legitimate sites that provide legally free content on movies for Internet users. Such sites may have limited content or at times outdated. However, there are still other sites that do provide constantly updated streaming of content. Perhaps, with a bit of searching around there may just be a bit of something for everyone to arouse the keen curiosity. A few legal sites for free movies are listed below. Another, often overlooked, option is to visit the local public library of archived videos.

22 February 2015

Outsourcing Development

Many companies look to outsourcing as a means of cost efficiencies and for rapid turnaround of work. On other occasions it is about lack of in house skills for which they need to outsource for support. Although, outsourcing may appear to increase development efficiencies in short-term, it more than reduces it in the long-term. Outsourcing is also detrimental to agile processes within a team environment. It also ordinarily reduces the scope of development work for existing permanent staff which inevitably leads to loss of morale and productivity. Although, management might see outsourcing as the way to go, for many developers it is often an unpleasant experience. Not only does outsourcing incorporate frustration with third-party communication but it also involves lower quality as well as increases risk for unexpected delays in project deliveries. Invariably, with outsourcing one is also limited to the skills and experiences of the third-party for development. On many occasions once a project is delivered, third-parties also play tricks to continue on the contracted work with continued maintenance or for more project deliveries as a form of business opportunity. While the product owner values quality assurance, the third-party outsourcing agency more than values delivery of work for which they are paid. At times, delaying work also means more income from the product owner. On other occasions delivering on buggy projects means more continued work for maintenance later down the line. All this stretches the budget constraints on an outsourced project and for the product owner. The best approach for most corporate environments looking to deliver on project is to bring in permanent developers for the entirety of the full life cycle of work. This will mean a clean execution of development work with clear deliverable as well as a chance to form an agile culture that grows internally as the project evolves over time. It also avoids wasted time for the product owner as well as for development staff. Most hands-on developers that enjoy development work will never be in favor of outsourcing or contracting work out because it reduces the scope of their work and increases risk of uncertainty. Cloud computing has also allowed for more efficiencies in development with performance and scale to meet business demands for growth. Organizations that utilize outsourcing as their core means of development budgets really need to start re-evaluating their strategy for the long-term as having internal development teams will far surpass in quality of work as well as provide for continued cost efficiencies. In the long-term, this is almost a necessity especially with the changing trends in technology and the demands of the business environment for maintaining a competitive advantage.

10 February 2015

Go for Robotics and Internet of Things

Go is a very easy language to learn which can take a mere mortal a day at max to get started and probably less. The language is starting to take shape in a diverse application domains. It is becoming quite useful for cloud computing, and embedded as well as low-level development work. And, even an interesting replacement for shell scripting. Although, the language uses garbage collection, it does derive its roots from the C programming language, especially from looking at the backgrounds of the creators. For Python developers, it will be another useful and familiar arsenal at their disposal. Go is especially becoming useful for Robotics and Internet of Things with several interesting approaches becoming available in the community. It is even taking shape in the Semantic Linked Data space. At moment, the community is slowly starting to accept Go for a variety of application use cases through experimentation and evolution of the language. In time, we are going to witness an emergence of a new ecosystem of Go developers and a resurrection of an almost religious as well as active use for large-scale enterprise applications. 

4 February 2015

Convergence of TV and Internet

Although in past TV and Internet have been treated significantly as different media source streams, it will become evident that such options have more synergies for the future. Not only does it mean more access to a wider digital customer base but also reuse of existing content and technologies. Advertising companies are already focusing on context and behavior for mobile, TV and Internet. However, sooner or later WebTV and ubiquity is going to take over where content providers will need to start treating all such media sources as one ultimate streaming option with all dependent on metadata semantics. Even such retailers as eBay and Amazon will have to look to digital TV options to enhance their already wider customer base. This will also mean more complexity of use cases but yet consolidation of technology for big data. Desktop is again going to take center stage as more people will look for access to multi-model accesses to media streaming. People want applications and platforms that work across digital divides with minimum amount of fuss on compatibility and re-engineering by the consumer. Advertising will again mean reduction in subscription costs and a higher revenue stream. Essentially, ubiquity will also take center stage as digital technology becomes a part of our very existence rather than an over visible tool constraint, an almost plug and play for everything. And, as digital electronics and software converge it will mean mass appeal for the customers due to the multi-model transparency of use. Such approaches are already in existence today through Apple and Google products. However, people also want more versatility of contextual use as well as smarter intelligence. As more people spend time with digital technology, they will also become emotionally aware of their own health and well being which means more ubiquitous technology needs in the home and office as well as the daily leisure life of an individual. If there is a publishing or retailer site on the Internet, it will only become more contextually beneficial to have a channel brand as well to broadcast the content. Aggregated outlets will have the benefit of a larger contextual semantics for streaming. Real-time needs are growing while people become more demanding as consumers with their more complex tastes and digitally connected lives. It will inevitably become a case of consolidation for meeting digital trends, harnessing cost-effectiveness from technology, and the wider consumer base.