26 March 2014

Semantic Annotations

Semantic annotations is a broad and complex area often requiring a mixture of natural language processing as well as knowledge representation. One of the major inherent requirements in an application is to provide for word sense disambiguation. There are also more light weight approaches that generalize on the semantics alone in form of ontologies especially for maintaining publications and cataloging. Such semantics can cater for both text as well as multimedia. What this enables is that semantic labels can be constructed in context and provided for findability, better visualization, reasoning over a set of web resources, and allowing for the conversion from syntactic to knowledge structures. One can approach this manually or in an automated fashion. The manual step often takes the typical approach of transforming of syntactic resources into interlinks of knowledge without taking account of much in way of multiple perspectives of data sources, and which is applied using third-party tools. There is also the approach of utilizing semi-automated annotations. Even though, they also require human intervention at various phases of the process.  GATE is one such semi-automated tool for extracting entity sets. Automated approaches usually require tuning and re-tuning after training. They can get their knowledge from the web and apply it to content in a context-driven manner for automatic extraction and annotation. Wrappers are created that can identify and recognize patterns in text for annotations. While at times, they may be human assisted. They may approach using various classifiers as a supervised way of learning patterns. For annotation of multimedia, this often takes the approach of rich metadata. Alternatively, it could be more in way of content semantics or even granular to the multimedia. Annotations could be global, collaborative, and even local. One could extend and provide rich annotations using custom metadata that could be variously defined through controlled vocabularies, taxonomies, ontologies, topic maps, and thesauri for different contexts. There is even a W3C effort for open annotations as well as the LRMI effort based on schema.org as a learning resources initiative. One could even build a pipeline approach through the various workflow stages of filtering process for content using UIMA. And, even as a CMS approach similar to Apache Stanbol. Standard tools like Tika, Solr, OpenNLP, Kea, can also be useful. Often languages like Java, Groovy, Python, XML, RDF, OWL, are used for implementations and rich textual semantics. However, increasingly tools are emerging on Scala as well.