Mabble Rabble: Linear Annotations

1 November 2020

Linear Annotations

In theory, annotations can be implemented in multiple different ways. However, in practice they almost always tend to be linear. Language is hierarchical therefore not very linear. In order to programmatically apply non-linear steps it would mean multiple dependency points, lots of side effects, increase in complexity, and a requirement of state. In fact, it also means the annotations become dynamic rather than static. In big data terms, this is not only a huge computational cost, it also means many raw sources will fail annotation process and put the entire pipeline into a sudden grinding halt as the data sources and annotations grow. In fact, with just only two to five annotations and thousands of data sources the computational requirements can be exhaustive and time consuming. In practice, annotations can vary from anywhere from one annotations to thousands of annotations. In media and publishing, annotations can be so huge that they are provided with their own numbering system against a set standard. There are fundamentally a few abstractions in a callable annotation process which may include: model, score, label, annotation, and metrics. Invariably, there is also a frontend component that talks to the backend model components as well as an evaluation and an adjudication process in order to validate and mediate a corpus production. A separate method may be applied to quality check the human annotations for example via active learning and predetermined evaluation metrics. The model can be linear or non-linear. However, the annotation process itself in a pipeline is almost always linear in nature. An example of a basic linearly defined and aggregated parsing steps might be an Entity Recognizer. One can utilize annotation tree structures on frontend, but this creates complexity issues at the backend model process from an input/output perspective. In fact, some model steps may follow a serially defined dependency where one model process leads into another model process step. In the cloud, data tends to follow an immutability constraint both for process as well as storage. In industry, as a result of huge cost in exponential complexity, not to mention an increase in errors with processing large number of data sources, non-linear structured annotations have not received widespread adoption outside of the scientific research community. It seems also plausible to highlight that people that suggest to businesses to use non-linear annotations, as an alternative, are likely inexperienced to understand the practical complexities that come with such architectural issues, in production mode, unless they can provide at least one successfully deployed productionized pipeline, in industry, that used non-linear annotations and the associated metrics to backup their claims. In fairness, it may just be their confusion, for something that they may regard as non-linear may just be an aggregation of linear annotation steps.