18 June 2021

Why Pure Probabilistic Solutions Are Bad

In Data Science, there is a tendency to focus on machine learning models that are inherently based on statistical outcomes that are essentially probabilistic in nature. However, to test these models one uses an evaluation method which is also seeped in statistics. Then to apply further analysis on explainability and interpretability one again uses a statistical method. What this turns into is a vicious cycle of using statistics to explain statistics with an uncertainty of outcomes. At some point, one will need to incorporate certainty to gain confidence in the models being derived for a business case. Essentially, knowledge graphs serve multiple cases towards increasing certainty into the models but also for providing logical semantics that can be derived through constructive machine-driven inference. Logic, through relative inference, can give a definite answer while the machine learning model can at most provide a degree of confidence score to which something holds or doesn't hold with a total lack of care for contextual semantics. A machine learning model rarely provides a guaranteed solution as it is based on targets of approximations and error. Hence, why there is a tendency to measure bias and variance in training, testing, validation data. The evaluation is also relatively based on approximations of false positives, false negatives, true positives, and true negatives. Logical methods can be formally tested. A machine learning model can at most be subjectively evaluated with a degree of bias. Invariably, at any iterative time slice a pure statistically derived model will always be overfitted to the data to some degree. Statistics derive ridged models that don't lend themselves to providing definite guarantees in a highly uncertain world. Invariably, the use of statistics is to simplify the problem into mathematical terms that a human can both understand, solve, and constructively communicate. Hence, why there is such a huge statistical bias in academia as it tends to be traditionally a very conservative domain of processing thoughts and reasoning over concepts as a critical evaluation method within the research community. One can say that such a suboptimal solution may be sufficiently good enough? But, is it really good enough? One can always provide garbage data and train the model to provide garbage output. In fact, all the while the statistical model never really understands the semantics of the data to correct itself. Even the aspect of transfer learning in a purely statistical model is derived in a probabilistic manner. The most a statistically derived model can do is pick up on the patterns. But, the semantic interpretability of such data patterns is still yet to be determined for guarantees of certainty in fact it is presumably lost in translation. Even the state of the art model is fairly subjective. Evaluations that only look at best-cost analysis in terms of higher accuracy are flawed. If someone says their model is 90% accurate, one should question in terms of what? And, what happens to the other 10% that they failed to account for in their calculations which is something of an error that the person pipelining the work will have to take into account. Invariably, such a model will likely then have to be re-evaluated in terms of average-cost and worst-cost which is likely going to mean an increase in variable error between 5% and 15%. The average-cost is likely to lie somewhere in 10% and the worst-cost somewhere near 15%. So, 90% of time, in production the idealized performance accuracy of the model would be 90% - 10% = 80% on average-cost, plus/minus 5% on the best-cost, and anywhere between minus 10% and 15% on the worst-cost. This implies that 5% of time the model will perform on best-cost, 5% of time on worst-cost basis, but 90% of time on average-cost where the idealized predictive accuracy, when taken into account the full extent of the error would be 80%. Even though, this is still fairly subjective, at least idealized metric where it takes into account the environment factors improves on the certainty. This is because in most cases a model is built under the assumptions of perfect conditions without taking into account the complexity and uncertainty which would be present in production environment. There is also the need to be mindful, sensible, and rational with the use of the accuracy paradox. One can conclude here, that a hybrid solution of combining probabilistic and logic approach would be the best alternative for reaching a model generalization case of sufficient certainty to tackle the adaptability mechanism for process control as well as to capture the complexity and uncertainty domains of the world.