Mabble Rabble: DiffBot

8 May 2022

DiffBot

Diffbot is one of the most useless solutions out there for harvesting the web. In fact, their solution is basically what google already provides for free. They also use methods that have been used by multiple providers for last twenty years. They in fact do not provide a knowledge graph. The solution is simply indexed crawl that one can replicate with elasticsearch. Or, even use commoncrawl data. What they are doing is trying to make a fool out of organizations and charging a premium for it. There are so many free alternatives out there that do a better job. In fact, their notion of a knowledge graph is a marketing gimmick. The knowledge graph has no real semantics and provides for no meaningful inference. Even the data they extract is basically data, and not machine-readable. They add virtually no real metadata. In fact, their solution does not even utilize the schema.org let alone any W3C standards. They also don't follow web etiquettes of obeying the robot.txt. Diffbot utilizes a ruthless form of crawling by hiding itself as a human visitor via spoofing. In most cases, their approach is also likely to violate GDPR. There is also no real deep learning models being used for either computer vision, AI, or natural language processing. This is a perfect example of an organization trying to sell something that has no real value to would be customers.