27 February 2020

Why Neo4J Sucks

  • issues in cluster mode and scaling
  • not built for distributed data (or even big data)
  • issues with bulk loading, indexing (e.g. range, sort, etc), slow upsert
  • bottleneck with large volume of writes due to slave/master topology
  • extremely expensive to replicate entire graph across each node
  • confusing license terms for production use
  • heap allocation and GC cause out-of-memory errors (buggy configs)
  • practically only useful for small dataset reads that require visualization
  • not appropriate for high read/write/index searches on KG
  • re-indexing can be very slow, require elasticsearch/solr as added dependency step
  • deceptive marketing ploys that provide for flawed benchmarks

25 February 2020

Fake Data Engineers

How to spot the fake data engineer?
  • They prefer to use drag and drop interfaces rather than programmatically define their pipelines
  • They don't use any code control
  • They have no CI/CD process
  • They have no tests
  • They look like they have just spent their entire day coloring in their drawing book i.e a GUI
  • They don't have a clue what mapreduce is but know how to use spark (to some degree)
  • They prefer to use static SQL instead of programmatically define their DAGs
  • They have no clue what a DAG is
  • They don't have a clue what a dataframe is
  • They can't tell the difference between a stream vs batch processing
  • They don't know what immutability means
  • They smirk at the thought of thinking through the folder structure axis for a data lake
  • They turn an entire folder structure axis into internal vs external data
  • They transitioned from a SQL/BI background
  • They don't know what a computational graph is
  • They can't tell the difference between a static vs dynamic graph
  • They don't understand loose coupling or separation of concerns even between a pipeline and a model
  • They have no clue about data lineage
  • They have weak skills at abstracting out the workflow steps
  • They can't tell the difference between unstructured and structured data
  • They have no clear idea about software engineering principles
  • They class open big data stacks a very small aspect of data engineering
  • They think NoSQL literally means no SQL so therefore can only think one-dimensionally
  • They can't think outside the box to solve business cases
  • They have more experience with Azure Cloud compared to any other cloud providers
  • They prefer to use notebooks rather than development lifecycle tools for their work
  • They have never scaled a machine learning model before
  • They know the jargon of docker and kubernetes but have no clue about containers
  • They pronounce kubernetes as kuberneetes
  • They confuse serial vs parallel workflows
  • They prefer to use C#/GUIs and Powershell rather than Go, Python, Java, and Scala for their work
  • They have never attended a conference or summit that relates to their core work
  • They have more certifications especially microsoft experience on their resume than open source work
  • Somehow their workflows have always been smooth and perfectly delivered for production
  • They have the inability to breakdown problems sufficiently into epic set of stories
  • They never take ownership for when things go wrong rather they anticipate to blame others
  • They never learn from their mistakes so inherently repeat the same mistakes over and over again
  • They act like teachers in the team, but inevitably have little practical experience
  • They not very adaptable nor very curious of their work, of the data, nor with choice of tech stacks
  • They have little depth and breadth of practical experience
  • They don't take ownership from end-2-end delivery of their work
  • They never acknowledge when they are wrong, and rarely are convinced by others
  • They don't embrace modern data architecture thinking to tackle business challenges
  • They stick to what they know, but unwilling to train up for what they don't know
  • They not resourceful nor willing to scope out better open source alternatives for premium options
  • They build solutions that are of poor working quality
  • They don't ask for clarifications when they don't understand the abstractions of a data workflow
  • They not very multi-disciplined
  • They make poor problem solvers especially during critical issues
  • They don't understand the value of feature engineering
  • They approach Gartner for everything at the very minute where they need to use a fraction of brain cell
  • They will try to question or challenge you on the use of best practices which is usually an obvious evidence of their lack of experience
  • They will ask questions for which they already should know the answers as part of their job
  • They will spend more time trying to teach you how to do your job, while at same time not knowing how to do their own
  • They either over-engineer or under-engineer the solution by spending more time on optimization or how pretty something looks in their GUI
  • They generally don't like to follow standards driven approaches that could make their job easier for integration work
  • They lack much of the technical experience required for the work so try to make up for it with an air of superiority by trying to correct others at every opportunity
  • They have no clue how to clean, extract, mine, load, transform noisy data into an enriched source for consumption
  • They reject the value of metadata and semantic knowledge over the use of static SQL queries
  • They don't understand modularity so they class everything as shellcode or expect richer code (who knows what that means?), or do they expect monolithic repos that go on from the start of time of inception endlessly in an unmaintainable pile of crap?
  • Their only resource of knowledge is stackoverflow or asking others, they have an inept ability to google for things themselves, in fact for everything they likely will want to go on a certification course
  • When they fix a bug in their code they end up changing code in other places where they not supposed to or inherently introduce even more bugs
  • When you refer to any patterns or best practices they will often ask "who told you that?" or "did you just make it up yourself?" indicating they probably have never come across those approaches before, the fact that they only learn when someone formally teaches them how to do something, get insecure by realizing how inexperienced they really are, or try to be defensive by attempting to question your credibility

23 February 2020

Ontology Development Stages

  • Ontology Scope
    • What is the purpose of the taxonomy
    • Who will be using the taxonomy
    • What content will the taxonomy be covering
    • What is the scope of the taxonomy
    • What resources are available for developing the taxonomy
  • Ontology Reuse
  • Identify Useful Software
  • Knowledge Acquisition
  • Identify Important Terms
  • Identify Additional Terms, Attributes, and Relationships
  • Specify Definitions
  • Integrate With Existing Ontologies
  • Implementation
  • Evaluation
  • Documentation
  • Sustainability

14 February 2020

Types of Knowledge Graph Databases

  • Stardog
  • AllegroGraph
  • JanusGraph
  • Neptune
  • GraphDB
  • CayleyGraph
  • Grakn
  • Blazegraph
  • CM-Well
  • Akutan
  • Halyard
  • Hoply
  • Marmotta
  • NebulaGraph
  • Rya
  • AgenGraph
  • Jena/Fuseki
  • OrientDB
  • CosmosDB
  • DGraph
  • Virtuoso
  • MemGraph
  • TigerGraph
  • Sparksee
  • Parliament

Types of Controlled Vocabularies

  • Thesauri
  • Lists
  • Synonym Rings
  • Authority Files
  • Taxonomy
  • Ontology

1 February 2020

PyATS

PyATS

Gartner

What is the point to Gartner? This is a marketing intelligence company for enterprise solutions. Let's say that again a marketing intelligence company that allows organizations to blow their own horn with hyped up reports. At times also an advisory consultancy to clueless and gullible leaders. One needs to wonder how Microsoft keeps reappearing at the head of Gartner reports, time and time again, even with their substandard products. Any organization that has a head of something in technology with a Gartner subscription is likely an indication of their incompetence to sound like they know what they are talking about in front of their technical team and management. Most of what Gartner provides as consultancy and intelligence is in the form of stale information, not state of the art, not properly benchmarked, likely promoted, and will rarely work for AI solutions. One should not play with their bluff and likely should consider the person using them for advice as incompetent for prime time technical know how. It could also mean that the head of something has never released an AI solution in production before so likely high indication of their lack of experience. Furthermore, if they start talking about things that won't work, will work, or aimlessly keep questioning the approach, after having had a meeting with a Gartner consultant, without ever using the tools or methods themselves, that is a sure indication that the person is out of their depth. In such organizations, where resources are lacking, there is a high risk of failure where so called third-party specialists are often given more weight by their management for advice and where such advice given is rarely very useful and while openly rejecting or questioning the practical experiences of their technical teams who are usually in the know of how things work in practice. Just don't use Gartner for advice unless one wants to lose face over time with their technical team especially in relation to open source and AI related projects.

Gartner Magic Quadrant