25 March 2020

Fake Data Scientists

How to spot the fake data scientist?
  • they have no clue about the processes of a data science method
  • they skip the feature engineering part of the data science method
  • they require data engineers to provide them cleaned data through an ETL process
  • they need a whole team of technical people to support their work
  • they are only interested in building models and the models they build inherently are almost always overfitted as they never bother to do the feature engineering work themselves
  • they don't consider creating their own corpus as an important step of model build work
  • they don't understand the value of features when training a model to solve a business case
  • they have no clue how to scale, build, deploy, evaluate their models into production
  • they think with a phd they know everything but practically they are zero
  • they rarely bother to understand the business case nor ask the right questions
  • they don't know how to augment the data to create their own corpus for training
  • they don't know how to apply feature selection
  • they don't know how to generalize a model so they are sat there re-tunning their overfitted models
  • they spend years and years sitting in organizations building overfitted models when they could have built generalizable models in weeks and months 
  • they don't understand the value of metadata or the value of knowledge graphs for feature engineering
  • they raise ridiculously dumb issues during agile standups like they have built a model but it doesn't have certain features (i.e they skip the feature engineering step)
  • they build a model straight out of a research paper and assume the exploratory step is the entire data science method
  • they use classification approaches when they should be using clustering methods
  • they are unwilling to learn new ways of doing things nor are willing to adapt to change
  • they prefer to use notebooks rather than build a full structured implementation of their models that can be deployed to production
  • they build models that contain no formal evaluation or testing metrics
  • they only partially solve a business case because they skipped the feature engineering or passed that effort to a data engineer
  • they are only interested in quantitative methods and not willing to think outside the box of what they have been taught in academics
  • they build academic models that are not fit for purpose for production nor do they add business value
  • they require a lot of handholding and mentoring to be taught basic coding skills
  • they struggle to understand research papers nor the fact that 80% of such research work is useless and of no inherent value
  • they literally assume that something is state of the art when it is mentioned in a research paper rather than contextualize the model appropriateness to solve a business case
  • they don't bother to visualize the data as part of exploration stage
  • they don't bother to do background research to identify use cases where a certain approach has worked or not worked for a business
  • they don't bother to look at reuse appropriately
  • they have no understanding of how to clean data
  • they try every model type until something sticks
  • they don't have clarity on how the different model types work
  • they don't fully understand the appropriate context of when to apply a model type
  • they only know very few model methods and how to approach them for a narrow set of business cases
  • they don't understand bias and variance
  • they don't know whether they want accuracy or interpretability nor how to pick
  • they don't know what a baseline is
  • they use the wrong sets of metrics
  • they incorrectly apply the train, validation, test split
  • they go to the other extreme of focusing on optimization before actually solving the problem
  • they have a phd and the arrogance to match, but literally no practical experience of how to be productive in applying any of it in the workplace especially against noisy unstructured data
  • they come with fancy phds and spend time teaching others how to do their job, but usually require the help of everyone on team to do their own work
  • they come with a phd in a specific area but have no willingness to understand other scientific disciplines in the application of data or have a tendency to outright dismiss such methods
  • they think AI is just machine learning
  • they want someone to hand them a clean dataset on a gold platter because they can't be bothered to do it themselves nor do they think it is an important aspect of their work
  • they can't seem to think beyond statistics to solve a problem
  • they have tendency of looking down on people and dismissing any one that doesn't hold a phd
  • they struggle to understand basic concepts in computer science
  • they need a separate resource to help them refactor their code nor will they be bothered to do it themselves
  • they find services like datarobot helps their work in automating machine learning especially feature engineering which inherently allows them to build overfitted models much faster
  • they can't tell the difference between structured and unstructured data
  • they don't have a clue how to deal with noisy data
  • they not very resourceful in hunting for datasets as part of a curation step
  • they need to be shown how to google for things and basically someone constantly showing them how to do things to be practical in the workplace
  • they prefer to use GUI interfaces that allow them to simply use buttons and drag/drop to build models rather than hand build it themselves
  • they state that they have been a data scientist for last 20 years when the field only went mainstream in industry for last 4 or 5 years (an indication of the designated role is evidence from when it first started advertising on recruitment boards and within organizations)
  • they want to apply machine learning to everything, even where it may be overkill
  • they hold phd but are more than happy to plagiarize other people's work and try to take credit for it, in many cases their bit is probably just exposing it as an API
  • they hold a phd but try to take credit of the entire work, even when someone else or an entire team has probably done 80% of their work
  • they use personal pronouns like 'I' in most cases, but rarely do they use 'We' when working in the team
  • they only care about their inputs, outputs, and dependencies for building a model rather than being flexible, considerate, and thinking as a team in looking at the bigger picture
  • if your 'head of data science' uses terms like 'I don't understand' to the point of annoyance then it is a likely indication of their technical incompetence and ability in that capacity
  • they think decision trees is just a bunch of rules and not a type of machine learning technique