Big Data, machine learning, AI – the hype words that will pull you into the magic circle of modern technology. Everyone wants it, everyone wants the possibilities, the growth, the kickstart that can be given to you by a good amount of data and a smartly designed software built around it.
But then why 87% of the data analytics, big data and AI projects fails? Aside from organizational and cooperation-relatedproblems (more info here), there is more behind this amazingly high failure rate.
Is your data good data?
You could ask is good data and bad data actually a thing? Well maybe not in itself – but data without context is worthless (read more in our earlier blogpost here). Building the wrong relations, the wrong connections, having the wrong approach and conclusions can make your data bad data.
Like when in The Big Bang Theory Leonard, Howard and Raj found some notebooks of a late physicist, Professor Abbot,filled with hundreds of pages of seemingly random numbers without any notes or explanation. Seems like something important, maybe it is his life’s work, maybe it contains something important, exciting isn’t it? Realizing that it is in fact his daily calorie intake diary such a bummer...
You can’t know if your data is valuable if you don’t have the context. You have to know what this data is about, otherwise you might as well end up building a model to predict some long dead professor’s calory intake….
Seemed like a good idea...
There are a lot of projects that fail because they seemed like a good idea, but in reality were completely worthless.
A company that had a relatively high employee turnover decided to up their „HR game”. They started researching the signs of people resigning, trying to predict which employee will leave the company soon. They tracked measurable data, like the number of years they worked for the company, their commuting distance, their salary, overtime, sick leaves and some hardly quantifiable data like their engagement to the company. They were looking for detectable connections, that will help them make their employees next move predictable.
While in some cases, collecting data and finding connections simply turns out to be not viable at all in production, in our case, that was not the problem. The issue was the wrong understanding of what data is and what data science and statistics can do for you.
The company actually was planning to use the knowledge they earned: they were planning on using it in their evaluation and hiring process. The idea was to evaluate new candidates on their likelihood of leaving, and only hire those applicants whose likelihood of leaving the company in the first 3 year is below a certain rate.
...but it simply does not work that way!
Now this sounds great, isn’t it? Just add in a couple of variables about your new candidates, and a software will spit out the possibility of them becoming a loyal, long-term employee.
Is it doable? From a programming point of view, with quantifiable variables, it is no magic to create a program like that. But will it work? The answer is a straight and obvious no.
Besides the fact that collecting data like this is questionable at best, with regards to the privacy of employees, it does not make sense from a business perspective. If we really give some deep thoughts of why anyone would want to quit their job, the factors are multiple, and are neither predictable, nor measurable for start. How someone will respond to workplace dynamics, or personal issues they might encounter, are not quantifiable and cannot be predicted even with the utmost caution.
And beware of the self-fulfilling prophecy! A simple notification to a manager that a certain employee is most likely to leave the company in the near future based on these parameters will itself affect the workplace dynamics – and might bring upon what one wanted to prevent.
The idea completely ignores the single most important thing of why some people stay at a workplace even when underpaid or working extreme hours, and why others with a generous salary or personalized benefits will still leave. And that is the human factor (read more here).
Causes or consequences – did you make the right conclusions?
But let’s assume, that you are able to quantify and take into consideration every human factor – that is probably a mission impossible, but sticking to the above example let’s just assume it for a second, and try and dig deeper of why the idea is inherently wrong.
The software would calculate a vast number of variables, that are not decidedly causes, but most likely consequences of an employee leaving the company. Having less overtime, going on a sick leave, besides of other factors can actually be consequences of an already made decision: our employee already decided to leave, and preparing for the change, is looking for another job and so on.
The algorithm does not predict, it just tells us what is happening. We are too late to do anything about this employee leaving. That is how wrong connections, or even only seemingly existing relations can make your data bad data – and lead to wrong conclusions and wrong business strategy.
Machine learning and its limits
Last but not least, even if you get every single part of your use case right, although we mentioned it in our previous blog post, but it can never be emphasized enough: machine learning can be a goal, but it is very rare, that it can be your first step of data analytics. The reason for that is simple: you need huge, and I mean really huge amount of data for your software to be able to actually find working relations, rule out the insignificant factors, be aware of unique cases and extremes and create predictions that will actually give you useful and trusted information.
Having a machine learn relations, connections, create predictions is not so much different than having economists and researchers conducting statistical analysis. You need an unbiased representative sample to get to a conclusion, and to avoid biased sample you need to know your population. In our example case even if we ignore any other issues, simply reviewing 500 employees actions are just not enough to be able to get clean data that can be used as a base of any sort of prediction.
For a kickstart, most companies don’t need machine learning right away: synchronizing, centralizing your data, and automating just some of your daily tasks can already be a huge step towards the right direction.
This is how sometimes even small but fast companies on the market are able to steal market share from well-known giants: they already know that there is no need to reinvent wheel to gain momentum.