Health research in the 21st century has been marked by a massive increase in the availability and use of Big Data, data that include more participants and are more complex than conventional datasets.
This availability, combined with technological developments and predictions around Big Data, such as deep learning, has transformed many industries. Consequently, many have been asking if Big Data will allow us to finally develop cures for psychiatric conditions. The short answer to this is no.
The development of a cure for a health condition requires us to ask questions that involve causality. In particular, finding a cure asks two questions:
“If I take this treatment, will I stop having this condition? “
and
“If I had not taken the treatment, would I still have the condition?”
Even if we disregard an actual cure, and are interested only in discovering the cause of a psychiatric condition, we are still forced to ask the question “If I am exposed to this factor, will I develop a psychiatric condition that I would not have developed had I not been exposed to this factor”.
This sort of causal question requires us to have data on two different worlds: the world in which we were exposed to a factor, and the world in which we were not exposed to it.
Without a time-machine, these two worlds are mutually exclusive and only one will occur.
This means that, for any given question, we will only have data on one possible world and for any single individual, it is impossible to prove whether their exposure caused their outcome. Even with observational data on a population, we still cannot answer individual causal questions.
However, by combining our data with assumptions, we can answer causal questions about how much everyone’s outcome would have changed, on average, had everyone in the population been given a particular exposure of interest.
How then, you might ask, are companies like Google and Facebook able to so effectively use our data to figure out our buying habits? This is a key distinction between two tasks: prediction and causal inference. Prediction refers to the mapping of observed inputs to observed outputs. Causal inference, on the other hand, requires asking questions about those two mutually exclusive worlds mentioned earlier – for example, the worlds in which an individual is and is not exposed to a particular thing. The technologies used by major companies like Facebook and Google are highly successful at prediction. They can very effectively identify individuals who will visit particular websites or buy particular items based on the websites they visit and the information that they post. In the context of psychiatric conditions, the application of these algorithms in sufficiently large data may help us to predict who will develop a psychiatric disorder in the next five years.
However, such algorithms cannot tell us how to intervene upon the world in order to reduce the number of individuals who will develop a psychiatric disorder. Those questions are causal questions, and cannot be answered with the data alone, because the data alone does not contain information on each individual’s outcome under multiple possible worlds, the worlds in which they were and were not exposed to some intervention.
Instead, in order to conduct causal inference, we must combine data with strong assumptions based on expert knowledge. Currently, Big Data alone will never give us the answers we seek.