Machine learning techniques used to analyze data might be producing misleading results
Nowadays, a lot of scientific research across many subject areas involves using machine learning software to analyze already collected data. From biomedical research to astronomy, the data sets are very large and expensive. Machine learning (ML) is a branch of statistics and computer science that build computational systems that learn from data rather than following explicit instructions.
Associate professor of statistics, computer science and electrical and computer engineering, Genevera Allen, from Rice University in Houston, warned scientists that the increased use of this type of techniques is contributing to a growing “crisis in science”:
“There is general recognition of a reproducibility crisis in science right now. I would venture to argue that a huge part of that does come from the use of machine learning techniques in science.”
She stated that scientists should stop leaning on machine learning algorithms and start questioning the accuracy and reproducibility of scientific discoveries made with the help of these types of techniques. Dr. Allen presented her research at the 2019 AAAS (American Association for the Advancement of Science), a prominent scientific conference that took place this week in Washington, saying:
“The question is, ‘Can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets? The answer in many situations is probably, ‘Not without checking,’ but work is underway on next-generation machine-learning systems that will assess the uncertainty and reproducibility of their predictions.”
She is claiming that this is happening because the software is identifying patterns that exist only in the analyzed data and not in the real world. That is why Dr Allen is now working with a group of biomedical researchers at Baylor College of Medicine in Houston to improve the reliability of results. They are developing the next generation of machine learning algorithms and statistical techniques that cannot only go through big sets of data to make discoveries, but can also evaluate how reliable their predictions are:
“Collecting these huge data sets is incredibly expensive. And I tell the scientists that I work with that it might take you longer to get published, but in the end your results are going to stand the test of time. It will save scientists money and it’s also important to advance science by not going down all of these wrong possible directions.”