Home > Uncategorized > Machine learning — getting results that are completely wrong

Machine learning — getting results that are completely wrong

from Lars Syll

machMachine-learning techniques used by thousands of scientists to analyse data are producing results that are misleading and often completely wrong.

Dr Genevera Allen from Rice University in Houston said that the increased use of such systems was contributing to a “crisis in science” …

The data sets are very large and expensive. But, according to Dr Allen, the answers they come up with are likely to be inaccurate or wrong because the software is identifying patterns that exist only in that data set and not the real world …

Machine learning systems and the use of big data sets has accelerated the crisis, according to Dr Allen. That is because machine learning algorithms have been developed specifically to find interesting things in datasets and so when they search through huge amounts of data they will inevitably find a pattern.

“The challenge is can we really trust those findings?” she told BBC News.

“Are those really true discoveries that really represent science? Are they reproducible? If we had an additional dataset would we see the same scientific discovery or principle on the same dataset? And unfortunately the answer is often probably not.”

BBC News

The central problem with the present ‘machine learning’ and ‘big data’ hype is that so many think that they can get away with analysing real-world phenomena without any (commitment to) theory. But — data never speaks for itself. Without a prior statistical set-up, there actually are no data at all to process. And — using a machine learning algorithm will only produce what you are looking for.

Machine learning algorithms always express a view of what constitutes a pattern or regularity. They are never theory-neutral.

Clever data-mining tricks are not enough to answer important scientific questions. Theory matters.

  1. Helge Nome
    February 20, 2019 at 12:21 am

    A machine is a machine is a machine. The so-called “intelligence” in the machine is that of the person that creates the machine program. The rest is just hype. (I have worked with digital computers since the 1960ies and they have not fundamentally changed)

  2. February 20, 2019 at 12:57 am

    New methods of investigation often advances sciences and sciences need new methods or tools of analysis. Experiments were a new method at the dawn of modern science. At that time, there were no definite criteria with which one judge if the obtained results were a good reliable one. It took long years until scientists obtained a set of criteria such as repeatability, objective measurablility, and well defined description of the experimentation.

    Agent-based simulation is rather a new method that has emerged in social sciences. Its status and criteria are not yet well defined. However, we should be tolerant until it grows to a certain maturity. See in this regard, my paper A Guided Tour of the Backside of Agent-Based Simulation.

    Machine learning is a new method, but it is rather a method of discovery or heuristics. New finding is not by itself a new truth. We have to find a set of criteria to test such a result reliable or not. Lars Syll is right to say that theory matters. Concordance with existing theory is one thing. But, to build a checking system is another effort we need in the new era of big data.

    Another reservation we must keep in mid is the necessity of small-sampled but deep analysis. I hear that, in some journals, referees reject researches that do not use more than 5,000 samples (a minimum that a set of data is considered to be “big”). This must be a very silly practice. The best method to compete this tendency is to produce goods in-depth researches. Objections alone are not sufficient.

  3. Helen Sakho
    February 20, 2019 at 2:08 am

    I agree with Helga. However, machines (reflecting the intelligence of their creators) have to some extend become humanised; it is a shame about humans who have not.

  4. February 20, 2019 at 10:59 am

    The word is out: just heaping lots of data on a pile and doing a stir-fry with math in the hypothesis space does NOT guarantee good output. That is why I am very hesitant about calls for big tech to open their data troves (as a common good). Problems will get worse not better. See e.g. my https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3256146

  5. Grayce
    February 21, 2019 at 6:03 pm

    To add to Hildebrandt’s comment per her abstract, preregistration of the sources is good, but also needed is the DATE of the collected knowledge. Concrete example: In the case of insurance “likelihood’s” in the USA, there is an algorithm that conflates use of credit cards with likelihood of filing for an auto accident. Before the advent of frequent flyer miles and cash-back arrangements, various cardmembers held multiple retail cards that they either cancelled or let lapse. For those that lapsed, the retail card issuers closed accounts with various statements to credit ratings centers such as Experian. Some had negative ratings due to “failure to respond: and others appeared that the issuer had dropped the creditor for cause.

    The net result is less-than-lowest premia for auto insurance to this day. But imagine if the underlying data belongs to a time when the correlation had different drivers?

    Current example. 800-242-6422 – What is a credit-based insurance score?
    A credit-based insurance score, also known as an insurance score, is a snapshot of a consumer’s insurance risk picture at a particular point in time based on information contained in a consumer’s credit report. Since insurance scores have statistically proven to be a sound predictor of future loss, insurers use these scores, along with many other factors, to evaluate new and renewal insurance policies.

    How are insurance scores determined?
    The consumer’s credit information is entered into a computer model which analyzes the information and generates an insurance score. The scores are dynamic, changing as new information is added to a consumer’s credit report. Insurers will typically ask for a current score when they receive a new application for insurance, or prepare to renew an existing policy, so they have the most recent information available.

    What information affects my insurance score?
    An insurance score is generally based on the following: payment history, length of credit history, the amount of outstanding debt in relation to credit limits, types of credit in use and new applications for credit. These items may vary from state to state.

    What information does not affect my insurance score?
    An insurance score does not take into account income, race, gender, marital status, religion, age, geographic location, nationality, ethnicity or handicap. It only considers your credit history.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.