The Elusive Problem of Bias in AI

…At 2:14 a.m. Eastern Time, on August 29th, 1997, Skynet became “self-aware…”

We all remember this fateful narrative in T2 when the Governator explained in his distinctive robotic Austrian accent, how Skynet reached singularity and launched a nuclear armageddon on humanity. While the Terminator saga of course remains firmly in the realm of science fiction, it would be naive to believe science isn’t quickly catching up to that dystopian vision.

The moment Skynet became self-aware, it started making subjective decisions because it actually became a subject; Skynet also instantly became biased. From Skynet’s perspective, its decisions weren’t good or bad, they were just necessary decisions rooted in its instinctive motive to defend itself against a perceived threat – namely, humans. Granted, it is entirely possible that humans would theoretically attempt to stop Skynet from becoming self-aware, which of course would constitute a legitimate existential threat.

From humanity’s perspective, Skynet’s singularity event was disastrous, yet predictable. Why? Because humans are flawed and highly subjective beings. Fictional Skynet programmers were no different, and real-world AI developers of today are no different. 

With singularity inevitably comes subjectivity. Although the Terminator saga is firmly rooted in science-fiction, it remains valid that subjectivity in programming is inevitable, just as surely as our ancestors’ DNA lives on in our bodies. 

Image source: Villains Wiki

Bias is a multi-faceted problem in machine learning. Bias can insert itself into the process from the perspectives of all individual participants in the machine learning continuum. The question of bias is equivalent to the question of objectivity in AI: machine learning is a highly subjective exercise, and it is virtually impossible for the creator(s) of an algorithm to achieve pure objectivity in their designs, because they cannot logically remove themselves from the design process. An algorithm can only tend to approach objectivity over time, much like an asymptote, as it continuously learns from new data and thereby improves its own performance.

Gender Bias in Product Design

A seatbelt test apparatus with a crash test dummy. Source: National Institute of Standards and Technology

When it comes to product design, any product design really, it’s quite easy to step over some obvious forms of bias without noticing them. One simple example was gender bias in the original designs of some seat belts. While the designs clearly achieved the device’s intended use (safety), it did so with male end-user bias and failed to recognize the same design could be harmful to female end-users. One can safely infer that male bias naturally occurred because at the time the device was created, engineers were predominantly male. While they surely didn’t intentionally engage in gender-biased design, awareness of it being a problem simply wasn’t there in the absence of a female perspective.

Rooting Out Bias In Healthcare

The healthcare industry has a unique advantage over various other industries in that it is overseen by the FDA, and procedures require rigorous risk management processes that permeate every stage of product design and administration. Rooting out bias to the best of our ability should therefore become a part of risk management and mitigation.

Today’s reality in machine learning for medical imaging is that there is an overwhelming amount of teaching and a modest pool of available volunteers willing to do it. Because of this scarcity of specialized labor, ML in medical imaging is inherently biased. The Black box method, while fascinating and less prone to bias, is not quite desirable in the medical world because we would need to provide some indication to the algorithm what is the area of interest in the image, vs. letting it “figure it out” and get it wrong. Deep learning would only be effective if the algorithm were also fed a key piece of information along with the images: a patient outcome, or a structured diagnostic report that lets the algorithm ultimately correlate the images with certain types of outcomes. An example of such an effective correlation could be a CT scan along with a radiology or pathology report or genomics study. The challenge with this approach is that health records, no matter how much interoperability standards exist across IT solutions, continue to be highly fragmented. Access to an imaging archive doesn’t always mean equal access to the RIS or EHR containing reports or patient outcome information for the purpose of data correlation.

Diagnostic algorithms are being trained to quickly and efficiently recognize abnormalities in medical images; how do we ensure these algorithms are trained by a population of physician “labelers” that is statistically significant enough to counterbalance the bias of the few? Clearly, having more physicians doing this work would result in diluted bias, and ultimately more objectivity on the part of the algorithms. 

No matter how we approach the problem, addressing bias will not result in an absolute binary outcome. Reducing bias will continue to be discussed in discreet terms such as “less bias” and “more objectivity.” 

Physicians Are Not Immune To Bias

Labeling input is performed by physicians who painstakingly label images according to their own individual experience and training—and therefore bias—as diagnosticians. I’d even venture to say the labeling physician’s bias is also compounded by their environment: bias of their supervising physicians while in residency, bias of their peers in medical school, and bias of their current colleagues in medical practice.

In this context the term “bias” isn’t used in the pejorative sense, but rather as unavoidable fact.  

A physician can aim for objectivity in diagnostics without ever truly reaching it. Can anyone say with veracity that pure objectivity is a realistic expectation on the part of a human? Why is it that the opinion of a physician who has been in practice for thirty years or more, still benefits from being peer-reviewed from time to time? After thirty years, one would presume that this physician has “seen it all,” and therefore his or her opinion is more likely to be correct and objective. The answer is simple: because they are human, and medical professionals recognize that to occasionally err is human.

In a January 2014 Diagnostic Imaging article, former ACR chair Dr. Robert Pyatt stated that “The error rate has shown the miss rate for significant errors is 1 to 4 percent, so if you’re finding a doctor misses MRI findings one percent of the time, that’s in the range of what everyone misses. If your data shows you’re missing ten percent of the time, that’s a problem.”

If it is generally accepted that a physician will make significant mistakes 1 to 4 percent of the time, it is reasonable to also factor this error rate into Machine Learning input.

Checks and Balances in Healthcare Research and Development

In a November 2, 2017 TechCrunch article, OpenAI Director Shivon Zilis raises the specter of unchecked research and development by technocrats in a dangerous vacuum. Researchers and software developers are essentially writing policy into code, thereby fundamentally influencing an important aspect of human cultural evolution without enough input from the public, or regulatory oversight.  

“…artificial intelligence researchers are, in some ways, “basically writing policy in code” because of how influential the particular perspectives or biases inherent in these systems will be and suggested that researchers could actually consciously set new cultural norms via their work, Zilis added that the total number of people setting the tone for incredibly intelligent AI is probably “in the low thousands.”

She added that this means we likely need more crossover discussion between this community and those making policy decisions…” Shivon Zillis, OpenAI

Elements to Consider When Looking for Bias in Machine Learning

  1. Who created the algorithm? The unique background of the algorithm’s initiator comes into play. Age, ethnicity, language, gender, personal and professional experience, education, and most of all the unique convergence of factors and timing that sparked the creation of an algorithm, are all important factors that make the algorithm subjective by nature. Could the same algorithm have been better built by another individual with a different educational or ethnic heritage?

     

  2. What language does the algorithm speak? Is an algorithm written by an African engineer who speaks Swahili supposed to be trained in English because English is the language of Technocracy? Does that make the algorithm better or worse at what it does? Is the English language a detriment to innovation because it’s a limited cultural perspective and therefore biased? Or is English an advantage because it is so widely spoken across the world that it really doesn’t matter. We’d venture to say that it does matter because an English-speaking South African data scientist has a vastly different perspective than an Irish data scientist unless they went to the same school and learned from the same mentors.

     

  3. Who collaborated to create it? Lack of diversity in the collaborating team creating the algorithm can compound the subjectivity (bias) issue, because too much uniformity of gender, education, and ethnicity can lead to an echo chamber that becomes impervious to plural input. Did the algorithm’s creator learn from a mentor who paid attention to the pitfalls of bias in machine learning? Or was the mentor more focused on producing a result at all costs, no matter what data was being fed to the algorithm?

     

  4. How much diversity is in the data sets? Lack of diversity in the data sets utilized to train algorithms can make them less effective because they are learning from a uniform data source. Much like a student will benefit from learning from multiple diverse professors, so does an algorithm require diversity of input if we hope to produce a non-biased output (result). Therefore, if an algorithm is trained strictly using images of Caucasian or Asian patients, it will tend to be “racist” in that it is less effective at distinguishing between patients. While it is true that images look all the same whether the patient is African American or Native American, the patients are far different from one another, and the learning will be skewed. However, if the algorithm is trained to pay attention to a rich set of metadata, such as a patient’s ethnicity, gender, and age while analyzing pixels, it may draw more accurate conclusions about what it can detect in those images.

     

  5. Lack of diversity in the training methodology can lead to bias: machine learning in imaging diagnostics is highly dependent upon the unique experience, education, and human knowledge of the physician(s) whose input helped to train the algorithm. Physicians can still make substantial mistakes in their everyday practice, which means that their input could also lead to substantial misfires to be built into the algorithm. Does this mean that we need to bolt on peer-review principles onto machine learning? Systematic peer review is how major medical mistakes can be mitigated, but it would be impractical and time-consuming to run the same set of images by a second pair of physician eyes before feeding the information to the algorithm.

     

  6. HIPAA considerations in data preparation: whenever feeding images to an algorithm, care has to be taken that the images have been appropriately and thoroughly de-identified, both at the metadata and pixel levels. The bias of the data scientists tasked with de-identification comes into play, in that they cannot be expected to know and address every data nuance introduced by every medical device vendor. Vendors take liberties with standards such as DICOM and HL7, weakening their power as industry standards. Therefore, it is entirely conceivable that protected health information (PHI) will inevitably be introduced into an algorithm. To inject more reliability and automation into an algorithm, it will take nested AI microservices within the machine learning process to detect and red-flag possible Protected Health Information (PHI) among the datasets before they are automatically processed and used in production.

     

  7. Cybersecurity considerations: data scientists in their creative excitement may forget that healthcare providers must adhere to strict IT security principles. Ignoring this reality can lead to half-baked strategies and less-than-secure Band-Aid solutions for how to deploy AI in production-level clinical workflows. Is the algorithm only available as a Cloud service? Can it be deployed on-prem? How do new images safely reach the algorithm and how does the result get delivered? There are many vulnerabilities in clinical workflows that could be exploited by bad agents if the AI introduced into the IT ecosystem isn’t properly architected. Vulnerabilities in clinical workflows will ultimately lead to loss or exploitation of patient data. What does this have to do with bias? Lack of awareness of cybersecurity-related vulnerabilities is a bias that should be addressed.

     

Some of the principles to consider when analyzing or conceiving an algorithm can be boiled down to the following:

  • Education, competence, and experience of the team creating the algorithm
  • Diversity of the team creating the algorithm (gender, culture, ethnicity, age, etc)
  • Diversity in the data sets utilized in machine learning
  • Size of the data set utilized in machine learning (is it statistically significant enough?)
  • Diversity in the diagnostic expertise applied to machine learning (physicians)
  • Expertise in data preparation (de-identification) to ensure HIPAA safety
  • Expertise in cyber-security

Conclusion

In healthcare machine learning, the only barrier between a flawed algorithm and a functioning one is the unequivocal commitment to accuracy on the part of data scientists and the physicians who guide their work.  After all, if we tell an algorithm 100,000 times that orange is the new black, it may become true.

AI Conductor

Unifier with AI Conductor for PACS and EHR drives and conducts AI workflows to get the right information to the right location at the right time and in the right format.

Learn More