The Elusive Problem of Bias in AI
We all remember this fateful narrative in T2 when the Governator explained in his distinctive robotic Austrian accent, how Skynet reached singularity and launched a nuclear armageddon on humanity. While the Terminator saga of course remains firmly in the realm of science fiction, it would be naive to believe science isn’t quickly catching up to that dystopian vision.
From humanity’s perspective, Skynet’s singularity event was disastrous, yet predictable. Why? Because humans are flawed and highly subjective beings. Fictional Skynet programmers were no different, and real-world AI developers of today are no different.
With singularity inevitably comes subjectivity. Although the Terminator saga is firmly rooted in science-fiction, it remains valid that subjectivity in programming is inevitable, just as surely as our ancestors’ DNA lives on in our bodies.
Gender Bias in Product Design
Rooting Out Bias In Healthcare
Today’s reality in machine learning for medical imaging is that there is an overwhelming amount of teaching and a modest pool of available volunteers willing to do it. Because of this scarcity of specialized labor, ML in medical imaging is inherently biased. The Black box method, while fascinating and less prone to bias, is not quite desirable in the medical world because we would need to provide some indication to the algorithm what is the area of interest in the image, vs. letting it “figure it out” and get it wrong. Deep learning would only be effective if the algorithm were also fed a key piece of information along with the images: a patient outcome, or a structured diagnostic report that lets the algorithm ultimately correlate the images with certain types of outcomes. An example of such an effective correlation could be a CT scan along with a radiology or pathology report or genomics study. The challenge with this approach is that health records, no matter how much interoperability standards exist across IT solutions, continue to be highly fragmented. Access to an imaging archive doesn’t always mean equal access to the RIS or EHR containing reports or patient outcome information for the purpose of data correlation.
Diagnostic algorithms are being trained to quickly and efficiently recognize abnormalities in medical images; how do we ensure these algorithms are trained by a population of physician “labelers” that is statistically significant enough to counterbalance the bias of the few? Clearly, having more physicians doing this work would result in diluted bias, and ultimately more objectivity on the part of the algorithms.
No matter how we approach the problem, addressing bias will not result in an absolute binary outcome. Reducing bias will continue to be discussed in discreet terms such as “less bias” and “more objectivity.”
Physicians Are Not Immune To Bias
In this context the term “bias” isn’t used in the pejorative sense, but rather as unavoidable fact.
A physician can aim for objectivity in diagnostics without ever truly reaching it. Can anyone say with veracity that pure objectivity is a realistic expectation on the part of a human? Why is it that the opinion of a physician who has been in practice for thirty years or more, still benefits from being peer-reviewed from time to time? After thirty years, one would presume that this physician has “seen it all,” and therefore his or her opinion is more likely to be correct and objective. The answer is simple: because they are human, and medical professionals recognize that to occasionally err is human.
In a January 2014 Diagnostic Imaging article, former ACR chair Dr. Robert Pyatt stated that “The error rate has shown the miss rate for significant errors is 1 to 4 percent, so if you’re finding a doctor misses MRI findings one percent of the time, that’s in the range of what everyone misses. If your data shows you’re missing ten percent of the time, that’s a problem.”
If it is generally accepted that a physician will make significant mistakes 1 to 4 percent of the time, it is reasonable to also factor this error rate into Machine Learning input.
Checks and Balances in Healthcare Research and Development
“…artificial intelligence researchers are, in some ways, “basically writing policy in code” because of how influential the particular perspectives or biases inherent in these systems will be and suggested that researchers could actually consciously set new cultural norms via their work, Zilis added that the total number of people setting the tone for incredibly intelligent AI is probably “in the low thousands.”
She added that this means we likely need more crossover discussion between this community and those making policy decisions…” Shivon Zillis, OpenAI
Elements to Consider When Looking for Bias in Machine Learning
- Who created the algorithm? The unique background of the algorithm’s initiator comes into play. Age, ethnicity, language, gender, personal and professional experience, education, and most of all the unique convergence of factors and timing that sparked the creation of an algorithm, are all important factors that make the algorithm subjective by nature. Could the same algorithm have been better built by another individual with a different educational or ethnic heritage?
- What language does the algorithm speak? Is an algorithm written by an African engineer who speaks Swahili supposed to be trained in English because English is the language of Technocracy? Does that make the algorithm better or worse at what it does? Is the English language a detriment to innovation because it’s a limited cultural perspective and therefore biased? Or is English an advantage because it is so widely spoken across the world that it really doesn’t matter. We’d venture to say that it does matter because an English-speaking South African data scientist has a vastly different perspective than an Irish data scientist unless they went to the same school and learned from the same mentors.
- Who collaborated to create it? Lack of diversity in the collaborating team creating the algorithm can compound the subjectivity (bias) issue, because too much uniformity of gender, education, and ethnicity can lead to an echo chamber that becomes impervious to plural input. Did the algorithm’s creator learn from a mentor who paid attention to the pitfalls of bias in machine learning? Or was the mentor more focused on producing a result at all costs, no matter what data was being fed to the algorithm?
- How much diversity is in the data sets? Lack of diversity in the data sets utilized to train algorithms can make them less effective because they are learning from a uniform data source. Much like a student will benefit from learning from multiple diverse professors, so does an algorithm require diversity of input if we hope to produce a non-biased output (result). Therefore, if an algorithm is trained strictly using images of Caucasian or Asian patients, it will tend to be “racist” in that it is less effective at distinguishing between patients. While it is true that images look all the same whether the patient is African American or Native American, the patients are far different from one another, and the learning will be skewed. However, if the algorithm is trained to pay attention to a rich set of metadata, such as a patient’s ethnicity, gender, and age while analyzing pixels, it may draw more accurate conclusions about what it can detect in those images.
- Lack of diversity in the training methodology can lead to bias: machine learning in imaging diagnostics is highly dependent upon the unique experience, education, and human knowledge of the physician(s) whose input helped to train the algorithm. Physicians can still make substantial mistakes in their everyday practice, which means that their input could also lead to substantial misfires to be built into the algorithm. Does this mean that we need to bolt on peer-review principles onto machine learning? Systematic peer review is how major medical mistakes can be mitigated, but it would be impractical and time-consuming to run the same set of images by a second pair of physician eyes before feeding the information to the algorithm.
- HIPAA considerations in data preparation: whenever feeding images to an algorithm, care has to be taken that the images have been appropriately and thoroughly de-identified, both at the metadata and pixel levels. The bias of the data scientists tasked with de-identification comes into play, in that they cannot be expected to know and address every data nuance introduced by every medical device vendor. Vendors take liberties with standards such as DICOM and HL7, weakening their power as industry standards. Therefore, it is entirely conceivable that protected health information (PHI) will inevitably be introduced into an algorithm. To inject more reliability and automation into an algorithm, it will take nested AI microservices within the machine learning process to detect and red-flag possible Protected Health Information (PHI) among the datasets before they are automatically processed and used in production.
- Cybersecurity considerations: data scientists in their creative excitement may forget that healthcare providers must adhere to strict IT security principles. Ignoring this reality can lead to half-baked strategies and less-than-secure Band-Aid solutions for how to deploy AI in production-level clinical workflows. Is the algorithm only available as a Cloud service? Can it be deployed on-prem? How do new images safely reach the algorithm and how does the result get delivered? There are many vulnerabilities in clinical workflows that could be exploited by bad agents if the AI introduced into the IT ecosystem isn’t properly architected. Vulnerabilities in clinical workflows will ultimately lead to loss or exploitation of patient data. What does this have to do with bias? Lack of awareness of cybersecurity-related vulnerabilities is a bias that should be addressed.
- Education, competence, and experience of the team creating the algorithm
- Diversity of the team creating the algorithm (gender, culture, ethnicity, age, etc)
- Diversity in the data sets utilized in machine learning
- Size of the data set utilized in machine learning (is it statistically significant enough?)
- Diversity in the diagnostic expertise applied to machine learning (physicians)
- Expertise in data preparation (de-identification) to ensure HIPAA safety
- Expertise in cyber-security