AI Series Part III: Bias in Machine Learning – Peer-Review in the Age of Artificial Intelligence

By Florent Saint-Clair

A recent Technology Review article on bias in machine learning resonates ominously in medical imaging.

“The real safety question, if you want to call it that, is that if we give these systems biased data, they will be biased.” John Giannandrea, Google AI Chief

In our previous blog post, we identified the necessity to incentivize physicians to “donate” their expertise to the cause of machine learning.  Crowdsourcing diagnostics expertise is a tall order, as physicians are both suspicious of artificial intelligence, and essentially not incentivized to do for free what they routinely get paid to do as professionals.  Today’s reality in ML for medical imaging is that there is an overwhelming amount of teaching to do and a modest pool of available volunteers to do the teaching.  Because of this scarcity of specialized labor, ML in medical imaging is inherently biased.

Does The “Black Box” Method Of Teaching Neural Networks Have A Place In Medical Imaging?

Crowdsourcing is already being used to let neural networks self-teach, which is also known as the “Black Box” method.  This deep learning methodology essentially lets algorithms figure things out on their own, i.e. providing the engine millions of pictures along with one piece of input such as “cat” or “dog.”  Without pointing out where in the pictures the cat or the dog is, the algorithm eventually associates shapes in the picture with the concept of a cat or a dog.  The Black box method, while fascinating and devoid of bias, is not quite desirable in the medical world because we would need to provide some indication to the algorithm what is the area of interest in the image, vs. letting it figure it out and possibly get it wrong.

Technically, the same deep learning methodology could be utilized in medical imaging if and only if the algorithm were also fed a key piece of information along with the images: a patient outcome, or a structured diagnostic report that lets the algorithm ultimately correlate the images with certain types of outcome.  An example of such an effective correlation could be a CT scan along with a pathology report or genomics study.  The challenge with this approach is that health IT solutions, no matter how much interoperability standards exist, continue to be highly fragmented.  Access to an imaging archive doesn’t always mean equal access to a RIS or EHR containing reports or patient outcome information.

As I was researching this topic, a radiologist friend of mine pointed out that Black Box methodology could apply to more mechanical functions such as recognizing good quality vs. bad quality images (for example sharp vs. noisy), rejecting them and sending a request to the technologist for new images to be captured and included in the exam before they are sent to the diagnostician.  However, we are still pretty far from being able to entrust an algorithm with actual diagnostic duties – meaning that the supervision of a human physician is still an absolute necessity.

Diagnostic Algorithms Recognize Abnormalities In Medical Images

Diagnostic algorithms are being trained to quickly and efficiently recognize abnormalities in medical images; how do we ensure these algorithms are fed by a population of physician “labelers” that is statistically large enough to counterbalance the bias of the few?  Clearly, having more physicians doing this work would result in diluted bias, and ultimately more objectivity on the part of the algorithms.

The Inherent Human Bias In Image Labeling

The labeling input is performed by physicians who painstakingly label images according to their own individual experience and training—and therefore bias—as diagnosticians. I’d even venture to say the labeling physician’s bias is also compounded by their environment: bias of their supervising physicians in residency, bias of their peers in medical school, and bias of their current colleagues in medical practice.

In this context the term “bias” isn’t used in the pejorative sense, but rather as unavoidable fact.  A physician can aim for objectivity in diagnostics without ever truly reaching it.  Can anyone say with veracity that pure objectivity is a realistic expectation on the part of a human?  Why is it that the opinion of a physician who has been in practice for thirty years or more, still benefits from being peer-reviewed from time to time?  After thirty years, one would presume that this physician has “seen it all,” and therefore their opinion is more likely to be correct and objective.  The answer is simple: because they are human, and medical professionals recognize that to occasionally err is human.

How Many Images Does It Take For An Algorithm To Learn?

In a recent VentureBeat article, Google Brain’s chief Jeff Dean points out that deep learning takes at least 100,000 examples.  In other words, the algorithm is considered to have successfully learned one specific “thing” as the data sample being fed to it becomes more statistically significant.  The empirical observation by Google Brain scientists is that this threshold appears to be 100,000 examples.  Following this logic in medical imaging, it will take 1,000,000,000 image examples for an algorithm, or a neural network made up of multiple algorithms, to learn 10,000 “things.”  The labeling task is enormous.  Because of these numbers, it makes a lot of sense to focus the effort on low hanging fruit, high-value targets such as cancer.

Presuming that we solve this data supply-chain puzzle, the question remains: now that this algorithm has learned this one thing after hearing it 100,000 times, is this one thing it learned biased?  Who labeled these 100,000 examples?  Was it 100 physicians each labeling 1,000 images?  10,000 physicians each labeling 10 images?  100,000 physicians each labeling one image?  We can agree that as the number of physicians providing input increases, the level of relative objectivity also increases, and therefore the amount of bias becomes more diluted (not eliminated).

Preparing DICOM Images For Machine Learning Algorithms: De-Identify and Label

Therein exists a fundamental challenge in imaging AI: raw images in DICOM format cannot be natively fed to machine learning algorithms.  Before 100,000 examples can be fed to the algorithm while preserving the absolute integrity of patient privacy and security, 100,000 image examples must be methodically selected, extracted from their respective data source, and finally rigorously de-identified.  Once de-identified, images must be labeled by physicians before they can become relevant sources of data for an algorithm to learn from.

For this reason, a de-identified data-lake approach to machine learning is justified.  In order to increase objectivity, and reduce built-in bias in diagnostic imaging AI, the labeling effort must be crowdsourced from as wide a population of diagnosticians as possible.  The larger the “labeler” population, the higher the level of objectivity, and the lower the chances that a ML algorithm will have been trained with built-in bias.

Peer Reviews Protect Against Human Bias in The Pre-AI World

In the analog world of imaging diagnostics (natural intelligence), we guard against bias in diagnostics by requiring systematic and random peer-reviews to ensure that images were correctly interpreted by the primary reading physician.

In the realm of AI, wouldn’t it be counter-intuitive to have humans peer-review AI-generated interpretations?  Not to mention that it would be impractical to expect humans to peer-review as quickly as AI will produce a report, or a preliminary interpretation.  After all, we’re training algorithms to accelerate diagnostics (scarcity of available diagnosticians, increasing imaging volume), not slow them down.

A corollary question is, can AI be an objective diagnostician if its training came from a relatively limited pool of contributing physicians?  Wouldn’t the individual bias of these few contributors be perpetuated inside of the resulting algorithm?  So far we’ve seen diagnostic results coming from neural networks ranging from 80% to 95% accuracy.  Are these numbers within acceptable margins of error? Are they consistent with human diagnostic results?  At what point do these percentages become consistent enough for a neural network to be entrusted formally with some routine diagnostic tasks?

The Human Error Factor In Image Labeling

In a January 2014 Diagnostic Imaging article, former ACR chair Dr. Robert Pyatt stated that “The error rate has shown the miss rate for significant errors is 1 to 4 percent, so if you’re finding a doctor misses MRI findings one percent of the time, that’s in the range of what everyone misses. If your data shows you’re missing ten percent of the time, that’s a problem.”

If it is generally accepted that a physician will make significant mistakes 1 to 4 percent of the time, it is reasonable to also factor this error rate into Machine Learning labeling input.

Let’s consider this anecdotally: a radiologist is in the reading room for 10 hours, reading MRIs with a 1 to 4 percent error rate, finishes his or her worklist for the day, goes home, has dinner, a glass of wine, then moonlights as a ML labeler for a few hours.  Is it safe to presume that same radiologist will bring home the same error rate of 1 to 4 percent into his or her ML labeling activities?  Is it possible that fatigue may introduce a higher percentage of error into Machine Learning input?  If so, then the algorithm will inherit these error rates.

Should error rates be considered part of the bias physicians unavoidably impart to the neural networks they are training?  This statistic suggests there should be just as much rigor applied to Machine Learning input as there is in actual clinical diagnostics.  The environment in which the labeling is taking place, how many distractions exist, what kind of software is being used to do the labeling, how intuitive is the user interface?

How To Train An Algorithm: Roles & Responsibilities

The activity of training algorithms, at this time, is largely a research collaboration between industry and academia.  Who has oversight over their results?  Who declares an algorithm ready for production?  What methodology do they use to evaluate an algorithm’s readiness for clinical duty? An algorithm in diagnostic imaging is considered a medical device, and is overseen by the same authority overseeing manufacturing of an MRI scanner or PACS software: the FDA.

For the FDA to clear an algorithm for duty, the creators of the algorithm need only document and convince the FDA that a radiologist tends to be more accurate when assisted by the algorithm, than without it.  In other words, the creators of the algorithm need not document the rate of accuracy of the algorithm itself, only the rate of the accuracy of the physician with or without the algorithm involved in the diagnostic process.

For good reason, the majority of algorithms cleared by the FDA at this time fit in the category of CADe, which stands for Computer Aided Detection, versus CADx, Computer Aided Diagnostics.  In the realm of CADe, the physician remains the only authority at all times for the final diagnosis.  In the future, we can expect to see an increasing amount of research and FDA clearance of algorithms that will routinely fulfill basic diagnostic work, sifting through massive amounts of images and only flagging abnormalities for a human physician to review and diagnose.

Until then, the grunt work of image labeling will continue.

How much scrutiny do labelers undergo in their labeling activities?  Even if radiologists are being paid for their contribution to ML, how much peer review should there be in labeling, if at all?  Are labeler also reading the report associated with the exam they are labeling?  Is the report available to them?  If physicians are donating or discounting their time for the sake of science and progress, it is entirely possible that their activities and deliverables are considered with gratitude, and therefore a certain level of laissez-faire, leading to the flawed assumption that all the labeling that went into an algorithm was accurate.

Checks and Balances In Healthcare AI Research & Developments

So where are the checks and balances in healthcare AI research and development?

In a November 2, 2017 TechCrunch article, OpenAI Director Shivon Zilis raises the specter of unchecked research and development by technocrats in a dangerous vacuum.  Researchers and software developers are essentially writing policy into code, thereby fundamentally influencing an important aspect of human cultural evolution without enough input from the public.  

“…artificial intelligence researchers are, in some ways, “basically writing policy in code” because of how influential the particular perspectives or biases inherent in these systems will be, and suggested that researchers could actually consciously set new cultural norms via their work, Zilis added that the total number of people setting the tone for incredibly intelligent AI is probably “in the low thousands.”

She added that this means we likely need more crossover discussion between this community and those making policy decisions…” Shivon Zillis, OpenAI

In healthcare machine learning, the only barrier between a flawed algorithm and a functioning one is the unequivocal commitment of the physician to his or her hippocratic oath, and their professionalism.  After all, if we tell an algorithm 100,000 times that orange is the new black…

Would it be far-fetched to envision a point when machines are entrusted with peer-review of human diagnosticians?  What if the algorithms become sophisticated enough to have a detection success rate consistently superior to that of a human diagnostician for some specific diagnosis?

Approach Medical Imaging With Caution

The question remains: once we place artificial diagnosticians into production, who inherits the responsibility of peer-reviewing the diagnostic results?  The same ML algorithm cannot be in charge of its own reviewing, for the same reason a physician cannot self-review – given the same input, its diagnostic conclusion would be identical.  If the task of peer-reviewing is entrusted to a distinctly separate neural network, we are presupposing this second, peer-reviewing algorithm was originally taught by absorbing a separate set of similar images, labeled by a distinctly separate group of physicians.

We are raising these issues not because we have all the answers.  We are raising them so we stop and think, and because AI in healthcare will inexorably have an impact on patient care and outcomes – some good, some not so good.  There is an overwhelming amount of noise in health IT today, coming from a multitude of vendors and research institutions who are all fighting for a leading position in this burgeoning field, and hoping to convert a leading position into higher EPS.  

Wall Street expectations – aka greed – have a way of bringing the worst out of us, including a tendency to rush into commercializing solutions that may or may not be in the best interest of patient safety.  While some continue to adhere to sound ethical and scientific standards, there are others who are prematurely pushing half-baked solutions to the market, before many of the questions we raise in this article have been addressed or, at the very least, acknowledged.

Artificial Intelligence is an awe-inspiring endeavor that deserves our respect, enthusiasm and collective contribution.  As a vendor Dicom Systems is thrilled to play a part in the evolution of this exciting field.  Nonetheless, we caution our fellow Health IT vendors that there is still a long and arduous road ahead.

Creating AI solutions in the field of medical imaging is a laborious and unglamorous endeavor, yet it is worth the effort because it presents monumental possibilities for patient care.  For the sake of patient safety, we must ponder the consequences of our designs as carefully as the NASA engineers who were entrusted with the safety of astronauts sent into space for the first time.