Ceci n’est pas une pipe (This is not a pipe): OCR (Optical Character Recognition) progress and pitfalls for healthcare

There was a time, not so long ago, when digitizing text meant painstakingly retyping an abstract, a contract, or even an entire book. Today, OCR (Optical Character Recognition) can accomplish the same task in mere seconds.

What is OCR?

OCR is a set of technologies that turn images, faxes, PDFs, and other printed documents into data so that they don’t have to be keyed in manually. OCR is used in virtually any field. Companies use the software to scan paper documents to digitally index and archive them and to make them easily searchable. If you were to take a photo of a document from your mobile phone, it is only an image of the document and the characters aren’t really characters, but pixel representations of characters. None of the words on the page are searchable. OCR software analyzes that image and converts the picture of words into characters as if they had been typed. Those converted characters become useful data that can be sent to systems or stored with the document to make them indexable, and more easily searchable in the future. In healthcare, OCR allows medical record images to be searchable so that, for example, a nurse may search for all the lab results with the phrase “Liver Panel” and go directly to those specific results.

Traditional OCR finds its roots in 1976 with Ray Kurzweil’s OCR computer program, the first practical application of OCR technology. The Kurzweil Reading Machine combined omni-font OCR, a flat-bed scanner, and text-to-speech synthesis to create the first print-to-speech reading machine for the blind. It was the first computer to transform random text into computer-spoken words, enabling blind and visually impaired people to read any printed materials. In the1980s, Kurzweil sold the technology to Xerox who commercialized paper-to-computer text conversion.

Ray Kurzweil with the Reading Machine for the Blind, 1976. Source: Kurzweil

Ray Kurzweil with the Reading Machine for the Blind, 1976. Source: Kurzweil

How does OCR Work?

Different fonts and ways to write a single character make precise text recognition a challenge, much like various accents make speech recognition difficult for Natural language processing (NLP). Before selecting an OCR algorithm, the image must be preprocessed for it to be ready to be “read.”

OCR software often “pre-process” images to boost the chances of recognition. Pre-processing techniques include the following 9 steps as outlined in The Comprehensive Guide to Optical Character Recognition (OCR):

1. De-skew tilts the document a few degrees clockwise or counterclockwise to create text lines that are horizontal or vertical if the document was not correctly aligned when scanned.

2. Despeckle removes positive and negative spots, smoothing edges.

Source: verypdf

3. Binarization converts an image to black-and-white (called a “binary image” because there are two colors). The binarization task is conducted as an easy and accurate way to distinguish text (or any other required image element) from the background.

4. Line removal cleans up non-glyph boxes and lines.

Source: verypdf

5. Layout analysis or “zoning” identifies columns, paragraphs, captions, etc., as blocks. Particularly useful in multi-column layouts and tables.

6. Line and word detection establishes word and character shapes baseline and divides words when required.

Source: Moov.ai

7. Script recognition transforms at the word level in documents with multiple languages before the relevant OCR can be utilized to manage the particular script.

8. Character isolation or “segmentation” divides single characters broken into several artifact-based pieces that are linked together.

9. Normalization finalizes aspect ratio and scale.

Feature Extraction
There are two main methods for extracting features in OCR:

  1. In the first method, the algorithm for feature detection defines a character by evaluating its lines and strokes.
  2. In the second method, pattern recognition works by identifying the entire character. We can recognize a line of text by searching for white pixel rows that have black pixels in between. Similarly, we can recognize where a character starts and finishes.

Top Applications of OCR

Popular use cases for OCR technology include digitizing books and other unstructured documents that enable human-human communication. For example, Google Translate’s OCR enables users to read in any language.

The increasing desire for digitization makes OCR technology necessary for businesses. Throughout business processes, unstructured information remains trapped within legacy paper records contemporary electronic documents, requiring large amounts of time and significant cost to transfer data into back-end systems for it to be used effectively. Invoices, orders, freight bills, application forms, and insurance claims are all examples of documents that need to be classified, separated, and extracted. Global supply chains (shipping, customs, etc), regulatory filings, law suits, mergers and acquisitions (M&A) related documentation, and real estate purchases are all examples of document-intensive processes that generate enormous amounts of paper, all of which benefit from being indexed and archived electronically for easy access and sharability.

OCR use cases by industry

OCR can extract data in:

  • Checks to capture the account information, handwritten dollar amount, and signature.
  • Mortgage applications which contain numerous documents.
  • Payslips which are one of the best indicators for disposable income


  • Claims processing can be automated by OCR and supporting technologies.


  • Legal firms can digitize all of their printed documents such as affidavits, judgments, filings, statements, wills via OCR.


  • OCR can scan reports that contain X-rays, previous diseases, treatments or diagnostics, tests, hospital records, insurance payments.

OCR used in the healthcare industry. Photo credit: Moov.ai

OCR and De-Identification of Protected Health Information (PHI)

Although written informed consent from patients is not always necessary, according to the U.S. Health Insurance Portability and Accountability Act, or HIPAA, and the European General Data Protection Regulation (GDPR), both retrospectively and prospectively-gathered data require proper de-identification. Sensitive information includes but is not limited to name, medical record number, and date of birth. Removal of embedded information requires more advanced de-identification methods such as optical character recognition and human review for handwriting on scanned images not always recognizable by automated methods.

OCR with Artificial Intelligence (AI)

With the advent of machine learning and artificial intelligence, the limits that have kept traditional OCR technology from innovating at scale have been shattered. Newer OCR technologies leverage Artificial Intelligence (AI) and Machine Learning (ML) to radically improve the success of OCR on highly unstructured documents like medical records. AI allows OCR software to adapt to a document’s context, searching for a piece of data that may not be in the same place on every form, such as a blood pressure reading or lab test results. ML can “learn” via self-teaching and interactive training that allows it to learn different documents quickly, with little to no human intervention required.

What are the Challenges with Reading a Medical Record?

There are a number of challenges in reading a medical record with OCR. Quality varies greatly, especially fax quality. There are also countless document formats that need to be read. Handprinted characters and legibility of physician writing can also cause issues, some of which may include notes in different languages and/or alphabets. OCR is not easy. In fact, OCR of medical records with traditional OCR solutions is impractical at best. However, combining OCR, AI and ML changes the game. The adaptive, learning nature of the AI/ML combined with OCR overcomes most of the challenges listed above. It is now technically practical to read medical record documents and convert these images to usable, searchable data.

Ceci’ n’est pas une Pipe (This is Not a Pipe)

We mentioned in a recent blog post, an interesting branch of OCR that is being combined with NLP for the purpose of detecting language among pixels. A clear distinction should be made between unstructured text, which is still searchable text, versus words on a scanned document, which are pixel representations of words – not actual words. In 1929 French Surrealist René Magritte painted one of his most famous pieces: “Ceci n’est pas une pipe.” In this piece Magritte points out that a word is an arbitrary construct that doesn’t necessarily represent actual meaning to the person viewing it. In this example, we are not looking at a pipe, but rather a visual representation of something called a pipe in some cultures.

Rene Magritte’s famous painting The Treachery Of Images

Recognizing objects and patterns in medical images or documents can be fraught with misinterpretation and errors. Distinguishing the meaning of words that are represented by pixels can be even more challenging because words must be first identified, then processed, then understood with the right context and knowledge of that word.

This branch of NLP, combined with OCR, has become particularly important in the field of machine learning in medical imaging. As millions of images can be leveraged to unlock valuable knowledge of pathologies for AI, it’s also important to ensure no protected health information (PHI) is leftover in the dataset being fed to the algorithm. De-identification of metadata is a relatively straightforward, albeit tedious process. However, scanned documents, non-DICOM data, secondary capture images and handwritten notes on documents can all constitute potential HIPAA violations if pixel-based PHI are still present in the dataset. Because of these potential HIPAA vulnerabilities, NLP, combined with OCR cannot be 100% trusted to catch and de-identify PHI that exists at the pixel level.

For this reason, many valuable images must be consciously excluded from data sets prior to feeding them to machine learning algorithms, rather than risk the accidental and inevitable inclusion of embedded PHI. In a highly sensitized climate of cyber security attacks, ransomware, and breaches of privacy on a massive scale, the consequences of nonchalance in data preparation could far outweigh the benefits of training AI to enhance medical workflows.

Because OCR has become so ubiquitous in all sectors of our lives, personal as well as professional, it’s easy to forget that it constitutes a vital building block toward mass-customization of medicine for individuals. When combined with AI and ML, OCR will ultimately break language barriers, accent barriers in dictated content, and interchangeably recognize and process 8-bit characters as well as western alphabets.

OCR holds even more promise for population health, making it far easier for nations to aggregate and share vital health information on a massive scale, turning healthcare data into global “disease radars,” and ultimately better equipping us to combat pandemics and other life-threatening scourges of humanity.