De-Identification In Medical Imaging

Healthcare organizations obtain and have access to vast amounts of personal data contained in hospital records, results of clinical examinations, and medical images. This data is often shared with other healthcare systems or providers to maintain the care continuum or provide relevant patient medical history. Medical images are a foundational tool enabling clinicians to visualize critical information about a patient to help diagnose and treat them. The digitization of medical images has improved the ability to reliably store, share, view, search, and curate these images to assist medical professionals.

What is medical image sharing and how is it different from sharing data?

When medical images are shared the images are presented in a DICOM format. Sharing can be done between different physicians and departments in the same healthcare facility or shared outside the facility with consultants or with the patients themselves.

DICOM images are unlike other types of data and cannot be shared or viewed in the same manner. DICOM image files are large in size due to high resolution and quality, which limits how they can be shared. They also require a specific viewer. A DICOM router software that can recognize and process these high-quality images is necessary along with a viewer for reading the images.

Sharing medical information is an important aspect of healthcare operations and clinical research, however, ensuring all identifiable patient information is de-identified is critical, especially when sharing of medical images is involved.

What is data de-identification?

Data de-identification is the process used to prevent personal identity from being revealed. The process of data de-identification involves the removal or transformation of personal identifiers. Once personal identifiers are removed or transformed it is much easier to safely reuse and share the data with third parties

When applied to metadata or general data, the process is also known as data anonymization. Common strategies include deleting or masking personal identifiers, such as personal names, and suppressing or generalizing quasi-identifiers, such as date of birth. The reverse process of using de-identified data to identify individuals is known as data re-identification.

Original de-identified medical data. Source: ResearchGate

Which Industries Use De-identification?

Data de-identification is governed under HIPAA and is most often associated with medical data. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws.

However, data de-identification is also important for businesses or agencies that want or need to mask identities under other frameworks, such as CCPA and CPRA, GDPR or in fields of communications, multimedia, biometrics, big data, cloud computing, data mining, internet, social networks, and audio or video surveillance.

Requirements for California Consumer Privacy Act (CCPA), Colorado Privacy Act (CPRA), Virginia Consumer Data Protection Act (VCDPA), and Consumer Privacy Act (CPA). Source: GreenbergTrauig

Another example is when surveys are conducted, such as a census, information is collected from a specific group of people. To encourage participation and to protect the privacy of survey respondents, the researchers attempt to design the survey in a way that when people participate in a survey, it will not be possible to match any participant’s individual response(s) with any data published.

Re-identification, a system comparing two images of an object, and figuring if they have the same identity, has a use case in object tracking. One example is if a soccer player needs to be detected during a soccer match, the goal (pun intended) would be to recognize that particular player, but also recognize and identify each player on the entire team. The most accurate way to do this is by referencing a catalog in order to determine which identity corresponds to which image (player). If a catalog does not exist, re-identification algorithms are used to distinguish all the players during the soccer match, without determining their proper identity.

De-Identification In Medical Imaging

The process of de-identification, by which identifiers are removed from the health information, ensures patient privacy while allowing the use of imaging data for comparative effectiveness studies, policy assessment, life sciences research, and other endeavors. De-identification in healthcare removes all direct identifiers from patient data and allows organizations to share it without the potential of violating HIPAA. Direct identifiers, known as Protected Health Information, can include a patient’s name, address, and medical record information, and convey a patient’s physical or mental health condition, any healthcare services rendered to that individual, as well as financial data related to healthcare. For example, medical records, hospital bills, and lab results are all PHI.

The increasing adoption of health information technologies highlights their potential to facilitate beneficial studies that combine large, complex data sets from multiple sources. Since large sets of health data can enable clinical research and benefit the medical community, the HIPAA Privacy Rule permits a covered entity or business associate to de-identify data using specific standards and specifications.

Medical image before de-identification. Source: Dicom Systems

Why Is De-Identification Important?

De-identification protects the privacy of individuals and once a data set has been de-identified, and no longer contains personal information, its use or disclosure cannot violate the privacy of individuals.
De-identification in the healthcare industry has multiple uses cases, including:
  • Sharing health information with non-privileged parties
  • Creating datasets from multiple sources and analyzing them
  • Anonymizing data so that it can be used in machine learning models
  • Providing public health warnings without revealing PHI
  • A company that licenses de-identified patient data to analyze trends and patterns that help verify efficacy or buying trends.
De-identification preserves patient confidentiality without affecting the values and the information that could be needed for different research purposes and protects specific health information that could identify living or deceased individuals.

How Do Researchers Use De-Identified Data?

Data sharing allows the healthcare sector to create better tools and treatments to help improve patient care and outcomes. However, according to the Centers for Disease Control & Prevention (CDC), HIPAA law states that patient information must be protected and cannot be shared with other entities without the patient’s knowledge and consent.

By de-identifying data, researchers can safely share information, such as imaging data, with other organizations to advance medical research and treatment. Additionally, de-identifying the data removes some liability regarding HIPAA violations.

One example of the power and impact of de-identification for a Clinical Research Organization (CRO) is Dicom Systems’ collaboration with the HSS Global Innovation Institute to de-identify 5.3 million exams. The Hospital for Special Surgery (HSS) is the world’s leading academic medical center focused on musculoskeletal health. The HSS Global Innovation Institute, which works with industry, investors, and entrepreneurs to commercialize their portfolio of technologies, partnered with an AI company to develop algorithms targeted to address the issue of fracture misdiagnosis.

HSS needed to provide their AI partners with a high volume of de-identified clinical data for the purpose of feeding machine learning algorithms in a way that was efficient, secure, and compliant with HIPAA Safe Harbor de-identification regulation guidelines. Additionally, the process required exceptional controls to ensure the highest level of integrity and reliability acceptable under the review of a 3rd party auditing agency.
De-identification played a critical role in unlocking the hospital’s vast archive of studies. With proven technology verified by an independent third-party auditor, Dicom Systems delivered the successful de-identification of 5.3 million exams to HSS. Dicom Systems created a custom de-identification framework that satisfied all clinical, technical, privacy, and business requirements of the HSS AI ingestion pipeline. The scope included de-identification of DICOM header metadata spanning 6 unique modality types, as well as advanced image pixel de-identification to masked burned-in PHI while ensuring clinical image fidelity remained intact.

Results Achieved with Dicom Systems De-Identification Framework for HSS Global Innovation Institute

  • 5.3M images de-identified
  • Successful validation by 3rd party auditing agency
  • AI Algorithms safely provided with precise type and volume of clinical data required for effective machine learning

Dicom Systems example of de-Identification for Biomedical Research

De-Identification Techniques

De-Identification methods fall into one of two categories, depending on where in the file the PHIs are stored. In medical imaging, patient data can be part of the image, also known as burned-in annotations, or image overlay, or it can be stored as a meta tag in file properties. Identification of patient and date as text in an encapsulated document (e.g., in an XML attribute or element) is equivalent to “burned in the annotation.

Chest x-ray with burned-in annotation. Source: Research Gate

Each DICOM file contains extensive metadata in a header. Metadata is information that describes the image. The header portion of a DICOM file typically contains PHI. Metadata is a powerful tool to annotate and leverage image-related information for clinical and research purposes and to organize and retrieve archives images and associated data.

DICOM object with metadata and image. Source: Research Gate

Historically, de-identification methodologies relied on image masking for burned-in data. Image masking can be a manual process that must be completed on each image. Based on coordinates for a particular model of equipment, an editor can find the identifiers in the file and mask. Methods of de-identification include blurring, pixelating, or blocking.

Data masking has many limitations, due to the effort required and the opportunity for errors or missed fields. To replace data masking, automated tools using computer vision and optical recognition (OCR) have been developed. These de-identification tools can understand the difference and automatically mask the patient information while leaving the clinically relevant information intact, all in a fraction of the time it takes a human to perform the same task.

The HIPAA De-Identification Standard

Data de-identification is expressly governed under HIPAA. There are two ways to de-identify data in accordance with HIPAA. The first is Safe Harbor whereby all 18 identifiers are explicitly and implicitly removed. The second is Expert Determination whereby a qualified subject matter expert determines that the risk of re-identification of an individual from the data set is very small. In addition, the expert must document their analysis thoroughly to ensure compliance. The Expert Determination method allows individuals to extract key data points while still protecting patient privacy, but it has limitations.

Two methods to achieve de-identification in accordance with the HIPAA Privacy Rule. Source

Regardless of the method by which de-identification is achieved, the Privacy Rule does not restrict the use or disclosure of de-identified health information, as it is no longer considered protected health information.
The Safe Harbor method of de-identification requires removing 18 types of identifiers, like those listed below, so that residual information cannot be used for identification:
  1. Names
  2. All geographic subdivisions smaller than a state
  3. Dates
  4. Telephone Numbers
  5. Vehicle Identifiers
  6. Fax Numbers
  7. Device Identifiers and Serial Numbers
  8. Emails
  9. URLs
  10. Social Security Numbers
  11. Medical Record Numbers
  12. IP Addresses
  13. Biometric Identifiers
  14. Health Plan Beneficiary Numbers
  15. Full-face photographic images and any comparable images
  16. Account Numbers
  17. Certificate/license numbers
  18. Any other unique identifying number, characteristic, or code.
Once all of those elements are removed from the data, HIPAA protections no longer apply.
Any derivatives of any of the listed identifiers cannot be used under the Safe Harbor method. For example, a document containing the last four digits of a Social Security number would not meet the de-identification requirement.

How to De-identify Data

Data de-identification is typically managed in a two-step process.

The first step of data de-identification consists of classifying and tagging direct and indirect identifiers. Identifiers that are unique to a single individual, such as Social Security numbers, passport numbers, and taxpayer identification numbers are known as “direct identifiers.” The remaining types of identifiers are known as “indirect identifiers,” and generally consist of personal attributes that are not unique to a specific individual on their own. Examples of indirect identifiers include height, ethnicity, hair color, and more. Though not independently unique, indirect identifiers can be used in combination to single out an individual’s records.
Once the data classifiers have been verified and represent what is within data sources, it is possible to automate the tagging process. This makes the de-identification process vastly more efficient for data teams.
Data can then be de-identified through the combination of various dynamic data masking techniques and data access controls. These technical and organizational measures impact both the data’s appearance and its environment, including who can access the data and for which purposes, among other contingencies.
A two-stage de-identification process for CT scan images is available in DICOM file format. In the first stage of the de-identification process, the patient’s PII—including name, date of birth, etc., are removed at the hospital facility using the export process available in their Picture Archiving and Communication System (PACS). The second stage employs the proposed DICOM de-identification tool for an exhaustive attribute-level investigation to further de-identify and ensure that all PII has been removed. Finally, we provide a roadmap for future considerations to build a semi-automated or automated tool for the DICOM datasets de-identification.

Medical image after de-identification. Source: Dicom Systems

Lack Of Proper De-Identification Runs The Risk of HIPAA Violation

Data breaches in healthcare are becoming more frequent as news articles appear almost daily, and the reality is more frequent breaches could become the norm, requiring more diligence with security measures and ensuring patient data is protected. HIPAA regulations were established to protect patient privacy, but penalties for non-compliance can often result in steep fines for organizations that are exposed to data breaches, as well as the potential damage to an institution’s established reputation, resulting in public distrust.

The most common HIPAA violations that have resulted in financial penalties are the failure to perform an organization-wide risk analysis to identify risks to the confidentiality, integrity, and availability of protected health information (PHI); the failure to enter into a HIPAA-compliant business associate agreement; impermissible disclosures of PHI; delayed breach notifications; and the failure to safeguard PHI.

The failure to perform an organization-wide risk analysis is one of the most common HIPAA violations to result in a financial penalty. If the risk analysis is not performed regularly, organizations will not be able to determine whether any vulnerabilities to the confidentiality, integrity, and availability of PHI exist. Risks are therefore likely to remain unaddressed, leaving the door wide open to hackers.
HIPAA settlements with covered entities for the failure to conduct an organization-wide risk assessment include:
  • Premera Blue Cross—$6,850,000 settlement for risk analysis and risk management failures, and other potential HIPAA violations
  • Excellus Health Plan—$5,100,000 settlement for risk analysis and risk management failures, and other potential HIPAA violations
  • Oregon Health & Science University– $2.7 million settlement for the lack of an enterprise-wide risk analysis.
  • Cardionet – $2.5 million settlement for incomplete risk analysis and lack of risk management processes.
  • Cancer Care Group – $750,000 settlement for the failure to conduct an enterprise-wide risk analysis.
  • Lahey Hospital and Medical Center – $850,000 settlement for the failure to conduct an organization-wide risk assessment and other HIPAA violations.
  • Steven A. Porter, M.D – $100,000 penalty for risk analysis and risk management failures.

De-Identification vs. Anonymization

It is important to distinguish between data de-identification and anonymization. It is not uncommon for researchers to use the terms de-identified and anonymous interchangeably, but the distinction between them can mean the difference in the study needing to comply with the federal regulations regarding human subject research or not.

Both methods remove personal identifiers from a data set. However, after anonymization, the dataset does not contain any identifiable information and there is no way to link the information back to identifiable information. In contrast, de-identified data can potentially be re-identified.
It is also important to make a distinction between datasets that have to comply with the Health Insurance Portability and Accountability Act (HIPAA) and those that do not.
Under HIPAA, a dataset is considered de-identified if all 18 identifiers listed at 45 CFR 164.514(b)(2) are removed. If the dataset is not subject to HIPAA, it is considered anonymous if the identity of the human subjects cannot be readily ascertained. Identity is considered readily ascertainable if the information is publicly available or could be determined from publicly available information.
The reason it is important to understand the distinction between anonymous data and de-identified data is that research with anonymous data is not considered human subject research and does not need to comply with the federal regulations regarding human subjects research.

Medical Image De-Identification in the Cloud

The adoption of cloud computing in healthcare is enabling greater integration and collaboration between hospitals, medical organizations, and healthcare providers. The benefits of storing patient data in the cloud include enhanced security, scalability, reduced IT costs, greater availability and reliability, and 24/7 access from multiple locations.

When it comes to de-identification of data stored in the cloud, the adoption is strongly linked to compliance with Safe Harbor. Medical image de-identification is gaining adoption among healthcare organizations that store data in a private cloud. A private cloud is essentially an extension of the client’s data center and therefore is governed and protected by the same security and privacy methods as on-prem storage. In contrast, in a public cloud, such as Google Cloud, Amazon’s AWS, or Microsoft’s Azure, data de-identification would be in violation of HIPAA’s Safe Harbor.


Covered entities may also wish to re-identify data at a later date. Data re-identification or de-anonymization is the practice of matching anonymous data (also known as de-identified data) with publicly available information, or auxiliary data, in order to discover the individual to which the data belong.

If a covered entity or business associate successfully undertook an effort to identify the subject of de-identified information it maintained, the health information now related to a specific individual would again be protected by the Privacy Rule, as it would meet the definition of PHI. Disclosure of a code or other means of record identification designed to enable coded or otherwise de-identified information to be re-identified is also considered a disclosure of PHI.
In order to prepare for re-identification, covered entities may assign a unique code to the dataset or specific records, only if the code is not derived from information about the individual, and the entity does not disclose the code or the mechanism for re-identification to any third party.
If the data is re-identified, it is once again considered PHI under HIPAA.

Object detection and tracking of soccer players. Source: Klap

De-Identification As On-Ramp to AI and Machine Learning Innovation

Thanks to recent advances, artificial intelligence (AI) is quickly becoming an integral part of modern healthcare. AI algorithms and other applications powered by AI are being used to support medical professionals in clinical settings and in ongoing research. In medical imaging, AI tools are being used to analyze CT scans, x-rays, MRIs, and other images for lesions or other findings that a human radiologist might miss. Artificial intelligence can enhance medical imaging for screenings, precision medicine, and risk assessment.
High-quality data is an essential contribution to better machine learning algorithms. The more volume and variety of data made available to an algorithm, the more accurate the learning, and the more accurate the outcomes will be. For AI algorithms to properly “consume” patient data, the data must be standardized and HIPAA-compliant. To that end, Unifier’s de-identification functionality seamlessly delivers de-identified data in the right format and quality, for use in any relevant AI algorithm.

Dicom Systems Unifier platform AI Conductor

Dicom Systems Unifier Offers Deidentification Solution

Dicom Systems Unifier platform can de-identify DICOM, XML, TIFF, JPEG, PDF, and other image formats complying with HIPAA safe harbor de-identification of Protected Health Information (PHI) requirements. Images and data are received and translated into a standardized format that can then be transferred to or accessed by referring physicians, radiologists, PACS/MIMPS, RIS, or to any radiology workstation, regardless of its physical location.

Dicom Systems Unifier Offers Deidentification Features and Benefits:

  • Adherence to the HIPAA Privacy Rule and Safe Harbor audited by third parties.
  • Full customization of processes and output
  • Robust enough for large-scale de-identification with no impact on clinical workflow.
  • Supports full DICOM, DICOMweb, FHIR, and HL7 interoperability with compatible devices
  • Removes pixel, mask-based, metadata, and text removal from images referencing an industry database of modalities based on vendor and model to find exact coordinates of where text was burnt into the medical images. Optical Character Recognition (OCR ) software capability is available for certain use cases.
  • When deployed in conjunction with the Unifier platform Vendor Neutral Archive (VNA), it leverages a robust framework for imaging lifecycle management and archiving.
  • Bi-directional dynamic tag morphing makes changes in input and output.
  • Advanced pixel-level de-identification to avoid accidental corruption or truncation of the image file.
  • Complex DICOM tag substitutions, removals, or morphing are automated by designing transformations into the LUA script framework.
Want to learn more about de-identification with the Dicom Systems Unifier platform? Meet with one of our enterprise imaging workflow experts