De-Identification In Medical Imaging
Healthcare organizations obtain and have access to vast amounts of personal data contained in hospital records, results of clinical examinations, and medical images. This data is often shared with other healthcare systems or providers to maintain the care continuum or provide relevant patient medical history. Medical images are a foundational tool enabling clinicians to visualize critical information about a patient to help diagnose and treat them. The digitization of medical images has improved the ability to reliably store, share, view, search, and curate these images to assist medical professionals.
What is medical image sharing and how is it different from sharing data?
When medical images are shared the images are presented in a DICOM format. Sharing can be done between different physicians and departments in the same healthcare facility or shared outside the facility with consultants or with the patients themselves.
DICOM images are unlike other types of data and cannot be shared or viewed in the same manner. DICOM image files are large in size due to high resolution and quality, which limits how they can be shared. They also require a specific viewer. A DICOM router software that can recognize and process these high-quality images is necessary along with a viewer for reading the images.
What is data de-identification?
Data de-identification is the process used to prevent personal identity from being revealed. The process of data de-identification involves the removal or transformation of personal identifiers. Once personal identifiers are removed or transformed it is much easier to safely reuse and share the data with third parties
When applied to metadata or general data, the process is also known as data anonymization. Common strategies include deleting or masking personal identifiers, such as personal names, and suppressing or generalizing quasi-identifiers, such as date of birth. The reverse process of using de-identified data to identify individuals is known as data re-identification.
Original de-identified medical data. Source: ResearchGate
Which Industries Use De-identification?
Data de-identification is governed under HIPAA and is most often associated with medical data. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws.
However, data de-identification is also important for businesses or agencies that want or need to mask identities under other frameworks, such as CCPA and CPRA, GDPR or in fields of communications, multimedia, biometrics, big data, cloud computing, data mining, internet, social networks, and audio or video surveillance.
Requirements for California Consumer Privacy Act (CCPA), Colorado Privacy Act (CPRA), Virginia Consumer Data Protection Act (VCDPA), and Consumer Privacy Act (CPA). Source: GreenbergTrauig
Another example is when surveys are conducted, such as a census, information is collected from a specific group of people. To encourage participation and to protect the privacy of survey respondents, the researchers attempt to design the survey in a way that when people participate in a survey, it will not be possible to match any participant’s individual response(s) with any data published.
Re-identification, a system comparing two images of an object, and figuring if they have the same identity, has a use case in object tracking. One example is if a soccer player needs to be detected during a soccer match, the goal (pun intended) would be to recognize that particular player, but also recognize and identify each player on the entire team. The most accurate way to do this is by referencing a catalog in order to determine which identity corresponds to which image (player). If a catalog does not exist, re-identification algorithms are used to distinguish all the players during the soccer match, without determining their proper identity.
De-Identification In Medical Imaging
The process of de-identification, by which identifiers are removed from the health information, ensures patient privacy while allowing the use of imaging data for comparative effectiveness studies, policy assessment, life sciences research, and other endeavors. De-identification in healthcare removes all direct identifiers from patient data and allows organizations to share it without the potential of violating HIPAA. Direct identifiers, known as Protected Health Information, can include a patient’s name, address, and medical record information, and convey a patient’s physical or mental health condition, any healthcare services rendered to that individual, as well as financial data related to healthcare. For example, medical records, hospital bills, and lab results are all PHI.
The increasing adoption of health information technologies highlights their potential to facilitate beneficial studies that combine large, complex data sets from multiple sources. Since large sets of health data can enable clinical research and benefit the medical community, the HIPAA Privacy Rule permits a covered entity or business associate to de-identify data using specific standards and specifications.
Medical image before de-identification. Source: Dicom Systems
Why Is De-Identification Important?
- Sharing health information with non-privileged parties
- Creating datasets from multiple sources and analyzing them
- Anonymizing data so that it can be used in machine learning models
- Providing public health warnings without revealing PHI
- A company that licenses de-identified patient data to analyze trends and patterns that help verify efficacy or buying trends.
How Do Researchers Use De-Identified Data?
By de-identifying data, researchers can safely share information, such as imaging data, with other organizations to advance medical research and treatment. Additionally, de-identifying the data removes some liability regarding HIPAA violations.
One example of the power and impact of de-identification for a Clinical Research Organization (CRO) is Dicom Systems’ collaboration with the HSS Global Innovation Institute to de-identify 5.3 million exams. The Hospital for Special Surgery (HSS) is the world’s leading academic medical center focused on musculoskeletal health. The HSS Global Innovation Institute, which works with industry, investors, and entrepreneurs to commercialize their portfolio of technologies, partnered with an AI company to develop algorithms targeted to address the issue of fracture misdiagnosis.
Results Achieved with Dicom Systems De-Identification Framework for HSS Global Innovation Institute
- 5.3M images de-identified
- Successful validation by 3rd party auditing agency
- AI Algorithms safely provided with precise type and volume of clinical data required for effective machine learning

Dicom Systems example of de-Identification for Biomedical Research
De-Identification Techniques
De-Identification methods fall into one of two categories, depending on where in the file the PHIs are stored. In medical imaging, patient data can be part of the image, also known as burned-in annotations, or image overlay, or it can be stored as a meta tag in file properties. Identification of patient and date as text in an encapsulated document (e.g., in an XML attribute or element) is equivalent to “burned in the annotation.
Chest x-ray with burned-in annotation. Source: Research Gate
Each DICOM file contains extensive metadata in a header. Metadata is information that describes the image. The header portion of a DICOM file typically contains PHI. Metadata is a powerful tool to annotate and leverage image-related information for clinical and research purposes and to organize and retrieve archives images and associated data.
DICOM object with metadata and image. Source: Research Gate.
Historically, de-identification methodologies relied on image masking for burned-in data. Image masking can be a manual process that must be completed on each image. Based on coordinates for a particular model of equipment, an editor can find the identifiers in the file and mask. Methods of de-identification include blurring, pixelating, or blocking.
Data masking has many limitations, due to the effort required and the opportunity for errors or missed fields. To replace data masking, automated tools using computer vision and optical recognition (OCR) have been developed. These de-identification tools can understand the difference and automatically mask the patient information while leaving the clinically relevant information intact, all in a fraction of the time it takes a human to perform the same task.
The HIPAA De-Identification Standard
Data de-identification is expressly governed under HIPAA. There are two ways to de-identify data in accordance with HIPAA. The first is Safe Harbor whereby all 18 identifiers are explicitly and implicitly removed. The second is Expert Determination whereby a qualified subject matter expert determines that the risk of re-identification of an individual from the data set is very small. In addition, the expert must document their analysis thoroughly to ensure compliance. The Expert Determination method allows individuals to extract key data points while still protecting patient privacy, but it has limitations.
Two methods to achieve de-identification in accordance with the HIPAA Privacy Rule. Source hhs.gov
- Names
- All geographic subdivisions smaller than a state
- Dates
- Telephone Numbers
- Vehicle Identifiers
- Fax Numbers
- Device Identifiers and Serial Numbers
- Emails
- URLs
- Social Security Numbers
- Medical Record Numbers
- IP Addresses
- Biometric Identifiers
- Health Plan Beneficiary Numbers
- Full-face photographic images and any comparable images
- Account Numbers
- Certificate/license numbers
- Any other unique identifying number, characteristic, or code.
Any derivatives of any of the listed identifiers cannot be used under the Safe Harbor method. For example, a document containing the last four digits of a Social Security number would not meet the de-identification requirement.
How to De-identify Data
Data de-identification is typically managed in a two-step process.
Medical image after de-identification. Source: Dicom Systems
Lack Of Proper De-Identification Runs The Risk of HIPAA Violation
The most common HIPAA violations that have resulted in financial penalties are the failure to perform an organization-wide risk analysis to identify risks to the confidentiality, integrity, and availability of protected health information (PHI); the failure to enter into a HIPAA-compliant business associate agreement; impermissible disclosures of PHI; delayed breach notifications; and the failure to safeguard PHI.
HIPAA settlements with covered entities for the failure to conduct an organization-wide risk assessment include:
- Premera Blue Cross—$6,850,000 settlement for risk analysis and risk management failures, and other potential HIPAA violations
- Excellus Health Plan—$5,100,000 settlement for risk analysis and risk management failures, and other potential HIPAA violations
- Oregon Health & Science University– $2.7 million settlement for the lack of an enterprise-wide risk analysis.
- Cardionet – $2.5 million settlement for incomplete risk analysis and lack of risk management processes.
- Cancer Care Group – $750,000 settlement for the failure to conduct an enterprise-wide risk analysis.
- Lahey Hospital and Medical Center – $850,000 settlement for the failure to conduct an organization-wide risk assessment and other HIPAA violations.
- Steven A. Porter, M.D – $100,000 penalty for risk analysis and risk management failures.
De-Identification vs. Anonymization
It is important to distinguish between data de-identification and anonymization. It is not uncommon for researchers to use the terms de-identified and anonymous interchangeably, but the distinction between them can mean the difference in the study needing to comply with the federal regulations regarding human subject research or not.
Medical Image De-Identification in the Cloud
The adoption of cloud computing in healthcare is enabling greater integration and collaboration between hospitals, medical organizations, and healthcare providers. The benefits of storing patient data in the cloud include enhanced security, scalability, reduced IT costs, greater availability and reliability, and 24/7 access from multiple locations.
When it comes to de-identification of data stored in the cloud, the adoption is strongly linked to compliance with Safe Harbor. Medical image de-identification is gaining adoption among healthcare organizations that store data in a private cloud. A private cloud is essentially an extension of the client’s data center and therefore is governed and protected by the same security and privacy methods as on-prem storage. In contrast, in a public cloud, such as Google Cloud, Amazon’s AWS, or Microsoft’s Azure, data de-identification would be in violation of HIPAA’s Safe Harbor.
Medical image before de-identification. Source: Dicom Systems
Re-Identification
Covered entities may also wish to re-identify data at a later date. Data re-identification or de-anonymization is the practice of matching anonymous data (also known as de-identified data) with publicly available information, or auxiliary data, in order to discover the individual to which the data belong.
Object detection and tracking of soccer players. Source: Klap
De-Identification As On-Ramp to AI and Machine Learning Innovation

Dicom Systems Unifier platform AI Conductor
Dicom Systems Unifier Offers Deidentification Solution
Dicom Systems Unifier Offers Deidentification Features and Benefits:
- Adherence to the HIPAA Privacy Rule and Safe Harbor audited by third parties.
- Full customization of processes and output
- Robust enough for large-scale de-identification with no impact on clinical workflow.
- Supports full DICOM, DICOMweb, FHIR, and HL7 interoperability with compatible devices
- Removes pixel, mask-based, metadata, and text removal from images referencing an industry database of modalities based on vendor and model to find exact coordinates of where text was burnt into the medical images. Optical Character Recognition (OCR ) software capability is available for certain use cases.
- When deployed in conjunction with the Unifier platform Vendor Neutral Archive (VNA), it leverages a robust framework for imaging lifecycle management and archiving.
- Bi-directional dynamic tag morphing makes changes in input and output.
- Advanced pixel-level de-identification to avoid accidental corruption or truncation of the image file.
- Complex DICOM tag substitutions, removals, or morphing are automated by designing transformations into the LUA script framework.