AI, Data, & Tech Innovations

How to train AI models with personal data without violating the GDPR

GDPR-compliant AI training with personal data.png
252x252_arthur_heydata_882dfef0fd_c07468184b.webp
Arthur
14.08.2025

The most important facts at a glance

  • Broad database: AI models process direct and indirect personal data.
  • Legal framework: GDPR only allows AI training with a legal basis, transparency, purpose limitation, and data minimization.
  • Technical implementation: Anonymization, data governance, DSFA, and fairness checks reduce risks.
  • Challenges: Sensitive data, complex data flows, and bias require clear governance.
  • Recommendation: Establish processes, training, and protective measures at an early stage.

Background – What happened?

Training AI models with personal data has long been an integral part of many business processes – from e-commerce recommendations to medical diagnosis systems and HR applicant tools. However, the GDPR sets strict limits: companies must ensure a clear legal basis, transparency, purpose limitation, and data minimization. It becomes particularly sensitive when indirect data such as click behavior or speech patterns become identifiable in combination. Lack of governance, complex data flows, and inadequate technical safeguards increase the risk of data protection violations. Early processes, documentation, and technical safeguards are crucial to avoid regulatory conflicts and build trust.

Table of Contents:

What is personal data in the context of AI?

According to Art. 4 No. 1 GDPR, personal data is any information relating to an identified or identifiable person.
In the context of AI, this can be direct (name, email) or indirect (click behavior, IP address, speech patterns).

Examples of personal data in AI training:

  • Customer data from CRM systems for sales prediction
  • Applicant data for optimizing recruiting algorithms
  • Chat transcripts for chatbot training
  • Behavioral patterns in e-commerce for personalization

Problem: Many companies underestimate the fact that even seemingly harmless usage data (such as timestamps or navigation paths) can be linked to individuals when combined.

Where is AI already being trained with personal data today?

In almost all digital business models:

E-commerce & marketing

  • Product recommendations based on user behavior
  • A/B testing to optimize personalization
  • Lookalike audiences for advertising

Healthcare

  • Training data for diagnostic algorithms using patient data
  • Speech recognition in medical documentation

Predictive analytics

  • Customer churn predictions (churn prediction)
  • Sales forecasts using CRM histories

Human resources

  • Pre-sorting of applicants based on old application data
  • Performance forecasts using HR feedback data

Use case:

An SaaS provider in the HR sector trains a model for applicant selection using historical resumes. Much of this data contains gender, origin, age—in other words, sensitive personal data.

What is permitted from a data protection perspective – and what is not?

According to the GDPR, training AI with personal data is generally permissible, but only under certain conditions:

1. Legal basis according to Art. 6 GDPR

  • Most relevant: consent or legitimate interest
  • For particularly sensitive data (e.g., health): Art. 9 GDPR → explicit consent required

2. Transparency and purpose limitation

  • Users must clearly understand that their data is being used for AI purposes.
  • The purpose must be clear (e.g., “improvement of recommendation logic”).

3. Data minimization (Art. 5 GDPR)

  • Only data that is truly necessary may be used
  • Superfluous or outdated information must be excluded

4. Comply with data subject rights

Data must be accessible, deletable, and portable upon request
Profiling must not make legal or significant decisions without human intervention (Art. 22 GDPR)

Technical safeguards for data protection-compliant AI training

The legal framework must be implemented technically – here are the most important measures:

1. Anonymization & pseudonymization

  • Where possible, replace personal characteristics with random values or IDs
  • Please note: Only true anonymization exempts you from the GDPR – pseudonymization does not!

2. Data governance & versioning

  • Each AI training version should document which data was used in a traceable manner.
  • Central deletion logs and time limits (e.g., 12 months) are useful.

3. DPIA (data protection impact assessment)

  • Mandatory in cases of high risk for data subjects (e.g., scoring, behavior tracking).
  • Helps to identify risks early on and define measures.

4. “Fairness by Design”

  • Do not use sensitive characteristics (e.g., gender, origin) as features if they have no factual relevance.

Regularly perform bias detection and fairness audits (bias detection refers to the identification of systematic distortions in data, algorithms, or decisions, while fairness audits are structured checks that ensure that AI and data systems function fairly, without discrimination, and in compliance with regulations).

Practical recommendations for companies

Before training:

  • Establish a legal basis (preferably documented in a processing directory)
  • Create transparent data protection notices
  • Evaluate data sources: Which data categories are critical?

During training:

  • Activate pseudonymization or aggregation
  • Consciously remove or neutralize sensitive features
  • Implement automated risk assessment

After training:

  • Perform or update DSFA
  • Technically ensure deletion routines

Check result models for distortions (“fairness check”)

Conclusion

Training AI models with personal data is not prohibited per se – but it is regulated. Companies that combine legal requirements (GDPR) with technical safeguards reap double benefits: they build trust with customers while ensuring that their AI projects remain scalable and future-proof.

Important: The content of this article is for informational purposes only and does not constitute legal advice. The information provided here is no substitute for personalized legal advice from a data protection officer or an attorney. We do not guarantee that the information provided is up to date, complete, or accurate. Any actions taken on the basis of the information contained in this article are at your own risk. We recommend that you always consult a data protection officer or an attorney with any legal questions or problems.