Blog Economie Numérique - AI and Data Security — Interview with Pierluigi Paganini

The use of generative artificial intelligence systems in contexts related to psychological support and mental well-being is experiencing rapid growth, often outside any structured healthcare framework. These uses involve the processing of information of an extremely sensitive nature, characterised by a strong narrative, cumulative and longitudinal dimension.

In this context, issues relating to data security, information traceability and infrastructure governance take on central importance, particularly given the specific vulnerability of psychological data and their potential for re-identification. The evolution of the European regulatory framework on artificial intelligence further reinforces the need for a technical and operational analysis of these risks.

The interview below addresses these issues through the perspective of Pierluigi Paganini, with a specific focus on cybersecurity, system architecture and data protection in contexts where generative AI intercepts information relating to the psychological sphere.

Juliette Haller : Personal, highly intimate conversations generate extremely sensitive and cumulative data. From a cybersecurity perspective, what is currently the main risk factor associated with the prolonged accumulation of this type of data within generative artificial intelligence systems?

Pierluigi Paganini : The primary risk lies in the creation of highly detailed psychological profiles, built over time and potentially combinable with data breaches, OSINT sources and data broker datasets. This enables highly targeted social engineering, blackmail and disinformation campaigns. The danger increases with log retention: the longer such data remain in systems, the greater the attack surface, encompassing application vulnerabilities, cloud misconfigurations and insider threats. This makes data retention and minimisation policies critical.

User accounts, technical logs, metadata and conversation histories: to what extent do these elements enable effective re-identification of users, even in the absence of explicit personal identification data?

Pierluigi Paganini : Re-identification is often possible because technical identifiers such as IP addresses, devices and usage times, combined with behavioural metadata, can be correlated with other breached or tracking datasets. Conversations also contain quasi-identifiers such as profession, life events, relationships and habits. By cross-referencing these elements with public or commercial sources, it is often possible to quickly identify the individual, particularly when interactions are repeated and in-depth.

To what extent do current language model architectures expose risks of data leakage, information extraction or inference regarding sensitive data contained in prompts or training corpora?

Pierluigi Paganini : Current architectures present three main categories of risk: direct leakage through model outputs, inference attacks, and abuse of application integrations. Techniques such as model inversion and membership inference can reveal the use of sensitive data during training, while prompt injection and jailbreak attacks may induce the model to reuse learned information. In addition, insecure integrations with external tools or retrieval-augmented generation (RAG) databases can bypass controls and expose data that should remain isolated.

From a strictly technical standpoint, what specific attack surfaces emerge when highly sensitive data are processed and stored in centralised and heavily shared cloud infrastructures, regardless of their geographical location?

Pierluigi Paganini : Multi-tenant environments carry risks related to improper isolation: vulnerabilities in hypervisors, containers or shared services can enable lateral movement between tenants. Configuration errors, such as exposed storage or excessive IAM permissions, remain a primary cause of data breaches. The sharing of models, MLOps pipelines and common components amplifies the impact of vulnerabilities, while the complexity of the software supply chain increases the risk of upstream compromises and data poisoning.

The concepts of pseudonymisation and anonymisation are often invoked in the context of artificial intelligence. What are their concrete limitations when applied to rich, narrative and highly personal textual data?

Pierluigi Paganini : In personal narrative texts, pseudonymisation removes few explicit identifiers but leaves intact biographical, situational and linguistic references that function as quasi-identifiers, facilitating re-identification through external sources. Truly robust anonymisation would require generalisations and suppressions that would significantly reduce the usefulness of the data for training. Furthermore, stylometric analysis and semantic clustering can link texts to the same “voice,” further weakening traditional anonymisation approaches.

What specific vulnerabilities arise when sensitive data are processed or accessible from infrastructures located outside the European Union, in terms of control, third-party access and effective data protection?

Pierluigi Paganini : These vulnerabilities fall into three categories. First, a regulatory vulnerability: local laws may allow government access or secret orders incompatible with EU standards, beyond contractual safeguards. Second, an operational vulnerability: exercising rights such as erasure or restriction of processing vis-à-vis non-EU entities is difficult, particularly in opaque supply chains. Third, onward transfers: copies created for backup, testing or analytics may spread across multiple jurisdictions, complicating audits and oversight and increasing the risks of misuse or compromise.

What technical measures do you currently consider genuinely effective in reducing the risk of exposure, reuse or compromise of sensitive data in generative artificial intelligence systems?

Pierluigi Paganini : The most effective countermeasures combine prevention and containment. Upstream measures include data minimisation, automated classification of sensitive content and filters to block unnecessary personally identifiable information and psychological details prior to ingestion. At the infrastructure level, end-to-end encryption, strict tenant segregation, zero-trust access controls and immutable logging are essential. At the model level, LLM hardening includes privacy-oriented training and testing, defences against prompt injection, output monitoring, continuous red teaming, and data loss prevention (DLP) and masking applied to logs and datasets.

In light of the European Artificial Intelligence Regulation, what concrete impact will the obligations imposed on general-purpose AI models have, in your view, on data security and the management of risks related to highly sensitive information?

Pierluigi Paganini : The obligations applicable to general-purpose AI models will push providers to implement robust data governance systems, including source traceability, filter documentation, periodic impact assessments and verifiable security controls. This will foster greater discipline in the collection and management of sensitive data and logs. Risk management and transparency requirements will facilitate audits by authorities and enterprise customers, establishing a minimum standard for hardening, monitoring and limiting secondary uses of psychological data.