Comment: Using synthetic data to prioritise data privacy

Synthetic data acts as a protective barrier, preserving individuals' privacy and mitigating the risks associated with handling real-world data, says Steve Harris, CEO of Mindtech.

AdobeStock

In today’s digital landscape, data privacy has taken centre stage due to a surge in high-profile data breaches and the imposition of stringent data protection regulations. A prime example is the widely publicised T-mobile data breach from the previous year, which inflicted a staggering cost of $350m on the company in 2022 alone (excluding customer reimbursements).

Organisations are grappling with the challenge of balancing the extraction of valuable insights from data while safeguarding the privacy and confidentiality of individuals. Synthetic data has emerged as a compelling solution to bridge this gap by providing a feasible alternative to real-world data.

Synthetic data, often generated through advanced algorithms and statistical modelling techniques, possesses a unique capability to replicate the statistical properties, patterns, and structures of authentic datasets, all while excluding any personally identifiable information (PII). By eliminating sensitive data, synthetic data acts as a protective barrier, preserving individuals' privacy and mitigating the risks associated with handling real-world data.

Protecting privacy and fairness

A significant challenge in training AI models is the scarcity of usable, accurate real-world datasets that adhere to licensing and data privacy regulations. This limitation can lead to constrained datasets for model training and leave sensitive information exposed. Moreover, limiting AI's training data can exacerbate bias and discrimination concerns.

The inherent nature of synthetic data allows it to emulate real-world patterns without relying on genuine data, thereby enhancing privacy by avoiding the need to reveal or transmit PII. Notably, the creation of synthetic data often commences with real-world datasets, with alternative methods involving analyst-derived insights or a fusion of both approaches.

To address this, a privacy assurance assessment can be conducted to ensure that the resulting synthetic data remains distinct from actual personal data. This process not only removes sensitive information but also offers the opportunity to modify datasets, resulting in a more balanced representation of society.

Current practice

A pressing data privacy issue in AI model training revolves around the use of real-world images. While the internet is replete with images of individuals and groups, each image is governed by specific licensing rules, whether for commercial or non-commercial use. Determining which images are suitable for model training further complicates matters.

However, synthetic data offers a solution to this challenge by swiftly resolving the privacy and licensing issues associated with such content. Synthetic data comes in various forms, including tabular, textual, video, and audio formats, all of which are applicable in real-world scenarios.

Rather than relying on customer data for training AI models in database analysis, an innovative approach involves generating configurable facial images using artificial manufacturing techniques. The configurability of these facial images allows for the creation of diverse facial features, enhancing AI training while retaining a recognisable resemblance to real-world data-derived faces.

This approach can be extended to generate synthetic documents and train AI systems for tasks such as classification, language translation, and text summarisation, all while faithfully replicating real-world documents without compromising personal privacy. This advancement holds immense potential to revolutionise data privacy regulations, ushering in a future where a standardised system featuring configurable faces and synthetic data becomes the norm for AI model training.

The knock-on benefits

The use of synthetic data to uphold data privacy has far-reaching benefits for machine learning training and broader collaboration within the AI community. Effective AI models require a substantial amount of training data. Synthetic data addresses this need by rapidly and cost-effectively generating abundant data streams without the pitfalls of PII.

The computer-generated nature of synthetic data enables on-demand data creation, eliminating the reliance on procuring sufficient usable real-world datasets alone. This adaptability of synthetic data also bolsters another facet of AI development: sharing and collaboration. Similar to the ethos of code sharing and modification in the open-source community, synthetic data fosters a comparable approach for training AI models and sharing datasets.

Unlike situations where access restrictions hinder data sharing among departments or partner companies, synthetic data's accessibility primarily depends on the owner's jurisdiction. Encouraging data sharing and collaboration has the potential to facilitate responsible, secure, and efficient AI model development.

A transformative impact for data privacy?

The T-Mobile incident underscores the risks of directly engaging with real-world data, exposing entities to data breaches and privacy violations that invite financial losses, legal disputes, and reputational harm. While genuine data remains integral to synthetic data and AI model development, the scope of usable real-world data often remains limited, potentially involving PII.

The standout feature of synthetic data for data privacy lies in its fortified privacy-by-design protection. Beyond addressing data scarcity and fostering collaboration, its deployment and impact could potentially drive transformations in data privacy and AI training paradigms as we currently understand them.

Steve Harris, CEO of Mindtech