Privacy-preserving Machine learning#

Much current research in data science involves machine learning (ML) models interacting with data sourced from a large number of individuals with significant variation in the general level of awareness, consent, and understanding of research goals. As such, researchers have a responsibility to protect the confidentiality and privacy of the people whose data is being processed. At the same time, sharing both data and trained models drives scientific advancement and promotes important social goals in open and transparent science.

It’s important to note that local and international regulations such as the General Data Protection Regulation (GDPR) and the EU’s policy on trustworthy AI also establish legal duties and principles on privacy protection that the following tools may help researchers meet.

Sharing data with privacy#

Training a complex ML model can often require a very large amount of data, more than a single researcher or organisation could feasibly generate. Sharing our data not only helps us to create more reproducible research, but promotes advancements in the field as a whole. However, this does pose the risk of inadvertently sharing personal information that could be used to identify a subject.

Most researchers will remove uniquely identifying information (such as ID numbers, address, and phone numbers) before publication, but recent research has shown that with access to secondary datasets, such ‘pseudonymised’ datasets can still be traced back to the individual [NS08, SWZ12].

Differential privacy#

Differential privacy is a statistical tool which can estimate the risk of uniquely identifying a member of a dataset, whereupon calibrated noise can be added to ensure that privacy is preserved [YZM+12].

Synthetic data generation#

If sharing the original data raises privacy or ethical concerns, we can still contribute useful information by sharing synthetic datasets that reproduce statistical features of the original dataset without exposing actual instances [TFR20].

Useful resources#

Learning with privacy#

Beyond sharing data with other researchers, we can also share our trained models, or make them available as a service: carrying out predictions on data provided by others without the need for them to invest time and resources in training their own systems. However, this sharing can also carry risks for personal privacy. For instance, many ML solutions require users to send personal data to a central server to process, exposing them to the risk of interception or misuse. The model itself may learn sequences from the dataset that we don’t wish to be retained, a process referred to as unintended memorization [CLE+19]. This could be particularly harmful when considering models dealing with large amounts of user-created text [BLM+22].

Federated learning#

Federated Learning is a design paradigm in which the users’ data never leaves their own devices, with the model itself being broken down into a set of computations that take place on the edge, before updates are sent back to a central coordinator [KMA+19].

Adversarial learning#

We can also draw on the experience of research in the field of cross-domain training to teach models to ignore undesirable data by directly controlling the training process [CNC18]. This can also be extended beyond private attributes to elimination of unwanted biases [ZLM18].

Differential privacy#

Differential privacy has also seen significant use as a technique for preserving privacy during model training, reducing the risk of the model learning individual data points too well by adding small amounts of statistical noise during training [BDC20, FBDD20].

Useful resources#