Skip to article frontmatterSkip to article content

BigCode is an open scientific collaboration working on the responsible development and use of large language models (LLM) for code, aiming to empower the machine learning and open source communities through open governance. Code LLMs enable the completion and synthesis of code, both from other code snippets and natural language descriptions, and can be used across a wide range of domains, tasks, and programming languages. These models can, for example, assist professionals and hobbyists with building new LLM applications. As part of BigCode, the team created the StarCoder and StarCoderBase Large Language Models for Code.

Data Collection and Management Plan

Overview

The primary training dataset used for the BigCode project is The Stack, which was obtained by gathering public code files, issues, and commits from GitHub. For more information on The Stack dataset, visit this page to request access and view the dataset card. To collect Github repositories, they first extracted a list of repositories from GHArchive and subsequently cloned all of them using a large CPU cluster. They also used the data from GHArchive to extract the Github issues. The git commits were gathered from a public BigQuery service. Additionally, they collected a dataset of annotations of several kinds of private information on a subset of The Stack to support our privacy risk mitigation efforts.

The legal basis for data collection under fair use and with regards to GDPR and the corresponding case law are still evolving. In this context, the data collection and data management plans were carefully crafted with support from leading experts in the open source and legal tech community that participated in the BigCode Legal, Ethics, Governance Working Group in a best-effort approach to reflect current understandings of legal requirements for data collection and management.

The following sections will dive into the details of how the team approached key data governance activities like data collection, data management & opt-out, and data processing.

Data Collection

The StarCoder model was trained on The Stack v1.2, which exclusively contains 6.4TB of permissively licensed data from GitHub repositories, processed from an original source dataset of 102TB. Selecting repositories based on licenses is only the first step, however, so this approach is complemented by also giving repository owners the ability to opt out of having their repositories included in The Stack, described further in the Data Management & Opt-out section.

Data Management & Opt-Out

Data Access

Data Opt-Out

Data Processing

One significant concern with respect to privacy was the risk that the code LLM may generate private information found in its training data, including private tokens or passwords matched with identifiers or email addresses. Additionally, while users can (and have) requested that data be removed from The Stack dataset because it contains personal data, removing specific information from trained model weights after the fact remains an open technical challenge. In order to minimize this risk, they chose to apply automated PII redaction at the pre-processing stage during training.

The PII redaction process consisted of the following steps:

Learn more about the PII redaction process through this blog post from Toloka.

Acknowledgments

This case study is based on the BigCode Governance card, thanks to the efforts of the hundreds of BigCode participants, a living document that will evolve over time with the BigCode project. Please leave any comments in the HuggingFace Community space to ask a question or start a conversation about BigCode project governance.