Spaces:
Running
Running
| title: README | |
| emoji: 🐨 | |
| colorFrom: pink | |
| colorTo: indigo | |
| sdk: static | |
| pinned: false | |
| # The Stack v2 Training Data | |
| This organization contains the full datasets used to train StarCoder2: | |
| - `the-stack-v2-train-full`: contains the training data with 600+ programming languages used to train StarCoder2-15B with the files concatenated per repository | |
| - `the-stack-v2-train-full-files`: same as `the-stack-v2-train-full` but without repository concatenation which makes filtering files or licenses easier | |
| - `the-stack-v2-train-smol`: contains the training data with 17 programming languages used to train StarCoder2-3B and 7B with the files concatenated per repository | |
| - `the-stack-v2-train-smol-files`: same as `the-stack-v2-train-smol` but without repository concatenation which makes filtering files or licenses easier | |
| See the [tech report](https://arxiv.org/pdf/2402.19173) for all the details on the dataset. | |