Common datasets and sizes ←
Just thought i collect those (partly absolutely insane) numbers whenever i stumble across them.
Text datasets ←
C4 - Colossal Clean Crawled Corpus ←
github.com/google-research/text-to-text-transfer-transformer
Google's T5 text pre-training
- based on Common Crawl
- 7 TB raw download
- 335 CPU days for cleaning
- 750 GB of cleaned text (jmlr.org/20-074, page 7)
Text/Image datasets ←
LAION-5B ←
After the mind-blowing release of CLIP (Radford et al 2103.00020), a non-profit organisation recreated and released the dataset (which OpenAI did not). First starting with 400 million image/text pairs, as described in the paper and later releasing a 5,000 million version.