Common datasets and sizes ←

Just thought i collect those (partly absolutely insane) numbers whenever i stumble across them.

Text datasets ←

C4 - Colossal Clean Crawled Corpus ←

github.com/google-research/text-to-text-transfer-transformer

Google's T5 text pre-training

based on Common Crawl
7 TB raw download
335 CPU days for cleaning
750 GB of cleaned text (jmlr.org/20-074, page 7)

Text/Image datasets ←

LAION-5B ←

laion.ai/blog/laion-5b/

After the mind-blowing release of CLIP (Radford et al 2103.00020), a non-profit organisation recreated and released the dataset (which OpenAI did not). First starting with 400 million image/text pairs, as described in the paper and later releasing a 5,000 million version.