<< nn-experiments

Common datasets and sizes

Just thought i collect those (partly absolutely insane) numbers whenever i stumble across them.

Text datasets

C4 - Colossal Clean Crawled Corpus

github.com/google-research/text-to-text-transfer-transformer

Google's T5 text pre-training

Text/Image datasets

LAION-5B

laion.ai/blog/laion-5b/

After the mind-blowing release of CLIP (Radford et al 2103.00020), a non-profit organisation recreated and released the dataset (which OpenAI did not). First starting with 400 million image/text pairs, as described in the paper and later releasing a 5,000 million version.