一区二区日本_久久久久久久国产精品_无码国模国产在线观看_久久99深爱久久99精品_亚洲一区二区三区四区五区午夜_日本在线观看一区二区

The Pile

An 800GB Dataset of Diverse Text for Language Modeling

What is the Pile?

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Download

The Pile is hosted by the Eye.

The format of the Pile is jsonlines data compressed using zstandard.

Have a model that uses or evaluates on the Pile? Let us know!

Why is the Pile a good training set?

Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

Why is the Pile a good benchmark?

To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.

Citing

If you use the Pile or any of the components, please cite us!

@article{pile,
  title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
  journal={arXiv preprint arXiv:2101.00027},
  year={2020}
}
                

Leaderboard

* indicates potential test-set overlap. Zero-shot indicates that not all of the components of the Pile were present in the training data.

Rank Model Test BPB

1.

Jan 1.2021

GPT-3 (Zero-Shot)*

OpenAI

0.7177

2.

Jan 1.2021

GPT-2 (Zero-Shot)*

OpenAI

1.2253

主站蜘蛛池模板: 少妇性bbb搡bbb爽爽爽欧美 | 男女啪啪免费网站 | 91看片在线观看 | 国产理论片| 日韩中文字幕免费 | 青青视频网 | 婷婷伊人网| 久久成人国产 | 国产精品一区二区三区四区 | 亚州av在线| 国 产 黄 色 大 片 | 欧美在线观看一区二区 | 日日夜夜狠狠干 | 精品在线播放 | 国产成人综合在线 | 国产综合视频 | 黄av在线 | 五月婷在线 | 一区二区三区视频 | 一区二区三区在线观看视频 | 欧美日韩在线看 | 欧美精品在线免费观看 | 国产精品一区久久 | 国产精品视频专区 | 99久久婷婷国产综合精品草原 | 亚洲综合在线视频 | 欧美亚洲激情 | 亚洲成人毛片 | 日韩在线免费观看视频 | 亚洲精品视频免费 | 亚洲三级av| 成人午夜在线 | 欧美中文字幕在线 | 欧美精品一二三区 | av不卡一区 | 亚洲一区二区免费看 | 国产综合自拍 | 欧美日韩大片 | 成年人午夜视频 | 黄色片网站在线观看 | 国产一区福利 |