pip install torch numpy transformers datasets tiktoken wandb tqdm
If you are not a deep learning professional and you just want to feel the magic and get your feet wet, the fastest way to get started is to train a character-level GPT on the works of Shakespeare. First, we download it as a single (1MB) file and turn it from raw text into one large stream of integers:
python data/chinese_modern_poetry/prepare.py
python data/chinese_laws_pretrain/prepare.pyThis creates a train.bin and val.bin in that data directory. Now it is time to train your GPT. The size of it very much depends on the computational resources of your system:
I have a GPU. Great, we can quickly train a baby GPT with the settings provided in the config/train_gpt2_chinese_laws.py config file:
python train.py config/train_gpt2_chinese_poetry_debug.py
python train.py config/train_gpt2_chinese_poetry.py
python train.py config/train_gpt2_chinese_laws_debug.py
python train.py config/train_gpt2_chinese_laws.pyIf you peek inside it, you'll see that we're training a GPT with a context size of up to 256 characters, 384 feature channels, and it is a 6-layer Transformer with 6 heads in each layer. On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697. Based on the configuration, the model checkpoints are being written into the --out_dir directory out-shakespeare-char. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:
python sample-chinese-poetry.py --out_dir=out-chinese-poetry
python sample-chinese-laws.py --out_dir=out-chinese-lawsThis generates a few samples
