Let's Build GPT from Scratch for Text Generator

We will use KerasNLP to build a scaled-down Generative Pre-Trained (GPT) model in this example. GPT is a Transformer-based model that allows you to generate sophisticated text from a prompt.

We will train the model on the simplebooks-92 corpus, a dataset made from several novels. It is a good dataset for this example because it has a small vocabulary and high word frequency, which is beneficial when training a model with few parameters.


Photo by Iliescu Victor

Table of Content

◆ Introduction
◆ Setup
◆ Settings & hyperparameters
◆ Load the data
◆ Train the tokenizer
◆ Load tokenizer
◆ Tokenize data
◆ Build the model
◆ Training
◆ Inference
-> Greedy search
 -> Beam search
 -> Random search

◆ Conclusion

Source: The code used in this article is from Keras official website

Setup

Note: If you are running this example on a Colab, make sure to enable GPU runtime for faster training.

The command !pip install -q --upgrade keras-nlp installs and upgrades the keras-nlp package. The -q flag ensures the installation process is quiet (minimal output), and --upgrade provides the latest version of the package is installed.

The command !pip install -q --upgrade keras upgrades the keras package to the latest version.

!pip install -q --upgrade keras-nlp
!pip install -q --upgrade keras # Upgrade to Keras 3.

Settings & hyperparameters

Here, we define several key parameters related to data preparation, model architecture, training, and inference for the transformer model.

# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512 # Strings shorter than this will be discarded
SEQ_LEN = 128 # Length of training sequences, in tokens

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000 # Limits parameters in model.
# Training
EPOCHS = 5
# Inference
NUM_TOKENS_TO_GENERATE = 80

Load the data

Now, let’s download the dataset! The SimpleBooks dataset consists of 1,573 Gutenberg books and has one of the smallest vocabulary sizes to word-level tokens ratios. Its vocabulary size is ~98k, a third of WikiText-103’s, with around the same number of tokens (~100M). This makes it easy to fit a small model.

keras.utils.get_file(
origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
extract=True,
)

dir = os.path.expanduser("~/.keras/datasets/simplebooks/")
# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
tf_data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
.filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
.batch(BATCH_SIZE)
.shuffle(buffer_size=256)
)

# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
tf_data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
.filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
.batch(BATCH_SIZE)
)

Train the tokenizer

We train the tokenizer from the training dataset for a vocabulary size of VOCAB_SIZE, which is a tuned hyperparameter. We want to limit the vocabulary as much as possible, as we will see later on that it has a large effect on the number of model parameters. We also don't want to include too few vocabulary terms, or there would be too many out-of-vocabulary (OOV) sub-words. In addition, three tokens are reserved in the vocabulary:

  • "[PAD]" for padding sequences to SEQ_LEN. This token has index 0 in both reserved_tokens and vocab, since WordPieceTokenizer (and other layers) consider 0/vocab[0] as the default padding.
  • "[UNK]" for OOV sub-words, which should match the default oov_token="[UNK]" in WordPieceTokenizer.
  • "[BOS]" stands for beginning of sentence, but here technically it is a token representing the beginning of each line of training data.
# Train tokenizer vocabulary
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
raw_train_ds,
vocabulary_size=VOCAB_SIZE,
lowercase=True,
reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

Load tokenizer

We use the vocabulary data to initialize keras_nlp.tokenizers.WordPieceTokenizer. WordPieceTokenizer efficiently implementsthe WordPiece algorithm used by BERT and other models. It will strip, lowercase, and do other irreversible preprocessing operations.

tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
vocabulary=vocab,
sequence_length=SEQ_LEN,
lowercase=True,
)

Tokenize data

We preprocess the dataset by tokenizing and splitting it into features and labels.

# packer adds a start token
start_packer = keras_nlp.layers.StartEndPacker(
sequence_length=SEQ_LEN,
start_value=tokenizer.token_to_id("[BOS]"),
)

def preprocess(inputs):
outputs = tokenizer(inputs)
features = start_packer(outputs)
labels = outputs
return features, labels

# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
tf_data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
tf_data.AUTOTUNE
)

Build the model

We create our scaled-down GPT model with the following layers:

  • One keras_nlp.layers.TokenAndPositionEmbedding layer, which combines the embedding for the token and its position.
  • Multiple keras_nlp.layers.TransformerDecoder layers, with the default causal masking. The layer has no cross-attention when run with the decoder sequence only.
  • One final dense linear layer
inputs = keras.layers.Input(shape=(None,), dtype="int32")

# Embedding.
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
vocabulary_size=VOCAB_SIZE,
sequence_length=SEQ_LEN,
embedding_dim=EMBED_DIM,
mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
decoder_layer = keras_nlp.layers.TransformerDecoder(
num_heads=NUM_HEADS,
intermediate_dim=FEED_FORWARD_DIM,
)
x = decoder_layer(x) # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let’s take a look at our model summary — a large majority of the parameters are in the token_and_position_embedding and the output dense layer! This means that the vocabulary size (VOCAB_SIZE) has a large effect on the size of the model, while the number of Transformer decoder layers (NUM_LAYERS) doesn't affect it as much.

model.summary()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ input_layer (InputLayer) │ (None, None) │ 0
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ token_and_position_embedding │ (None, None, 256) │ 1,312,768
│ (TokenAndPositionEmbedding) │ │ │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ transformer_decoder │ (None, None, 256) │ 329,085
│ (TransformerDecoder) │ │ │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ transformer_decoder_1 │ (None, None, 256) │ 329,085
│ (TransformerDecoder) │ │ │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense (Dense) │ (None, None, 5000) │ 1,285,000
 Total params: 3,255,938 (12.42 MB)
Trainable params: 3,255,938 (12.42 MB)
Non-trainable params: 0 (0.00 B)

Training

Now that we have our model, let’s train it with the fit() method.

model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)
Epoch 1/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 216s 66ms/step - loss: 5.0008 - perplexity: 180.0715 - val_loss: 4.2176 - val_perplexity: 68.0438
Epoch 2/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 127s 48ms/step - loss: 4.1699 - perplexity: 64.7740 - val_loss: 4.0553 - val_perplexity: 57.7996
Epoch 3/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 126s 47ms/step - loss: 4.0286 - perplexity: 56.2138 - val_loss: 4.0134 - val_perplexity: 55.4446
Epoch 4/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 134s 50ms/step - loss: 3.9576 - perplexity: 52.3643 - val_loss: 3.9900 - val_perplexity: 54.1153
Epoch 5/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 135s 51ms/step - loss: 3.9080 - perplexity: 49.8242 - val_loss: 3.9500 - val_perplexity: 52.0006

Inference

With our trained model, we can test it out to gauge its performance. To do this we can seed our model with an input sequence starting with the "[BOS]" token, and progressively sample the model by making predictions for each subsequent token in a loop.

To start lets build a prompt with the same shape as our model inputs, containing only the "[BOS]" token.

# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens
<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
dtype=int32)>

We will use the keras_nlp.samplers module for inference requires a callback function wrapping the model we just trained. This wrapper calls the model and returns the logit predictions for the current token we are generating.

Note: Two pieces of more advanced functionality areavailable when defining your callback. The first is the ability to take in a cache of states computed in previous generation steps, which can be used to speed up generation. The second is the ability to output the final dense "hidden state" of each generated token. This is used by keras_nlp.samplers.ContrastiveSampler, which avoids repetition by penalizing repeated hidden states. Both are optional, and we will ignore them for now.

def next(prompt, cache, index):
logits = model(prompt)[:, index - 1, :]
# Ignore hidden states for now; only needed for contrastive search.
hidden_states = None
return logits, hidden_states, cache

Creating the wrapper function is the most complex part of using these functions. Now that it’s done, let’s test the different utilities, starting with greedy search.

Greedy search

We greedily pick the most probable token at each timestep. In other words, we get the argmax of the model output.

sampler = keras_nlp.samplers.GreedySampler()
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1, # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")

Output:

Greedy search generated text: 
[b'[BOS] " i \' m going to tell you , " said the boy , " i \' ll tell you , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good']

As you can see, greedy search starts out making some sense, but quickly starts repeating itself. This is a common problem with text generation that can be fixed by some of the probabilistic text generation utilities shown later on!

Beam search

At a high-level, beam search keeps track of the num_beams most probable sequences at each timestep, and predicts the best next token from all sequences. It is an improvement over greedy search since it stores more possibilities. However, it is less efficient than greedy search since it has to compute and store multiple potential sequences.

Note: beam search with num_beams=1 is identical to greedy search.

sampler = keras_nlp.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")

Output:

Beam search generated text: 
[b'[BOS] " i don \' t know anything about it , " she said . " i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \'']

Like greedy search, beam search repeats itself since it is still deterministic.

Random Search

Our first probabilistic method is random search. At each time step, it samples the next token using the softmax probabilities provided by the model.

sampler = keras_nlp.samplers.RandomSampler()
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")

Output:

Random search generated text: 
[b'[BOS] eleanor . like ice , not children would have suspicious forehead . they will see him , no goods in her plums . i have made a stump one , on the occasion , - - it is sacred , and one is unholy - plaything - - the partial consequences , and one refuge in a style of a boy , who was his grandmother . it was a young gentleman who bore off upon the middle of the day , rush and as he maltreated the female society , were growing at once . in and out of the craid little plays , stopping']

Conclusion

In this example, we use KerasNLP layers to train a sub-word vocabulary, tokenize training data, create a miniature GPT model, and perform inference with the text generation library.

If you would like to understand how Transformers work or learn more about training the full GPT model, here are some further readings:

Previous Post Next Post