This is everything you need to either get started building your own GPT models from the ground up, or fine-tune existing models using your own data.
This project contains two main directories, models
and tune
:
├── src
│ ├── models
│ │ ├── bigram
│ │ ├── encoder.py
│ │ └── gpt
│ └── tune
│ ├── chatbot.py
│ ├── fine_tune.py
│ ├── hyperparameters.py
│ └── inference.py
You can play around with the concepts of language models in the models
folder. Here there are two kinds of LM strategies documented in this repo:
-
Bigram Language Model: A bigram model is a type of n-gram model that predicts the probability of a word based on the previous word. It considers pairs of consecutive words (bigrams) and calculates the conditional probability of the current word given the previous word.
-
GPT Language Model: GPT is a transformer-based language model that uses a deep neural network architecture, specifically the Transformer model. It is a much more sophisticated model that considers the entire context of a sentence, not just the previous word. GPT is based on a self-attention mechanism that allows it to weigh the importance of each word in the context and capture long-range dependencies in text.
You can fine-tune pre-existing models using your own training text. This is fairly straightforward, and all the primary logic for that takes place inside the src/tune/fine_tune.py
file. You can place your existing model anywhere as long as you specify the path to that model's directory with the PRETRAINED_MODEL
constant inside the fine_tune.py
file.
Here's an example of a pretrained model you can use here:
Just simply clone this gpt2 repository somewhere on your computer, and set the path in the fine_tune.py
file.
git clone https://huggingface.co/gpt2
I have two methods of testing inference with the model built into the src/tune
folder. You can run the src/tune/inference.py
file, or test it out in a simple terminal chat interface in the src/tune/chatbot.py
file.
Optional: Create an env
python -m venv env
source env/bin/activate
Add your input data to the data/
folder.
Ensure to update the path to the name of your file in the INPUT_FILE variable if it has a custom name or name it input.txt
pip install -r requirements.txt
This project contains two language models,
python -m src.models.bigram.train
python -m src.models.gpt.train
Ensure you have specified the model path inside the file. You can place your model anywhere as long as you specify the path to its directory inside of the fine_tune.py
file.
python -m src.tune.fine_tune
python -m src.tune.inference
python -m src.tune.chatbot