Skip to content

This is everything you need to either get started building your own GPT models from the ground up, or fine-tune existing models using your own data.

Notifications You must be signed in to change notification settings

AlextheYounga/language-models-starter-kit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Models Starter Kit

This is everything you need to either get started building your own GPT models from the ground up, or fine-tune existing models using your own data.

This project contains two main directories, models and tune:

├── src
│   ├── models
│   │   ├── bigram
│   │   ├── encoder.py
│   │   └── gpt
│   └── tune
│       ├── chatbot.py
│       ├── fine_tune.py
│       ├── hyperparameters.py
│       └── inference.py

Model Concepts

You can play around with the concepts of language models in the models folder. Here there are two kinds of LM strategies documented in this repo:

  • Bigram Language Model: A bigram model is a type of n-gram model that predicts the probability of a word based on the previous word. It considers pairs of consecutive words (bigrams) and calculates the conditional probability of the current word given the previous word.

  • GPT Language Model: GPT is a transformer-based language model that uses a deep neural network architecture, specifically the Transformer model. It is a much more sophisticated model that considers the entire context of a sentence, not just the previous word. GPT is based on a self-attention mechanism that allows it to weigh the importance of each word in the context and capture long-range dependencies in text.

Fine Tuning

You can fine-tune pre-existing models using your own training text. This is fairly straightforward, and all the primary logic for that takes place inside the src/tune/fine_tune.py file. You can place your existing model anywhere as long as you specify the path to that model's directory with the PRETRAINED_MODEL constant inside the fine_tune.py file.

Here's an example of a pretrained model you can use here:

https://huggingface.co/gpt2

Just simply clone this gpt2 repository somewhere on your computer, and set the path in the fine_tune.py file.

git clone https://huggingface.co/gpt2

Inference

I have two methods of testing inference with the model built into the src/tune folder. You can run the src/tune/inference.py file, or test it out in a simple terminal chat interface in the src/tune/chatbot.py file.

Usage

Optional: Create an env

python -m venv env

source env/bin/activate
Training data

Add your input data to the data/ folder. Ensure to update the path to the name of your file in the INPUT_FILE variable if it has a custom name or name it input.txt

Download packages
pip install -r requirements.txt

Building your own language model from the ground up

This project contains two language models,

Run Bigram Model

python -m src.models.bigram.train

Run GPT Model

python -m src.models.gpt.train

Fine Tuning Existing Models on Your Own Data

Ensure you have specified the model path inside the file. You can place your model anywhere as long as you specify the path to its directory inside of the fine_tune.py file.

Run Fine Tuning on PreExisting Model

python -m src.tune.fine_tune
Test Inference
python -m src.tune.inference
Run ChatBot
python -m src.tune.chatbot

About

This is everything you need to either get started building your own GPT models from the ground up, or fine-tune existing models using your own data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages