Finetuning a Language Model

1 minute read

In this post we look at how to funetune a Language Model - expanding on one of the examples from the Hugging Face Transformers project.

What is a Language Model

A language model is a computer structure that contains information about characters, words and other ookens and how these basic tokens carry information useful for human communication. This information includes the probability distribution of these units and especially the probability distribution of words in relation to each other. In the end, a language model is able to predict the next token given a sequence of tokens.

HuggingFace Transformers

The easiest way to get started with Language Models and to try newer and better models as they are available is to use the Hugging Face transformers project. You can find more information on the project’s documentation page

Install Transformers

The transformers library is easy to install using pip

!pip install transformers

Download WikiText

The Wikitext dataset is created from verified Good and Featured articles on Wikipedia. It is a commonly used source for training language models because of the size and quality.

Here we download and extract the Wikitext raw text - the version that is not tokenized.

%%shell
if [ ! -d "wikitext-2-raw" ]; 
  then
    wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
    unzip wikitext-2-raw-v1.zip
fi 

Now we can list the contents of the directory

!ls wikitext-2-raw

Olympic Sports

Make output directory

We make an output directory for the results of the finetuning.

!mkdir -p output

Running the Finetuning

Now we can run the finetuning script. We choose to use the BERT cased model. The results will be saved to the output directory.

%%shell
python run_lm_finetuning.py \
       --output_dir=output \
       --overwrite_output_dir \
       --mlm \
       --model_type=bert \
       --model_name_or_path=bert-base-cased \
       --do_train --train_data_file=wikitext-2-raw/wiki.train.raw  \
       --do_eval --eval_data_file=wikitext-2-raw/wiki.test.raw 

Viewing the output

If we look inside the output directory we see a bunch of checkpoint files. We also see the finetuned model file - pytorch_model.bin which is 416MB in size. It is thins model file that we can use for predicting tokens or in a pipeline for other NLP tasks like classification.

Output Directory

Output Directory

Google Colab

This code was done as a Google Colab notebook that is made available as a Github Gist. You can select an action on that gist to run it Google Colab.

Simple Gist

A simpler gist in the form of a script is also available. Here is a preview.