In this post we look at how to funetune a Language Model - expanding on one of the examples from the Hugging Face Transformers project.
What is a Language Model
A language model is a computer structure that contains information about characters, words and other ookens and how these basic tokens carry information useful for human communication. This information includes the probability distribution of these units and especially the probability distribution of words in relation to each other. In the end, a language model is able to predict the next token given a sequence of tokens.
The easiest way to get started with Language Models and to try newer and better models as they are available is to use the Hugging Face transformers project. You can find more information on the project’s documentation page
The transformers library is easy to install using pip
!pip install transformers
The Wikitext dataset is created from verified Good and Featured articles on Wikipedia. It is a commonly used source for training language models because of the size and quality.
Here we download and extract the Wikitext raw text - the version that is not tokenized.
%%shell if [ ! -d "wikitext-2-raw" ]; then wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip unzip wikitext-2-raw-v1.zip fi
Now we can list the contents of the directory
Make output directory
We make an output directory for the results of the finetuning.
!mkdir -p output
Running the Finetuning
Now we can run the finetuning script. We choose to use the BERT cased model. The results will be saved to the output directory.
%%shell python run_lm_finetuning.py \ --output_dir=output \ --overwrite_output_dir \ --mlm \ --model_type=bert \ --model_name_or_path=bert-base-cased \ --do_train --train_data_file=wikitext-2-raw/wiki.train.raw \ --do_eval --eval_data_file=wikitext-2-raw/wiki.test.raw
Viewing the output
If we look inside the output directory we see a bunch of checkpoint files. We also see the finetuned model file - pytorch_model.bin which is 416MB in size. It is thins model file that we can use for predicting tokens or in a pipeline for other NLP tasks like classification.
This code was done as a Google Colab notebook that is made available as a Github Gist. You can select an action on that gist to run it Google Colab.
A simpler gist in the form of a script is also available. Here is a preview.