Decoding by Contrasting Layers (DoLa) is a technique suggesting a different approach to calculating next token probabilities in a transformer. It is described in this paper. What is interesting is that without any changes to the model, it is possible to make a code change to the decoder part of the transformer and get a noticeable boost in the model’s factual knowledge and fewer hallucinations.
A few days ago a PR was merged into the Hugging Face Transformers library implementing this trick.
It happened that I had MT-Bench set up while tinkering with 1.6B model and conducting the evals. The LLM Judge relies on HF Transformers, so it was easy to do a quick trial of DoLa and see if it improves AI chatbot’s overall performance (reasoning, coding, writing, etc.)
-
I installed the Transformers from sources (the new feature is not available at PiPY yet):
pip install git+https://github.com/huggingface/transformers
-
Made a change to (gen_model_answer.py)[https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_model_answer.py] adding the
dola_layers
params
output_ids = model.generate(
torch.as_tensor(input_ids).cuda(),
do_sample=do_sample,
temperature=temperature,
max_new_tokens=max_new_token,
dola_layers='low'
)
- Ran MT-Bench with the params commented out, set to
low
andhigh
Here’re the results:
Mode: single
Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl
########## First turn ##########
score
model turn
stablelm-2-brief-1_6b_v8_r57. 1 4.8500
stablelm-2-brief-1_6b_r57_dola_low 1 4.6125
stablelm-2-brief-1_6b_r57_dola_high 1 3.9500
########## Second turn ##########
score
model turn
stablelm-2-brief-1_6b_r57_dola_low 2 3.700
stablelm-2-brief-1_6b_v8_r57 2 3.700
stablelm-2-brief-1_6b_r57_dola_high 2 2.825
########## Average ##########
score
model
stablelm-2-brief-1_6b_v8_r57 4.27500
stablelm-2-brief-1_6b_r57_dola_low 4.15625
stablelm-2-brief-1_6b_r57_dola_high 3.38750
As you can see, the trick didn’t quite play out, there was no quick win here.