Tuesday, March 18, 2025

DoLa and MT-Bench – A Quick Eval of a new LLM trick

Programming LanguageDoLa and MT-Bench - A Quick Eval of a new LLM trick


Decoding by Contrasting Layers (DoLa) is a technique suggesting a different approach to calculating next token probabilities in a transformer. It is described in this paper. What is interesting is that without any changes to the model, it is possible to make a code change to the decoder part of the transformer and get a noticeable boost in the model’s factual knowledge and fewer hallucinations.

A few days ago a PR was merged into the Hugging Face Transformers library implementing this trick.

It happened that I had MT-Bench set up while tinkering with 1.6B model and conducting the evals. The LLM Judge relies on HF Transformers, so it was easy to do a quick trial of DoLa and see if it improves AI chatbot’s overall performance (reasoning, coding, writing, etc.)

  1. I installed the Transformers from sources (the new feature is not available at PiPY yet): pip install git+https://github.com/huggingface/transformers

  2. Made a change to (gen_model_answer.py)[https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_model_answer.py] adding the dola_layers params

output_ids = model.generate(
    torch.as_tensor(input_ids).cuda(),
    do_sample=do_sample,
    temperature=temperature,
    max_new_tokens=max_new_token,
    dola_layers='low'
)
Enter fullscreen mode

Exit fullscreen mode

  1. Ran MT-Bench with the params commented out, set to low and high

Here’re the results:

Mode: single
Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl

########## First turn ##########
                                             score
model                                 turn        
stablelm-2-brief-1_6b_v8_r57.         1     4.8500
stablelm-2-brief-1_6b_r57_dola_low    1     4.6125
stablelm-2-brief-1_6b_r57_dola_high   1     3.9500

########## Second turn ##########
                                            score
model                                 turn       
stablelm-2-brief-1_6b_r57_dola_low    2     3.700
stablelm-2-brief-1_6b_v8_r57          2     3.700
stablelm-2-brief-1_6b_r57_dola_high   2     2.825

########## Average ##########
                                         score
model                                         
stablelm-2-brief-1_6b_v8_r57           4.27500
stablelm-2-brief-1_6b_r57_dola_low     4.15625
stablelm-2-brief-1_6b_r57_dola_high    3.38750
Enter fullscreen mode

Exit fullscreen mode

As you can see, the trick didn’t quite play out, there was no quick win here.

Check out our other content

Check out other tags:

Most Popular Articles