Explanation of Japanese LLM "ELYZA-japanese-Llama-2-7b" released by ELYZA (Pretraining)

2023年9月14日 08:37

Hi, this is Wataru Fukuda, CEO of Techno-Trade Japan Co., Ltd. and Strada AIDX Inc. I hold a Master's degree from the Graduate School of Informatics and Engineering at the University of Electro-Communications. My goal is to leverage technology and innovation to solve challenges faced by businesses and society in Japan.

This article is about a company named ELYZA, which is known for its LLM research and is based out of the University of Tokyo. This company recently unveiled a commercial use (※check details) Japanese LLM based on Meta's "Llama 2". The article goes into detail about the motivations behind this development, the models it includes, and specifics regarding the experiments and methodologies used in training the models.

Motivation

For this release, I've compiled articles in English to make them accessible to users outside of Japan as well.

https://zenn.dev/elyza/articles/2fd451c944649d

Here is the English translation:
※Please make sure to check the original website↑

Again, ELYZA, a corporation originating from the University of Tokyo and renowned for its LLM research in Japan, has unveiled a commercially usable Japanese LLM named "ELYZA-japanese-Llama-2-7b", based on Meta's "Llama 2".

Introduction

ELYZA Corporation, recently issued the following release.

ELYZA-japanese-Llama-2-7b, a Japanese LLM for commercial use based on Meta's "Llama 2" is now available

The above release includes the following models based on Meta's Llama 2

Japanese Additional Pre-Learned Models
- ELYZA-japanese-Llama-2-7b
- ELYZA-japanese-Llama-2-7b-fast
Post-training (instruction tuning) on the above pre-trained model
- ELYZA-japanese-Llama-2-7b-instruct (demo )
- ELYZA-japanese-Llama-2-7b-fast-instruct (dem o)

This article focuses on the Japanese-language additional pre-trained model and describes its motivation, details of the training procedure, and experiments that have not been successful at this time.

Motivation for additional pre-study in Japanese based on Llama 2

※"we" = ELYZA Corporation

The biggest reason why we did not train a full-scratch Japanese model, but instead additionally pre-learned Japanese from an English-based model, is the overwhelming volume of English text data compared to Japanese.
In recent years, even open models based on English have reached a very high performance level, and we believe this is due to the existence of very large English text data.
As mentioned in LIMA: Less Is More for Alignment (Zhou+, arXiv2023), if the source of LLMs' knowledge and ability to follow instructions lies in pre-trained models, the existence of such very large text data may be essential. Japanese text data, on the other hand, is extremely large compared to LLM models.
On the other hand, Japanese text data is extremely scarce in comparison, so even if a full-scratch Japanese model is trained using only Japanese text data, it may not be able to match the spectacular performance of a model based mainly on English.
Therefore, with the idea of "standing on the shoulders of giants (English)," we opted for additional pre-training from the English-dominant model.

Second, we based this effort on Llama 2 for the following reasons

The fact that the pre-learning is done on a very large scale, with 2 trillion tokens, mainly in English.
- Since Llama 2 was pre-trained on such a very large amount of data, we expected it to have superior linguistic ability and common sense knowledge compared to other publicly available models.
- This also had the potential to reduce costs compared to pre-learning Japanese in a full-scratch program.
Elemental technologies developed in recent years are mobilized.
- SFT (Supervised Fine-Tuning) has been implemented, as well as RLHF (Reinforcement Learning with Human Feedback).
- Furthermore, we focused on the mobilization of recently proposed elemental technologies such as SwiGLU,RMSNorm, and RoPE.
- In addition, the newly proposed GAtt (Ghost Attention) in Llama 2 is expected to be robust against forgetting instructions during repeated interaction.
- Based on the above, we viewed this as the best base model among the current options.
The point where many derivative models were created from its predecessor, LLaMA
- Llama 2's predecessor, LLaMA, attracted a lot of attention worldwide, with many derivative models being researched and developed, including Alpaca, despite its non-commercial use license.
- Since Llama 2 is now licensed for commercial use, albeit with conditions as stated in the LLAMA 2 COMMUNITY LICENSE, we expect that the movement to build derivative models based on it will be even more active than in LLaMA. With hopes for such an upsurge in the research and development community, we have also focused our attention on Llama 2.
- In fact, many derivatives of Llama 2 have appeared on Hugging Face.

Additional Prior Learning Details

Overview

In this effort, we adopted meta-llama/Llama-2-7b-chat-hf as the base model out of Llama 2, where SFT and RLHF had already been implemented.
The reason for this is that we wanted to take over the instruction-following ability acquired mainly in English and the safety of the output as much as possible, as described in the motivation above.

The original Llama 2 pre-training contained only about 0.1% (2 billion tokens) of Japanese.
We considered this small amount of Japanese as the root cause of the difficulty in handling Japanese in Llama 2, and set a goal of pre-training an additional 18 billion tokens of Japanese data, for a total of 20 billion tokens of Japanese.
We used general Japanese text data such as OSCAR,Wikipedia, and other crawled data for pre-training.
Pre-processing such as removing duplicate strings and filtering by NG words was performed separately.

Experimental environment and detailed settings during the study

Almost all experiments were conducted using ABCI.
Among them, the node with A100 as GPU (rt_AF) was not available during the experiment period (August 2023), so an instance with four V100s (rt_F) was mostly used.
We used DeepSpeed's ZeRO Stage3 as the experimental framework, and the time required to train one epoch of the aforementioned Japanese text data (18 billion tokens) was about 1 to 1.5 days for rt_F × 64 nodes.
One of the reasons why we were able to proceed with this project so quickly is that, by using Llama 2 (7B) as a base, we were able to run the experiment cycle in a relatively short time even on V100, which is relatively available, rather than A100, which is hard to use these days.

The following figures show the loss of ELYZA-japanese-Llama2-7b and ELYZA-japanese-Llama2-7b-fast at the time of study.
Although the loss of the fast model is higher due to the addition of vocabulary, it is apparent that the loss of each model is decreasing steadily.

[For images and code sections, please Check out this site.]

The hyperparameters for training are basically based on the chapter on pre-training in the Llama 2 paper (2.2 Training Details) and are set as follows

optim: "adamw_torch"
adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 1.0e-5
weight_decay: 0.1
# lr scheduler not used
learning_rate: 3.0e-5

Add Japanese vocabulary (ELYZA-japanese-Llama-2-7b-fast, or. .. -fast-instruct only )

The meta-llama/Llama-2-7b-chat-hf on which this project is based contains 32,000 tokens, but their vocabulary is not optimized for Japanese, so the cost of handling Japanese is higher than for English.
For example, " konnichiwa " (hello) corresponds to one token per character, "tokyo" (Tokyo ) corresponds to one token per character, and "niku" (meat), a kanji that seems to have been infrequent in the original Llama 2 pre-training, is represented by 3 tokens per byte. (On the other hand, English words such as "Hello" and "World" are represented by one token each.)

To address these issues, we have added a Japanese vocabulary to Llama 2's Tokenizer. The following is an overview of the project.

[For images and code sections, please Check out this site.]
(Sorry, again😃)

The specific steps are as follows

Independently of the original Llama 2 Tokenizer, a separate BPE Tokenizer is learned using only Japanese text. The number of vocabulary items here is set to 15,000. This allows us to use the Llama 2 Tokenizer, which tends to break Japanese text into small character and byte units, and the BPE Tokenizer , which can appropriately break frequently occurring Japanese text into words and other units.
Since each of the above two Tokenizers is a BPE Tokenizer, the vocab of each can be combined in the implementation of the tokenizers library to obtain a Tokenizer with a vocabulary of 45,043, taking advantage of the merits of each.
Combining Tokenizer has been completed above, but at this point, the model's embed_tokens and lm_head still correspond to the original vocabulary (32,000 tokens), so it is necessary to deal with the additional vocabulary. Therefore, we decided to use the average of the vectors corresponding to the original tokens in embed_tokens and lm_head (e.g., the vectors for East, Kyoto, and Tokyo ) as the initial value of the vector corresponding to the added token (e.g., the vector for Tokyo ). Preliminary experiments with random initialization did not show much change, so we experimented with this policy in the hope that the performance of the original Llama 2 would be inherited. From another point of view, we also confirmed that the loss drop is more pronounced than with random initialization.
After the above steps are performed, the additional pre-training described above is performed. As described in step 3 here, we expect the added tokens to be learned from "a state with some good initial values".

The specific source code for combining the two Tokenizers is shown here.

from copy import deepcopy
import json

# 元となるtokenizer.jsonを読み込む
with open("/path/to/original/tokenizer.json") as f:
    original = json.load(f)

# 追加したいtokenizer.jsonを読み込む
with open("/path/to/append/tokenizer.json") as f:
    append = json.load(f)


def merge_tokenizer(data1: dict, data2: dict):
    vocab1 = data1["model"]["vocab"]
    vocab2 = data2["model"]["vocab"]
    
    merges1 = data1["model"]["merges"]
    merges2 = data2["model"]["merges"]
    
    # 出力用の変数を定義
    vocab_out = deepcopy(vocab1)
    data_out = deepcopy(data1)

    # merge前の最大idxを取得
    idx = max(vocab_out.values())
    
    # vocab2のうちvocab1にないものを、idxをインクリメントしつつvocab_outに追加
    for token in vocab2.keys():
        if token not in vocab1:
            idx += 1
            vocab_out[token] = idx
        
    # vocab_out中の全てのtokenについて、それをある位置で区切ったときの左右それぞれの要素がいずれもvocab_outに含まれる場合、merges_outに追加
    # 参考: https://github.com/huggingface/transformers/pull/17199
    merges_out = []
    for candidate, piece_id in vocab_out.items():
        for i in range(1, len(candidate)):
            left, right = candidate[:i], candidate[i:]
    
            left_id = vocab_out.get(left, None)
            right_id = vocab_out.get(right, None)
    
            if left_id is not None and right_id is not None:
                merges_out += [f"{left} {right}"]

    data_out["model"]["vocab"] = vocab_out
    data_out["model"]["merges"] = merges_out

    with open("/path/to/merged/tokenizer.json", "w") as f:
        json.dump(data_out, f, ensure_ascii=False, indent=2)

# 上で定義した関数により、元のtokenizerと追加したいtokenizerをmerge
merge_tokenizer(data1=original, data2=append)

By following the above steps, the amount of tokens required to handle Japanese text was reduced, and the inference speed was greatly improved by a factor of approximately 1.8.
As a side effect, learning also became more efficient, and text that would have required 30 billion tokens in the original Tokenizer was represented with 16 billion tokens.
As a result, ELYZA-japanese-Llama-2-7b-fast learned with 16 billion tokens less than the 18 billion tokens ELYZA-japanese-Llama-2-7b saw during additional pre-training, but saw approximately 1.66 times more text than that. This means that we have seen

In this project, we succeeded in adding a Japanese vocabulary by following the above procedure after much trial and error. We will continue to investigate better methods.

Continual Learning Validation

(The contents verified in this item have not been reflected in the published model due to performance reasons.

Although we used Llama 2 as the basis for this study, there was concern that additional pre-learning in Japanese alone would result in the loss of the instruction-following ability of the original Llama 2 (catastrophic forgetting), so we also verified Continual Learning.

As a general guideline for the validation, we referred to Fine-tuned Language Models are Continual Learners (Scialom+, EMNLP2022).
In this paper, it is reported that when a new task is added to a model, a certain amount (1% in the paper) of each previously learned task is added to the model so that the new task is learned without forgetting the previously learned tasks.

However, since the specific data used by Llama 2 in the SFT was not publicly available, we used the following dataset for our continued study.

Data for prior learning of English (purpose: to maintain fluency in English)
- SlimPajama
Data for learning English instructions (purpose: to maintain the ability to follow instructions in English)
- OpenAssistant/oasst1
- databricks/databricks-dolly-15k
Japanese-English Bilingual Data (Purpose: to transfer English language skills to Japanese language.)
- Wikipedia Japanese-English Bilingual Corpus of Kyoto-related Documents

As described above, this time, continuous learning including pre-learning and bilingualism was implemented rather than task-specific learning, so the ratio of additional pre-learning:continuous learning = 97.5:2.5, which is slightly different from the setting in the paper described above.
However, as a result, performance after instructional learning was worse than without continuous learning, so continuous learning is not reflected in the published model.

One possible reason for this may be that the inclusion of English in the continuous learning process has left a tendency to output English in situations where Japanese should have been output.
For example, in the following example, English words such as " potentiality " and " White Collar work field " are included in the output even though the instructions are given in Japanese.

(Instructions):
Please explain for elementary school students.Large-scale linguistic AI is a technology that can handle language with human-like or human-exceeding accuracy by computing large amounts of linguistic data. This large-scale linguistic AI is attracting attention around the world for its potential to create significant operational efficiency, mainly in the white-collar domain, by increasing accuracy through a combination of quantity and quality data sets and the technical capabilities and know-how to handle the latest technologies. However, as with other AI technologies, Japan lags behind other countries in understanding large-scale linguistic AI and, as a result, in its practical application. ELYZA has developed its own large-scale language AI for Japanese language, and has been one of the first to implement it in society, mainly with large companies.
(Output):
Large-scale language AI is a technology that processes large amounts of text and voice data at high speed to understand and generate language with greater capabilities than humans Since its announcement in 2018, its potentiality has been recognized to the extent that it is being studied in many countries. In particular, it has been evaluated as enabling significant productivity improvement in the White Collar work field, which supports clerical work, etc. In Japan, however, it is relatively late to the game. On the other hand, Japan is relatively behind in the development of big language AI, and ELYZA is working on the early stage of social implementation of its originally developed Japanese big language AI to bring the use of big language AI to a world-class level in Japan. ELYZA is now working on the early stage of social implementation of its originally developed Japanese big language AI.

Continuous learning will continue to be implemented in the future, including verification of the extent to which English language skills are maintained.

Conclusion

In this article, we have explained the pre-training of the ELYZA-japanese-Llama-2-7b series.
At this time, we have released models with up to 7B (7 billion) parameters, but we have already started developing models with 13B (13 billion) and 70B (70 billion) parameters.
In future articles, we plan to explain the ELYZA-tasks-100 data set that we have released along with the model for evaluation.
ELYZA Corporation focuses on large-scale Japanese language models and conducts joint research with companies and develops cloud services based on the philosophy of "creating the obvious in unexplored areas.
Through research and development of cutting-edge technologies and consulting, we promote the introduction and implementation of language-generating AI in a way that contributes to corporate growth.

Thank you for reading. For more details, please refer to the Japanese original version.

この記事が気に入ったらサポートをしてみませんか？