LLaVA-1.6を使ってみた。日本語もOKなマルチモーダルLLM

2024年2月14日 12:30

久しぶりにLLMの記事です。OSのお引越し作業のついでに商用可能になったというLLaVAを動かそうとしたら、1.6にバージョンアップされていて、以前に動かしたときよりも随分変わっていました。

環境

リポジトリ通りにインストールします。

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

git pull
pip install -e .

SGLang workerが追加されて、ここのインストールをしないとエラーになってしまいます。ちょっとやっただけないので確信は持てませんが、GUI版もCLI版も最初は動きませんでした。
以下の部分です。

Launch a SGLang worker

This is the recommended way to serve LLaVA model with high throughput, and you need to install SGLang first. Note that currently 4-bit quantization is not supported yet on SGLang-LLaVA, and if you have limited GPU VRAM, please check out model worker with quantization.

pip install "sglang[all]"

You'll first launch a SGLang backend worker which will execute the models on GPUs. Remember the --port you've set and you'll use that later.

# Single GPU
CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000

# Multiple GPUs with tensor parallel
CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-13b --tokenizer-path llava-hf/llava-1.5-13b-hf --port 30000

pip install "sglang[all]"の部分ですかね。

GUIを動かす

ターミナルを4個開き、おのおのディレクトリをLLaVA、conda仮想環境をllavaとして準備します。以下の銃で動かします。

コントローラ

python -m llava.serve.controller --host 0.0.0.0 --port 10000

guradio WEBサーバ

python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload

SGLangサーバ

CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000

モデルワーカ

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b

ブラウザーから　http://0.0.0.0:7860/　をアクセス
以下のgradioが立ち上がります。デモと同じです。

サンプルの画像をクリックすると画像エリアに移り、質問部分に
What is unusual about this image?
と入ります。そのまま、SENDボタンを押すと回答が出てきます。
回答
The unusual aspect of this image is that a man is sitting on a folding chair in the back of a yellow taxi, which is driving down a busy city street. This is not a typical sight, as people usually sit inside the taxi or use a designated seating area for passengers. The man's presence in the back of the taxi, along with the folding chair, creates an unconventional and unexpected scene that captures attention.
と、まあ英語です。

ここで日本で質問して見ます。
車の色は何色ですか
回答：車の色は黄色です。
日本語で答えてくれました。
質問：写っている人は何を市hていますか？
回答：写っている人は、衣服を市場に乗せています
ちょっと変。元々の画像が変ですけど。

CLIで動かす

起動させたすべてのサーバを停止します。
いずれかのターミナルに以下を入力します。

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --load-4bit

すると、チャットが始まります。サンプル通りだとアップロードされる画像はGUIのサンプルの桟橋の絵の方です。
USER:
とプロンプトが出てくるので、
USER: こにちは
と入力。
ASSISTANT: こんいちは、私の日本語の説明を理解できるかどうか確認してください。
と帰ってきます。感激。
USER: この絵の説明をしてください。
ASSISTANT: この絵は、木製の桟橋が湖に突き出している景色です。桟橋は、2つの木製の板で作られており、それぞれが長さに応じて異なる高さになっています。桟橋の下には、湖の水が流れているように見えます。周りには、山が迷彩を演じており、山の裏には、豊かな樹木が生えています。この風景は、湖畔の自然の美しさを表現しています。

ちゃんと日本語で正しく回答してくれます。
USER: 桟橋は何色？
ASSISTANT: 桟橋は木製であり、その色は自然の木材色である灰色です
だそうです。

API

かなりシンプルです。サーバはipとポートを変えればすぐに使えそうです。

まとめ

日本語で本格的に受け答えができる高機能なマルチモーダルLLMがローカルで使えることはとても助かります。