WSL2でRPG-DiffusionMasterを試してみる

2024年1月25日 19:59

「テキストから画像への生成と編集を実現するための、補完的な画像リージョナルディフュージョン（以下、長いので「領域拡散」）を備えたプロンプト要約および画像領域のプランナーとして独自のマルチモーダルLLM（MLLM）またはオープンソースのローカルMLLM を利用できる、トレーニング不要の強力なパラダイム」らしい「Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs」、略して？RPG-DiffusionMasterを試してみます。

使用するPCはドスパラさんの「GALLERIA UL9C-R49」。スペックは
・CPU: Intel® Core™ i9-13900HX Processor
・Mem: 64 GB
・GPU: NVIDIA® GeForce RTX™ 4090 Laptop GPU(16GB)
・GPU: NVIDIA® GeForce RTX™ 4090 (24GB)
・OS: Ubuntu22.04 on WSL2（Windows 11）
です。

1. 準備

環境の構築

python3 -m venv prg-diffusionmaster
cd $_
source bin/activate

リポジトリをクローンして、

git clone https://github.com/YangLing0818/RPG-DiffusionMaster
cd RPG-DiffusionMaster

pip install。

pip install -r requirements.txt

193パッケージあります…。

$ wc -l requirements.txt
193
$

ディレクトリの作成。

mkdir repositories
mkdir -p generated_imgs/demo_imgs
mkdir models/Stable-diffusion

各種ライブラリのダウンロード

cd repositories
git clone https://github.com/Stability-AI/generative-models
git clone https://github.com/Stability-AI/stablediffusion
git clone https://github.com/sczhou/CodeFormer
git clone https://github.com/crowsonkb/k-diffusion
git clone https://github.com/salesforce/BLIP
mv stablediffusion stable-diffusion-stability-ai
cd ..

2. 試す前にコードの確認

実行するプログラムは PRG.py です。オプションを紐解いていきましょう。現状コーディングされているオプションを見ていきます。

RPG.pyのオプション

python RPG.py \
	--user_prompt 'A blonde hair girl with black suit and white skirt' \
	--model_name v1-5-pruned-emaonly.safetensors \
	--version_number 0 \
	--api_key "${OPENAI_API_KEY}" \
	--use_gpt

LLM関連のオプション：

--user_prompt : 画像に含まれるコンテンツを大まかに要約したプロンプト。LLM（GPT-4とか）に投げるユーザープロンプトです。
--version_number : 現時点では 0と1が指定可能。指定した番号によってプロンプトを生成するためのテンプレートが変わります。このテキスト（プロンプト）がミソですね。
- 0 / 複数属性 : template/human_multi_attribute_examples.txt
- 1 / 複数オブジェクト : template/complex_multi_object_examples.txt
--use-gpt : GPT-4（gpt-4-1106-preview）を使用する
- --api_key : GPT-4 API使用のためのキー
--use_local : ローカルLLMを使用する
- --llm_path : ローカルLLMのパスを指定

txt2img関連のオプション：

--model_name : models/StableDiffusionディレクトリに格納したモデルのファイル名のみを指定する
--activate : 領域拡散をする（デフォルトTrue）
--use_base : 画像の各領域に対する、ベースプロンプトを使用する（デフォルトTrue）
- --base_ratio : ベースプロンプトの重み付け（デフォルト0.3）
- --base_prompt : ベースプロンプト（デフォルトNone）。指定されない場合、user_promptが使用される。
--batch_size : txt2imgのバッチサイズ（デフォルト１）
--seed : txt2imgの種（デフォルト1234）
--cfg : コンテキストフリーガイダンスのスケール（デフォルト５）
--steps : txt2imgのステップ数（デフォルト20）
--height : 生成画像の高さ（デフォルト1024）
--width : 生成画像のよこ幅（デフォルト1024）

その他

--demo : デモです

LLMに投げるプロンプト

mllm.py 内でプロンプトを作成しています。

    with open('template/template.txt', 'r') as f:
        template=f.readlines()
    if version=='multi-attribute':
        with open('template/human_multi_attribute_examples.txt', 'r') as f:
            incontext_examples=f.readlines()
    elif version=='complex-object':
        with open('template/complex_multi_object_examples.txt', 'r') as f:
            incontext_examples=f.readlines()
    user_textprompt=f"Caption:{prompt} \n Let's think step by step:"

    textprompt= f"{' '.join(template)} \n {' '.join(incontext_examples)} \n {user_textprompt}"

以下の3つのテキストと指示を繋げて、入力プロンプトとしています。

template/template.txt
version_numberオプションで指定した数字に対応したテンプレート
0 / 複数属性 : template/human_multi_attribute_examples.txt
1 / 複数オブジェクト : template/complex_multi_object_examples.txt
RPG.pyの引数 --user_prompt の値を Caption として与えて、
最後に「Let's think step by stemp:」（6単語）と問いかける

それぞれのテキストの単語数は以下。

$ wc -w template/*.txt
    4 template/User_playground.txt
 1269 template/complex_multi_object_examples.txt
 1354 template/human_multi_attribute_examples.txt
  517 template/template.txt
$

version_number 0 のとき、1927ワード（517 + 1354 +6）プラスuser_promptのワード数をLLMに渡します。これだけ入力の文字数あったら、コンシューマ向けのGPUだと厳しいですわな…。

3. 試してみる

画像生成に入力するプロンプトを生成するLLMは、上記のようにGPT-4、ローカルLLMのいずれかを指定可能です。

ローカルLLMを使用するとGPUx2（24GB+16GB）でもVRAMが足りないので、今回はおとなしくGPT-4を使用します。

実行するコマンドはこちら。

CUDA_VISIBLE_DEVICES=0,1 python RPG.py \
	--user_prompt 'A blonde hair girl with black suit and white skirt' \
	--model_name v1-5-pruned-emaonly.safetensors \
	--version_number 0 \
	--api_key "${OPENAI_API_KEY}" \
	--use_gpt

結果 - version_number 0

所要時間は、プロンプト生成から画像生成まででトータルで85秒です。
うち、GPT-4 API呼び出しにかかる時間が約54秒でした。よって、ローカルの処理時間は31秒でした（複数回試行しても平均31秒です）。

select_checkpoint: v1-5-pruned-emaonly.safetensors [6ce0161689]
100%|██████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.11it/s]
Total progress: 20it [00:16,  1.22it/s]
Total progress: 20it [00:16,  1.22it/s]
real    1m25.230s
user    0m44.514s
sys     0m11.210s

生成された画像はこちら。./outputs/txt2img-images/YYY-MM-DD ディレクトリ、または ./generated_imgs ディレクトリに出力されます。

生成されたプロンプトはこちら。水平方向に3つの領域に分かれて…るかな。むつかしい。

### Original Caption:
"A blonde hair girl with a black suit and white skirt."

### Key phrases identification:
We identify a girl with three attributes: blonde hair, black suit, white skirt. We'll split her features with a particular focus on distinctive elements.
1. Blonde hair (head features of the girl)
2. Black suit (upper garment)
3. White skirt (lower garment)

### Split Ratio Planning:
#### Horizontal Split Ratio: 1;1;1
- We'll split the image into three horizontal rows to provide a focused description for each attribute, from head to lower garment.

#### Vertical Split Ratio: None
- Since each attribute can be described individually in each horizontal section, we do not need to add vertical splits.

#### Detailed Subregion Prompts:
1. **First Row** (`1`):
- **Region 0:** Gleaming blonde hair styled impeccably, reflecting the light and setting a tone of sophistication.
2. **Second Row** (`1`):
- **Region 1:** The black suit is a portrait of professional elegance, with sharp tailoring and a silhouette that exudes confidence.
3. **Third Row** (`1`):
- **Region 2:** A pristine white skirt that complements the ensemble, its crisp lines and fabric adding a sense of chic simplicity.

#### Composition Logic:
- We begin with the blonde hair to immediately capture the viewer's attention with its brightness and texture.
- Moving to the black suit, we focus on the professional aspect and stylish form, making it a central part of the image.
- Lastly, the white skirt adds balance to the darker tones of the suit, completing the look with a clean and sleek appearance.

#### Aesthetic Considerations:
- The blonde hair introduces a vibrant yet polished element, providing a nice contrast to the darker clothing colors.
- The black suit's description will focus on the cut and quality, illustrating a powerful and striking impression of the girl.
- The white skirt enhances the monochromatic theme and adds purity to the professional look, ensuring a visually pleasing and coherent depiction.

By aligning with this plan, we present each region with specific attention to detail, using descriptive language to highlight colors and textures while upholding an elegant visual composition.

### Output:
Horizontal split ratio: 1;1;1
Vertical split ratio: None
Split ratio: 1;1;1
Regional Prompt: Gleaming blonde hair styled impeccably, reflecting the light and setting a tone of sophistication. BREAK
The black suit is a portrait of professional elegance, with sharp tailoring and a silhouette that exudes confidence. BREAK
A pristine white skirt that complements the ensemble, its crisp lines and fabric adding a sense of chic simplicity.
Horizontal split ratio: 1;1;1
Vertical split ratio: None
Split ratio: 1;1;1
Regional Prompt: Gleaming blonde hair styled impeccably, reflecting the light and setting a tone of sophistication. BREAK
The black suit is a portrait of professional elegance, with sharp tailoring and a silhouette that exudes confidence. BREAK
A pristine white skirt that complements the ensemble, its crisp lines and fabric adding a sense of chic simplicity.
{'split ratio': '1;1;1', 'Regional Prompt': 'Gleaming blonde hair styled impeccably, reflecting the light and setting a tone of sophistication. BREAK\nThe black suit is a portrait of professional elegance, with sharp tailoring and a silhouette that exudes confidence. BREAK\nA pristine white skirt that complements the ensemble, its crisp lines and fabric adding a sense of chic simplicity.'}
select_checkpoint: v1-5-pruned-emaonly.safetensors [6ce0161689]
process_script_args (True, False, 'Matrix', 'Columns', 'Mask', 'Prompt', '1;1;1', 0.3, False, False, False, 'Attention', [False], 0, 0, 0.4, None, 0, 0, False)
fatal: No names found, cannot describe anything.
1;1;1 0.3 Horizontal
Regional Prompter Active, Pos tokens : [19, 22, 22], Neg tokens : [0]

結果 - 富士山
READMEにある雪山、火山、川のサンプルプロンプト。火山（volcano）を富士山に変更して、

A beautiful landscape with a river in the middle the left of the river is in the evening and in the winter with a big iceberg and a small village while some people are skating on the river and some people are skiing, the right of the river is in the summer with Mt. Fuji in the morning and a small village while some people are playing.

version_number 0 として実行した結果がこちら。

富士山に見えなくはない。１枚の絵にしか見えないような…。
生成されたプロンプトはこちら。

### Original Caption:
"A beautiful landscape with a river in the middle, the left of the river is in the evening and in the winter with a big iceberg and a small village, while some people are skating on the river and some people are skiing. The right side of the river is in the summer with Mt. Fuji in the morning and a small village, while some people are playing."

### Key phrases identification:
To capture the complexity of the scene without creating too many subregions, we identify these key components:
1. Evening winter scene with iceberg and village (landscape features on the left of the river)
2. People skating and skiing (activities on the left of the river)
3. Summer scene with Mt. Fuji and village (landscape features on the right of the river)
4. People playing (activities on the right of the river)

We need to split the image into four subregions to maintain the logical flow of the scene.

### Split Ratio Planning:
#### Horizontal Split Ratio: `1;1`
- This ratio splits the image into two horizontal rows, one for the winter evening scene and the other for the summer morning scene.

#### Vertical Split Ratio: `1,(2,1); 1,(2,1)`
- Each row is then further divided into three parts, with the larger portion for landscape features and the smaller for activities. This accommodates the presence of elements such as the iceberg and people in the given space more comfortably.

#### Detailed Subregion Prompts:
1. **First Row** (`1,(2,1)`):
- **Region 0:** A winter evening, with the blue hues of dusk settling over a quaint village next to a glistening iceberg.
- **Region 1:** People joyously skating on the river's frozen surface, with others skiing on the nearby snow-covered slopes.
2. **Second Row** (`1,(2,1)`):
- **Region 2:** A summer morning scene with the iconic silhouette of Mt. Fuji, bathed in the warm glow of sunrise, behind a charming village.
- **Region 3:** A group of people playing in the verdant fields, bringing a sense of liveliness to the summer side of the river.

#### Composition Logic:
- The entire left portion of the image is dedicated to the winter evening scene, with the larger subregion emphasizing the landscape and the smaller one highlighting the activities of skating and skiing.
- The right portion is a mirror of the left in terms of structure but showcases a contrasting summer morning with activities appropriate for the season.

#### Aesthetic Considerations:
- The winter scene's blue tones and the sense of chill from the iceberg evoke a cold atmosphere, while the village suggests a communal warmth.
- The movement of the people skating and skiing introduces dynamism and conveys the joys of winter sports.
- Mt. Fuji in the summer brings a sense of calm and grandeur, and the depiction of sunrise over the mountain and the village adds a warm and optimistic feeling.
- The playfulness of the people in the summer fields contrasts with the more static winter landscape, presenting a full spectrum of seasonal activities.

By following this layout plan, each region portrays either a single element or related elements without overcomplicating the scene, focusing on descriptive aspects to highlight the contrasts between the two sides of the river.

Now, let's output the split ratio and regional prompts according to the plan we've developed.

### Output:
Horizontal split ratio: 1;1
Vertical split ratio: 1,(2,1); 1,(2,1)
Split ratio: 1,(2,1); 1,(2,1)
Regional Prompt:
A winter evening, with the blue hues of dusk settling over a quaint village next to a glistening iceberg. BREAK
People joyously skating on the river's frozen surface, with others skiing on nearby snow-covered slopes. BREAK
A summer morning scene with the iconic silhouette of Mt. Fuji, bathed in the warm glow of sunrise, behind a charming village. BREAK
A group of people playing in the verdant fields, bringing a sense of liveliness to the summer side of the river.
Horizontal split ratio: 1;1
Vertical split ratio: 1,(2,1); 1,(2,1)
Split ratio: 1,(2,1); 1,(2,1)
Regional Prompt:
A winter evening, with the blue hues of dusk settling over a quaint village next to a glistening iceberg. BREAK
People joyously skating on the river's frozen surface, with others skiing on nearby snow-covered slopes. BREAK
A summer morning scene with the iconic silhouette of Mt. Fuji, bathed in the warm glow of sunrise, behind a charming village. BREAK
A group of people playing in the verdant fields, bringing a sense of liveliness to the summer side of the river.
{'split ratio': '1,2,1; 1,2,1', 'Regional Prompt': "A winter evening, with the blue hues of dusk settling over a quaint village next to a glistening iceberg. BREAK\nPeople joyously skating on the river's frozen surface, with others skiing on nearby snow-covered slopes. BREAK\nA summer morning scene with the iconic silhouette of Mt. Fuji, bathed in the warm glow of sunrise, behind a charming village. BREAK\nA group of people playing in the verdant fields, bringing a sense of liveliness to the summer side of the river."}
select_checkpoint: v1-5-pruned-emaonly.safetensors [6ce0161689]
process_script_args (True, False, 'Matrix', 'Columns', 'Mask', 'Prompt', '1,2,1; 1,2,1', 0.3, False, False, False, 'Attention', [False], 0, 0, 0.4, None, 0, 0, False)
fatal: No names found, cannot describe anything.
1,2,1; 1,2,1 0.3 Horizontal
Regional Prompter Active, Pos tokens : [22, 23, 27, 25], Neg tokens : [0]

GPUメモリ使用量

・GPT-4への問合せ前（1つめの山）： 4,559 MiB
・画像生成中（2つめの山）：23,465 MiB

FYI. GPU x1のケース

GPU 1枚（24GB）だとVRAM溢れました。

生成に要した時間は 12分47秒。

100%|██████████████████████████████████████████████████████████████████████████████████████| 20/20 [11:49<00:00, 35.48s/it]
Total progress: 20it [11:21, 34.06s/it]
Total progress: 20it [11:21, 35.78s/it]
real    12m47.443s
user    9m28.467s
sys     3m0.477s

おまけ

volcanoのままで生成したらこんな感じでした。これはたしかに領域が分かれている。

### Original Caption:
"A beautiful landscape with a river in the middle, the left of the river is in the evening and in winter with a big iceberg and a small village while some people are skating on the river and some people are skiing, the right of the river is in the summer with Mt. Fuji in the morning and a small village while some people are playing."

### Key Phrases Identification:
The key elements can be identified as follows:
1. River (central element dividing the scene)
2. Left side:
- Evening & Winter landscape
- Big iceberg & Skating people
- Small village & Skiing people
3. Right side:
- Summer & Morning landscape
- Mt. Fuji (iconic mountain)
- Small village & Playing people

### Split Ratio Planning:
#### Horizontal Split Ratio: 1
- We won't split the image horizontally as the river runs centrally and divides the landscape vertically.

#### Vertical Split Ratio: `1,(1,2); 1,(1,2)`
- The vertical split ratio reflects the river acting as a divider and the two contrasting seasons on either side. Each side is further divided to capture the sub-elements: the winter village and the activities, and the summer village with different activities.

#### Detailed Subregion Prompts:
1. **First Row - Left of the River** (`1,(1,2)`):
- **Region 0:** Winter evening ambiance with a glimmering big iceberg in the dim twilight.
- **Region 1:** A cozy, snow-covered village with warm lights peeking through windows; inhabitants skating joyously on the frozen river and nearby slopes dotted with skiers descending softly amidst flurries.
2. **First Row - Right of the River** (`1,(1,2)`):
- **Region 2:** Summer morning vista with a majestic Mt. Fuji bathed in the warm glow of the sunrise.
- **Region 3:** A lively, verdant village with children and adults playing in the open fields, laughter mixing with the bright sounds of summer.

#### Composition Logic:
- The image artfully features a split-narrative layout, dividing the landscape through the middle with a river that sets the boundary between two seasons and consequent activities.
- The winter evening on the left provides a contrast to the summer morning on the right, each with corresponding village life and activities, producing a rich tapestry of human interaction with the landscape.

#### Aesthetic Considerations:
- The contrasting times of day (evening on the left, morning on the right) and seasons (winter versus summer) offer a unique visual experience, crafting a scene that invites viewers to a story-like journey across times and activities.
- The presence of people engaging in season-appropriate activities on both sides of the river adds a layer of dynamism to the landscape, creating interactivity and connection within the whole image.

By meticulously adhering to this layout plan, we can ensure each of the key phrases is adequately represented, focusing on the picturesque nature of the landscape, the distinguished characteristics of each season, and the spirited depictions of village life and leisure activities.

### Output:
Horizontal split ratio: 1
Vertical split ratio: 1,(1,2); 1,(1,2)
Split ratio: 1,(1,2); 1,(1,2)

Regional Prompt:
Winter evening ambiance with a glimmering big iceberg in the dim twilight. BREAK
A cozy, snow-covered village with warm lights peeking through windows; inhabitants skating joyously on the frozen river and nearby slopes dotted with skiers descending softly amidst flurries. BREAK
Summer morning vista with a majestic Mt. Fuji bathed in the warm glow of the sunrise. BREAK
A lively, verdant village with children and adults playing in the open fields, laughter mixing with the bright sounds of summer.
Horizontal split ratio: 1
Vertical split ratio: 1,(1,2); 1,(1,2)
Split ratio: 1,(1,2); 1,(1,2)

Regional Prompt:
Winter evening ambiance with a glimmering big iceberg in the dim twilight. BREAK
A cozy, snow-covered village with warm lights peeking through windows; inhabitants skating joyously on the frozen river and nearby slopes dotted with skiers descending softly amidst flurries. BREAK
Summer morning vista with a majestic Mt. Fuji bathed in the warm glow of the sunrise. BREAK
A lively, verdant village with children and adults playing in the open fields, laughter mixing with the bright sounds of summer.
{'split ratio': '1,1,2; 1,1,2', 'Regional Prompt': 'Winter evening ambiance with a glimmering big iceberg in the dim twilight. BREAK\nA cozy, snow-covered village with warm lights peeking through windows; inhabitants skating joyously on the frozen river and nearby slopes dotted with skiers descending softly amidst flurries. BREAK\nSummer morning vista with a majestic Mt. Fuji bathed in the warm glow of the sunrise. BREAK\nA lively, verdant village with children and adults playing in the open fields, laughter mixing with the bright sounds of summer.'}
select_checkpoint: v1-5-pruned-emaonly.safetensors [6ce0161689]
process_script_args (True, False, 'Matrix', 'Columns', 'Mask', 'Prompt', '1,1,2; 1,1,2', 0.3, False, False, False, 'Attention', [False], 0, 0, 0.4, None, 0, 0, False)
fatal: No names found, cannot describe anything.
1,1,2; 1,1,2 0.3 Horizontal
Regional Prompter Active, Pos tokens : [16, 37, 19, 25], Neg tokens : [0]

4. まとめ

RTX 4090（24GB）単品ではVRAMが溢れました。
ただ、24+16とGPUカードを２枚使用とすると使用量は24GB未満という結果に。ふしぎです。

画像生成の速度は、当然ながらGPT-4からの応答速度によります。そこが速ければ１分足らずでした（GPT-4にかかる時間を除くと31秒）。安定した画像生成を望む場合は、ローカルLLMを同時に立てて対応するのが良いでしょう。