StreamDiffusionをStreamオブジェクトで使う。（Mini-Tipsまとめ）

2024年1月7日 22:05

超高速画像生成AI,StreamDiffusionが公開されてから、今日で2週間が経過しました。様々な使い方が提案されて今後が楽しみです。デモも充実していて暫く使うには充分だと思います。この記事では2歩先に進んでこれまでに調べたStreamオブジェクトを使うときの要点や豆知識をまとめました。
最後にテストに使用したソースコード全体があります。

なぜStreamオブジェクトを使うのか

通常はutilsフォルダーにある、wrapper.pyで充分だそうです。初期化時のパラメータとデフォルトが設定されています。この下には各パラメータの説明が記載されています。
wrapper.pyの初期化パラメータ

class StreamDiffusionWrapper:
    def __init__(
        self,
        model_id_or_path: str,
        t_index_list: List[int],
        lora_dict: Optional[Dict[str, float]] = None,
        mode: Literal["img2img", "txt2img"] = "img2img",
        output_type: Literal["pil", "pt", "np", "latent"] = "pil",
        lcm_lora_id: Optional[str] = None,
        vae_id: Optional[str] = None,
        device: Literal["cpu", "cuda"] = "cuda",
        dtype: torch.dtype = torch.float16,
        frame_buffer_size: int = 1,
        width: int = 512,
        height: int = 512,
        warmup: int = 10,
        acceleration: Literal["none", "xformers", "tensorrt"] = "tensorrt",
        do_add_noise: bool = True,
        device_ids: Optional[List[int]] = None,
        use_lcm_lora: bool = True,
        use_tiny_vae: bool = True,
        enable_similar_image_filter: bool = False,
        similar_image_filter_threshold: float = 0.98,
        similar_image_filter_max_skip_frame: int = 10,
        use_denoising_batch: bool = True,
        cfg_type: Literal["none", "full", "self", "initialize"] = "self",
        seed: int = 2,
        use_safety_checker: bool = False,
        engine_dir: Optional[Union[str, Path]] = "engines",
    ):

さらに、使い方に合わせて、以下のクラスが準備されています。
prepare()
txt2img()
img2img()
preprocess_image()
postprocess_image()
StreamDiffusionWrapperの初期化の後、各クラスにパラメータを渡して呼び出せば容易に動かすことが出来ます。

私がラッパーを使わずにStreamを直接操作して様々な生成を試している理由ですがあまり明確では無いんです。あえて上げれば最大の理由はパフォーマンスを上げたいからということになります。上記の各クラスでは様々なチェクを経てStreamオブジェクトを呼び出しています。この余分な処理を省いてStreamDiffusionの本来の性能が知りたいというとこですね。それと、Diffuserをやっていた流れからReadMEに示されているサンプルを使い始めたことも一つの理由です。

Streamを使う流れ

以下の記事にコードの流れに沿って説明しています。

この記事では、上記の記事を書いた後に、バージョンアップも有ったことや、地味な作業を黙々と進め、色々と試して解ったことや性能評価結果、豆知識をまとめました。

主要な部分の使い方や変更方法

モデルのロード

ダウンロードしたモデルを使う

ここは以前の記事でも触れていますが、以下のようになっています。
pipeはDiffuserで使われるクラスなのでHugingFaseのDiffuserに膨大な資料と共に豊富なサンプルを見ることが出来ます。

pipe = StableDiffusionPipeline.from_pretrained("KBlueLeaf/kohaku-v2.1").to(
    device=torch.device("cuda"),
    dtype=torch.float16,
)

ここのpipe=StableDiffusionPipeline.from_pretrainedのfrom_pretrainedを以下のように、from_single_fileに変えてローカルにダウンオードしたモデルへのパスとモデル名にかえれはOKです。

pipe = StableDiffusionPipeline.from_single_file(
    "/auto1111/models/Counterfeit-V3.0/Counterfeit-V3.0_fix_fp16.safetensors").to(
    device=torch.device("cuda"),
    dtype=torch.float16,
)

streamの初期化

上記のDiffuserで定義したpipeをSureamDiffusionでラップして初期化します。 t_index_listとcfg_typeが重要なパラメータです。

    stream = StreamDiffusion(
        pipe,
        t_index_list=index_list,
        torch_dtype=torch.float16,
        cfg_type=cfg_type,
    )

streamオブジェクトはsrc/streamdiffusion/の
pipeline.py　で定義されています。以下が初期化時の引数の一覧とデフォルト値になります。widthやheightを指定すると望みの画像サイズを生成できます。pipeline.pyはパラメータの説明がありません。上記wrapper.pyが参考になります。

class StreamDiffusion:
    def __init__(
        self,
        pipe: StableDiffusionPipeline,
        t_index_list: List[int],
        torch_dtype: torch.dtype = torch.float16,
        width: int = 512,
        height: int = 512,
        do_add_noise: bool = True,
        use_denoising_batch: bool = True,
        frame_buffer_size: int = 1,
        cfg_type: Literal["none", "full", "self", "initialize"] = "self",
    )

LCM-LoRAのロード

高速化のためにLCM-LoRAを使うのでモデルをロードします。
使うモデルはデフォルトで
"latent-consistency/lcm-lora-sdv1-5"
になっています。

独自LoRA(Style-LoRA)のロード（公式）

独自に学習させたLoRAがロード出来ます。以下、コードです。

stream.load_lora("./models/LoRA/megu_sports_v02.safetensors")
stream.fuse_lora(lora_scale=1.0)

stream.fuse_loraで有効化すると共に、lora_scale=1.0のように効き具合を調節出来ます。デフォルトはlora_scale=1.0です。1.0だと強すぎるような気がします。ソースコードを見ると、adapter_nameも指定できるようです。

独自LoRA(Style-LoRA)のロード（非公式）

非公式が認められるわけではないのですが、効果が有るのは確かなので説明します。Diffuserでの手法をStreamDiffusionでラップする前に行います。以下のコードをpipeの初期化のすぐ下に記述します。

pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5", adapter_name="lcm") #Stable  Diffusion 1.5 のLCM LoRA
pipe.load_lora_weights("./models/LoRA/megu_sports_v02.safetensors", adapter_name="papercut")
pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.5])

pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.5])
でLMCと独自LoRAをバインドしています。あくまで非公式ですが、
adapter_weights=[1.0, 0.5]
での設定が穏やかに効くので扱いやすい印象です。このset_adaptersを使うときはPEFTのインストールが必要です。lcm-lora-sdv1-5のロードはstreamでも行うので、どちらかがあればよさそうですが、両方で行ってもエラーにはなりませんでした。
あくまで非公式なのでご利用はご自身の責任でお願いします。今後のバージョンアップでは使えなくなる可能性も有ります。このLoRA設定方法はTensorRTでも有効に機能します

TensorRTの使い方

なかなか大変な部分です。ドキュメントにある、以下の記述について説明します。1行目はTensorRTモジュールを読み込んでいるところなので迷いはありません。問題は "engines"が何を意味するのかです。

from streamdiffusion.acceleration.tensorrt import accelerate_with_tensorrt

stream = accelerate_with_tensorrt(
    stream, "engines", max_batch_size=2,
)

結論から言うと、TensorRTを使うために構築した engineを保存するフォルダー名を指します。TensorRTは少々気難しく、パラメータを変更するだけでも正しく動かないことがあります。構築には10分程度かかるので、確認だけでも大変です。 "engines"を変更しない場合、TensorRTの engineは新たに構築されず、"engines"にあるファイルを用います。上書きされるわけではなく構築されないと言うことです。従って正しい動作をしないということになります。新たなパラメータで有効なTensorRTの engineの再構築は"engines"に存在しないフォルダー名を指定することで行われます。engineが収納されるフォルダの容量は概ね5.2GByteです。パラメータを変えて "engines"を変更し再構築を繰り返すと、どんどんディスクスペースが減ります。使わない、あるいはエラーが生じたフォルダーは速やかに消去したほうが良さそうです。
注意事項
１）構築時に8G程度のGPU-VRAMが使われるようです。充分にスペースを確保しましょう。使用しているとVRAMに残骸が残っている場合があります。
２）パラメータを変えたら再構築することが無難。
３）フォルダー容量が大きいのでこまめに消しましょう
４）どのような条件で作成したengineか、容易にわかるようなフォルダー名を使いましょう。デモで作成されるengineフォルダー名が参考になります。
５）画像サイズ512X512以外は構築は出来ますが、画像生成時にエラーになります。（現状且つ私の環境=設定を含む、ではです）
６）生成時間はi5で概ね10分程度です。シングルコア性能の高いCPUが有利

プロンプト

StableDiffusionのプロンプトと同じです。ただし最大が77個までです。
（ここはDiffuserのようにembededで77の壁を超える仕様に変更してもらいたいところです）プロンプトは事前に
stream.prepare(prompt)
で使用され、様々な事前処理が実行されます。生成時にはこのときに読み込んで作成されたキャッシュが利用されます。一方でインターラクティブに変更する手段も準備されています。例えは連続で生成しているループの中でプロンプトを変更したいときなどです。以下の例はprompt_listに格納されたプロンプトを一つずつ取りだして生成毎にプロンプトを変更しています。ここで利用している
stream.update_prompt( prompt)
を使えは生成時間に大きな影響を与えずに毎回プロンプトを変えることが出来ます。

for i in tqdm(range(len(prompt_list))):
    start_time = time.time()
    #動的にプロンプトを変える 
    prompt= prompt_list[i]
    stream.update_prompt( prompt)
    x_output = stream.txt2img()
    image=postprocess_image(x_output, output_type="np")[0]

実際の生成時の画像変化。これはprompt= prompt_list[i]ではなく、prompt=prompt+prompt_list[i]として、prompt_lisには、単語を一つづつ要素として記述し、順次プロンプトを長くしていったときの生成画像です。

#StreamDiffusion
あき先生のアドバイスに従い
stream.update_prompt( prompt)
へプロンプトを順次追加し、徐々にめぐに仕上がるまでとその後walk a headで歩かせています。
LoRA有効、index_list=[0, 16, 32, 45]、rcfg_type = "self"
14.88fpsも出てしまうので生成後cv2.waitKey(100)で100mSのwaite pic.twitter.com/n192KlZGUq
— ゆずき (@uzuki425) January 7, 2024

生成画像の形式

x_output = stream(init_image)　i2i
x_output = stream.txt2img()　t2i
で生成されたx_outputにあるデータはtensor形式です。そのままでは表示に不便なので、
postprocess_image(x_output, output_type="pil")[0]
が準備されています。
/src/streamdiffusion/image_utils.py
で定義されています。得られる画像形式は、output_type=で指定し、デフォルトは”pil"すなわちpillow形式です。選択できる形式は、
"latent"、 "pt"、”pil"、 "np"です。 "np"を指定すると少し手を加えるだけでOpenCVで容易に扱うことが出来ます。"latent"=潜在空間の形式、 "pt"=PyTorch形式、はユーザーが使うことは無いと思います。
np形式で出力した画像は正規化されているため、変換します。
mgCV_RGB = np.array(image, dtype=np.uint8)#0〜255へ変換
imgCV_BGR = np.array(imgCV_RGB)[:, :, ::-1]#RBG2BGR
でOpenCVの通常形式に戻せます。

"np"を指定して表示

以下のようなコードを用いてOpenCVの cv2.imshow("t2i",image)で表示出来ます。cv2.imshowはウインドウ名を変えなければ上書きされるので、連続生成したときに変化をみるためには便利です。動画の様に見えます。

    image=postprocess_image(x_output, output_type="np")[0]
    image = np.array(image)[:, :, ::-1]
    times.append(time.time() - start_time)
    cv2.imshow("t2i",image)
    cv2.waitKey(1)

i2i時の入力画像形式

利用できる形式は
Tensor, PIL.Imageオブジェクト, ndarrayのようです。
ndarrayを利用する場合、OpenCV形式の0〜255整数アレイではなく、
０〜１までの実数または正規化された浮動少数点数であることに注意が必要です。Webカメラなどから読み込んだOpenCV形式画像は以下のように変換します。
image=image/255.0#０〜１の実数へ変換=ndarray
pillow形式は何も考えることはありません。

入出力の形式による生成時間に与える影響

StreamDiffusionは生成時間が極めて短くなっています。RTX4090などの高性能GPUを使えば気にすることではないのですが、GPUが中堅クラスで実用的な生成時間を得るためには様々な工夫が必要です。そのためStreamDiffusionが持つ本来の性能を限界まで利用するためには、生成以外の部分で余計な時間を使うことは避けたいと思います。たとえばリアルタイムで動画を生成したり変換したりする場合のフレームレートを高く保つ必要がある場合が当てはまります。
入力の影響　pillow<np 　ここは不思議。npの方が早そうですが、逆です。
出力の影響　pillow>np　こちらは逆です。postprocess_imageのコードを見るとnpではtensorから変換後には何も処理をせずに返していますが、pilだと変換をしています。従って最も生成速度の早い組み合わせは
入力　pillow　、出力　np
になります。以下測定データです。
入力フォームと出力方法、上部は出力なし。以下のデータは画像出力をループ内で処理する場合とスレッドで処理する場合も組み合わせてイます。
出力は全て"np"のときです。
測定データ
18.76fps　pillow入力　+thread-OUT　画像表示がスレッディング内
18.49fps　OpenCV入力+thread-OUT　画像表示がスレッディング内
17.65fps　pillow入力　+cv2.imshow　ループ内表示
17.20fps　OpenCV入力+cv2.imshow　ループ内表示
コンディション
RTX4070 + i5-13600K , TensorRT有効、
index_list=[35,40 ,42,45]　、cfg_type = "self"

その他のMini-Tips

i2iとt2iの速度比較

#StreamDiffusion
t2iとi2i比較。ほぼ同じ生成時間でした。
モデル　Counterfeit-V3.0_fix_fp16.safetensors
LoRA無し
cfg_type = "none"
t2i=36.68fps　t_index_list=[0, 16, 32, 45]
i2i=34.1fps　t_index_list=[35,40 ,42,45]
RTX4090+i5-13400K
CPUはお正月のときの13400Fと大差無く4090の威力 pic.twitter.com/6ACPWhrRcH
— ゆずき (@uzuki425) January 5, 2024

画像サイズを変える

以下のようにstream 定義時に指定します。
stream = StreamDiffusion(
pipe,
t_index_list=[40,42,44,45],
torch_dtype=torch.float16,
cfg_type=cfg_type,
width =512,
height = 768,
)
注意）画像サイズを変えるとTensorRTが使えないようです。

#StreamDiffusion
Mini-Tips　for Stream
画像サイズを変える
stream = StreamDiffusion(
pipe,
t_index_list=index_list,
torch_dtype=torch.float16,
cfg_type=cfg_type,
width =512,
height = 768,
)
のようにstream定義で行う
t2i　8.1fps　4070+i5　LoRA無し pic.twitter.com/y2p1AyfsKv
— ゆずき (@uzuki425) January 7, 2024

生成速度とt_index_list

Step数と生成速度 step数とは、t_index_listの要素数
Step1の時　32.68fps
Step2の時　23.84fps
Step3の時　18.15fps
Step4の時　15.18fps
RTX4070+i5-12400F TensorRT有効 np形式→OpenCV
t_index_listの要素数は生成時間に大きく影響します。

postprocess_image()のoutput_typeで指定した形式の差

"pil" 21.26fps　
　→OpeCV変換へ変換後表示
”np"　23.91fps
　→OpeCV変換 image = np.array(image)[:, :, ::-1]
x_output はTensor形式なのでnp/OpenCVが有利です。

TensorRT 構築時間とengineの容量

構築時間
１０分前後
構築後のengineフォルダの大きさ　
5.2GByte
設定変えてengine を順次作成するとSSD残り容量がみるみる減ります。

pillowかOpenCVか

サーバ化すするときに気になることです。
pillow： 扱い易いがサイズが大きい。i2iの入力では有利
ndarrey：i2iの入力でそのまま使える。サイズはさほど大きくはない
OpenCV：データが0〜255、i2iの入力では正規化ndarreyへ変換が必要、
　　　　　サイズは最もコンパクト
API化して通信で画像を送受信する場合、サイズが大きいと通信によるオーバーヘッドが大きくなる。特にPOST/GETで通信するとオーバーヘッドが大きくなります。ndarreyは１要素がOpenCVに比べて大きいので不利です。
実測値
i2iで入出力に画像を送受信する場合。
FastAPIによるpost　8ms
TCP/IPパケット　　3ms
データはOpenCV(0~255)　mSを詰める努力をしているとこの差はとても大きいです。

ここまでのまとめ

まだまだ試すことはたくさんありますが、一区切り付けてこのテストの過程で実装できたAPIについて記事を書いて行きたいと思います。
今後も改良が続くでしょう。正直な所、ここに記載したことが真実かどうかも現状では自身がありません。間違えている場合はご容赦ください。

付録：テストコード

t2iのテストコード

import torch
from diffusers import AutoencoderTiny, StableDiffusionPipeline
from diffusers.utils import load_image

from streamdiffusion import StreamDiffusion
from streamdiffusion.image_utils import postprocess_image

import numpy as np
import time
import cv2
from PIL import Image

#--- Diffuserによるpipeの初期化
pipe = StableDiffusionPipeline.from_single_file(
    "/home/animede/auto1111/models/Counterfeit-V3.0/Counterfeit-V3.0_fix_fp16.safetensors").to(
    device=torch.device("cuda"),
    dtype=torch.float16,
)
#--- 非公式　独自LoRAのロード
pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5", adapter_name="lcm") #Stable  Diffusion 1.5 のLCM LoRA
pipe.load_lora_weights("./models/LoRA/megu_sports_v02.safetensors", adapter_name="papercut")#独自学習した LoRA
pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.5])

#--- RCFG の指定 #cfg_type  = "none"  #cfg_type  = "full" 
cfg_type = "self"  #cfg_type  = "initialize" 

#--- t_index_lisの指定  #index_list =[40]  #index_list =[0, 45]  #index_list =[38,40 ,42,45]  #index_list =[20,30,40]  #index_list =[35,40 ,42,45]  #index_list =[41,42,44,45] #cam  
index_list=[0, 16, 32, 45] #t2i  

# ---Wrap the pipeline in StreamDiffusion
stream = StreamDiffusion(
    pipe,
    t_index_list=index_list,
    torch_dtype=torch.float16,
    cfg_type=cfg_type,
    width =512,
    height = 512,
    #height  = 768, #TensorRT有効のときはサイズに注意 　512x512のみ
)

# ---If the loaded model is not LCM, merge LCM
stream.load_lcm_lora()
stream.fuse_lora()
# ---公式　独自LoRAのロード #stream .load_lora("./models/LoRA/megu_sports_v02.safetensors") #stream .fuse_lora(lora_scale=0.1)

# --- Use Tiny VAE for further acceleration
stream.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd").to(device=pipe.device, dtype=pipe.dtype)

# --- Enable acceleration　いずれかを有効にする　
#>>>　with xformers #pipe .enable_xformers_memory_efficient_attention()
#>>>　with TensroRT
from streamdiffusion.acceleration.tensorrt import accelerate_with_tensorrt
stream = accelerate_with_tensorrt(stream,  "engines_t2i_4_m_t2i_self_tate",  max_batch_size=4,) #Step =3

# --- Prepare the stream with prompt
#>>>固定プロンプト用 #prompt  = "masterpiece, best quality, 1girl, solo, long hair,  white shirt, serafuku,  brown hair,looking at viewer,blush,smile,bangs,blue eyes,simple background,\
#                            t-shirt,white background,closed mouth,standing,white t-shirt,shorts,short shorts,headphones,black shorts,light brown hair,blue shorts ,running"
#>>>動的プロンプトの初期プロンプト
prompt = "masterpiece, best quality, 1girl,"
prompt_list=[
            "1girl","long hair,","white shirt,","serafuku,","brown hair,","looking at viewer,","blush,","smile,", "bangs,","blue eyes,","simple background,", "t-shirt,",\
             "white background,","walk a  head,","white background,","walk a  head,","white background,","walk a  head,","white background,","walk a  head,","white background,"]

# ---事前計算
stream.prepare(prompt,
               guidance_scale = 1.0,
               seed=1,
               )

# ---Warmup >= len(t_index_list) x frame_buffer_size
for _ in range(len(index_list)):
    stream()



# ---表示のスレッド化準備（必要に応じて）
#>>> 表示スレッドの定義
t2i_img_flag = False
import threading
def disp_t2i():
    global t2i_img , t2i_img_flag
    while True:
        if t2i_img_flag==True:
             cv2.imshow("i2i_t", t2i_img)
             cv2.waitKey(1)
             t2i_img_flag=False
        time.sleep(0.005)
#>>> 表示スレッドを開始
thread = threading.Thread(target=disp_t2i, name='t2i',daemon = True)
thread.start()
# ---表示のスレッド化準備　ここまで



# --- Run the stream infinitely
for i in range(len(prompt_list)):
    start_time = time.time()
    #動的にプロンプトを変える 　固定プロンプトのときは不要
    prompt=prompt+prompt_list[i]
    stream.update_prompt(prompt)

    #stream .txt2img()で画像を生成　output_type="np"
    x_output = stream.txt2img()
    image=postprocess_image(x_output, output_type="np")[0]
    image_out_cv=np.array(image, dtype=np.uint8)
    image = np.array(image)[:, :, ::-1]

    # --- 出力方法選択  いずれかを選択
    #>>> cv2.imshowの時
    cv2.imshow("t2i",image)
    cv2.waitKey(1)
    #>>>out=threadの時  　最後はCont-Zで強制終了してください
    #t2i_img  =image 
    #t2i_img_flag =True

    end_time=time.time() 
    print("生成時間",end_time- start_time)
    print("i-fps",1/(end_time- start_time))
 #最後に生成された画像をキーが押されるまで表示    >>>output_type="np", cv2.imshowの時
cv2.imshow("t2i",image)
cv2.waitKey()

i2iのテストコード

import torch
from diffusers import AutoencoderTiny, StableDiffusionPipeline,StableDiffusionImg2ImgPipeline
from diffusers.utils import load_image,make_image_grid

from streamdiffusion import StreamDiffusion
from streamdiffusion.image_utils import postprocess_image

import numpy as np
import time
import cv2
from PIL import Image,ImageOps
 #pipe  = StableDiffusionImg2ImgPipeline.from_single_file(
# or
pipe = StableDiffusionPipeline.from_single_file(
    "/home/animede/auto1111/models/Counterfeit-V3.0/Counterfeit-V3.0_fix_fp16.safetensors").to(
    device=torch.device("cuda"),
    dtype=torch.float16,
)

#--- 非公式　独自LoRAのロード
pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5", adapter_name="lcm") #Stable  Diffusion 1.5 のLCM LoRA
pipe.load_lora_weights("./models/LoRA/megu_sports_v02.safetensors", adapter_name="papercut")#独自学習した LoRA
pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.8])

#--- RCFG の指定 
# RCFG Onetime-Negative  #cfg_type  = "none"  #cfg_type  = "full" 
cfg_type = "self"  #cfg_type  = "initialize" 

#--- t_index_lisの指定  #t_index_list  definition  #index_list =[40]  #index_list =[32, 45]  #index_list =[38,40 ,42,45]  #index_list =[20,30,40]  #index_list =[35,40 ,42,45]#cam2 
index_list=[30,40 ,42,45]  #index_list =[41,42,44,45] #cam  

# ---Wrap the pipeline in StreamDiffusion
stream = StreamDiffusion(
    pipe,
    t_index_list=index_list,
    torch_dtype=torch.float16,
    cfg_type=cfg_type,
    width =512,
    height = 512,
    #height  = 768, #TensorRT有効のときはサイズに注意 　512x512のみ
)

# ---IIf the loaded model is not LCM, merge LCM
stream.load_lcm_lora()
stream.fuse_lora()
# ---公式　独自LoRAのロード #stream .load_lora("./models/LoRA/megu_sports_v02.safetensors") #stream .fuse_lora(lora_scale=1.0)

# Use Tiny VAE for further acceleration
stream.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd").to(device=pipe.device, dtype=pipe.dtype)

# --- Enable acceleration　いずれかを有効にする　
# >>>  with xformers #pipe .enable_xformers_memory_efficient_attention()
#>>> with TensroRT
from streamdiffusion.acceleration.tensorrt import accelerate_with_tensorrt
stream = accelerate_with_tensorrt(stream,  "engines_i2i_4_m_cfg_full",  max_batch_size=4,) #Step =3

# ---入力イメージの準備
init_image = load_image("pose2.png").resize((512, 512))
image_pil=init_image
image_pil.show()
#>>>for OpenCV input
imgeCV=np.array(init_image , dtype=np.uint8)
imgeCV= np.array(imgeCV)[:, :, ::-1]
imgeCV=imgeCV/255.0#０〜１の実数へ変換（ndarrey)

#--- Prepare the stream with prompt
#>>>動的プロンプトの初期プロンプト
prompt = "masterpiece, best quality, 1girl,"
prompt_list=[
            "1girl","long hair,","white shirt,","serafuku,","brown hair,","looking at viewer,","blush,","smile,", "bangs,","blue eyes,","simple background,", "t-shirt,",\
             "white background,","walk a  head,","white background,","walk a  head,","white background,","walk a  head,","white background,","walk a  head,","white background,"]
#>>>固定プロンプト用 #prompt  = "masterpiece, best quality, 1girl, solo, long hair,  white shirt, serafuku,  brown hair,looking at viewer,blush,smile,bangs,blue eyes,simple background,t-shirt,white background,closed mouth,standing,white t-shirt,shorts,short shorts,headphones,black shorts,light brown hair,blue shorts ,running"

# ---事前計算
stream.prepare(prompt,
               guidance_scale = 1.0,
               seed=1,
               )

# ---Warmup >= len(t_index_list) x frame_buffer_size
for _ in range(len(index_list)):
    stream(imgeCV)



# ---表示のスレッド化準備（必要に応じて）
#>>> 表示スレッドの定義
import threading
i2i_img_flag=False
def disp_org():
    global  i2i_img , i2i_img_flag
    while True:
        if i2i_img_flag==True: #When  IN=CV2
             cv2.imshow("i2i_t",i2i_img)
             cv2.waitKey(1)
             i2i_img_flag=False
        time.sleep(0.002) #表示プロセスを開始 
thread = threading.Thread(target=disp_org, name='disp_org',daemon = True)
thread.start()
# ---表示のスレッド化準備　ここまで




# --- Run the stream infinitely
gen_count=100
for i in range(gen_count):
    start_time = time.time()
    """
    #動的にプロンプトを変える 　固定プロンプトのときは不要
    prompt=prompt+prompt_list[i]
    stream.update_prompt(prompt)
    """

    #IN =PIL,out=cv2.imshow

    x_output = stream(image_pil)
    image=postprocess_image(x_output, output_type="np")[0]
    image_out_cv=np.array(image, dtype=np.uint8)
    image = np.array(image )[:, :, ::-1]
    cv2.imshow("i2i_c",image )
    cv2.waitKey(1)     

    
    #>>>IN=darrey , out=cv2.imshow
    """
    x_output = stream(imgeCV)
    image=postprocess_image(x_output, output_type="np")[0]
    image_out_cv=np.array(image, dtype=np.uint8)
    image = np.array(image )[:, :, ::-1]
    cv2.imshow("i2i_c",image )
    cv2.waitKey(1)     
    """
    
    #>>>IN=ndarrey , out=thread
    """
    x_output = stream(imgeCV)
    image=postprocess_image(x_output, output_type="np")[0]
    image_out_cv=np.array(image, dtype=np.uint8)
    image = np.array(image )[:, :, ::-1]
    i2i_img =image 
    i2i_img_flag=True
    """
    
    #IN =PIL, out=thread
    """
    x_output = stream(image_pil )
    image=postprocess_image(x_output, output_type="np")[0]
    image_out_cv=np.array(image, dtype=np.uint8)
    image = np.array(image )[:, :, ::-1]
    i2i_img =image 
    i2i_img_flag=True
    """
    
    end_time=time.time() 
    print("生成時間",end_time- start_time)
    print("i-fps",1/(end_time- start_time))
 #imgCV_RGB  = np.array(image, dtype=np.uint8) #imgCV_BGR  = np.array(imgCV_RGB)[:, :, ::-1] #cv2 .imshow("i2i",imgCV_BGR) #cv2 .waitKey()