githubリポジトリをEmbeddingして質問に答えてもらう【Open Interpreter】

ニケちゃん

2023年10月21日 19:54

昨日このような記事を書いたのですが、

同じようなことをしているリポジトリがOpen Interpreterハッカソンで発表されていたとのことなので試してみました。

ちなみにその他のハッカソン参加作品は阿部さんの記事をご覧ください。
（私も美少女OPInterpreter出せばよかったなあ…

やりかた

READMEに書かれていますが軽く横やり入れながら説明していきます。
なお、MacPCでは指定した方法では動かなかったのでWindowsで実行しています。

まずはリポジトリをクローンしてpipインストールしましょう。

git clone https://github.com/Raghavan1988/CodeBaseBuddy.git
 
cd CodeBaseBuddy
 
pip install -r requirements.txt

次にEmbeddingしたいリポジトリをクローンします。クローンするフォルダは後ほど指定するのでどこでも良いです。

git clone https://github.com/run-llama/llama_index.git

キーを設定しておきます。

export OPENAI_API_KEY=

Embeddingします。一つ目の引数が作成するファイルの名称、二つ目が先ほどクローンした対象リポジトリです。そこそこ時間がかかるので待ちます。

python build_embeddings.py llama_index ./llama_index

終わったらクエリを投げましょう。
第一引数がクエリ（質問）。
第二引数が検索の深さで、検索結果の上位何件を返すかをしていします。
第三引数は上記のEmbeddingで指定したファイル名称です。

> python search.py "Explain each LlamaIndex class? in Japanese" 5 llama_index                      

▌ A new version of Open Interpreter is available.

▌ Please run: pip install --upgrade open-interpreter

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Files you might want to read:
./llama_index\llama_index\indices\__init__.py
./llama_index\llama_index\objects\__init__.py
./llama_index\llama_index\indices\registry.py
./llama_index\llama_index\data_structs\registry.py
./llama_index\llama_index\langchain_helpers\agents\toolkits.py
./llama_index\experimental\cli\__init__.py
./llama_index\llama_index\agent\react\__init__.py
./llama_index\llama_index\bridge\__init__.py
./llama_index\llama_index\evaluation\retrieval\__init__.py
./llama_index\llama_index\indices\query\__init__.py
open interpreter's recommendation

▌ Model set to GPT-4

Open Interpreter will require approval before running code. Use interpreter -y to bypass this.

Press CTRL-C to exit.


  Firstly, to interpret the python code and classes, we need an understanding of the purpose and structure of the LlamaIndex. It seems to be a structure for handling and indexing      
  data, particularly structured information. We have different kinds of Indexes, Tools, and Toolkits in the context.

  Here is a brief overview of the classes found in the provided code, and the corresponding explanations in Japanese:

   1 DocumentSummaryIndex: This class likely deals with the indexing of the summary of documents. (Japanese interpretation:
     このクラスは、おそらくドキュメントのサマリーのインデキシングを扱います。)
   2 KeywordTableIndex, RAKEKeywordTableIndex, SimpleKeywordTableIndex: These set of classes are possibly handling the indexing associated with keywords in some form of table.
     (Japanese interpretation: これらのクラスのセットは、テーブルの中のキーワードに関連するインデキシングを扱っている可能性があります。)
   3 ListIndex, GPTListIndex, SummaryIndex: As from the name, These classes probably work with some form of list indexing. (Japanese interpretation:
     名前から推測すると、これらのクラスは何らかの形のリストインデキシングを行いそうです。)
   4 VectorIndex: This class probably deals with creating indexes for vectors. (Japanese interpretation:
     このクラスでは、ベクトルのためのインデックスを作成することを扱っている可能性があります。)
   5 TreeIndex, GPTTreeIndex: These classes likely work with some form of tree indexing. (Japanese interpretation: これらのクラスは何らかの形のツリーインデキシングを行いそうです。)    
   6 ObjectRetriever, ObjectIndex: These classes possibly work with retrieving and indexing objects. (Japanese interpretation:
     これらのクラスは、オブジェクトの取得とインデキシングを行う可能性があります。)
   7 BaseIndex: This class might be a base (or parent) class for all other Index classes providing some common functionality. (Japanese interpretation:
     このクラスは、他のすべてのインデックスクラスに共通の機能を提供するベース（または親）クラスである可能性があります。)
   8 LlamaToolkit: This class possibly provides a toolkit for interacting with Llama indices. (Japanese interpretation:
     このクラスは、Llamaインデックスと対話するためのツールキットを提供する可能性があります。)

  To provide a more accurate description, we need more information about the LlamaIndex project and the context in which these classes are used.

interpreter's recommendation done. (Risk: LLMs are known to hallucinate)

いけました！！！！

比較

前回のこちらの方法と比較します。
前回 => LlamaIndex
今回 => Open Interpreter

準備

LlamaIndex ≒ Open Interpreter
LlamaIndexは、githubトークンを取得してくる分、Open Interpreterのは対象リポジトリをクローンしてくる分、手間が増えるけどお互いにそこまでめんどくさくはないです。
Open Interpreterはクローンしなくちゃいけないのでちょっと管理めんどくさいかも。

Embedding速度

LlamaIndex ≒ Open Interpreter

クエリ回答速度

LlamaIndex < Open Interpreter
LlamaIndexの方はindex読み込みに時間がかかっている感じがある。

性能

LlamaIndex > Open Interpreter
好みにもよるかもしれないけど、LlamaIndexの方が欲しい回答を返してくれることが多かったです。
また、このリポジトリでは一発でほしい回答を返してくれないこともありました。

総合

LlamaIndex > Open Interpreter
このリポジトリの方がやや劣るかなという印象です。
ただ、そもそもOpen Interpreterは対話しながら解決していくのが前提仕様な気がするので、改造して対話ベースで進められるようにしたらより性能が上がる気はしました。

改造

というわけで改造します。
search.pyが実行ファイルなので、これを雑に変更します。

＃ 下の数行をこのように変更する

print( "open interpreter's recommendation")

message = "I have a task to complete. Please help with the task below and answer my question. Task : READ THE FILE content below and their paths and answer " + query + "\n" + file_content

interpreter.auto_run = True

count = 0
while True:
    if (count == 0):
        interpreter.chat(message)
    else:
        message = input("Enter your message: ")
        interpreter.chat(message)
    count += 1

print ("interpreter's recommendation done. (Risk: LLMs are known to hallucinate)")

これで対話方式でできるようになりました。これぞOpen Inteoreter。

> python search.py "Explain each LlamaIndex class in llama_index folder? in Japanese" 5 llama_index

▌ A new version of Open Interpreter is available.

▌ Please run: pip install --upgrade open-interpreter

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Files you might want to read:
./llama_index\llama_index\indices\__init__.py
./llama_index\llama_index\objects\__init__.py
./llama_index\llama_index\indices\registry.py
./llama_index\llama_index\__init__.py
./llama_index\llama_index\data_structs\registry.py
./llama_index\llama_index\indices\managed\__init__.py
./llama_index\llama_index\indices\empty\__init__.py
./llama_index\llama_index\llm_predictor\vellum\__init__.py
./llama_index\llama_index\indices\managed\vectara\__init__.py
open interpreter's recommendation


  From a look into the directories and the various import statements, the codebases consist of several classes that represent the data structures and functionalities of a LlamaIndex.
  Let's take a more detailed look into each of these classes. The explanations will be in Japanese as requested.

    1 DocumentSummaryIndex: ドキュメントの要約をインデックスするためのクラス。要約に基づいてドキュメントを検索したり操作したりする機能を提供します。
    2 KeywordTableIndex, GPTKeywordTableIndex, RAKEKeywordTableIndex, SimpleKeywordTableIndex, GPTSimpleKeywordTableIndex, GPTRAKEKeywordTableIndex:
      キーワードテーブルを管理するためのクラス群。キーワードの集合を表現し、それらに対する検索や操作を行います。これらはキーワードの抽出方法や管理方法に違いがあります。
    3 ListIndex, GPTListIndex, and SummaryIndex: リストや要約の管理および操作を行うクラスです。
    4 TreeIndex, GPTTreeIndex: 木構造データを管理するためのクラス。木構造に基づいてデータを配置し、検索や操作を行います。
    5 VectaraIndex: Vectaraというベクター管理システムに関連する機能を提供するクラス。
    6 ObjectIndex, ObjectRetriever: オブジェクトの索引作成や検索を行うためのクラス。
    7 SimpleObjectNodeMapping, SQLTableNodeMapping, SimpleToolNodeMapping, SimpleQueryToolNodeMapping: データベーステーブルやツールへのマッピングを行う機能を提供するクラス。
    8 IndexStructType, BaseIndex, EmptyIndex, KnowledgeGraphIndex, SQLStructStoreIndex, PandasIndex:
      インデックスの基本型を定義した基本クラス、空のインデックス、知識グラフ、各種データストアへのインデックス。
    9 ComposableGraph, ServiceContext, Prompts: インデックスの拡張機能やサービス、プロンプトメッセージの設定などを行うクラス。
   10 llm_predictor, vellum: 言語モデルやVellumという予測ツールに関連するクラス。

  Please note that some of these are parent classes for others, and many of them are likely related or depend on each other in various ways. Each class represents different aspects    
  or functionalities of the LlamaIndex.  
 
Enter your message: DocumentSummaryIndexはどのようにつかいますか？

  DocumentSummaryIndex は、ドキュメントの要約を管理およびインデクシングするためのラマインデクサ（LlamaIndex）のクラスです。
  このクラスは、さまざまな情報を扱うために使用されますが、主に大量の書かれたテキストから情報の要約を生成し、利用者が短時間でその内容を 把握できるようにするためのものです。


  使用方法

  このクラスの具体的な使用は、開発目的や使用環境によりますが、基本的な利用法は以下の通りです。

   1 インスタンスの生成 : DocumentSummaryIndex クラスのインスタンスを生成します。これは通常、特定のデータセットに対して行います。

                                                                                                                                                                                        
   dsi = DocumentSummaryIndex()                                                                                                                                                         
                                                                                                                                                                                        

   2 データの追加 : 要約するドキュメントのデータを追加します。これは通常、文書のリストやテキストファイルの形式で行われます。

                                                                                                                                                                                        
   dsi.add_documents(["document1", "document2", "document3"])                                                                                                                           
                                                                                                                                                                                        

   3 要約の取得 : 特定のキー（たとえば、ドキュメントのIDや名前）に関連付けられたドキュメントの要約を取得します。

                                                                                                                                                                                        
   summary = dsi.get_summary("document1")                                                                                                                                               
                                                                                                                                                                                        

  上記のコードは、実際の DocumentSummaryIndex の使用方法の一例です。 コードの詳細は、LlamaIndexの実装や使用目的により異なる可能性があります。言い換えれば、特定の目的に合わせて
  DocumentSummaryIndex クラスをカスタマイズすることが可能です。
 
Enter your message:

良い感じです。
会話が続くと肝心のリポジトリの中身が会話履歴に含まれなくなってしまうので、その場合は再度読み込ませる処理とか入れれば良さそうですね。

この記事が気に入ったらサポートをしてみませんか？