【GPTs開発日記】(7) 論文まとめGPT

ai300lab

2024年4月5日 06:51

はじめに

Lumiere, Sora, Ginieと動画生成AIが目白押しなので動画生成AIの論文まとめをするGPTsに取り組みました。

概要

機能概要

論文のURLを抽出し、論文を読んで要約、新規性、関連研究をまとめる

開発の経緯

動画生成AIの論文を読むにあたって論文要約GPTのインストラクションの研究をしました。簡単なインストラクションが多かったので自分でも作ってみることにしました。
たくさんの論文を簡単にまとめられると便利だと思いましたが予想外に難航しました。
困難は次の通りでした：

arxivの論文URLリストを作るのに苦労しました
- 簡単と思ったのですがURLリストを取り出すだけだと余計な派生URLまでとってくれるらしく無限ループに陥っている様子でした
論文URLリストから論文を読み出すのに苦労しました
- これはスクレイピングなど違法あるいは問題のある行動を防ぐためにChatGPTがブラウジング機能を制限しているためのようです
- スクレイピングでもできそうですが、うまくいきませんでした。ChatGPT プラグインが2024年3月で廃止になるというので追求するのは諦めました。

インストラクション

結局、うまくいきませんでしたが、GitHubのarxivの論文リストをまとめているサイトからURLをまとめるのは以下のプロンプトで行いました。

大量に読むのはだめなようですが、1個なら問題ないようです。

以下のURLのファイルを読んでタイトルと要約を教えて 
### URL ### 
https://arxiv.org/pdf/2306.02781.pdf

URLを並べると1個しか読んでくれません

以下のURLのファイルを読んでタイトルと要約を教えて
### URL ###
https://arxiv.org/pdf/2306.02781.pdf
https://arxiv.org/pdf/2304.08818.pdf

なぜかこれは動きました。

以下のURLのファイルを読んでそれぞれタイトルと要約を教えて
### URLs ###
https://arxiv.org/pdf/2306.02781.pdf
https://arxiv.org/pdf/2304.08818.pdf

リストを作るほうを優先しました。ファイルは"awesome video generative AI github"で探しました。入力したファイルはhttps://github.com/filipecalegario/awesome-generative-ai のHTMLファイルです。146個のURLが抽出できました。


ファイルを読み取り、
次の正規表現「https://arxiv.org/abs/[0-9\.]*」に一致するすべての文字列を抽出し、
ダウンロードする Excel シートに入力します。

今後の予定

とりあえずarXivのリンク集は作れたのでぼちぼち考えていきます。
論文1個読んだだけでも、論文に対する質問をChatGPTがいろいろ答えてくれるので、それはそれなりに便利です。

おわりに

一発で最新AI論文をExcelに格納し、なんでも答えてくれるというのは今のところ夢におわりました。いろいろガバナンス的に問題のあるツールとして使われる可能性があるので無限定に自由に作るのは難しそうです。

参考文献

付録

付録A ブラウジングを拒否するエラーメッセージの数々

その１
The response from scraping the requested webpage for links was too large to process in one go. This usually means there are a lot of links on the page. I can try a different approach by fetching the content in smaller parts or focusing on specific sections of the page. Would you like me to try another method, or do you have a specific part of the page in mind that might contain the arXiv.org URLs?

その２（個数を絞った）
It seems we're still encountering issues with the size of the response when trying to fetch all links from the page. Given the limitations, I'll adjust the strategy to directly fetch and process the page content, then extract the first 100 URLs manually from that content, focusing on identifying any arXiv.org URLs within that subset. Let's proceed with fetching the text content of the page, and I'll manually search for the URLs within that content. Please bear with me for a moment.

The text content retrieved from the webpage is extensive and covers a wide range of topics related to Generative AI, including papers, tools, applications, and educational resources. However, it does not directly provide URLs in a format that can be easily extracted programmatically from this response.

Given the nature of the content, extracting specific arXiv URLs or any URLs directly from this text would require a more nuanced approach, such as identifying patterns or specific keywords related to arXiv papers. Unfortunately, this method might not be efficient or accurate in this context due to the sheer volume and variety of the content.

If you're looking for arXiv papers mentioned in the awesome-generative-ai list, I recommend visiting the GitHub page directly and using the browser's search function (usually accessible with Ctrl+F or Cmd+F) to search for "arxiv" keywords. This approach allows you to quickly find and access any arXiv links mentioned in the repository.

If you have any specific topics or papers in mind, please let me know, and I can assist you further!

この記事が気に入ったらサポートをしてみませんか？