[アイデア] explainableな構造ー物性相関のLLM予測モデルのデータセットの自動生成

はじめに

化学構造から物性を予測する問題は、「構造ー物性相関」というジャンルで長年研究されています。ここ10年くらいはグラフニューラルネットワークなどが注目されてきました。一方、あまり予測があたらなかったり、予測理由がわからないなどの問題が知られています。

高性能な大規模言語モデルを使えば、Explainableかつ、高い予測精度で、実験結果を予測するAIを作れる可能性があります。例えば、分子構造から融点を高精度かつ理由付きで予測できるようになるかもしれません。
一方、どのようにExplainableかつ予測精度を上げるためのデータセットを作ればよいのか、という問題が浮上します。

本稿では、データセットを作るためのアイデアについて記します。
前提となる知見は、以下のpreprintです。


考え方

  1. 問題と回答のペアを与える

  2. 理由を考えさせる

  3. 問題+理由から、良い感じの回答が得られれば、それをデータセットに採用する

このコンセプトが構造ー物性相関のタスクにおいて、実際にworkするのか、少し興味が湧いたので、試してみました。

例題

1,4-Benzenedithiolの融点を聞く問題を考えてみます

実際の融点は98度です(参考)。

GPT-4の回答

間違っています。GPT-4はこの化合物の融点を知らないようです。言い換えると、データセットの作成タスクに適用できます。

予測理由を考えさせる

次に、QとAを与えて、その理由を考えさせます。

回答は以下の通り。それっぽい考察をしてきました。

To prepare quantitative hints for predicting the melting point of 1,4-Benzenedithiol, which is 97 degrees Celsius, you can consider various factors that typically influence the melting point of a compound. These hints can include:
Molecular Structure: 1,4-Benzenedithiol consists of a benzene ring with two thiol (SH) groups at the 1 and 4 positions. The presence of a benzene ring usually suggests a higher melting point due to the stability of the aromatic ring. Additionally, the thiol groups might influence intermolecular interactions.
Intermolecular Forces: The thiol groups in 1,4-Benzenedithiol can form hydrogen bonds, which are strong intermolecular forces. Stronger intermolecular forces generally lead to higher melting points.
Symmetry of the Molecule: Symmetrical molecules tend to pack better in a solid state, which can lead to higher melting points. The 1,4-substitution pattern on the benzene ring provides a degree of symmetry.
Molar Mass: Generally, compounds with higher molar masses have higher melting points. The molar mass of 1,4-Benzenedithiol can be considered moderate in comparison to other organic compounds.
Comparison with Similar Compounds: Comparing with similar compounds, such as benzene or other substituted benzenes, can give insights. For example, benzene has a melting point of 5.5 degrees Celsius, but the introduction of functional groups typically increases the melting point.
Purity of the Sample: The purity of a compound significantly affects its melting point. Impurities usually lower the melting point and broaden the melting point range.
Experimental Conditions: The method and conditions under which the melting point is measured (like atmospheric pressure) can also affect the result.

ヒント付きで予測させる

次に、「問題+GPT-4が作ったヒント」で、回答させてみます。

当てずっぽうで良いので、必ず何らかの数値を予測すること
Q: Melting point of 1,4-Benzenedithiol?
Hints: Molecular Structure: 1,4-Benzenedithiol consists of a benzene ring with two thiol (SH) groups at the 1 and 4 positions. The presence of a benzene ring usually suggests a higher melting point due to the stability of the aromatic ring. Additionally, the thiol groups might influence intermolecular interactions. Intermolecular Forces: The thiol groups in 1,4-Benzenedithiol can form hydrogen bonds, which are strong intermolecular forces. Stronger intermolecular forces generally lead to higher melting points. Symmetry of the Molecule: Symmetrical molecules tend to pack better in a solid state, which can lead to higher melting points. The 1,4-substitution pattern on the benzene ring provides a degree of symmetry. Molar Mass: Generally, compounds with higher molar masses have higher melting points. The molar mass of 1,4-Benzenedithiol can be considered moderate in comparison to other organic compounds. Comparison with Similar Compounds: Comparing with similar compounds, such as benzene or other substituted benzenes, can give insights. For example, benzene has a melting point of 5.5 degrees Celsius, but the introduction of functional groups typically increases the melting point. Purity of the Sample: The purity of a compound significantly affects its melting point. Impurities usually lower the melting point and broaden the melting point range. Experimental Conditions: The method and conditions under which the melting point is measured (like atmospheric pressure) can also affect the result.
A:

回答

Predicting the exact melting point of a compound like 1,4-Benzenedithiol without experimental data is challenging, but we can make an educated guess based on the given hints:

  1. Molecular Structure: The presence of a benzene ring suggests a higher melting point due to aromatic stability. Thiol groups at the 1 and 4 positions may enhance intermolecular interactions.

  2. Intermolecular Forces: Thiol groups can form hydrogen bonds, a strong intermolecular force. This typically leads to a higher melting point.

  3. Symmetry of the Molecule: The symmetrical 1,4-substitution pattern on the benzene ring aids in efficient packing in the solid state, which usually correlates with a higher melting point.

  4. Molar Mass: 1,4-Benzenedithiol has a moderate molar mass relative to other organic compounds, which could imply a moderately high melting point.

  5. Comparison with Similar Compounds: Benzene itself melts at 5.5 °C. Substituted benzenes often have higher melting points due to added functional groups. Since thiol groups can form strong hydrogen bonds, this could significantly increase the melting point compared to benzene.

  6. Purity of the Sample: Assuming a pure sample, we would expect the melting point to be at its highest and most narrow range.

  7. Experimental Conditions: Assuming standard conditions, the prediction would be more straightforward.

Considering these factors, especially the impact of hydrogen bonding and the moderate molar mass, the melting point of 1,4-Benzenedithiol is likely to be significantly higher than that of benzene, but not extremely high due to its moderate molar mass. A rough estimate might place the melting point in the range of 50 to 150 °C. However, without experimental data, this remains an educated guess.

融点に幅がありますが、回答が少しマシになったようにも見受けられます。

まとめ

アイデアの方向性は、間違っていない気がしています。しっかり作り込めば、それなりに良いものができそうです。
今後、王道のアプローチになるポテンシャルはありそうです。
(今から研究を始めて半年以内くらいに論文を書けば、それなりに先駆者になれるかも?)

バリエーションとしては、
・データセットの作り方を工夫する
・低コストな計算値をinputに加える(例: group contribution method, 低コストなDFT)
・グラフニューラルネットなどと連携して、分子構造の認識精度を上げる
などでしょうか。

難しさは、
・予測精度を上げるための説明文の作り方(prompt engineering)
・間違った解釈のデータセットが大量に生成されるリスク
あたりでしょうか。

わりと化学系のドメイン知識が必要な泥臭い取り組みなので、Google, Metaを筆頭とするモンスター集団(≒圧倒的なGPUパワーや天才エンジニアを持つグループ)とのガチンコ勝負を避けられるメリットもありそうです。

意外と試行錯誤とマンパワーが必要そうなので、手を動かしてくれる方も募集中です…笑

この記事が気に入ったらサポートをしてみませんか?