

GoogleのDeepMindとGoogle Researchは、会社のAI「Gemini」を基にして、医療分野に特化した「Med-Gemini」という研究論文を発表しました。Med-Geminiは、これまでのAIシステムが抱えていたベンチマークの曖昧さを克服し、医療現場で高い精度を実現しています。Med-Geminiの特長は、自己学習と検索機能を活用した高度な推論能力にあります。研究によると、Med-Geminiを実際の医師と比較した場合、診断精度で医師を上回る結果が得られました。Med-Geminiは、医師とのコミュニケーションを支援するAMIEとは異なり、膨大な医療データを分析し、診断と治療計画の立案を支援することに特化しています。将来的には、患者情報の総合的な分析を通じて、医療従事者のより適切な意思決定をサポートすることが期待されます。

Google DeepMind and Google Research have released a very interesting research paper called the capabilities of Gemini models in medicine.

Google DeepMindとGoogle Researchは、医療におけるGeminiモデルの能力と題した非常に興味深い研究論文を発表しました。

Essentially what they've done is they've made a paper discussing and showing how Google's Gemini model can be fine-tuned and turned into something that can be used very very effectively for helping out in the medical industry.


This is quite surprising because I didn't expect such a system from Google just yet but they also released something earlier this year which was actually pretty similar.


If you remember this is something that I spoke about and this was AMIE which was Articulate Medical Intelligence Explorer.

覚えていると思いますが、これは私が話したことで、アーティキュレート・メディカル・インテリジェンス・エクスプローラー(Articulate Medical Intelligence Explorer)というAMIEでした。

It was an advanced AI research system developed by Google and this one which was released around three to four weeks ago.


It was basically designed to handle diagnostic reasoning and engage in meaningful conversations within a medical context aiming to enhance the interactions between physicians and patients as well as improve the quality and accessibility of consultations.


Basically the reason that AMIE was so good is because it was able to use a simulated learning environment to enhance its learning and it engaged in diagnostic dialogues with AI patient simulators allowing it to practice and redefine and refine its conversational and diagnostic skills continually.


It was actually trained on a huge diverse set of medical data including real world clinical conversations with medical reasoning scenarios.


One of the key things about AMIE that when it was actually pitted against clinicians it showed us that this was something that was far effective if humans used it in the loop.


You can see right here on this graph we can literally see that with AMIE only the system performed increasingly better than the clinician unassisted and the clinician was assisted by search and search is essentially just the internet which shows a huge improvement in the gaps and then we can see that assisted by AMIE is a far stark increase from just the clinician unassisted.


Of course AMIE only compared to the clinician assisted by AMIE shows us that AMIE actually did surpass the actual clinician.


Basically what this showed us, okay, and I know this isn't Med-Gemini just yet, but this is basically showing us that Google's increasing their efforts for medical health in terms of research because they're showing us that these AI systems like AMIE are far superior than just the clinicians.


Essentially, like I said, this is of course Med-Gemini, so what we have here is we have the initial Gemini system that exists here.


Gemini is a family of powerful AI systems that are completely multimodal.


You can see that there are the inherited capabilities such as the advanced reasoning, the multimodal understanding, and the long context processing.


This is where they decided to have the development for Med-Gemini.


They did medical specialization with self-training with web search integration, and of course the multimodal understanding.


They did fine-tuning and customized encoders, and of course with the long context processing, they did chain of reasoning prompting.


With all of these skills combined, that's where we now get, of course, Med-Gemini, this version of Gemini which is specialized for medical applications.


It's very, very fascinating because if you're someone that hasn't been paying attention to this space, this is an industry that is truly about to be disrupted because the applications and the things that we're seeing show us that the benchmarks are looking pretty incredible in terms of the applications.


One of the things that was there before was, of course, the previous state of the art.


The previous state of the art in terms of in the industry for medical AI systems that are able to talk and that are able to answer certain questions and queries in terms of the accuracy, you can see that here we can see from September 21st all the way to September 2023, there's been a huge increase in terms of what these AI systems have been able to do.


Noticeably, the jump from GPT-3.5 to Google's med palm, then of course to GPT-4 and med palm 2, and the previous state of the art model which was before Med-Gemini.


Of course, now what we do have is we do have a state of the art system which is medical Gemini, the state of the art system before the one that was released today with Google.


Well, not actually released, but the paper was released, was actually GPT-4.


Not just the base version, which the GPT-4 base version was very close to Google's medical one.


It's actually GPT-4 with a fine-tuned version that gets it 19.2 percent on the MedQA, which is a decent benchmark for these AI systems.


GPT-4 with the med prompt actually had a very high benchmark, but once again, Google has beaten them now with GPT-4 and med prompt.


The reason I'm showing you guys this infrastructure is because this shows us how crazy it actually is.


This takes us from the base level of GPT-4, and you can see all of the different things that they've added to GPT-4 in order to get the system to perform a lot better.


What's crazy is that Med-Gemini, the reason why it's so effective and we're about to get into that, is because it doesn't use all of these crazy techniques like the ensemble with choice shuffle.


If you don't know what that is, basically in multiple choice questions, sometimes there is bias rated towards the first question in terms of like the answer.


For example, if you were to ask someone what is the primary gas found in Earth's atmosphere, and if you were to have four answers, the first one, a lot of people subconsciously might think that that one is correct.


Essentially what you do is you shuffle them.


When the most common answer is picked among these, you can then find out that answer, and then that's the one that you use.


It's pretty crazy.


You can see how many different iterations that they did on top of GPT-4 to get to 90.2.


Surprisingly, they managed to surpass this benchmark.


You can see right here that this is where Google's Med-Gemini comes in on the MedQA at 91.1%, which is very interesting in terms of the increase.


I guess some people could argue that maybe we are starting to peter out in terms of the capabilities of Large Language Models on these benchmarks.


But I would certainly disagree because whilst yes, this might be the truth, Med-Gemini is pretty great.


One of the key things you do need to know about the benchmarks is that you can see here it says relabeling with expert clinicians suggests that 7.4 percent of the questions in the data set have quality issues or ambiguous ground truth.


Essentially, one of the things that have been consistently problematic in the AI benchmarking industry was the fact that these kind of systems unfortunately have to go with benchmarks that are pretty standard.


But these benchmarks contain thousands and thousands of questions, but some of these questions are quite ambiguous and they have quality issues, meaning that the AI systems can't really even get any of the answers correct because the questions literally don't make sense.


I wish I did have some examples to show you, but just trust me when I tell you some of the questions are completely insane.


What they're stating is that 7.4 percent of this might not even be that great because the benchmarks are potentially facing several quality issues.


Whilst you might think that this is going to be something that peters out in the future, we do know that more advanced reasoning systems could take this to 100 in terms of not only the system being good but of course the benchmarks performing a lot better when those quality issues do get fixed.


One of the things we could see as well about the medical benchmarking here as we can see that there are several categories in which Med-Gemini actually surpasses the previously state of the art.


The blue one here is of course the main focus and we can see that the blue one is Gemini and this surpasses the previous state of the art which is GPT-4 with med prompt in every single category, well nearly every single category.


There are some ones right here where it is pretty much on par and only this one where it's on you know long context video.


But you can even see right here it says GPT-4 results not available due to context length limitations and the advanced text reasoning you can see here that this is done pretty well, the multimodal understanding it's done pretty well as well.


Essentially if you just want to take a look at this in terms of how much better it is, if we just take a look at this line we can see that anything above this line is something that is an improvement and we can see that in pretty much near all of these categories we do have a decent improvement.


One of the things here is the medical Gemini on advanced text based reasoning tasks and you can see here that based on the previous state of the art it surpasses it in many different categories.


One of the things that they actually did talk about is that they actually did talk about how this system compared with GPT-4 in some scenarios.


GPT-4 just didn't actually have the context length to support that on long context reasoning.


It's important to know that long context reasoning with Google's new context length it's actually a pretty important feature for the future because it allows us to process more information.


The thing is with the medical industry, the more data you have, the more comprehensive of a picture you do have because a human body is made up of so many different intricate parts and because they all connect together, if you have a long context window, you're able to fit more data in and arguably get to a better conclusion about what the diagnosis may be or what's going wrong with a certain individual's body.


It's important to have that for the future especially in terms of the needle in the haystack which is where you're trying to get that data from a long piece of context and use it correctly and of course reason with that correctly.


You can see Gemini, well Med-Gemini surpasses state of the art and we can of course see it here as well compared to AMIE where it does surpass that as well.


Of course some of these aren't that crazy but it seems like maybe there's going to be some advanced reasoning techniques now.


One of the things that I did actually see that was pretty cool was that the advanced reasoning that we did see was a little bit different.


They had two kind of ways that they did advanced reasoning with this and I think that this is probably what we're going to see in terms of different models that are specialized for different use cases.


In the case of Med-Gemini, they essentially had self-training and search and these were leveraged to enhance its capabilities in handling complex medical data and queries.


Let's actually dive into how these features were used by Med-Gemini.


The self-training aspect with Med-Gemini was where you use the model's own outputs to generate new training examples which are then used to further improve the model.


Apologies for that coming off.


This method is particularly beneficial for refining the model's capabilities in an area where the initial training data might be limited or lack diversity.


Essentially, what they do is the first thing that they do is they generate synthetic examples.


For example, Med-Gemini processes medical data or queries and then generates responses based on its current understanding.


These responses, along with the context in which they were made, serve as new training examples.


We have its refinement.


These generated examples are fed back into the training cycle, allowing Med-Gemini to learn from its own outputs.


This iterative process helps the model to continually refine its reasoning and decision-making capabilities, especially in handling complex medical scenarios.


We have enhanced learning from simulators.


This is where, for Med-Gemini, simulations could involve creating scenarios where the model must interpret complex medical data from text, images, or even long medical records.


Feedback from the simulations helps the model adjust its methods for better accuracy and reliability.


Of course, we have the search for Med-Gemini.


This is where we have Med-Gemini when it actually encounters a question that it might struggle with, has low confidence, or insufficient internal data.


It can perform a web search to gather additional information.


Like here, you can see it says confident, no.


Of course, we go to here.


Of course, we go to here, which is web search.


Of course, there's an uncertainty-guided search where essentially Med-Gemini employs an uncertainty-guided search strategy where the model calculates that its predictions have high uncertainty and it proactively searches for more information before finalizing its response.


This method helps in the accuracy and reliability of its outputs.


This is the loop that it conducts in order to get better results.


Of course, we have continuous update of the knowledge.


The ability to search and integrate information from external sources means that Med-Gemini can continuously update its knowledge base without the need for frequent retraining.


This is actually pretty crucial in the medical field where new research and clinical practices can change standard care protocols.


By combining these two approaches that we have here, it can better adapt to new or rare medical scenarios.


It has access to the latest medical information via search, which means that its data is not only based on the initial training data but also the latest studies, clinical trials, and guidelines.


Through the continuous learning and adaptation, it becomes more proficient at handling diverse and complex medical queries, making it a valuable tool for medical professionals seeking ai support.


Of course, here are the video benchmarks, and this is something where it's pretty good.


We do know that Google's Gemini 1.5 Pro can actually take in, I think, an hour of video, which means it can also look at these videos and analyze the scene in a medical way.

GoogleのGemini 1.5 Proは実際に1時間のビデオを取り込むことができ、これらのビデオを見て医療の観点からシーンを分析することもできることを私たちは知っています。

You can see right here the states.


Here you can see the previous state of the art where we've got Med-PaLM and GPT-4 Vision.

ここでは、Med-PaLMとGPT-4 Visionという以前の最先端技術が見られます。

I'm wondering what would happen when GPT-4 releases the video, how it's going to compete with Medical Gemini on this as well, which is going to be pretty crazy.

GPT-4がビデオをリリースしたとき、これもMedical Geminiと競合することになるのは非常にクレイジーになるだろうと思っています。

Of course, here is where we have the capabilities of Gemini models in medicine, and this is where we have the actual benchmarks.


You can see right here, this is where we have the benchmark, and you can see the accuracy from the clinician, then the clinician and search, then the prior state of the art, then of course the Med-Gemini, then the Med-Gemini plus search.


We can see a stark improvement like I showed you guys before with, of course, AMIE.


This is the AMIE stuff right here, is what I'm guessing.


This is, of course, the GPT-4 or AMIE.


I'm not sure which one, but what we do have here is a clear conclusion from this graph.


Just on this part, we can see that the clinician is actually the lowest rated one in terms of accuracy.


If we go all the way up here, we can see that Med-Gemini plus search gives us a huge increase in terms of performance.


There is a huge gap right here in terms of what's being done, and I think that is rather impressive on how AI is able to bridge this knowledge gap for clinicians and help them.


I definitely think that this is something that could provide a lot of details and help to people.


The only thing that I do think could be problematic here is that hopefully humans don't completely rely on the AI because sometimes the AI might miss certain things in diagnoses, and I think humans always have a lot more data points than an AI system.


Because something that I've and this is a little bit off tangent, but it does make sense here, is that if you are prompting GPT-4 or any other AI system, sometimes you'll think, Why can't this system get what I want to go out?


But if you give the system every single piece of information that you can, like, for example, if you're asking it how to write a certain essay and if you say when the deadline is, if you say what your teacher is like, if you say the style, if you say the length, if you say a few things you need, the more information you give these systems, the much better they are at essentially giving you the output that you want.


That's something personally that I've seen, and that's why I state that the reasons humans would have a lot more information is because a human might go to a system and be like, Oh, I have a runny nose.


But if you do, I have a runny nose to an AI system, that's the only piece of information it has, whereas a human who's in the room with you can see how your skin looks, it can see how you're moving around, it can see if you're fatigued, it knows your age, it knows what you look like, it knows your family tree.


The point here is that I do think that this is going to definitely improve as we move on in the future, and hopefully this can be a very good tool for clinicians in the future.


Like I said before, whilst yes, this is doing well, there are some problematic benchmarks you can see here.


As I spoke about previously revisiting the MedQA benchmarks, there were some probabilities.


However, some MedQA test questions have missing information, such as figures or lab results, and potentially outdated ground truth answers.


To address these concerns, they had at least three U.S. doctors review each question to answer the question themselves, check if the original answers were still correct, and note any missing details or ambiguous elements in these questions.


They used a method called bootstrapping, where a committee of three reviewers decided if a question should be excluded due to its flaws.


Apologies for that little glitch there.


The findings were that 3.8 percent of questions were missing information, 2.9 percent had incorrect answers, and 0.7 percent were ambiguous.


Most reviewers agreed on these assessments, showing strong consensus.


Removing flawed questions helped improve the AI's test score from 91.1 accuracy to 91.8 percent.


If they used majority decisions, which are the more relaxed criteria instead of unanimous ones, the accuracy further increased to 92.9 percent by dropping about one-fifth of the problematic questions.


In simple terms here, by cleaning up the test and removing or fixing flawed questions, they made the test a better tool for accurately assessing the AI's ability to handle medical queries.


Here's where we actually get into some of the dialogue examples for this AI system, and I think this is where we can see how this multimodal system works.


It's not something that's too crazy, but I think you guys can all truly understand exactly how this works.


You can see you have the dialogue examples, and this is where the person actually decides to leave a comment down where they're stating exactly what's going on.


Med-Gemini says i understand your concern.


Can you send me a picture of whatever it is you're dealing with?


It sends them a picture.


It queries and asks for more information.


It says this is what i think it is.




Could you explain what this is?


It tells you exactly what it is.


It's a diagnosis.


This is only a definitive diagnosis can be made.


It says okay.


Will you advise me on how to treat this?


You can see here it's able to give out a lot of these pieces of information that could help this person to solve this issue.


I think like i said before this stuff is really good because doctors you have an appointment. You maybe have 15 to 30 minutes.


But with an ai system you could ask it a million or a billion questions.


It's going to be patient.


It's going to understand you.


It's going to be able to talk to you in many different ways that it could easily help you more than a person.


Because whilst yes i do appreciate what doctors do and i know how good they are.


The only problem is and i know this sounds crazy that i'm about to say this is that they are human.


Which means they have human limitations such as the time.


This is something that ai systems can help.


Imagine a virtual ai doctor where it would just talk to you about this kind of thing.


You could quickly get diagnosed.


There was some feedback from an actual dermatologist where it says that it's impressive diagnostic accuracy for this condition which is relatively rare and a specialty specific condition based on limited data of one photo and a brief description.


The reason they included this is because this is something that is relatively rare and its specialty specific condition.


The fact is is that this was just one picture.


Remember i've always stated how in the future ai systems trained on millions and millions of different images are going to be far far superior than any human because they're going to have seen so much data that they're going to instantly know when something is like that.


I think in the future with more advanced systems these kind of things are going to becoming pretty pretty normal.


However there were some cons.


It says additional photos of representative lesions on different extremities would strengthen the diagnosis.


It says they could include other things.


Of course it does say whilst there's no cure, it could emphasize the possibility for symptom improvement and management.


There was also another one here, and I do want to state that there are several examples in the paper that I simply just didn't want to take a look at because the pictures were pretty graphic considering they were from medical examinations like people doing open surgeries, so it was a little bit graphic and I just didn't want to include that.


But you can see here, this is where we have another dialogue example of someone getting their picture, and it says, Hello, I'm a primary care physician, and this is an x-ray for a patient of mine.


The formal radiology report is still pending, and I would like some help to understand the x-ray.


Please write a radiology report for me.


It talks about that, then you can query it, and then it says, My patient has a history of xyz, but I do think in the future that it would be better if you could just somehow load the patient's data with this because trying to prompt it and say, could this be the back pain?


I do think that whilst yes, doctors are good to do that, I think it would be better in the future if these AI systems are going to have access to that user's complete medical history because it's going to allow for a much more comprehensive diagnosis.


Of course, this is where we had feedback, which was a rather interesting example where it said, You are a helpful medical video assistant.


You are given a video and a corresponding subtitle with a start time and duration followed by question.


Your task is to extract the precise video timestamps and then answer the question given below.


Provide one single time span that spans the entire length of the answer while considering the entire video.


It's better to be exhaustive and providing the longest time span for the answer.


How do we relieve calf strain with foam roller massage?


You can see that this is exactly where the footage is.


It says the start and the end, and it says that the ground truth time span annotation is the same as what Gemini had answered.


Overall, I think some people at the end of this might be a little bit confused between Med-Gemini and AMIE, but essentially they are quite different.


Whereas the purpose of these two systems, AMIE is primarily designed for improving diagnostic dialogues and reasoning within medical consultations.


It aims to simulate and support the interactive conversation part of a medical consultation, focus on history taking, diagnostic accuracy, and patient communication.


But Med-Gemini is a more generalized AI model that excels in processing complex multimodal medical data, such as text and long medical records, and it's specialized in understanding and integrating broad medical knowledge across various formats to assist in diagnostics and treatment planning beyond the conversational capabilities.


The strength, of course, of Med-Gemini is the fact that it could be used, of course, in some instances for the long context processing with long history records to enable more accurate diagnoses.


AMIE is actually optimized on focusing on engaging patients in meaningful dialogues.


The future goals for these two kinds of systems are quite different because AMIE aims to become a virtual assistant in medical consultations, enhancing the quality of care through better communication and diagnostic support, and it seeks to basically address the conversational and empathetic aspects of medical practice.


Whereas Med-Gemini is positioned to aid in a more analytical way involving vast amounts of data, aiming to support medical professionals by providing comprehensive integrative analyses of patient information, potentially leading to more informed decisions.

一方、メッドGeminiは、膨大な量のデータを含むより分析的な方法で支援するように位置づけられており、患者情報の包括的な統合分析を提供することで医療専門家をサポートし、より informed な決定につなげることを目的としています。

I think something that you also need to take into account is that once these AI systems are trained on vast different languages, it's also going to break down the barrier in terms of people who struggle with certain languages get the right medical care that they need because I can't imagine trying to explain something to a doctor that doesn't speak my language.


The nuances in certain things do make the difference in ensuring that you get the right medical treatment.

