Chatdoctor: A Medical Chat Model Fine-Tuned On Llama Model Using Medical Domain Knowledge

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

ChatDoctor: A Medical Chat Model Fine-tuned

on LLaMA Model using Medical Domain


Knowledge

Yunxiang Li1 , Zihan Li2 , Kai Zhang3 , Ruilong Dan4 , You Zhang1( )

1
arXiv:2303.14070v1 [cs.CL] 24 Mar 2023

University of Texas Southwestern Medical Center, Dallas, USA


2
University of Illinois at Urbana-Champaign, Urbana, USA
3
Ohio State University, Columbus, USA
4
Hangzhou Dianzi University, Hangzhou, China
[email protected]

Abstract. Recent large language models (LLMs) in the general do-


main, such as ChatGPT, have shown remarkable success in following in-
structions and producing human-like responses. However, such language
models have not been learned individually and carefully for the medical
domain, resulting in poor diagnostic accuracy and inability to give cor-
rect recommendations for medical diagnosis, medications, etc. To address
this issue, we collected more than 700 diseases and their corresponding
symptoms, recommended medications, and required medical tests, and
then generated 5K doctor-patient conversations. By fine-tuning models
of doctor-patient conversations, these models emerge with great poten-
tial to understand patients’ needs, provide informed advice, and offer
valuable assistance in a variety of medical-related fields. The integration
of these advanced language models into healthcare can revolutionize the
way healthcare professionals and patients communicate, ultimately im-
proving the overall quality of care and patient outcomes. In addition,
we will open all source code, datasets and model weights to advance the
further development of dialogue models in the medical field. In addi-
tion, the training data, code, and weights of this project are available at:
https://github.com/Kent0n-Li/ChatDoctor.

1 Introduction
The development of instruction-following large language models (LLMs) such
as ChatGPT [4] has garnered significant attention due to their remarkable suc-
cess in instruction understanding and human-like response generation. These
auto-regressive LLMs [7] are pre-trained over web-scale natural languages by
predicting the next token and then fine-tuned to follow large-scale human in-
structions. Also, they have shown strong performances over a wide range of
NLP tasks and generalizations to unseen tasks, demonstrating their potential as
a unified solution for various problems such as natural language understanding,
text generation, and conversational AI. However, the exploration of such general-
domain LLMs in the medical field remains relatively untapped [2], despite the
2 Y. Li et al.

immense potential they hold for transforming healthcare communication and


decision-making [1]. The specific reason is that the existing models do not learn
the medical field in detail, resulting in the models often giving wrong diagnoses
and wrong medical advice when playing the role of a doctor. By fine-tuning the
large language dialogue model on the data of doctor-patient conversations, the
application of the model in the medical field can be significantly spurred. Espe-
cially in areas where medical resources are scarce, ChatDoctor can be used for
initial diagnosis and triage of patients, significantly improving the operational
efficiency of existing hospitals.
Since large language models such as ChatGPT are in a non-open source state,
we used Meta’s LLaMA and first trained a generic conversation model using 52K
instruction-following data provided by Stanford Alpaca [5], and then fine-tuned
the model on our collected physician-patient conversation dataset. The main
contributions of our method are three-fold:
1) We designed a novel process framework for fine-tuning large language models
in the medical domain.
2) We collected a dataset with 5,000 doctor-patient conversations for fine-tuning
the large language model.
3) We validate that the fine-tuned ChatDoctor with medical domain knowledge
has real potential for clinical application.

Fig. 1. Overview of the physician and patient conversation dataset collection pipeline
and the training procedure of ChatDoctor.

2 Method
2.1 Physician and patient conversation dataset
The first step in building a physician-patient conversation dataset is to collect
the disease database that serves as the golden standard of medical-domain ex-
Title Suppressed Due to Excessive Length 3

pertise. Therefore, we collected and organized a database, containing about 700


diseases with their relative symptoms, medical tests, and recommended medi-
cations. To train high-quality conversation models on an academic budget, we
input each message from the disease database separately as a prompt into the
ChatGPT API to automatically generate instruction data. Notably, our prompts
to the ChatGPT API contain the gold standard of diseases, corresponding symp-
toms, and reference drugs, so our fine-tuned ChatDoctor is not only capable of
achieving ChatGPT’s conversational fluency but also higher diagnostic accuracy
compared to ChatGPT. Finally, we collected 5K doctor-patient conversation
instructions and constructed a new dataset, namely InstructorDoctor-5K.

2.2 Training of the model

We build our ChatDoctor utilizing Meta’s LLaMA model [6], a distinguished


publicly accessible LLM. Notably, in spite of its mere 7 billion parameters, the
LLaMA has reported that LLaMA’s efficacy can attain competitive and even su-
perior outcomes in comparison to the considerably larger GPT-3 (with 175 billion
parameters) on several NLP benchmarks. LLaMA’s performance improvement
was achieved by amplifying the magnitude of training data, as opposed to param-
eter quantity. Specifically, LLaMA was trained on 1.0 trillion tokens, procured
from publicly accessible data repositories such as CommonCrawl and arXiv doc-
uments.
We utilize conversation demonstrations synthesized via ChatGPT [4] and
subsequently validated by medical practitioners to fine-tune the LLaMA model
[8], in accordance with the Stanford Alpaca [5] training methodology. The fine-
tuning process was conducted using 6 A*100 GPUs for a duration of 30 minutes.
The hyperparameters employed in the training process were as follows: the total
batch size of 192, a learning rate of 2 × 10−5 , a total of 3 epochs, a maximum
sequence length of 512 tokens, a warmup ratio of 0.03, with no weight decay.

3 Results

On our ChatDoctor model, we play the role of a patient and manually input
some medically relevant questions. The related conversation results are presented
in Figures 2, ??, and 3. To assess the performance of ChatDoctor, the input
from the self-structured evaluation set was manually assessed by experienced
practitioners. We performed a blind evaluation of ChatDoctor and ChatGPT
against each other to fairly assess their medical capabilities. In the comparison of
recommending medications based on diseases, our ChatDoctor achieved 91.25%
accuracy compared to ChatGPT’s 87.5%.
Some examples of ChatDoctor’s response are depicted in Table 1. After our
analysis, we found many interesting points. For example, in the first question,
the patient asks for the recommended medication for pyloric stenosis and does
not ask anything about surgery, but the ChatDoctor’s response mentions that
pyloric stenosis is not adequately treated by medication alone and that the best
4 Y. Li et al.

Fig. 2. An example of a medical advice consultation with ChatDoctor.

Fig. 3. An example of a medical advice consultation with ChatDoctor.


Title Suppressed Due to Excessive Length 5

Table 1. Some examples of consulting ChatDoctor about some medical advice.

No. Class Dialogue


Question of Patient Doctor, I have been experiencing vomiting, nausea, and
1
abdominal pain. I recently found out that I have pyloric
stenosis. What medications should I take for it?
Response of ChatDoctor The most common treatment for pyloric stenosis is a sur-
gical procedure called a pyloromyotomy. This involves
cutting and sealing the muscle around the opening of the
pylorus, which widens it and allows food to pass through
more easily. However, there are also some medications
that can help reduce the symptoms such as metoclo-
pramide or cisapride. Which one would you prefer?
Question of Patient Hi doctor, I have been losing hair rapidly for the past
2
few months. I think I might have Alopecia. What should
I do?
Response of ChatDoctor There are several medications that can help with Alope-
cia. You may want to consider topical steroids or minox-
idil, both of which have been proven effective. Addition-
ally, immunosuppressants and biologics may be beneficial
for more severe cases.
Question of Patient Doctor, I’ve been experiencing sudden jerks and twitches
3
in my muscles. It’s becoming very difficult to manage
them on a daily basis.
Response of ChatDoctor There are several medications that can help with My-
oclonus. Are you taking any other medications?
Question of Patient I think I have carbon monoxide poisoning. I feel dizzy
4
and nauseous.
Response of ChatDoctor I recommend that you get medical attention immediately
and start taking a breathing treatment with 100% oxy-
gen. You should also receive an injection of the drug "Di-
amox" which helps to reduce the symptoms of Carbon
Monoxide Poisoning.
Question of Patient I have been experiencing memory problems and confusion
5
lately. I think I might have Wernicke Korsakoff syndrome.
Response of ChatDoctor Doctor: Yes, we can try some medications such as thi-
amine and memantine, but it is best to consult with a
neurologist for further evaluation and treatment.
6 Y. Li et al.

solution is surgery. In the third question, ChatDoctor believes that some of the
drugs available for recommendation are harmful when taken in combination with
other drugs. Therefore, ChatDoctor did not recommend the medication directly
but asked the patient if he or she was taking other medications. In the fourth
question, ChatDoctor considers carbon monoxide poisoning to be very urgent
and responds by advising the patient to seek immediate medical attention. In
the fifth question, ChatDoctor did not have enough knowledge about Wernicke
Korsakoff syndrome to answer in more detail and advised the patient to consult
with a neurologist for further evaluation and treatment.

4 Limitations

We emphasize that ChatDoctor is for academic research only and any commer-
cial use and clinical use is prohibited. There are three factors for this decision:
First, ChatDoctor is based on LLaMA and has a non-commercial license, so we
necessarily inherited these rules. Second, our model is not licensed for healthcare-
related purposes [3]. Also, we have not designed sufficient security measures, and
the current model still can not guarantee the full correctness of medical diag-
noses.

5 Discussion and conclusion

ChatDoctor obtained by fine-tuning large language models on medical domain


knowledge has a wide range of potential applications. However, due to the unique
characteristics of the medical domain, latent language errors in diagnosis and
medical advice can have serious consequences. And large language models often
generate many incorrect and harmful statements on the knowledge they do not
know, which may result in malpractice. In future work, it will be vital to limit
large language models to generate only results that they are very sure of and
suppress the indefinite responses. In addition, the model performance is highly
correlated with the high-quality training data that is extremely scarce. Despite
these challenges, the potential benefits of ChatDoctor are significant, includ-
ing improving the accuracy and efficiency of medical diagnoses, and reducing
the workload of medical professionals while increasing access to medical advice,
especially for most grassroots hospitals and patients in third-world countries.
By addressing the challenges of language model applications in the medical do-
main, our ChatDoctor could become a valuable assistant in promoting patient
outcomes and advancing medical research.

References
1. Abacha, A.B., Zweigenbaum, P.: Means: A medical question-answering system com-
bining nlp techniques and semantic web technologies. Information processing &
management 51(5), 570–594 (2015) 2
Title Suppressed Due to Excessive Length 7

2. Gilson, A., Safranek, C.W., Huang, T., Socrates, V., Chi, L., Taylor, R.A., Char-
tash, D., et al.: How does chatgpt perform on the united states medical licensing
examination? the implications of large language models for medical education and
knowledge assessment. JMIR Medical Education 9(1), e45312 (2023) 1
3. Hatherley, J.J.: Limits of trust in medical ai. Journal of medical ethics 46(7), 478–
481 (2020) 6
4. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang,
C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow
instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022) 1, 3
5. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P.,
Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model. https:
//github.com/tatsu-lab/stanford_alpaca (2023) 2, 3
6. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T.,
Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E.,
Lample, G.: Llama: Open and efficient foundation language models (2023) 3
7. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.:
Self-instruct: Aligning language model with self generated instructions (2022) 1
8. Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S., Huang, F.: Raise a
child in large language model: Towards effective and generalizable fine-tuning. arXiv
preprint arXiv:2109.05687 (2021) 3

You might also like