Chatdoctor: A Medical Chat Model Fine-Tuned On Llama Model Using Medical Domain Knowledge
Chatdoctor: A Medical Chat Model Fine-Tuned On Llama Model Using Medical Domain Knowledge
Chatdoctor: A Medical Chat Model Fine-Tuned On Llama Model Using Medical Domain Knowledge
Yunxiang Li1 , Zihan Li2 , Kai Zhang3 , Ruilong Dan4 , You Zhang1( )
1
arXiv:2303.14070v1 [cs.CL] 24 Mar 2023
1 Introduction
The development of instruction-following large language models (LLMs) such
as ChatGPT [4] has garnered significant attention due to their remarkable suc-
cess in instruction understanding and human-like response generation. These
auto-regressive LLMs [7] are pre-trained over web-scale natural languages by
predicting the next token and then fine-tuned to follow large-scale human in-
structions. Also, they have shown strong performances over a wide range of
NLP tasks and generalizations to unseen tasks, demonstrating their potential as
a unified solution for various problems such as natural language understanding,
text generation, and conversational AI. However, the exploration of such general-
domain LLMs in the medical field remains relatively untapped [2], despite the
2 Y. Li et al.
Fig. 1. Overview of the physician and patient conversation dataset collection pipeline
and the training procedure of ChatDoctor.
2 Method
2.1 Physician and patient conversation dataset
The first step in building a physician-patient conversation dataset is to collect
the disease database that serves as the golden standard of medical-domain ex-
Title Suppressed Due to Excessive Length 3
3 Results
On our ChatDoctor model, we play the role of a patient and manually input
some medically relevant questions. The related conversation results are presented
in Figures 2, ??, and 3. To assess the performance of ChatDoctor, the input
from the self-structured evaluation set was manually assessed by experienced
practitioners. We performed a blind evaluation of ChatDoctor and ChatGPT
against each other to fairly assess their medical capabilities. In the comparison of
recommending medications based on diseases, our ChatDoctor achieved 91.25%
accuracy compared to ChatGPT’s 87.5%.
Some examples of ChatDoctor’s response are depicted in Table 1. After our
analysis, we found many interesting points. For example, in the first question,
the patient asks for the recommended medication for pyloric stenosis and does
not ask anything about surgery, but the ChatDoctor’s response mentions that
pyloric stenosis is not adequately treated by medication alone and that the best
4 Y. Li et al.
solution is surgery. In the third question, ChatDoctor believes that some of the
drugs available for recommendation are harmful when taken in combination with
other drugs. Therefore, ChatDoctor did not recommend the medication directly
but asked the patient if he or she was taking other medications. In the fourth
question, ChatDoctor considers carbon monoxide poisoning to be very urgent
and responds by advising the patient to seek immediate medical attention. In
the fifth question, ChatDoctor did not have enough knowledge about Wernicke
Korsakoff syndrome to answer in more detail and advised the patient to consult
with a neurologist for further evaluation and treatment.
4 Limitations
We emphasize that ChatDoctor is for academic research only and any commer-
cial use and clinical use is prohibited. There are three factors for this decision:
First, ChatDoctor is based on LLaMA and has a non-commercial license, so we
necessarily inherited these rules. Second, our model is not licensed for healthcare-
related purposes [3]. Also, we have not designed sufficient security measures, and
the current model still can not guarantee the full correctness of medical diag-
noses.
References
1. Abacha, A.B., Zweigenbaum, P.: Means: A medical question-answering system com-
bining nlp techniques and semantic web technologies. Information processing &
management 51(5), 570–594 (2015) 2
Title Suppressed Due to Excessive Length 7
2. Gilson, A., Safranek, C.W., Huang, T., Socrates, V., Chi, L., Taylor, R.A., Char-
tash, D., et al.: How does chatgpt perform on the united states medical licensing
examination? the implications of large language models for medical education and
knowledge assessment. JMIR Medical Education 9(1), e45312 (2023) 1
3. Hatherley, J.J.: Limits of trust in medical ai. Journal of medical ethics 46(7), 478–
481 (2020) 6
4. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang,
C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow
instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022) 1, 3
5. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P.,
Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model. https:
//github.com/tatsu-lab/stanford_alpaca (2023) 2, 3
6. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T.,
Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E.,
Lample, G.: Llama: Open and efficient foundation language models (2023) 3
7. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.:
Self-instruct: Aligning language model with self generated instructions (2022) 1
8. Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S., Huang, F.: Raise a
child in large language model: Towards effective and generalizable fine-tuning. arXiv
preprint arXiv:2109.05687 (2021) 3