1-s2.0-S2666123324000448-main

JID: ENBENV
ARTICLE IN PRESS [m5GeSdc;April 9, 2024;6:14]
Energy and Built Environment xxx (xxxx) xxx
Contents lists available at ScienceDirect
Energy and Built Environment

journal homepage: http://www.keaipublishing.com/en/journals/energy-and-built-environment/
Evaluation of large language models (LLMs) on the mastery of knowledge

and skills in the heating, ventilation and air conditioning (HVAC) industry
Jie Lu a,b, Xiangning Tian c,d, Chaobo Zhang e, Yang Zhao a,f,∗, Jian Zhang a, Wenkai Zhang a,
Chenxin Feng a, Jianing He a, Jiaxi Wang a, Fengtai He a
a
Institute of Refrigeration and Cryogenics, Zhejiang University, Hangzhou, China
b
Energy Efficient Cities Initiative, Department of Engineering, University of Cambridge, Cambridge, United Kingdom
c
Architectural Design & Research Institute of Zhejiang University Co., Ltd., Hangzhou, China
d
Center for Balance Architecture, Zhejiang University, Hangzhou, China
e
Department of the Built Environment, Eindhoven University of Technology, Eindhoven, the Netherlands
f
Key Laboratory of Clean Energy and Carbon Neutrality of Zhejiang Province, Jiaxing Research Institute, Zhejiang University, Jiaxing, China
a r t i c l e i n f o a b s t r a c t
Keywords: Large language models (LLMs) have shown human-level capabilities in solving various complex tasks. However,
Large language model it is still unknown whether state-of-the-art LLMs master sufficient knowledge related to heating, ventilation and
ChatGPT air conditioning (HVAC) systems. It will be inspiring if LLMs can think and learn like professionals in the HVAC
GPT-4
industry. Hence, this study investigates the performance of LLMs on mastering the knowledge and skills related
Artificial general intelligence
to the HVAC industry by letting them take the ASHRAE Certified HVAC Designer examination, an authoritative
HVAC systems
examination in the HVAC industry. Three key knowledge capabilities are explored: recall, analysis and applica-
tion. Twelve representative LLMs are tested such as GPT-3.5, GPT-4 and LLaMA. According to the results, GPT-4
passes the ASHRAE Certified HVAC Designer examination with scores from 74 to 78, which is higher than about
half of human examinees. Besides, GPT-3.5 passes the examination twice out of five times. It demonstrates that
some LLMs such as GPT-4 and GPT-3.5 have great potential to assist or replace humans in designing and operating
HVAC systems. However, they still make some mistakes sometimes due to the lack of knowledge, poor reason-
ing capabilities and unsatisfactory equation calculation abilities. Accordingly, four future research directions are
proposed to reveal how to utilize and improve LLMs in the HVAC industry: teaching LLMs to use design tools
or software in the HVAC industry, enabling LLMs to read and analyze the operational data from HVAC systems,
developing tailored corpuses for the HVAC industry, and assessing the performance of LLMs in real-world HVAC
design and operation scenarios.
1. Introduction knowledge query, design check, etc. [7]. Seo et al. proposed a genetic
algorithm-based method to optimize the design of HVAC system to min-
The building sector is one of the biggest energy consumers in the imize primary energy demand for an apartment house [8]. The results
world, which accounts for about one-third of the global energy con- showed that this proposed method was capable of determining optimal
sumption [1,2]. As the most important components in buildings, heat- system design for saving energy. Cho et al. adopted a decision tree-
ing, ventilation and air-conditioning (HVAC) systems account for about based method to estimate the energy performance and investment cost
50 % of the energy used in the building sector [3]. Due to the complexity of HVAC systems, yielding information on optimal combination HVAC
and diversity of HVAC systems, the introduction of artificial intelligence sets in the office buildings [9]. Ellis and Mathews applied a specialized
(AI) is important to ensure the functionality, availability, reliability and knowledge database-based method to assist in quick searching for infor-
energy efficiency of HVAC systems [4]. mation related to HVAC system design such as building materials [10].
In the past decade, artificial intelligence (AI) has shown powerful Lee et al. also proposed a knowledge-based method to check whether
capacities in solving problems in the design and operation stages of design proposals comply with building codes and regulations automat-
HVAC systems [5,6]. In the design stage, AI demonstrates exceptional ically, such as national energy efficiency standards, environmental reg-
proficiency in design scheme optimization, energy efficiency estimation, ulations, and building safety rules [11]. In the operation stage, AI has
∗
Corresponding author.
E-mail address: [email protected] (Y. Zhao).
https://doi.org/10.1016/j.enbenv.2024.03.010
Received 29 January 2024; Received in revised form 25 March 2024; Accepted 26 March 2024
Available online xxx
2666-1233/Copyright © 2024 Southwest Jiatong University. Publishing services by Elsevier B.V. on behalf of KeAi Communication Co. Ltd. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Please cite this article as: J. Lu, X. Tian, C. Zhang et al., Evaluation of large language models (LLMs) on the mastery of knowledge and skills in the
heating, ventilation and air conditioning (HVAC) industry, Energy and Built Environment, https://doi.org/10.1016/j.enbenv.2024.03.010
JID: ENBENV
J. Lu, X. Tian, C. Zhang et al. Energy and Built Environment xxx (xxxx) xxx
shown promising capacity in system performance optimization, energy method achieved high diagnosis accuracy. However, existing AI meth-
consumption reduction, fault detection and diagnosis, and occupant ods are still unable to think and learn like professionals in the HVAC
thermal comfort enhancement, etc. [12,13]. For instance, Google has industry, since they are usually developed based on operational data or
used a deep neural networks-based method to optimize the energy effi- limited expert knowledge. It is of great interest to enable AI to mirror
ciency of a chiller system in a data center. The cooling energy of the data the cognitive and experiential learning capabilities of HVAC profession-
center was reduced by 40 % [7]. Afram et al. [14], Shultz and Zedeck als. In this way, humans can communicate with AI models to ask them
[15], Afram and Janabi-Sharifi [16] proposed a model-based predictive to help solve complex, repetitive and time-consuming tasks related to
control method to optimize the operation of HVAC systems, which are design and operation of HVAC systems, leading to higher productivity
demonstrated to have outstanding energy saving effects. Additionally, and lower labor costs.
Yang et al. adopted recurrent neural network-based method to learn oc- Recently, the development of the large language models (LLMs) has
cupant behaviors and external environmental factors, making real-time seen an impressive boom, reflecting their remarkable logical, cognitive,
adjustments to temperature, airflow, and humidity levels, ensuring op- and knowledge capabilities [19,20]. They demonstrate an advanced un-
timal indoor comfort and air quality [17]. Furthermore, Zhang et al. derstanding of complex concepts, exhibit nuanced reasoning, and access
developed a causal neural networks-based method for accurate and reli- a vast repository of information across various fields, such as medicine,
able fault detection and diagnosis of HVAC systems [18]. The proposed economics, media and so on. Hence, they possess significant potential to
Fig. 1. Knowledge capability evaluation framework for LLMs in the domain of HVAC systems.
2
JID: ENBENV
Fig. 2. Professional capabilities, scenarios and tasks covered in the ASHRAE CHD examination.
3
JID: ENBENV
Fig. 3. Sample questions for professional capability evaluation of LLMs in the domain of HVAC systems.
become digital AI experts of the HVAC industry. Lots of LLMs have been Question 2: Do LLMs master sufficient knowledge of HVAC that
proposed, such as GPT-3.5 [21], GPT-4 [22], ERNIE Bot [23], LLaMA could support the realization of AI experts in the future?
[24], Alpaca [25], Koala [26], Vicuna [27], Dolly [28], Oasst-phthia Question 3: What potential AI innovations could LLMs introduce to
[29], FastChat-T5 [30], ChatGLM [31] and StableLM-tuned-alpha [32]. the HVAC industry in the coming years?
Among them, GPT-4, one of the most advanced LLMs, has shown pow- With an aim of answering the Question 1 and Question 2, this study
erful abilities such as reasoning, creativity, and deduction. It is able to evaluates the knowledge capabilities of LLMs in the HVAC industry by
perform a variety of tasks, e.g., playing games, using tools, explaining it- letting them take the ASHRAE Certified HVAC Designer (CHD) examina-
self [33]. The cornerstone of their effectiveness as digital AI experts lies tion, which is a professional exam for testing the knowledge capabilities
in mastering domain-specific knowledge. It is only when LLMs have suf- of humans who work in this domain. Twelve representative LLMs are
ficient knowledge and skills related to HVAC industry that they can be considered, i.e., GPT-3.5, GPT-4, ERNIE Bot, LLaMA, Alpaca, Koala, Vi-
expected to assist or replace humans in designing and operating HVAC cuna, Dolly, Oasst-phthia, FastChat-T5, ChatGLM and StableLM-tuned-
systems. However, it is important to note that existing popular LLMs are alpha. Based on the above analysis, future research directions are dis-
not domain-specific models. Given the complexity of HVAC systems, it cussed for figuring out the Question 3.
is still unknown whether state-of-the-art LLMs like GPT-4 masters suffi- Section 2 describes the methodology, including evaluation tests for
cient HVAC knowledge to support them to become digital AI experts of large language models, representative large language models and eval-
this field. Accordingly, the following questions are crucial but unclear uation metrics for large language models. Section 3 describes the exam
yet: score results of large language models in the ASHRAE CHD Examina-
Question 1: How to evaluate the mastery of knowledge and skills of tion, and the performance of large language models in terms of the re-
LLMs in the context of the HVAC industry? call, analysis and application capability. Section 4 discusses the insights
4
JID: ENBENV
Fig. 4. Performance test flowchart of LLM models.
obtained from the evaluation, including the knowledge capabilities, lim- 2.1.2. ASHRAE certified HVAC designer examination
itations and future research directions of LLMs in the HVAC industry. ASHRAE is a famous organization with members from all over the
Finally, Section 5 summarizes the main findings of this study. world in the HVAC industry [35]. It publishes a series of standards and
guidelines which are excellent references for a vast array of HVAC de-
signers and engineers. It has consistently been at the forefront of the
2. Methodology
development of state-of-the-art HVAC technologies. The ASHRAE CHD
examination is a demanding and comprehensive test for HVAC design-
A framework (see Fig. 1) is proposed to conduct the comprehensive
ers [34]. It covers a broad range of topics, e.g., HVAC design principles,
evaluation of knowledge capabilities of 12 representative state-of-the-
indoor air quality and hydraulic commissioning. The ASHRAE CHD ex-
art LLMs in the HVAC industry. The professional capabilities of these
amination is accredited by ANSI National Accreditation Board personnel
LLMs are evaluated and compared using three metrics, i.e., overall ac-
certification program, under ISO/IEC 17024, ensuring fairness and va-
curacy, local accuracy and consistency.
lidity. Hence, the knowledge capabilities of LLMs are evaluated by let-
ting them take the ASHRAE CHD examination in this study. Researchers
2.1. Evaluation tests for large language models in the HVAC industry interested in replicating the results contained in this paper can refer to
[34].
2.1.1. Knowledge capabilities to be tested for large language models As shown in Fig. 2, four scenarios (system design, design calcula-
Professional licensure exams are prevalent in various fields, includ- tion, procedural and coordination) are considered in this examination
ing HVAC, accounting, law, medicine, and so on. Achieving success on for evaluating the three knowledge capabilities (recall, analysis and ap-
these exams typically necessitates a combination of three critical ele- plication) of HVAC designers. These scenarios comprise of key concepts,
ments: (i) acquiring a large amount of theoretical knowledge, (ii) the knowledge, and skills related to HVAC design. Each scenario contains
capability to comprehend and analyze the questions, and (iii) the skill multiple various tasks. System design involves the fundamental concepts
to respond to scenario-specific questions. For performance evaluation of and terms in HVAC design, such as conditions, variables, and opera-
LLMs in the domain of HVAC, three key knowledge capabilities are taken tional parameters. Tasks in this scenario include “size supply, return,
into consideration: recall, analysis, and application. They are main di- and exhaust ducts”, “create HVAC zoning and sensor locations”, and
mensions employed by ASHRAE to assess the professional competencies so on. Design calculation involves the tasks related to calculations for
of individuals engaged in this domain [34]. They are defined as follows: HVAC design, such as calculating HVAC system requirements and sup-
porting project estimates for system selection. Procedural focuses on the
• Recall refers to the ability to remember or recognize specific knowl- processes executed by HVAC designers, including analyzing compliance
edge. An excellent HVAC designer or engineer is required to have with codes and standards, reviewing shop drawings, performing field re-
a solid grasp of key concepts, principles, and technical information views, etc. Coordination includes the tasks that need cooperation among
associated with HVAC systems. This memory-based skill is crucial as
it lays the groundwork for solving problems and making decisions
in HVAC design processes.
• Analysis involves the ability to analyze the given information, draw
upon various sources, generate accurate responses, etc. It is an im-
portant capability for HVAC designers and engineers. In a real-world
HVAC design task, a good HVAC designer needs to understand the
requirements of building owners, architectural designs, environmen-
tal factors, and legal guidelines very well.
• Application tests the ability of knowledge utilization. It reflects how
well an HVAC designer or engineer can use their domain knowledge
to solve tasks associated with HVAC systems. For instance, HVAC
designers usually need to select suitable HVAC equipment, design
ductwork, troubleshoot system issues and so on according to their
domain knowledge. This capability bridges the gap between theory
and practice. Fig. 5. Exam scores of LLMs in the ASHRAE CHD Examination.
5
JID: ENBENV
Fig. 6. Accuracy of 12 LLMs on the questions association with the recall capability.
multiple project stakeholders, such as “assist in basis of design develop- accurate statement about the task of sequencing heating and cooling?”),
ment”, “coordinate HVAC equipment space requirements” and “comply and the application capability to answer calculation questions (“…cal-
with client specs and performance”. culate the total pressure of a centrifugal fan”).
The closed-book question-answering method is utilized to test LLMs, The performance test flowchart of LLM models is illustrated in Fig. 4.
without access to additional resources. 100 exam-style single-choice The prompt is generated using the prompt template. It consists of two
questions from the ASHRAE CHD examination guide book [34] (Pages parts: instruction and question. Instruction explains the task that LLMs
30–38, 43–48, 53–55 and 64–68) are utilized to assess the three knowl- should conduct, which is “Answer the following single-choice questions
edge capabilities (recall, analysis, and application) of LLMs in the do- about CERTIFIED HVAC DESIGNER exam and explain the reason:” for
main of HVAC design. As suggested by ASHRAE [34], the ratio of recall, this study. Question means one of the 100 exam-style single-choice ques-
analysis, and application questions is 5:3:2. Fig. 3 shows some typical tions from the ASHRAE CHD examination guide book, such as one sam-
questions as instances. These questions can be applied to test the recall ple in Fig. 3. It should be noted that some LLMs (e.g., ChatGPT) yield
capability to remember key concepts of HVAC design (“What does a sys- answers with previous prompts. To avoid this influence on the test re-
tem manual typically include?”), the analysis capability to solve ques- sults, there is no pre-fed prompts before one test. Each test is conducted
tions that are focused on deep insights and analysis (“What’s the MOST in a new dialog box in the platform. Considering the randomness of the
6
JID: ENBENV
Table 1
Representative LLMs used for performance evaluation.
Accessibility Model name Parameter size (B) Pre-training model
Closed source GPT-3.5 – GPT-3

[21]
GPT-4.0 – GPT-3.5
[22]
ERNIE Bot – ERNIE 3.0
[23]
Publicly available LlaMA 13 –
[24]
Alpaca 13 LLaMA
[25]
Koala 13 LLaMA
[26]
Vicuna 13 LLaMA
[27]
Dolly 12 Pythia
[28]
Oasst-pythia 12 Pythia
[29]
FastChat-T5 3 FLANT5
[30]
ChatGLM 6 GLM
[31]
StableLM-tuned-alpha 7 LLaMA
[32]
Fig. 7. Accuracy of 12 LLMs on the recall capability. Fig. 8. Consistency of 12 LLMs on the questions association with the recall
capability.
answers from LLMs, each question is repeatedly asked 5 times for each
LLM in a new chat. Transformer networks to learn language knowledge. The applications of
ERNIE Bot are extensive, such as text generation [41], speech recogni-
2.2. Representative large language models tion [42], and machine translation [43].
The remaining nine models, i.e., LLaMA, Alpaca, Koala, Vicuna,
The performance of twelve state-of-the-art LLMs is evaluated in this Dolly, OpenAssistant, FastChat-T5, ChatGLM, and StableLM, are open-
study, including GPT-3.5, GPT-4, ERNIE Bot, LLaMA, Alpaca, Koala, Vi- source. LLaMA is developed by Meta [24]. It can achieve high perfor-
cuna, Dolly, Oasst-phthia, FastChat-T5, ChatGLM and StableLM-tuned- mance with a small parameter size, i.e., 7 billion parameters. It inspires
alpha. They are various in terms of the accessibility, parameter size, and a host of further improved models such as Alpaca, Koala and Vicuna. Al-
built-in pre-training model, as listed in Table 1. paca is a LLM developed by Stanford [25]. It is fine-tuned from LLaMA,
GPT-3.5, GPT-4.0, and ERNIE Bot are closed-source LLMs. The de- using only 52k data, with comparable performance to GPT-3.5 in terms
tails regarding their design, codes, and training data are not publicly of the training cost. Koala is proposed by UC Berkeley, with 13 billion
available. GPT-3.5 and GPT-4.0 are both developed by OpenAI based on parameters [26]. Unlike other models, it is trained using the high-quality
the Transformer architecture [36]. Recent studies have demonstrated data collected from the Internet. Its training dataset includes the open-
that GPT-4 significantly outperforms GPT-3.5 [37]. The former has source dataset from OpenInstructionGeneralist, Stanford’s Alpaca model
shown better abilities on understanding complex queries and contexts, dataset, etc. It is reported that Koala achieves 50 % of the performance
as well as generating logical and accurate responses [38]. Addition- of GPT-3.5 [26]. Vicuna-13B is developed by the students and faculty
ally, GPT-4 exhibits a marked improvement in reasoning and ambiguity from UC Berkeley in collaboration with University of California San
resolution, leading to better human-like performance on some profes- Diego and Carnegie Mellon University [27]. It is fine-tuned from the
sional tests, such as Scholastic Assessment Test [39], Law School Admis- LLaMa-13B model using the user dialogue data generated by ShareGPT,
sion Test [15] and Uniform Bar Exam [40]. ERNIE Bot is a knowledge- totaling 70 K entries. Dolly 2.0 is the first open-source and instruction-
enhanced LLM developed by Baidu [23]. It is trained using multi-layered tuned LLM developed by Databricks [28]. It is fine-tuned based on the
7
JID: ENBENV
Fig. 9. Accuracy of 12 LLMs on the questions association with the analysis capability.
Pythia model with 12 billion parameters. Its fine-tuning dataset includes in tasks such as self-cognition, outline writing, and information extrac-
15,000 high-quality human-generated prompt and response pairs for tion. StableLM is trained using a new dataset constructed from the Pile,
tasks such as brainstorming, information extraction, and summariza- comprising 1.5 trillion content tokens [32]. It performs exceptionally
tion. Oasst-pythia is an English supervised-fine-tuning model generated well in conversational and coding tasks.
by LalON [29]. It is based on the Pythia-12B model and fine-tuned us-
ing human demonstrations of assistant conversations. FastChat-T5 is a 2.3. Evaluation metrics for large language models in the domain of HVAC
chat assistant fine-tuned from FLAN-T5 by LMSYS using user-shared systems
conversations from ShareGPT [30]. It is based on an encoder-decoder
transformer architecture, and can autoregressively generate responses 2.3.1. Overall accuracy
to users’ inputs. ChatGLM-6B is an open-source conversational robot re- Exam score is calculated according to the scoring criteria set by the
leased by Tsinghua University [31]. The foundation of this model is a ASHRAE CHD Exam Subcommittee [44]. A CHD exam comprises 100
general language model which is fine-tuned from autoregressive blank single-choice questions, with each correctly answered question earning
infilling (GLM), with approximately 6 billion parameters. It is proficient a point, as defined in Eq. (1). The maximum exam score equals to 100.
in both English and Chinese, particularly optimized for Chinese. It excels This metric can be employed to assess the overall knowledge capabilities
8
JID: ENBENV
of LLMs. answers of LLMs. It measures whether an LLM can replicate the same
𝑚
∑ answers for the same questions across multiple tests. In this study, the
𝐸𝑥𝑎𝑚𝑠𝑐𝑜𝑟𝑒 = 𝑛𝑖 (1) consistency of LLMs on each knowledge capability is assessed, as defined
𝑖=1 by Eq. (4). It should be noted that the hyperparameter named temper-
where, m is the number of questions in one test, and ni represents the ature have influence on for the diversity of answers. To control for the
score for the i th question (ni =1 if the answer is correct, ni =0 otherwise). effects of temperature, it is set to 1.0 for GPT models.
𝑚
∑ 𝑐𝑖
2.3.2. Local accuracy 𝑄𝑇
𝑖=1
Apart from the overall accuracy, it is also necessary to evaluate the 𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 = (4)
𝑚
local accuracy of LLMs. Hence, two metrics are designed to assess the
where, ci is the maximum number of the identical answers for the i th
accuracy of LLMs on each question and each knowledge capability. The
question in all tests. Two answers are considered identical if they choose
first metric (Accuracy1 ) is defined by Eq. (2) to assess the accuracy of
the same choice to the same single-choice question.
LLMs on each question. The second metric (Accuracy2 ) is defined by
Eq. (3) to assess the accuracy of LLMs on each knowledge capability. 3. Results
𝑄𝐶,𝑖
𝐴𝑐 𝑐 𝑢𝑟𝑎𝑐𝑦1,𝑖 = (2)
𝑄𝑇 3.1. Exam scores of large language models in the ASHRAE CHD
examination
𝑚
∑
𝑓𝑖
𝑖=1 The exam scores of the twelve representative LLMs in five tests are
𝐴𝑐 𝑐 𝑢𝑟𝑎𝑐𝑦2 = (3)
𝑚 shown in Fig. 5. Each LLM takes the ASHRAE CHD Examination five
where, QC , i represents the number of correct answers for the i th ques- times. According to the standard of ASHRAE, an HVAC designer passes
tion, QT is the total number of test times (QT equals to 5 in this study), fi the ASHRAE CHD Examination if the exam score should be higher than
represents the accuracy score for the i th question associated to a knowl- 70 [45]. The results show that there is a huge performance gap among
edge capability (fi = 1 if Accuracy1, i ≥ 80 %, fi = 0 otherwise), and m the twelve LLMs. GPT-4 and GPT-3.5 are the only two LLMs which pass
is the total number of questions associated to a knowledge capability in the ASHRAE CHD Examination. GPT-4 is the top LLM with quite high
the ASHRAE CHD examination. exam scores. Its exam scores, between 74 and 78, are always higher than
the passing score. GPT-3.5 also passes the ASHRAE CHD Examination
2.3.3. Consistency twice. Its exam scores are between 66 and 70. It’s worth noting that by
The answers of LLMs for a question usually change in various chats. December 2022, the pass rate of the first-time human examinees was
The metric named consistency is applied to evaluate the stability of the 53 % in this exam [45]. It means that GPT-4 and GPT-3.5 have outper-
formed about half of human examinees in this exam.
The remaining models have significantly lower exam scores than
GPT-4 and GPT-3.5. None of them passes the ASHRAE CHD Examina-
tion. Vicuna’s exam scores vary between 41 and 55. The exam scores of
FastChat-T5 are within the range between 50 and 53. Alpaca gets the
exam scores between 44 and 52. LLaMA’s exam scores range from 40
to 47. Koala scores from 37 to 44. ChatGLM has exam scores between
37 and 48. ERNIE Bot scores between 31 and 41. The exam scores of
Oasst-phthia are between 28 and 37. Dolly obtains exam scores from 29
to 38. The exam scores of StableLM-tuned-alpha vary from 23 to 37.
3.2. Performance of large language models in terms of the recall capability
3.2.1. Local accuracy of large language models in terms of the recall

capability
The accuracy of the twelve representative LLMs on each question as-
sociated with the recall capability is as shown in Fig. 6. It shows that
Fig. 10. Accuracy of 12 LLMs on the analysis capability. GPT-4 and GPT-3.5 are the top performers. GPT-4 performs the best with
the majority of these questions (approximately 72 %) achieving the ac-
curacy of 100 %. GPT-3.5 displays the accuracy of 100 % on 60 % of
these questions. The remaining LLMs show significantly lower accuracy
on these questions. FastChat-T5 and Alpaca answer 40 % and 32 % of
these questions with the accuracy of 100 %, respectively. The percent-
age of questions of Vicuna, LLaMA, ChatGLM and StableLM-tuned-alpha
peak at the accuracy between 40 % and 60 %, while the percentage of
questions of Koala, Ernie Bot, Dolly and Oasst-pythia peak at the accu-
racy between 0 % and 20 %.
Fig. 7 shows the accuracy of the 12 LLMs on the recall capability.
GPT-4 has the highest accuracy (76 %) on this capability, followed by
GPT-3.5 with the accuracy of 68 %. The accuracy of other LLMs on this
capability is significantly lower than GPT-4 and GPT-3.5. FastChat-T5
has the accuracy of 56 %. Alpaca presents the accuracy of 38 %, while
Vicuna and Koala both achieve the accuracy of 26 %. LLaMA and Chat-
GLM show similar accuracy (approximately 20 %). Ernie Bot and Dolly
Fig. 11. Consistency of 12 LLMs on the questions association with the recall models both get the accuracy of 14 %. Oasst-pythia shows the accu-
capability. racy of 12 %. StableLM-tuned-alpha has the lowest accuracy (8 %). In
9
JID: ENBENV
Fig. 12. Accuracy of 12 LLMs on the questions association with the application capability.
summary, GPT-4 and GPT-3.5 are able to correctly answer most ques- to answer the questions associated with the recall capability, while the
tions related to HVAC concepts. The accuracy of other LLMs is quite low answers from other LLMs are usually unreliable and unstable for these
in this case study, indicating that they still lack some specific domain questions.
knowledge.
3.3. Performance of large language models in terms of the analysis
3.2.2. Consistency of large language models in terms of the recall capability capability
Fig. 8 shows the consistency of the 12 representative LLMs in terms
of the recall capability. GPT-4 exhibits the highest consistency of 96 %, 3.3.1. Local accuracy of large language models in terms of the analysis
closely followed by GPT-3.5 with the consistency of 95 %. The consis- capability
tency of the other LLMs is significantly lower than those of GPT-4 and The accuracy of twelve representative LLMs on each question associ-
GPT-3.5. The consistency of FastChat-T5, Alpaca, Erinie Bot, ChatGLM, ated with the analysis capability is depicted in Fig. 9. GPT-4 and GPT-3.5
Vicuna, LLaMA, Dolly, Oasst-pythia, Koala and StableLM-tuned-alpha are the top performers. GPT-4 leads with 70 % of these questions achiev-
are 83 %, 82 %, 73 %, 71 %, 67 %, 66 %, 64 %, 64 %, 64 %, and 53 %, ing the accuracy of 100 %. GPT-3.5 demonstrates the accuracy of 100 %
respectively. It indicates that GPT-4 and GPT-3.5 have high stability on 63 % of these questions. The percentage of questions of Dolly peaks
10
JID: ENBENV
Fig. 13. Accuracy of 12 LLMs on the application capability. Fig. 14. Consistency of 12 LLMs on the questions association with the applica-
tion capability.
at the accuracy of 60 %. The percentage of questions of FastChat-T5,

Vicuna, LLaMA, Koala, Oasst-pythia, and StableLM-tuned-alpha peak at around 7 % to 20 %. In summary, GPT-4 and GPT-3.5 show significantly
the accuracy of 40 %. The percentage of questions of ChatGLM peaks at better analysis capability than the other LLMs.
the accuracy of about 20 %. The percentage of questions of Ernie Bot
peaks at the accuracy of 0 % which is troublingly low. The percentage 3.3.2. Consistency of large language models in terms of the analysis
of questions of Alpaca peaks at the accuracy of both 0 % and 100 %. capability
Fig. 10 shows the accuracy of the 12 LLMs on the analysis capa- Fig. 11 shows the consistency of the twelve representative LLMs in
bility. GPT-4 has the highest accuracy (nearly 77 %). GPT-3.5 follows analysis tasks. GPT-4 stands out as the top performer, achieving a con-
with the accuracy of 63 %. There’s a noticeable drop in the accuracy sistency rate of 95 %. GPT-3.5 follows closely with a consistency rate of
for the rest of the LLMs. Alpaca and Ernie Bot both record the accuracy 93 %. Alpaca ranks in third with an 85 % consistency rate. Erinie Bot
of about 37 %. Vicuna, FastChat-T5 and LLaMA have relatively simi- records a consistency rate of 81 %, outperforming FastChat-T5, which
lar accuracy which is about 30 %. ChatGLM, Koala, Dolly, Oasst-pythia, shows a consistency rate of 73 %. The consistency rates of Vicuna and
and StableLM-tuned-alpha display the quite poor accuracy, ranging from ChatGLM are relatively close at 67 % and 66 % respectively. The rates
Fig. 15. Example about a recall capability question for GPT-4.
11
JID: ENBENV
Fig. 16. Example about an analysis capability question for GPT-4.
continue to drop slightly with LLaMA, Dolly, and Koala achieving con- accuracy of 20 %. LLaMA and Dolly present the accuracy of 10 %. Oasst-
sistency rates of 63 %, 63 %, and 62 % respectively. Oasst-pythia shows pythia and StableLM-tuned-alpha show the lowest performance with the
a consistency rate of 61 %. StableLM-tuned-alpha records the lowest accuracy of 5 %. In summary, GPT-4 and GPT-3.5 demonstrate the high-
consistency rate among all the models, with a rate of 56 %. Accord- est accuracy on the questions associated with the application capacity.
ingly, GPT-4 and GPT-3.5 are the most consistent performers in analysis As for other LLMs, they show quite poor capacities in executing appli-
tasks among the tested LLMs. Others show lower consistency, indicating cation tasks.
a great need for improvement to ensure reliable and consistent perfor-
mance in analytical tasks. 3.4.2. Consistency of large language models in terms of the application
capability
3.4. Performance of large language models in terms of the application Fig. 14 shows the consistency of the 12 representative LLMs in appli-
capability cation tasks. GPT-4 ranks as the top performer with a consistency rate
of 89 %. GPT-3.5 and Erinie Bot present a consistency rate of 84 %. and
3.4.1. Local accuracy of large language models in terms of the application 83 % respectively. FastChat-T5 shows a rate of 77 %, while Alpaca fol-
capability lows with a consistency rate of 75 %. Oasst-pythia, ChatGLM, Vicuna
The accuracy of twelve representative LLMs on each question asso- and Koala have consistency rates of 70 %, 69 %, 69 % and 67 % respec-
ciated with the application capability is shown in Fig. 12. GPT-4 and tively. Dolly and LLaMA both show slightly lower performance with
GPT-3.5 are the leading performers among the 12 LLMs. GPT-4 reaches consistency rates of 65 % and 62 % respectively. StableLM-tuned-alpha
the accuracy of 100 % for approximately 45 % of these questions, while has the lowest consistency rate among the evaluated models, recording
GPT-3.5 shows the accuracy of 100 % for 50 % of these questions. The re- a rate of 60 %. Accordingly, the results show relatively close consis-
maining LLMs show significantly lower accuracy. LLaMA and StableLM- tency amongst the 12 representative LLMs in application tasks. GPT-4
tuned-alpha demonstrate the accuracy of 40 % for about half of these demonstrates the most reliable performance, while the consistency of
questions. The percentage of questions of Vicuna, ChatGLM, Dolly, and other models varies from 60 % to 85 %.
Oasst-pythia peak at the accuracy of 20 %. FastChat-T5, Alpaca, Koala,
and Ernie Bot have the accuracy of 0 % for 45 %, 30 %, 30 % and 50 % 4. Discussions
of these questions, respectively.
Fig. 13 shows the accuracy of the 12 LLMs on the application capa- 4.1. Discussions about the knowledge capabilities of LLMs in the HVAC
bility. GPT-4 and GPT-3.5 perform the best, each achieving the accuracy industry
of 55 %. Vicuna obtains the accuracy of 40 %. Alpaca, Ernie Bot, and
ChatGLM record the accuracy of 30 %, 25 %, and 25 %, respectively. The knowledge capabilities of the representative twelve LLMs are
FastChat-T5 and Koala show similar performances, each attaining the significantly different. GPT-4 and GPT-3.5 both have very strong
12
JID: ENBENV
Fig. 17. Example about an application capability question for GPT-4.
knowledge capabilities (recall, analysis and application), while other the topic. As shown in Fig. 16, the correct answer, option C, is accu-
LLMs (ERNIE Bot, LLaMA, Alpaca, Koala, Vicuna, Dolly, Oasst-phthia, rately identified, emphasizing the importance of minimizing simultane-
FastChat-T5, ChatGLM and StableLM-tuned-alpha) show relatively poor ous heating and cooling for efficiency and comfort. The explanations for
knowledge capabilities. GPT-4 and GPT-3.5 passed the ASHRAE CHD options A and B are also addressed, pointing out their inaccuracies in
examination. It means that they already have similar knowledge capa- the context of HVAC sequencing.
bilities to be better than about half of human examinees, according to The application capability of GPT-4 is shown in Fig. 17, GPT-4 not
the pass rate of the human examinees reported by ASHRAE [45]. Regard- only provides the correct answer but also explains the reasoning. The
ing the recall capabilities, GPT-4 consistently demonstrates proficiency. solution is approached in a systematic way, breaking down the total re-
It delivers explanations that are both precise and detailed. As illustrated quirement into two separate calculations (people-based and area-based).
in Fig. 15, GPT-4 not only accurately identifies the basic types of vari- The calculations are done correctly using the provided rates for both
able refrigerant flow systems but also elucidates the definition of each people and area. The intermediate steps are also clearly mentioned. The
type, outlining reasons why alternative options are incorrect. This can conclusion arrives at the correct answer option C. Overall, the solution
be attributed to the fact that such questions primarily pertain to con- process of this application question is methodical, clear, and accurate.
ceptual knowledge. Throughout the training phase, the GPT model has Furthermore, it is discovered that the capabilities of recall and analy-
extensively learned information from the HVAC domain. Regarding the sis are significantly better than the application capability for GPT-4 and
analysis capabilities, GPT-4 can effectively navigate the nuances of the GPT-3.5. It indicates that they have remembered many specific knowl-
question and clarify potential misunderstandings of options related to edge in this domain and is capable of analyzing the data and informa-
13
JID: ENBENV
Fig. 18. Example about the lack of specific domain knowledge for GPT-4.
tion from real-world HVAC systems. However, they still cannot apply 4.2. Discussions about the limitations of LLMs in the HVAC industry
these knowledge and information well. More details will be discussed
in Section 4.2. Although LLMs have shown powerful knowledge capabilities in the
The stability of the responses of the twelve LLMs are also explored HVAC industry, they still make mistakes sometimes due to the following
by using the metric named consistency. It is found that the responses three reasons. Take GPT-4 and GPT-3.5 for instance:
of GPT-4 and GPT-3.5 have quite high stability (higher than 80.0 %).
However, other LLMs have significantly lower stability than GPT-4 and • Firstly, GPT-4 and GPT-3.5 still lack some domain knowledge of
GPT-3.5. It indicates that the knowledge capabilities of GPT-4 and GPT- HVAC systems. For example, as shown in Fig. 18, GPT-4 chooses the
3.5 are not significantly affected by the randomness of model responses. correct answer B twice and incorrectly opting for C thrice, when it is
14
JID: ENBENV
Fig. 19. Example about a wrong reasoning process of GPT-4.
asked to answer why utilizing density for determining glycol concen- in the application lead to an incorrect result. Ensuring accurate ap-
tration is unsatisfactory. However, the understanding of GPT-4 about plication of formulas is crucial to arrive at the correct answer in
the peak concentration values of propylene glycol (70 %−75 %) is ‘application’ type problems.
always wrong in these cases, leading to incorrect options. Specifi-
cally, GPT-4 erroneously thinks this value should be within a range
4.3. Future research directions of LLMs in the HVAC industry
between 50 % and 55 %.
• Secondly, GPT-4 and GPT-3.5 have poor reasoning capabilities in
4.3.1. LLM-based automated design tools for HVAC systems
some questions. For instance, GPT-4 accurately selects the option
LLMs, especially GPT-3.5 and GPT-4, have remembered many knowl-
A twice but mistakenly chooses the option B three times, when it
edge related to the design of HVAC systems. They can also understand
is asked to determine the type of pump mounted to a horizontal
and analyze the information about the design of HVAC systems. It is
motor (Fig. 19). It is found that GPT-4 often justifies its options with
also possible for them to apply the knowledge and information for solv-
reasons that have no relevance to the selected answer. Specifically,
ing some problems (such as equipment selection and ductwork design)
the justification "Frame-mounted, end-suction pumps are typically
regarding the design of HVAC systems. However, it is still not enough
mounted on a baseplate and can be connected to the motor through
for them to be a good designer of HVAC systems. In the future, it is
a coupling" is insufficient to validate the selection of the option B.
•
suggested to teach them how to use design tools or software in this do-
Thirdly, GPT-4 and GPT-3.5 make mistakes in the application of for-
main, such as the software of the psychrometric chart, EnergyPlus, etc.
mulas. As shown in Fig. 20, GPT-4 correctly converts units, demon-
It is crucial for making LLMs become digital designers in this domain,
strating a good grasp of switching between different units of mea-
which can reduce the design time and improve the automation level for
sure. Moreover, GPT-4 correctly identifies the necessary formula to
the design of HVAC systems.
determine the head, but their application is flawed. This indicates
GPT-4 understands the fundamental principles but make errors in
execution. It results in an inconsistency between the result and the 4.3.2. LLM-based automated operation tools for HVAC systems
given options. In summary, GPT-4 shows understanding of the con- LLMs should have great potential to become digital experts for the
cepts and principles behind the problem. However, crucial mistakes operation and maintenance of HVAC systems. However, it is necessary
15
JID: ENBENV
Fig. 20. Example about an application capability question for GPT-4.
to further study how to make them read the operational data from real- due to the limitation of the number of questions, the evaluation could
world HVAC systems automatically and easily, so that they can monitor overlook some dimensions of HVAC operation. In the future, additional
the operation of HVAC systems. It is also necessary to explore the po- questions concerning HVAC system operation could be integrated, in-
tential of LLMs in analyzing the massive amounts of operational data of cluding system performance optimization, energy consumption reduc-
HVAC systems, which can make them be able to discover device faults or tion, fault detection and diagnosis, and occupant thermal comfort en-
abnormal operation patterns based on these operational data. Moreover, hancement, etc.
16
JID: ENBENV
4.3.3. Development of customized LLMs for the domain of HVAC industry. With the help of these corpuses, it is possible to enhance the
Although LLMs have shown strong knowledge capabilities, they still knowledge capabilities of LLMs in this domain. Lastly, it is essential to
have some limitations for now (see Section 4.2). The limitations might test how the LLMs perform in real design and operation cases. Such an
result from the lack of rich corpuses associated with the HVAC industry. evaluation is meaningful to ensure that the designs and operations of
Hence, it is of great value to develop specified corpuses for this domain. LLMs agree to domain common sense.
With the help of these corpuses, it is possible to train customized LLM
for this domain. Moreover, it is discovered that many LLMs still have
Declaration of competing interests
poor knowledge capabilities in this domain. It might be helpful for im-
proving existing LLMs to enhance their knowledge capabilities. A LLM
The authors declare the following financial interests/personal rela-
with a relatively small size of parameters might be better for this do-
tionships which may be considered as potential competing interests: The
main, considering the costs for model training and model deployment.
author Prof. Yang Zhao is an Editorial Board Member for Energy and
Otherwise, with the development of multimodal technology, more data
Built Environment and was not involved in the editorial review or the
types could be recognized by LLMs in the future, such as engineering
decision to publish this article.
drawings, videos and operation logs. These data would provide more
information on various aspects, which would enhance LLMs to establish
a more comprehensive understanding of HVAC systems. CRediT authorship contribution statement
Jie Lu: Writing – original draft, Validation, Methodology, Con-

4.3.4. Evaluation of LLMs in real cases of HVAC design and operation ceptualization. Xiangning Tian: Validation, Resources, Investigation.
While HVAC design and operation relies on domain knowledge, pass-
Chaobo Zhang: Writing – original draft, Formal analysis, Data cura-
ing the HVAC designer exam does not necessarily mean that LLMs are
tion. Yang Zhao: Writing – review & editing, Supervision, Resources,
capable of designing and operating the system from a pragmatic an- Funding acquisition. Jian Zhang: Validation, Investigation, Data cura-
gle. Thus, it is essential to test how the LLMs perform in real design
tion. Wenkai Zhang: Validation. Chenxin Feng: Investigation. Jianing
and operation cases. For instance, it is worth studying if LLMs can offer
He: Software. Jiaxi Wang: Investigation. Fengtai He: Investigation.
real-time suggestions, predict potential challenges, and provide insights
into optimizing designs and operations based on vast amounts of data.
Such an evaluation is meaningful to ensure that the LLM’s designs and Acknowledgements
operations agree to domain common sense.
This work is supported by the National Natural Science Founda-
tion of China (No. 52161135202), Hangzhou Key Scientific Research
5. Conclusions Plan Project (No. 2023SZD0028), the Basic Research Funds for the
Central Government ‘Innovative Team of Zhejiang University’ (No.
Large language models (LLMs) have shown powerful abilities in solv- 2022FZZX01-09), and China Scholarship Fund.
ing complex problems like humans in many domains such as medicine,
economics, media and so on. Given the complexity of HVAC systems
References
and the generalist nature of LLMs’ training, it is still unknown whether
state-of-the-art LLMs master sufficient HVAC knowledge to support the [1] C. Zhang, J. Li, Y. Zhao, T. Li, Q. Chen, X. Zhang, W. Qiu, Problem of data imbalance
future realization of digital AI experts in the HVAC industry. This paper in building energy load prediction: concept, influence, and solution, Appl. Energy
investigates the knowledge capabilities and skills of LLMs in the HVAC 297 (2021) 117139, doi:10.1016/j.apenergy.2021.117139.
[2] J. Lu, C. Zhang, J. Li, Y. Zhao, W. Qiu, T. Li, K. Zhou, Graph convolutional networks-
industry. Twelve representative LLMs are selected to take the ASHRAE based method for estimating design loads of complex buildings in the preliminary de-
Certified HVAC Designer certification examination for evaluating the sign stage, Appl. Energy 322 (2022) 119478, doi:10.1016/j.apenergy.2022.119478.
knowledge capabilities of recall, analysis and application. [3] N. Ma, D. Aviv, H. Guo, W.W. Braham, Measuring the right factors: a review of
variables and models for thermal comfort and indoor air quality, Renew. Sustain.
The results show that the knowledge capabilities of the studied LLMs Energy Rev. 135 (2021) 110436, doi:10.1016/j.rser.2020.110436.
are significantly different. GPT-4 can always pass the ASHRAE Certified [4] J. Lu, X. Tian, C. Feng, C. Zhang, Y. Zhao, Y. Zhang, Z. Wang, Clustering compression-
HVAC Designer certification examination, with the scores between 74 based computation-efficient calibration method for digital twin modeling of HVAC
system, Build. Simul. (2023), doi:10.1007/s12273-023-0996-2.
and 78. Its scores are higher than about half of human examinees. GPT-
[5] C. Zhang, X. Tian, Y. Zhao, J. Lu, Automated machine learning-based
3.5 can pass the examination twice out of five times. As for other LLMs, building energy load prediction method, J. Build. Eng. 80 (2023) 108071,
none of them passes the examination. It is further discovered that the ca- doi:10.1016/j.jobe.2023.108071.
pabilities of LLMs are unbalanced. Take GPT-4 and GPT-3.5 for instance. [6] C. Zhang, Y. Zhao, T. Li, X. Zhang, J. Luo, A comprehensive investigation of knowl-
edge discovered from historical operational data of a typical building energy system,
Their capabilities of recall and analysis are stronger than the application J. Build. Eng. 42 (2021) 102502, doi:10.1016/j.jobe.2021.102502.
capability. As for the stability of its responses, the results demonstrate [7] D. Lee, S.T. Lee, Artificial intelligence enabled energy-efficient heat-
that some LLMs can output relatively consistent responses for a ques- ing, ventilation and air conditioning system: design, analysis and nec-
essary hardware upgrades, Appl. Therm. Eng. 235 (2023) 121253,
tion in various chats, such as GPT-4 and GPT-3.5. It is important for the doi:10.1016/j.applthermaleng.2023.121253.
practical applications of LLMs in this domain, which can reduce the risk [8] J. Seo, R. Ooka, J.T. Kim, Y. Nam, Optimization of the HVAC system de-
from the randomness of model outputs. sign to minimize primary energy demand, Energy Build. 76 (2014) 102–108,
doi:10.1016/j.enbuild.2014.02.034.
Although LLMs has shown great knowledge capabilities in this do- [9] J. Cho, Y. Kim, J. Koo, W. Park, Energy-cost analysis of HVAC system for office
main, they still expose some limitations. Take the best-performing GPT-4 buildings: development of a multiple prediction methodology for HVAC system cost
and GPT-3.5 for instance. They lack some domain knowledge of HVAC estimation, Energy Build. 173 (2018) 562–576, doi:10.1016/j.enbuild.2018.05.019.
[10] M.W. Ellis, E.H. Mathews, Needs and trends in building and HVAC system design
systems, present poor reasoning capabilities in some questions, and
tools, Build. Environ. 37 (2002) 461–470, doi:10.1016/S0360-1323(01)00040-3.
make mistakes in the application of formulas. Moreover, four future re- [11] P.C. Lee, T.P. Lo, M.Y. Tian, D. Long, An efficient design support system based on au-
search directions are proposed in this study. Firstly, it is suggested to tomatic rule checking and case-based reasoning, KSCE J. Civ. Eng. 23 (2019) 1952–
1962, doi:10.1007/s12205-019-1750-2.
teach LLMs how to use design tools or software in the HVAC industry,
[12] Z. Wang, W. Zhang, H. Fan, C. Zhang, Y. Zhao, Z. Huang, An uncertainty-
such as the software of the psychrometric chart, EnergyPlus, etc. Sec- tolerant robust distributed control strategy for building cooling water sys-
ondly, it is necessary to further study how to make LLMs read and ana- tems considering measurement uncertainties, J. Build. Eng. 76 (2023) 107162,
lyze the operational data from real-world HVAC systems automatically doi:10.1016/j.jobe.2023.107162.
[13] Z. Wang, Y. Zhao, C. Zhang, P. Ma, X. Liu, A general multi agent-based distributed
and easily, so that they can monitor the operation of HVAC systems. framework for optimal control of building HVAC systems, J. Build. Eng. 52 (2022)
Thirdly, it is of great value to develop specified corpuses for the HVAC 104498, doi:10.1016/j.jobe.2022.104498.
17
JID: ENBENV
[14] A. Afram, F. Janabi-Sharifi, A.S. Fung, K. Raahemifar, Artificial neural network [28] M. Conover, M. Hayes, A. Mathur, X. Meng, J. Xie, J. Wan, A. Ghodsi, P. Wendell,
(ANN) based model predictive control (MPC) and optimization of HVAC systems: M. Zaharia, Hello Dolly: Democratizing the Magic of ChatGPT with Open Models,
a state of the art review and case study of a residential HVAC system, Energy Build. 2023.
141 (2017) 96–113, doi:10.1016/j.enbuild.2017.02.012. [29] Open-Assistant ContributorsOasst-phthia, 2023 https://huggingface.co/Open
[15] M. Shultz, S. Zedeck, Predicting lawyer effectiveness : broadening the basis for law, Assistant/oasst-sft-1-pythia-12b.
Law Soc. Inq. 36 (2011) 620–661 http://www.jstor.org/stable/23011885. [30] LM-SYSFastChat-T5, 2023. https://github.com/lm-sys/FastChat.
[16] A. Afram, F. Janabi-Sharifi, Theory and applications of HVAC control systems - [31] Tsinghua UniversityChatGLM, 2023. https://github.com/THUDM/ChatGLM-6B.
a review of model predictive control (MPC), Build. Environ. 72 (2014) 343–355, [32] A.I. Stability, StableLM-Tuned-Alpha, 2023 https://huggingface.co/CarperAI/stable-
doi:10.1016/j.buildenv.2013.11.016. vicuna-13b-delta.
[17] S. Yang, M.P. Wan, W. Chen, B.F. Ng, S. Dubey, Experiment study of machine- [33] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee,
learning-based approximate model predictive control for energy-efficient building Y.T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M.T. Ribeiro, Y. Zhang, Sparks of
control, Appl. Energy 288 (2021) 116648, doi:10.1016/j.apenergy.2021.116648. Artificial General Intelligence: experiments with an early version of GPT-4, (n.d.).
[18] C. Zhang, X. Tian, Y. Zhao, T. Li, Y. Zhou, X. Zhang, Causal discovery-based [34] ASHRAECertification Study Guide: Certified HVAC Designer (CHD), 2020.
external attention in neural networks for accurate and reliable fault detection [35] ASHRAE, ASHRAE, 2023 https://www.ashrae.org/.
and diagnosis of building energy systems, Build. Environ. 222 (2022) 109357, [36] OpenAI, OpenAI, 2023 https://openai.com/.
doi:10.1016/j.buildenv.2022.109357. [37] H. Nori, N. King, S.M. McKinney, D. Carignan, E. Horvitz, Capabilities of GPT-4 on
[19] C. Zhang, J. Lu, Y. Zhao, Generative pre-trained transformers (GPT)- Medical Challenge Problems (2023) 1–35 http://arxiv.org/abs/2303.13375.
based automated data mining for building energy management: advan- [38] H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, Y. Zhang, Evaluating the Logical Reasoning
tages, limitations and the future, Energy Built. Environ. 5 (2024) 143–169, Ability of ChatGPT and GPT-4, 2023 http://arxiv.org/abs/2304.03439.
doi:10.1016/j.enbenv.2023.06.005. [39] S. Al Ghazali, S. Harous, S. Turaev, Application of artificial neural net-
[20] C. Zhang, J. Zhang, Y. Zhao, J. Lu, Automated data mining framework for build- work to estimate students performance in scholastic assessment test, in: Pro-
ing energy conservation aided by generative pre-trained transformers (GPT), Energy ceedings - 2022 14th IEEE International Conference on Computational In-
Build. 305 (2024) 113877, doi:10.1016/j.enbuild.2023.113877. telligence and Communication Networks, CICN 2022, 2022, pp. 166–170,
[21] A. Koubaa, GPT-4vs . GPT-3 . 5 : A Concise Showdown, 2023, pp. 1–5, doi:10.1109/CICN56167.2022.10008315.
doi:10.20944/preprints202303.0422.v1. [40] T.K. Odle, J.Y. Bae, M.S. González Canché, The effect of the uniform bar examination
[22] OpenAI, GPT-4 Technical Report, 4, 2023, pp. 1–100 http://arxiv.org/abs/ on admissions, diversity, affordability, and employment across law schools in the
2303.08774. United States, Educ. Eval. Policy Anal. (2022), doi:10.3102/01623737221131803.
[23] Baidu, ERNIE Bot, 2023 https://yiyan.baidu.com/welcome. [41] Y.C. Chen, Z. Gan, Y. Cheng, J. Liu, J. Liu, Distilling knowledge learned in BERT for
[24] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, text generation, in: Proceedings of the Annual Meeting of the Association for Com-
B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, putational Linguistics, 2020, pp. 7893–7905, doi:10.18653/v1/2020.acl-main.705.
G. Lample, LLaMA: Open and Efficient Foundation Language Models, 2023 [42] T. Wang, L. Zhou, Z. Zhang, Y. Wu, S. Liu, Y. Gaur, Z. Chen, J. Li, F. Wei, VioLA:
http://arxiv.org/abs/2302.13971. Unified Codec Language Models for Speech Recognition, Synthesis, and Translation,
[25] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, 2023 http://arxiv.org/abs/2305.16107.
T.B. Hashimoto, Alpaca: a strong, replicable instruction-following model, Stanford [43] R. Puduppully, R. Dabre, A.T. Aw, N.F. Chen, Decomposed Prompting for Machine
Cent. Res. Found. Model. 3 (2023) 7. Translation Between Related Languages using Large Language Models, 2023, p. 5
[26] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, D. Song, Koala: a http://arxiv.org/abs/2305.13085.
dialogue model for academic research, Blog Post (2023). [44] ASHRAECertified HVAC Designer, 2020 https://www.ashrae.org/professional-
[27] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, development/ashrae-certification/certification-types/chd-certified-hvac-designer.
J.E. Gonzalez, et al., Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%∗ [45] ASHRAE, CandidateGuidebook Certified HVAC Designer (CHD), 2023.
ChatGPT Quality, 2023 See (Accessed 14 April 2023).
18

1-s2.0-S2666123324000448-main

Uploaded by

Copyright:

Available Formats

1-s2.0-S2666123324000448-main

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-s2.0-S2666123324000448-main

Uploaded by

Copyright:

Available Formats

JID: ENBENV

ARTICLE IN PRESS [m5GeSdc;April 9, 2024;6:14]

Energy and Built Environment xxx (xxxx) xxx

Contents lists available at ScienceDirect

Energy and Built Environment

Evaluation of large language models (LLMs) on the mastery of knowledge

Fig. 4. Performance test ﬂowchart of LLM models.

Accessibility Model name Parameter size (B) Pre-training model

Closed source GPT-3.5 – GPT-3

3.2. Performance of large language models in terms of the recall capability

3.2.1. Local accuracy of large language models in terms of the recall

at the accuracy of 60 %. The percentage of questions of FastChat-T5,

Fig. 15. Example about a recall capability question for GPT-4.

Fig. 16. Example about an analysis capability question for GPT-4.

Fig. 17. Example about an application capability question for GPT-4.

Fig. 19. Example about a wrong reasoning process of GPT-4.

Fig. 20. Example about an application capability question for GPT-4.

Jie Lu: Writing – original draft, Validation, Methodology, Con-

You might also like