Submissions and Reflections From The 2024 Large Language Model (LLM) Hackathon For Applications in Materials Science and Chemistry
Submissions and Reflections From The 2024 Large Language Model (LLM) Hackathon For Applications in Materials Science and Chemistry
Submissions and Reflections From The 2024 Large Language Model (LLM) Hackathon For Applications in Materials Science and Chemistry
Aswad8 , Jiaru Bai9 , Viktoriia Baibakova10 , Devi Dutta Biswajeet11 , Erik Bitzek7 , Joshua
D. Bocarsly12 , Anna Borisova13 , Andres M Bran13 , L. Catherine Brinson17 , Marcel Moran
Calderon13 , Alessandro Canalicchio14 , Victor Chen15 , Yuan Chiang10,16 , Defne Circi17 ,
Benjamin Charmes9 , Vikrant Chaudhary18,19 , Zizhang Chen20 , Min-Hsueh Chiu21 , Judith
Clymo7 , Kedar Dabhadkar22 , Nathan Daelman18 , Archit Datar74 , Matthew L. Evans23,24 ,
Maryam Ghazizade Fard25 , Giuseppe Fisicaro26 , Abhijeet Sadashiv Gangan27 , Janine
George4,43 , Jose D. Cojal Gonzalez18 , Michael Götte28 , Ankur K. Gupta29 , Hassan Harb20 ,
Pengyu Hong21 , Abdelrahman Ibrahim4 , Ahmed Ilyas18 , Alishba Imran31 , Kevin Ishimwe2 ,
Ramsey Issa33 , Kevin Maik Jablonka4 , Colin Jones2 , Tyler R. Josephson2 , Greg Juhasz34 ,
Sarthak Kapoor18 , Rongda Kang35 , Ghazal Khalighinejad17 , Sartaaj Khan8 , Sascha
Klawohn18 , Suneel Kuman36 , Alvin Noe Ladines18 , Sarom Leang37 , Magdalena
Lederbauer13,38 , Sheng-Lun (Mark) Liao35 , Hao Liu39 , Xuefeng Liu15,73 , Stanley Lo8 ,
Sandeep Madireddy20 , Piyush Ranjan Maharana72 , Shagun Maheshwari40 , Soroush
Mahjoubi3 , José A. Márquez18 , Rob Mills13 , Trupti Mohanty33 , Bernadette Mohr18,41 ,
Seyed Mohamad Moosavi6,8 , Alexander Moßhammer14 , Amirhossein D. Naghdi42 , Aakash
Naik4,43 , Oleksandr Narykov20 , Hampus Näsström18 , Xuan Vu Nguyen44 , Xinyi Ni30 , Dana
O’Connor45 , Teslim Olayiwola46 , Federico Ottomano7 , Aleyna Beste Ozhan3 , Sebastian
Pagel47 , Chiku Parida48 , Jaehee Park15 , Vraj Patel12 , Elena Patyukova7 , Martin Hoffmann
Petersen48 , Luis Pinto49 , José M. Pizarro18 , Dieter Plessers50 , Tapashree Pradhan50 ,
Utkarsh Pratiush51 , Charishma Puli2 , Andrew Qin15 , Mahyar Rajabi8 , Francesco Ricci16 ,
Elliot Risch52 , Martiño Rı́os-Garcı́a4,53 , Aritra Roy71 , Tehseen Rug14 , Hasan M Sayeed33 ,
Markus Scheidgen18 , Mara Schilling-Wilhelmi4 , Marcel Schloz18 , Fabian Schöppach18 , Julia
Schumann18 , Philippe Schwaller13 , Marcus Schwarting15 , Samiha Sharlin2 , Kevin Shen55 ,
Jiale Shi3 , Pradip Si56 , Jennifer D’Souza57 , Taylor Sparks33 , Suraj Sudhakar15 , Leopold
Talirz32 , Dandan Tang58 , Olga Taran59 , Carla Terboven28 , Mark Tropin61 , Anastasiia
Tsymbal62,63 , Katharina Ueltzen43 , Pablo Andres Unzueta64 , Archit Vasan20 , Tirtha
Vinchurkar40 , Trung Vo11 , Gabriel Vogel65 , Christoph Völker14 , Jan Weinreich66 , Faradawn
Yang15 , Mohd Zaki67 , Chi Zhang7 , Sylvester Zhang5 , Weijie Zhang58 , Ruijie Zhu15 , Shang
Zhu69 , Jan Janssen70 , and Ben Blaiszik∗15,20
1
University of the Punjab
2
University of Maryland, Baltimore County
3
Massachusetts Institute of Technology
4
Friedrich-Schiller-Universität Jena
5
McGill Unviersity
6
Acceleration Consortium
1
7
University of Liverpool
8
University of Toronto
9
University of Cambridge
10
University of California at Berkeley
11
University of Illinois at Chicago
12
University of Houston
13
EPFL
14
iteratec GmbH
15
University of Chicago
16
Lawrence Berkeley National Laboratory
17
Duke University
18
Humboldt University of Berlin
19
Technology University of Darmstadt
20
Argonne National Laboratory
21
University of Southern California
22
Lam Research
23
Université catholique de Louvain
24
Matgenix SRL
25
Queen’s University
26
CNR Institute for Microelectronics and Microsystems
27
University of California at Los Angeles
28
Helmholtz-Zentrum Berlin für Materialien und Energie GmbH
29
Soley Therapeutics
30
Brandeis University
31
Kleiner Perkins
32
Schott
33
University of Utah
34
Tokyo Institute of Technology
35
Factorial Energy
36
Molecular Forecaster
37
EP Analytics, Inc.
38
ETH Zurich
39
Fordham University
40
Carnegie Mellon University
41
University of Amsterdam
42
IDEAS NCBR
43
Federal Institute of Materials Research and Testing (BAM)
44
Università degli Studi di Milano
45
Pittsburgh Supercomputing Center
46
Louisiana State University
47
University of Glasgow
48
Technical University of Denmark
49
Independent Researcher
50
KU Leuven
51
University of Tennessee, Knoxville
52
Enterprise Knowledge
2
53
Instituto de Ciencia y Tecnologı́a del Carbono
54
University of Missouri-Columbia
55
NobleAI
56
University of North Texas
57
Technische Informationsbibliothek
58
University of Virginia
59
University of California at Davis
60
Helmholtz-Zentrum Berlin für Materialien und Energie GmbH
61
Windmill Labs
62
Rutgers University
63
University of Pennsylvania
64
Stanford University
65
Delft University of Technology
66
Quastify GmbH
67
Indian Institute of Technology Delhi
68
RWTH Aachen University
69
University of Michigan-Ann Arbor
70
Max-Planck Institute for Sustainable Materials
71
London South Bank University
72
CSIR-National Chemical Laboratory
73
LinkDot.AI
74
Celanese Corporation
∗
Corresponding author: [email protected] † These authors also contributed substantially to compiling
team results and other paper writing tasks
Abstract
Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Appli-
cations in Materials Science and Chemistry, which engaged participants across global hybrid locations,
resulting in 34 team submissions. The submissions spanned seven key application areas and demon-
strated the diverse utility of LLMs for applications in (1) molecular and material property prediction;
(2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and
education; (5) research data management and automation; (6) hypothesis generation and evaluation; and
(7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in
a summary table with links to the code and as brief papers in the appendix. Beyond team results, we
discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal,
San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual
collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the pre-
vious year’s hackathon, suggesting continued expansion of LLMs for applications in materials science and
chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models
for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific
research.
Introduction
Science hackathons have emerged as a powerful tool for fostering collaboration building, innovation, and
rapid problem-solving in the scientific community [1–3]. By leveraging social media, virtual platforms, and
hybrid event structures, such hackathons can be organized in a cost-effective manner while maximizing
their impact and reach. In this article, we first introduce the project submissions to the second Large
3
Language Model Hackathon for Applications in Materials Science and Chemistry, detailing the broad classes
of problems addressed by teams, while analyzing trends and patterns in the approaches taken. We then
present each team submission in turn, plus a summary table with the names of team members and links
to code repositories where available. Finally, we include the detailed project documents submitted by each
team, showcasing the depth and breadth of innovation demonstrated during the hackathon.
Overview of Submissions
The hackathon resulted in 34 team submissions (with 32 submissions with a written description included
here), categorized as shown in Table 1. From these submissions, we identified seven key application areas:
1. Molecular and Material Property Prediction: Forecasting chemical and physical properties of
molecules and materials using LLMs, particularly excelling in low-data environments and combining
structured/unstructured data.
2. Molecular and Material Design: Generation and optimization of novel molecules and materials
using LLMs, including peptides, metal-organic frameworks, and sustainable construction materials.
3. Automation and Novel Interfaces: Development of natural language interfaces and automated
workflows to simplify complex scientific tasks, making advanced tools and techniques more accessible
to researchers.
4. Scientific Communication and Education: Enhancement of academic communication, automation
of educational content creation, and facilitation of learning in materials science and chemistry.
5. Research Data Management and Automation: Streamlining the handling, organization, and
processing of scientific data through LLM-powered tools and multimodal agents.
6. Hypothesis Generation and Evaluation: Generation, assessment, and validation of scientific hy-
potheses using LLMs, often combining multiple AI agents and statistical approaches.
7. Knowledge Extraction and Reasoning: Extraction of structured information from scientific liter-
ature and sophisticated reasoning about chemical and materials science concepts through knowledge
graphs and multimodal approaches.
We next discuss each application area in more detail and highlight exemplar projects in each.
4
2. Molecular and Material Design
LLMs have also advanced in molecular and material design, proving capable in both settings [11–15], es-
pecially if pre-trained or fine-tuned with domain-specific data [16]. However, despite these advancements,
LLMs still face limitations in practical applications [17].
Exemplar projects: During the hackathon, teams tackled these challenges through different approaches.
The team behind MC-Peptide (Bran et al.) developed a workflow harnessing LLMs for the design of macro-
cyclic peptides (MCPs). By employing semantic search and constraint-based generation, they automated
the extraction of literature data to propose MCPs with improved permeability, crucial for drug develop-
ment [18, 19]. Meanwhile, the MOF Innovators team (Ansari et al.) focused on metal-organic frameworks
(MOFs). Their AI agent utilized retrieval-augmented generation (RAG) [20] to incorporate design rules
extracted from the literature to optimize MOF properties. In pursuit of sustainable materials, the Green
Construct team (Canalicchio et al.) investigated small-scale LLMs such as Llama 3 [21] and Phi-3 [22] to
streamline the design of alkali-activated concrete. Their models provided insights into reducing emissions of
traditional materials like Portland cement through zero-shot learning.
5
understand solution steps. Meanwhile, the LLMy Way team (Zhu et al.) focused on simplifying the pro-
cess of creating academic presentations by using LLMs to automatically generate structured slide decks from
research articles. The tool formats summaries of papers into typical sections—such as background, methods,
and results—and transforms them into presentation slides. It offers customization based on audience exper-
tise and time constraints, aiming to make the communication of complex topics more efficient and effective.
Lastly, the WaterLLM team (Baibakova et al.) sought to address water pollution challenges, particularly
in decentralized communities lacking centralized water treatment infrastructure. They developed a chatbot
that uses LLMs enhanced with RAG to suggest optimal water purification methods for microplastic con-
tamination. Grounded in up-to-date scientific literature, the chatbot provides tailored purification protocols
based on contaminant characteristics, resource availability, and cost considerations, promoting effective water
treatment solutions for underserved areas.
6
information, generate inspirations, and propose hypotheses, which are then evaluated for feasibility, utility,
and novelty. By applying the “Tree of Thoughts” framework [46], this system streamlines the creative process
and improves the quality of scientific hypotheses, as demonstrated through a case study on sustainable
concrete design. Another project, ActiveScience (Chiu), integrated LLMs, knowledge graphs, and RAG
to extract high-level insights from scientific articles. Focused on materials science research, particularly
alloys, this framework organizes extracted information into a Neo4j knowledge graph, allowing for complex
querying and discovery. Additionally, a first-pass peer review system was developed using fine-tuned LLMs,
G-Peer-T (Al-Feghali & Zhang), to assess materials science papers. By analyzing the log probabilities of
abstracts, the system flags works that deviate from typical scientific language patterns, helping to identify
both highly innovative and potentially nonsensical research. Preliminary findings suggest that highly cited
papers tend to use less typical language, highlighting the potential for LLMs to support the peer review
process and detect outlier research.
7
Figure 1: LLM Hackathon for Applications in Materials and Chemistry hybrid hackathon. Researchers were
able to participate from both remote and in-person locations (purple pins).
multiple continents (Figure 1). The event was a follow-up to the previous hackathon described in detail
here. [53] The event began with a kickoff panel featuring leading researchers from academia and industry,
including Elsa Olivetti (MIT), Jon Reifsneider (Duke), Michael Craig (Valence Laboratories), and Marwin
Segler (Microsoft). The charge of the hackathon was intentionally open-ended; i.e., to explore the vast po-
tential application space, and create tangible demonstrations of the most innovative, impactful, and scalable
solutions in a constrained time using open-source and best-in-class multimodal models applied to problems
in materials science and chemistry.
Event registration included 556 participants and over 120 researchers comprising 34 teams that sub-
mitted completed projects. The hybrid format proved particularly successful, with physical hub locations
in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo facilitating local collaboration while
maintaining global connectivity across the physical hubs and remote participants through virtual platforms
(Figure 1). This distributed approach enabled researchers to participate from either a local site or remotely
from anywhere on Earth. This format blended the strengths of in-person events with the flexibility of re-
mote participation leading to an inclusive event that led to team formation that crossed international and
institutional boundaries, the submitted projects in this paper, and the growth of a new persistent online
community of 483 researchers via Slack.
8
Conclusion
The LLM Hackathon for Applications in Materials Science and Chemistry has demonstrated the dual utility
and immense promise of large language models serving as 1) multipurpose models for diverse machine
learning tasks and 2) platforms for rapid prototyping. Participants effectively utilized LLMs to tackle
specific challenges while rapidly evaluating their ideas over a short 24-hour period, showcasing their ability
to enhance the efficiency and creativity of research processes in highly diverse ways. It’s important to note
that many projects benefited from significant advancements in LLM performance since last year’s hackathon.
That is, the performance across the diverse application space was improved simply via the release of new
versions of Gemini, ChatGPT, Claude, Llama, and other models. If this trend continues, we expect to see
even broader applications in subsequent hackathons, and in materials science and chemistry more generally.
Additionally, the hackathon hybrid format has proven effective towards creating new scientific collabora-
tions and communities. By uniting individuals from various backgrounds and areas of expertise, these events
facilitate knowledge exchange and promote interdisciplinary approaches, which are essential for advancing
research in this rapidly evolving field.
As the integration of LLMs continues to expand, collaborative initiatives like hackathons will play a
critical role in driving innovation and addressing complex challenges in chemistry, materials science, and
beyond. The outcomes from this event highlight the significance of leveraging LLMs for their adaptability
and their potential to accelerate the development of new concepts and applications.
Table 1: Overview of the tools developed by the various tools, and links to source code repositories. Full
descriptions of the projects can be found in the appendix.
9
Project Authors Links
Leveraging AI Agents for Designing Low Band Gap Sartaaj Khan, Mahyar Rajabi, Amro GitHub
Metal-Organic Frameworks Aswad, Seyed Mohamad Moosavi,
Mehrad Ansari
How Low Can You Go? Leveraging Small LLMs for Alessandro Canalicchio, Alexander GitHub
Material Design Moßhammer, Tehseen Rug,
Christoph Völker
10
Project Authors Links
Acknowledgments
Planning for this event was supported by NSF Awards #2226419 and #2209892. We would like to thank event
sponsors who provided platform credits and prizes for teams, including RadicalAI, Iteratec, Reincarnate,
Acceleration Consortium, and Neo4j.
References
[1] A. Nolte, L. B. Hayden and J. D. Herbsleb, Proc. ACM Hum.-Comput. Interact., 2020, 4, 1–23. https:
//doi.org/10.1145/3392830.
11
[2] E. P. P. Pe-Than and J. D. Herbsleb, in Lecture Notes in Computer Science, Springer, 2019, vol. 11546,
pp. 27–37. https://doi.org/10.1007/978-3-030-15742-5_3.
[3] B. Heller, A. Amir, R. Waxman and Y. Maaravi, J. Innov. Entrep., 2023, 12, 1. https://doi.org/10.
1186/s13731-023-00269-0.
[4] K. M. Jablonka, P. Schwaller, A. Ortega-Guerrero and B. Smit, Nat. Mach. Intell., 2024, 6, 161–169.
https://doi.org/10.1038/s42256-023-00788-1.
[5] C. Qian, H. Tang, Z. Yang, H. Liang and Y. Liu, arXiv, 2023. https://arxiv.org/abs/2307.07443.
[6] R. Jacobs, M. P. Polak, L. E. Schultz, H. Mahdavi, V. Honavar and D. Morgan, arXiv, 2024. https:
//arxiv.org/abs/2409.06080.
[7] R. Vacareanu, V. A. Negru, V. Suciu and M. Surdeanu, in First Conference on Language Modeling,
2024. https://openreview.net/forum?id=LzpaUxcNFK.
[8] A. K. Gupta and K. Raghavachari, J. Chem. Theory Comput., 2022, 18, 2132–2143. https://doi.org/
10.1021/acs.jctc.1c00504.
[9] S. Chithrananda, G. Grand and B. Ramsundar, arXiv, 2020. https://arxiv.org/abs/2010.09885.
[10] J. Lu and Y. Zhang, J. Chem. Inf. Model., 2022, 62, 1376–1387. https://doi.org/10.1021/acs.jcim.
1c01467.
[11] D. Bhattacharya, H. J. Cassady, M. A. Hickner and W. F. Reinhart, J. Chem. Inf. Model., 2024, 64,
7086–7096. https://doi.org/10.1021/acs.jcim.4c01396.
[12] G. Liu, M. Sun, W. Matusik, M. Jiang and J. Chen, arXiv, 2024. https://arxiv.org/abs/2410.04223.
[13] S. Jia, C. Zhang and V. Fung, arXiv, 2024. https://arxiv.org/abs/2406.13163.
[14] H. Jang, Y. Jang, J. Kim and S. Ahn, arXiv, 2024. https://arxiv.org/abs/2410.03138.
[15] J. Lu, Z. Song, Q. Zhao, Y. Du, Y. Cao, H. Jia and C. Duan, arXiv, 2024. https://arxiv.org/abs/
2410.18136.
[16] A. Kristiadi et al., in Proceedings of the 41st International Conference on Machine Learning, PMLR,
2024, vol. 235, pp. 25603–25622. https://proceedings.mlr.press/v235/kristiadi24a.html.
[17] S. Miret and N. M. A. Krishnan, arXiv, 2024. https://arxiv.org/abs/2402.05200.
[18] X. Ji, A. L. Nielsen and C. Heinis, Angew. Chem. Int. Ed., 2023, 63, 3. https://doi.org/10.1002/
anie.202308251.
[19] M. L. Merz et al., Nat. Chem. Biol., 2023, 20, 624–633. https://doi.org/10.1038/
s41589-023-01496-y.
[20] P. Lewis et al., arXiv, 2020. https://arxiv.org/abs/2005.11401.
[21] A. Dubey et al., arXiv, 2024. https://arxiv.org/abs/2407.21783.
[22] M. Abdin et al., arXiv, 2024. https://arxiv.org/abs/2404.14219.
[23] A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White and P. Schwaller, arXiv, 2023. https:
//arxiv.org/abs/2304.05376.
[24] Y. Song et al., arXiv, 2023. https://arxiv.org/abs/2306.06624.
[25] H. Zhang, Y. Song, Z. Hou, S. Miret and B. Liu, arXiv, 2024. https://arxiv.org/abs/2409.00135.
[26] D. A. Boiko, R. MacKnight, B. Kline and G. Gomes, Nature, 2023, 624, 570–578. https://doi.org/
10.1038/s41586-023-06792-0.
12
[27] K. Darvish et al., arXiv, 2024. https://arxiv.org/abs/2401.06949.
[28] G. Tom et al., Chem. Rev., 2024, 124, 9633–9732. https://doi.org/10.1021/acs.chemrev.4c00055.
[29] H. Chase, Langchain, 2024. https://github.com/langchain-ai/langchain.
[30] RDK it: Open-source cheminformatics, http://www.rdkit.org.
[31] L. Yan et al., Br. J. Educ. Technol., 2023, 55, 90–112. https://doi.org/10.1111/bjet.13370.
[32] S. Wang et al., arXiv, 2024. https://arxiv.org/abs/2403.18105.
[33] E. Kasneci et al., Learn. Individ. Differ., 2023, 103, 102274. https://doi.org/10.1016/j.lindif.
2023.102274.
[42] S. Tong, K. Mao, Z. Huang, Y. Zhao and K. Peng, Humanit. Soc. Sci. Commun., 2024, 11, 1. https:
//doi.org/10.1057/s41599-024-03407-5.
[43] I. Ciucă, Y.-S. Ting, S. Kruk and K. Iyer, arXiv, 2023. https://arxiv.org/abs/2306.11648.
[44] Q. Liu et al., arXiv, 2024. https://arxiv.org/abs/2409.06756.
13
Appendix: Individual Project Reports
Table of Contents
5 LLMSpectrometry 25
8 How Low Can You Go? Leveraging Small LLMs for Material Design 34
9 LangSim 37
11 T2Dllama: Harnessing Language Model for Density Functional Theory (DFT) Parameter Suggestion 42
15 LLMy-Way 55
16 WaterLLM: Creating a Custom ChatGPT for Water Purification Using Prompt-Engineering Techniques 57
18 LLMads 62
22 Multi-Agent Hypothesis Generation and Verification through Tree of Thoughts and Retrieval Augmented Generation 71
23 ActiveScience 75
28 Ontosynthesis 86
30 Synthetic Data Generation and Insightful Machine Learning for High Entropy Alloy Hydrides 91
31 Chemsense: Are large language models aligned with human chemical preference? 93
32 GlossaGen 96
14
1 Leveraging Orbital-Based Bonding Analysis Information in LLMs
Authors: Katharina Ueltzen, Aakash Naik, Janine George
LLMs were recently demonstrated to perform well for materials property prediction, especially in the low-
data limit [1,2]. The Learning LOBSTERs team fine-tuned multiple Llama 3 models on textual descriptions
of 1264 crystal structures to predict the highest-frequency peak in their phonon density of states (DOS) [3,4].
This target is relevant to the thermal properties of materials, and this target dataset is part of the MatBench
benchmark project [3, 4].
The text descriptions were generated using two packages: the Robocrystallographer package [5] generates
descriptions of structural features like bond lengths, coordination polyhedra, or structure type. It has recently
emerged as a popular tool for materials property prediction models that require text input [6–8]. Further,
text descriptions of orbital-based bonding analyses containing information on covalent bond strengths or
antibonding states were generated using LobsterPy [9]. The data used here is available on Zenodo [10] and
was generated as part of our previous study, in which the importance of such bonding information for the
same target via an RF model was demonstrated [10].
Figure 2: Schematic depicting the prompt for fine-tuning the LLM with Alpaca prompt format.
In the hackathon, one Llama model was fine-tuned with the Alpaca prompt format using both Robocrys-
tallographer and LobsterPy text descriptions, and another one using solely Robocrystallographer input.
Figure 2 depicts the prompt used to fine-tune an LLM to predict the last phonon DOS peak. The train/test/-
validation split was 0.64/0.2/0.16. The models were trained for 10 epochs with a validation step after each
epoch. The textual output was converted back into numerical frequency values for the computation of MAEs
and RMSEs. Our results show that including bonding-based information improved the model’s prediction.
The results also corroborate our previous finding that quantum-chemical bond strengths are relevant for this
particular target property [10]. Both model performances (Robocrystallographer: 44 cm-1 , Robocrystallog-
rapher+LobsterPy: 38 cm-1 ) are comparable to other models of the MatBench test suite, with MAEs ranging
from 29 cm-1 to 68 cm-1 at time of writing [11]. However, due to the time constraints of the hackathon, no
five-fold cross-validation was implemented for our model.
Although the preliminary results seem very promising, the models have not yet been exhaustively analyzed
or improved. As the prediction of a numerical value and not its text embedding is of interest to our task,
further model adaptation might be beneficial. For example, Rubungo et al. [7] modified T5 [12], an encoder-
decoder model, for regression tasks by removing its decoder and adding a linear layer on top of its encoder.
Halving the number of model parameters allowed them to fine-tune on longer input sequences, improving
model performance [7].
Easy-to-use packages like Unsloth [13] allowed us to integrate our materials data into fine-tuning an LLM
for property prediction with very limited resources and time.
15
References
[1] K. M. Jablonka, P. Schwaller, A. Ortega-Guerrero, B. Smit, Nat Mach Intell, 2024, 6, 161–169.
[11] The Matbench Test Suite, Phonon dataset as per 12.07.2024, https://matbench.materialsproject.
org/Leaderboards.
[12] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Journal of
Machine Learning Research, 2020, 21, 1–67.
16
2 Context-Enhanced Material Property Prediction (CEMPP)
Authors: Federico Ottomano, Elena Patyukova, Judith Clymo, Dmytro Antypov, Chi Zhang,
Aritra Roy, Piyush Ranjan Maharana, Weijie Zhang, Xuefeng Liu, Erik Bitzek
2.1 Introduction
The Liverpool Materials team sought to improve composition-based property prediction models for novel
materials by providing natural language input alongside the chemical composition of interest. In doing
so, we leverage the ability of large language models (LLMs) to capture the broader context relevant to
the composition. We demonstrate that enriching materials representation can be beneficial where training
data is limited. Code to reproduce the experiments described below is available at https://github.com/
fedeotto/lpoolmat_llms.
Our experiments are based on Roost [1], a deep neural network for predicting inorganic material properties
from their composition. Roost consists of a graph attention network that creates an embedding of the
input composition (which we will enrich with context embedding), and a residual network that acts on this
embedding to predict the target property value (Figure 3, top).
Figure 3: Model architecture and the schema of the second experiment. Material composition is encoded
with Roost encoder, additional information extracted from cited paper with Llama3, and encoded with
Mat(Sci)Bert. Composition and LLM embeddings are aggregated and passed through the Residual Net
projection head to predict the property. At the inference stage, the average LLM embedding from 5 nearest
neighbors in composition space is taken. Results show MAE for adding different types of context (top to
bottom): adding random context; not adding context; adding consistently structured context for chemical
and structural family (data extracted by humans); adding automatically extracted context for experimental
conditions and structure-property relationship.
17
2.2 Experiment 1: Using latent knowledge
We prompt two LLMs trained on materials science literature, MatBert [2] and MatSciBert [3], to directly
attempt the prediction task (”What is the property of material?”). The embedding of the response and
Roost’s composition embedding are aggregated (via summation or concatenation) and passed to the residual
network.
We consider two datasets: matbench-perovskites [4] has a target property of formation energy and Li-
ion conductors [5] dataset has a target property of ionic conductivity. The former has low stoichiometric
diversity in the inputs (all entries have a general formula of ABX3) and the latter is limited in size (only 403
distinct compositions), making the prediction tasks especially challenging. We observe a 26-32% decrease
in mean absolute error (MAE) for all four settings tested (two LLMs and two aggregation schemes) in the
Li-ion conductors task, and a 2-4% decrease in MAE in the matbench-perovskites task.
References
[1] R. Goodall, A. J. et al., “Predicting materials properties without crystal structure: deep representation
learning from stoichiometry”, Nature Communications, vol. 11, no. 6280, 2020.
[2] A. Trewartha, et al., ’Quantifying the advantage of domain-specific pre-training on named entity recog-
nition tasks in materials science’, Patterns, vol. 3, no. 8, 2022.
[3] V. Gupta, et al., MatSciBERT: A materials domain language model for text mining and information
extraction, npj Computational Materials, vol. 8, no. 1, 2022.
[4] Matbench Perovskites dataset provided by the Materials Project, https://ml.materialsproject.org/
projects/matbench_perovskites.json.gz.
[5] Li-ion Conductors Database, https://pcwww.liv.ac.uk/˜msd30/lmds/LiIonDatabase.html.
[6] PyMuPDF, https://pypi.org/project/PyMuPDF.
[7] AI@Meta, Llama 3 Model Card, 2024, https://github.com/meta-llama/llama3/blob/main/MODEL_
CARD.md.
[8] Hargreaves et al., The Earth Mover’s Distance as a Metric for the Space of Inorganic Compositions,
Chem. Mater. 2020.
[9] Hargreaves et al., A Database of Experimentally Measured Lithium Solid Electrolyte Conductivities
Evaluated with Machine Learning, npj Computational Materials 2022.
18
3 MolFoundation: Benchmarking Chemistry LLMs on Predictive
Tasks
Authors: Hassan Harb, Xuefeng Liu, Anastasiia Tsymbal, Oleksandr Narykov, Dana O’Connor,
Shagun Maheshwari, Stanley Lo, Archit Vasan, Zartashia Afzal, Kevin Shen
3.1 Summary
The MolFoundation team is focused on enhancing the prediction capabilities of pre-trained large language
models (LLMs) such as ChemBERTa and T5-Chem for specific molecular properties, utilizing the QM9
database. Targeting properties like dipole moment and zero-point vibrational energy (ZPVE), our approach
involves retraining the last layer of these models with a selected dataset and fine-tuning the embeddings to
refine accuracy. After making predictions and evaluating performance, our results indicate that T5-Chem
outperforms ChemBERTa. Additionally, we found little difference between finetuned and pre-trained results,
suggesting that the computationally expensive task of finetuning may be avoided.
Figure 4
3.2 Methods
The models are downloaded from HuggingFace (ChemBERT and T5-Chem). Using the provided code we
can tokenize our datasets (QM9). The datasets must contain SMILES and we checked that the tokenizer
had all the necessary tokens for our datasets. To make the LLMs compatible with the regression tasks in
the QM9 dataset, we froze the LLM embeddings and fine-tuned on the regression layer. Training on the full
LLM end-to-end is infeasible given our resources, so training on a single linear layer was much more efficient.
3.3 Results
We compare the out-of-the-box (pre-trained) and fine-tuned ChemBERT and T5-Chem to predict all the
molecules’ zero-point vibrational energy (ZPVE) in the QM9 dataset. We hypothesized that the fine-tuned
models would perform better. However, across all models, there is no significant improvement in the fine-
tuned models as measured by R2 (Figure 6, Figure 7). We noticed that the LLMs required approximately
100K datapoints to show improvements in the modeling performance, indicating a saturation regime for the
models. Lastly, the T5-Chem model performs significantly better than ChemBERT.
19
Figure 5
Figure 6: Model comparison of the pre-trained and fine-tuned T5-Chem on zero-point vibrational energy
(ZPVE).
Figure 7: Model comparison of the pre-trained and fine-tuned ChemBERTa on zero-point vibrational energy
(ZPVE).
20
different datasets, models, and fine-tuning strategies (i.e., end-to-end retraining, multi-task fine-tuning) to
determine the efficacy of LLMs on chemistry prediction tasks.
MolFoundation Github: https://github.com/shagunm1210/MolFoundation
References
[1] Kristiadi, A.; Strieth-Kalthoff, F.; Skreta, M.; Poupart, P.; Aspuru-Guzik, A.; Pleiss, G. A Sober Look
at LLMs for Material Discovery: Are They Actually Good for Bayesian Optimization Over Molecules?
arXiv May 28, 2024. http://arxiv.org/abs/2402.05015 (accessed 2024-06-29).
21
4 3D Molecular Feature Vectors for Large Language Models
Authors: Jan Weinreich, Ankur K. Gupta, Amirhossein D. Naghdi, Alishba Imran
Link to code repo: https://github.com/janweinreich/geometric-geniuses
Direct link to tutorial: https://github.com/janweinreich/geometric-geniuses/blob/main/tutorial.
ipynb
Accurate chemical property prediction is a central goal in computational chemistry and materials science.
While quantum chemistry methods offer high precision, they often suffer from computational bottlenecks.
Large language models (LLMs) have shown promise as a computationally efficient alternative [1]. However,
common string-based molecular representations like SMILES and SELFIES, despite their success in LLM ap-
plications, inherently lack 3D geometric information. This limitation hinders their applicability in predicting
properties for different conformations of the same molecule - a capability essential for practical applications
such as crystal structure prediction and molecular dynamics simulations. Moreover, recent studies have
demonstrated that naive encoding of geometric information as numerical data can negatively impact LLM
prediction accuracy [2].
Molecular and materials property prediction typically leverages engineered feature vectors, such as those
utilized in quantitative structure-activity relationship (QSAR) models. In contrast, physics-based represen-
tations, which center on molecular geometry due to their direct relevance to the Schrödinger equation, have
demonstrated efficacy in various deep learning architectures [3–5]. This research investigates new strategies
for encoding 3D molecular geometry for LLMs. We hypothesize that augmenting the simple SMILES rep-
resentation with geometric features could enable the integration of complementary information from both
modalities, ultimately enhancing the predictive power of LLMs in the context of molecular properties.
Figure 8: Schematic representation of the training process for a regression model illustrating the novel
string-based encoding of 3D molecular geometry for LLM-based energy prediction. The workflow involves (1)
Computation of high-dimensional feature vectors representing the 3D molecular geometry of each conformer.
(2) Dimensionality reduction to obtain a more compact representation. (3) Conversion of the reduced vectors
into a histogram, where unique string characters are assigned to each bin. (4) Input of the resulting set of
strings (one per conformer) into an LLM for energy prediction.
22
Figure 9: Performance of an LLM in predicting total energies of benzene and ethanol structures, where the
model was trained on a large dataset of MD-generated configurations. Each point represents a different
molecular structure sampled from MD simulations [7].
The final step involves randomly partitioning the dataset into 80% training and 20% testing sets. The
string-based representations are then employed to train a RoBERTa model augmented with a regression head.
In Figure 9, we showcase scatter plots illustrating the predicted versus true total energies for benzene (trained
for 4 epochs) and ethanol (trained for 20 epochs) using a dataset of 80,000 molecular configurations for each
molecule. Our results do not yet attain the accuracy levels of state-of-the-art equivariant neural networks
like MACE, which reports a mean absolute error (MAE) of 0.009 kcal/mol for benzene [10]. Nonetheless, it
is important to underscore that this represents a novel capability for LLMs, which were previously unable
to process and predict properties of 3D molecular structures only
differing by bond rotations. This initial investigation paves the way for advancements through the
exploration of alternative string encoding schemes of numerical vectors in combination with larger LLMs.
Acknowledgements
A.K.G. and W.A.D. acknowledge funding for this project from the U.S. Department of Energy (DOE), Office
of Science, Office of Basic Energy Sciences, through the Rare Earth Project in the Separations Program at
Lawrence Berkeley National Laboratory under Contract DE-AC02-05CH11231.
J.W. thanks EPFL for computational resources and NCCR Catalysis (grant number 180544), a National
Centre of Competence in Research funded by the Swiss National Science Foundation for funding as well as
the Laboratory for Computational Molecular Design.
References
[1] Jablonka, K. M., Ai, Q., Al-Feghali, A., Badhwar, S., Bocarsly, J. D., Bran, A. M., ... & Blaiszik, B.
(2023). 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large
language model hackathon. Digital Discovery, 2(5), 1233-1250.
[2] Alampara, N., Miret, S., & Jablonka, K. M. (2024). MatText: Do Language Models Need More than
Text & Scale for Materials Modeling? arXiv preprint arXiv:2406.17295.
23
[3] Batzner, S., Musaelian, A., Sun, L., Geiger, M., Mailoa, J. P., Kornbluth, M., ... & Kozinsky, B. (2022).
E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature
Communications, 13(1), 2453.
[4] Gupta, A. K., & Raghavachari, K. (2022). Three-dimensional convolutional neural networks utilizing
molecular topological features for accurate atomization energy predictions. Journal of Chemical Theory
and Computation, 18(4), 2132-2143.
[5] In-Context Learning of Physical Properties: Few-Shot Adaptation to Out-of-Distribution Molecular
Graphs. Grzegorz Kaszuba, Amirhossein D. Naghdi, Dario Massa, Stefanos Papanikolaou, Andrzej
Jaszkiewicz, Piotr Sankowski, https://arxiv.org/abs/2406.01808.
[6] Khan, D., Heinen, S., & von Lilienfeld, O. A. (2023). Kernel based quantum machine learning at record
rate: Many-body distribution functionals as compact representations. Journal of Chemical Physics,
159(034106).
[7] Chmiela, S., Vassilev-Galindo, V., Unke, O. T., Kabylda, A., Sauceda, H. E., Tkatchenko, A., & Müller,
K. R. (2023). Accurate global machine learning force fields for molecules with hundreds of atoms. Science
Advances, 9(2), https://doi.org/10.1126/sciadv.adf0873.
[8] Bowman, J. M., Qu, C., Conte, R., Nandi, A., Houston, P. L., & Yu, Q. (2022). The MD17 datasets
from the perspective of datasets for gas-phase “small” molecule potentials. The Journal of chemical
physics, 156(24).
[9] Weinreich, J., & Probst, D. (2023). Parameter-Free Molecular Classification and Regression with Gzip.
ChemRxiv.
[10] Batatia, I., Kovacs, D. P., Simm, G., Ortner, C., & Csányi, G. (2022). MACE: Higher order equivariant
message passing neural networks for fast and accurate force fields. Advances in Neural Information
Processing Systems, 35, 11423-11436.
24
5 LLMSpectrometry
Authors: Tyler Josephson, Fariha Agbere, Kevin Ishimwe, Colin Jones, Charishma Puli,
Samiha Sharlin, Hao Liu
5.1 Introduction
Nuclear Magnetic Resonance (NMR) spectroscopy is a chemical characterization technique that uses oscil-
lating magnetic fields to characterize the structure of molecules. Different atoms in molecules resonate at
different frequencies based on their local chemical environments. The resulting NMR spectrum can be used
to infer interactions between particular atoms in a molecule and determine the molecule’s entire structure.
Solving NMR spectral tasks is critical, as multiple aspects, such as the number, intensity, and shape of
signals, as well as chemical shifts, need to be considered.
Machine learning tools, including the Molecular Transformer [1], have been used to learn a function to
map spectrum to structure, but these require thousands to millions of labeled examples [2], far more than
what humans typically encounter when learning to assign spectra. In contrast, we recognize NMR spectral
tasks as being fundamentally about multi-step reasoning, and we aim to explore the reasoning capabilities
of large language models (LLMs) for solving this task.
The project aims to investigate the capabilities of Large Language Models (LLMs), specifically GPT-4,
for NMR structure determination. In particular, we were interested in evaluating whether GPT-4 could use
chain-of-thought reasoning [3] with a scratchpad to evaluate the components of the spectra and synthesize
an answer in the form of a molecule. Interpreting NMR data is crucial for organic chemists; an AI-assisted
tool could benefit the pharmaceutical or food industry, as well as forensic science, medicine, research, and
teaching.
5.2 Method
We manually gathered 19 experimental 1 H NMR data from the Spectral Database for Organic Compounds
(SDBS) website [4]. We then combined Python scripts, LangChain, and API calls to GPT-4 to automate
structure elucidation. The components of the model are shown in Figure 10. First, NMR peak data and
the chemical formula were formatted as text and inserted into a prompt. This prompt instructs the LLM to
reason about the data step-by-step while using a scratchpad to record its “thoughts,” then report its answer
according to an output template. We then used Python to parse the output and compare the answer to the
true answer by matching with chemical names from the NIST Chemistry WebBook.
5.3 Results
Our 2024 LLM Hackathon for Applications in Materials and Chemistry (May 2024) results found GPT-4
successful in just 3 of the 19 NMR datasets. The scratchpad reveals how GPT-4 follows the prompt and
systematically approaches the problem. It analyzes the molecular formula and carefully reads peak values
and intensity to predict the molecule. The model correctly identified nonanal, ethanol, and acetic acid - 3
relatively simple molecules with few structural isomers. Incorrect answers included more complex molecules,
with significant branching, many functional groups, and aromatic rings, leading to more structural isomers
consistent with the chemical formula.
25
Figure 10: Scheme of the system. Data is first converted into text format, with peak positions and intensities
represented as (x,y) pairs. These are passed into an LLM prompt, which is tasked to use a scratchpad as it
reasons about the data and the formula, before providing a final answer.
which can enable comparison of GPT-4 to humans. Further analysis of proximity of incorrect answers to
correct answers would provide more granular information about the performance of the AI, for example, an
aromatic molecule with correct substituents in incorrect locations is further from the right answer than a
molecule with incorrect substituents. We think this could form a useful addition to existing LLM benchmarks,
for evaluating chemistry knowledge intertwined with complex multistep reasoning.
Datasets, code, and results are available at: https://github.com/ATOMSLab/LLMSpectroscopy
References
[1] Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C., Bekas, C., Lee, A. A., ACS Central Sci.,
2019, Vol. 5, No. 9, 1572-1583. https://pubs.acs.org/doi/10.1021/acscentsci.9b00576
[2] Alberts, M., Zipoli, F., Vaucher, A. C., ChemRxiv Preprint. https://doi.org/10.26434/
chemrxiv-2023-8wxcz
[3] Wei, J., Wang X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D. Chain-
of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. https://arxiv.
org/abs/2201.11903
[4] Yamaji, T., Saito, T., Hayamizu, K., Yanagisawa, M., Yamamoto, O., Wasada, N., Someno, K., Kinu-
gasa, S., Tanabe, K., Tamura, T., Hiraishi, J., 2024. https://sdbs.db.aist.go.jp
[5] Socha, O., Osifova, Z., Dracinsky, M., J. Chem. Educ., 2023, Vol. 100, No. 2, 962-968. https://pubs.
acs.org/doi/10.1021/acs.jchemed.2c01067
[6] https://github.com/ATOMSLab/LLMSpectroscopy
26
6 MC-Peptide: An Agentic Workflow for Data-Driven Design of
Macrocyclic Peptides
Authors: Andres M. Bran, Anna Borisova, Marcel M. Calderon, Mark Tropin, Rob Mills,
Philippe Schwaller
6.1 Introduction
Macrocyclic peptides (MCPs) are a class of compounds composed of cyclized chains of amino acids forming
a macrocyclic structure. They are promising for having improved binding affinity, specificity, proteolytic
stability, and enhanced membrane permeability compared to linear peptides [1]. Their unique properties
make them highly suitable for drug development, enabling the creation of therapeutics that can address
currently unmet medical needs [1]. Indeed, it has been shown that MCPs of up to 12 amino acids show great
promise for permeating cellular membranes [2]. Despite the more constrained chemical space offered by this
class of molecules, their design remains a challenging issue due to the vast space of amino acid combinations.
One important parameter of MCPs is permeability, which determines how well the structure can permeate
into cells, making it a relevant factor in assessing the MCP’s pharmacological activity. However, data for
MCPs and their permeabilities is scattered across scientific papers that report data without consensus on
reporting form, making it challenging for systems to compile and use this raw data.
Here, we introduce MC-Peptide, an LLM-based agentic workflow created for tackling this issue in an
end-to-end fashion. As shown in Figure 1, MC-Peptide’s main goal is to produce suggestions of novel MCPs,
following from a reasoning process involving: (i) understanding of the design objective, (ii) gathering of
relevant scientific information, (iii) data extraction from papers, and (iv) inference based on the extracted
information. MC-Peptide leverages advances in LLMs such as semantic search, grammar-constrained gen-
eration, and agentic architectures, ultimately yielding candidate MCP variants with enhanced predicted
permeability in comparison to reference designs.
6.2 Methods
We implemented the basic building blocks for the pipeline described in Figure 11, with the most important
components being document retrieval, structured data extraction, and in-context learning for design.
Document Retrieval An important part of this pipeline is the retrieval of relevant documents. As shown
in Figure 11, the pipeline developed here does not rely on a stand-alone predictive system, but rather
aims to leverage existing knowledge in the scientific literature, in an on-demand fashion. This is known
as retrieval-augmented generation (RAG) [8], and one of its key components is a relevance-based retrieval
system. By integrating the Semantic Scholar API, a service that provides access to AI-based search of
scientific documents, the pipeline is able to create and refine a structured knowledge base from papers.
Structured Data Extraction To employ the unstructured data from the previous step, a retrieval-
augmented system was designed that retrieves sections of papers and uses them as context for the generation
of structured data objects with predefined grammars, which can be used to constrain the decoding of LLMs
[3], ensuring that the resulting output follows the same structure and format. This technique also mitigates
hallucination, as it prevents the LLM from generating unnecessary and potentially misleading information [9].
In-Context Learning The extracted data is then leveraged through the in-context learning capabilities
of LLMs [4], allowing the models to learn from few data points given as context in the prompt and generate
output based on that. This capability has been extensively explored elsewhere [5]. Here we show that, for
the specific task of modifying MCPs to improve permeability, LLMs perform well, as assessed by a surrogate
random forest model.
27
6.3 Conclusions
We present MC-Peptide, a novel agentic workflow for designing macrocyclic peptides. The system is built
from a few main components: data collection, peptide extraction, and peptide generation. We evaluate the
peptide permeability of newly generated peptides with respect to the initial structures found in reference
articles.
The resulting system shows that LLMs can be successfully leveraged for automating multiple aspects of
peptide design, yielding an end-to-end generative tool that researchers can utilize to accelerate and enhance
their experimental workflows. Furthermore, this workflow can be extended by adding more specialized
modules (e.g., for increasing the diversity of information sources). The modular design of MC-Peptide
ensures extendability to more design objectives and input sources, as well as other domains where data is
reported in an unstructured fashion in papers, such as materials science and organic chemistry.
The code for this project has been made available at: https://github.com/doncamilom/mc-peptide.
Figure 11: MC-Peptide: Pipeline implemented in this work. The example illustrates a user request, followed
by retrieval from the Semantic Scholar API, and the creation of a knowledge base. MCPs and permeabilities
are extracted from [7]. The pipeline finishes with an LLM proposing modifications to an MCP, increasing
its permeability.
References
[1] Ji, X., Nielsen, A. L., Heinis, C., ”Cyclic peptides for drug development,” Angewandte Chemie Interna-
tional Edition, 2024, 63(3), e202308251.
[2] Merz, M.L., Habeshian, S., Li, B. et al., ”De novo development of small cyclic peptides that are orally
bioavailable,” Nat Chem Biol, 2024, 20, 624–633. https://doi.org/10.1038/s41589-023-01496-y.
[3] Beurer-Kellner, L., et al., ”Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation,”
ArXiv, 2024, abs/2403.06988.
[4] Dong, Q., et al., ”A survey on in-context learning,” ArXiv, 2022, arXiv:2301.00234.
[5] Agarwal, R., et al., ”Many-shot in-context learning,” ArXiv, 2024, arXiv:2404.11018.
28
[6] Kristiadi, A., et al., ”A Sober Look at LLMs for Material Discovery: Are They Actually Good for
Bayesian Optimization Over Molecules?,” ArXiv, 2024, arXiv:2402.05015.
[7] Lewis, I., Schaefer, M., Wagner, T. et al., ”A Detailed Investigation on Conformation, Permeability and
PK Properties of Two Related Cyclohexapeptides,” Int J Pept Res Ther, 2015, 21, 205–221. https:
//doi.org/10.1007/s10989-014-9447-3.
[8] Lewis, P., et al., ”Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” ArXiv, 2020,
abs/2005.11401.
[9] Béchard, P., Ayala, O. M., ”Reducing hallucination in structured outputs via Retrieval-Augmented
Generation,” ArXiv, 2024, abs/2404.08189.
29
7 Leveraging AI Agents for Designing Low Band Gap Metal-
Organic Frameworks
Authors: Sartaaj Khan, Mahyar Rajabi, Amro Aswad, Seyed Mohamad Moosavi, Mehrad
Ansari
7.1 Introduction
Metal-organic frameworks (MOFs) are known to be excellent candidates for electrocatalysis due to their
large surface area, high adsorption capacity at low CO2 concentrations, and the ability to fine-tune the
spatial arrangement of active sites within their crystalline structure [1]. Low band gap MOFs are crucial
as they efficiently absorb visible light and exhibit higher electrical conductivity, making them suitable for
photocatalysis, solar energy conversion, sensors, and optoelectronics. In this work, we aim at using chemistry-
informed ReAct [2] AI Agents to optimize the band gap property of MOFs. The overview of the workflow
is presented in Figure 12a. The agent inputs a textual representation of the initial MOF structure as a
SMILES (Simplified Molecular Input Line-Entry System) string representation, and a short description of
the property optimization task (i.e., reducing band gap), all in natural language text. This is followed by an
iterative closed-loop suggestion of new MOF candidates with a lower band gap with uncertainty assessment,
by making adjustments to the initial MOF given a set of design guidelines automatically obtained from the
scientific literature. Detailed analysis of this methodology applied to other materials and target properties
can be found in reference [3].
1. Retrieval-Augmented Generation (RAG): This tool allows the agent to obtain design guidelines
on how to adapt the MOF structure from unstructured text. In specific, the agent has access to a
fixed set of seven MOF research papers (see Refs. [4]- [10]) as PDFs. This tool is designed to extract
the most relevant sentences from papers in response to a given query. It works by embedding both the
paper and the query into numerical vectors, then identifying the top k passages within the document
that either explicitly mention or implicitly suggest the adaptations to the band gap property for a
MOF. The embedding model is OpenAI’s text-ada-002 [11]. Inspired by our earlier work [12], k is
set to 9 but is dynamically adjusted based on the relevant context’s length to avoid OpenAI’s token
limitation error.
2. Surrogate Band Gap Predictor: The surrogate model used is a transformer (MOFormer [13]) that
inputs the MOF as SMILES. This model is pre-trained using a self-supervised learning technique known
as Barlow-Twin [14], where representation learning is done against structure-based embeddings from
a crystal graph convolutional neural network (CGCNN) [15]. This was done against 16,000 BW20K
entries [16]. The pre-trained weights are then transferred and fine-tuned to predict the band gap labels
taken from 7450 entries from the QMOF database [17]. From a 5-fold training, an ensemble of five
transformers are trained to return the mean band gap and the standard deviation, which is used to
assess uncertainty for predictions. For comparison, our transformer’s mean absolute error (MAE) is
approximately 0.467, whereas MOFormer, which was pre-trained on 400,000 entries, achieves an MAE
of approximately 0.387.
3. Chemical Feasibility Evaluator: This tool primarily uses RDKit [18] to convert a SMILES string
into an RDKit Mol object, and performs several validation steps to ensure chemical feasibility. First, it
parses the SMILES string to confirm correct syntax. Next, it validates the atoms and bonds, ensuring
they are chemically valid and recognized. It then checks atomic valences to ensure each atom forms
a reasonable number of bonds. For ring structures, RDKit verifies the correct ring closure notation.
Additionally, it adds implicit hydrogens to satisfy valence requirements and detects aromatic systems,
30
marking relevant atoms and bonds as aromatic. These steps collectively ensure the molecule’s basic
chemical validity.
We use OpenAI’s GPT-4 [19] with a temperature of 0.1 as our LLM and LangChain [20] for the application
framework development (note the choice of LLM is only a hyperparameter and other LLMs can be also used
with the agent).
Figure 12: a) Workflow overview. The ReAct agent looks up guidelines for designing low band gap MOFs
from research papers and suggests a new MOF (likely with lower band gap). It then checks validity of the
new SMILES candidate and predicts band gap with uncertainty estimation using an ensemble of surrogate
fine-tuned MOFormers. b) Band gap predictions for new MOF candidates as a function of agent iterations.
Detailed analysis of this methodology applied to other materials and target properties can be found in
reference [3].
The new MOF candidates and their corresponding inferred band gap are represented in Figure 1.b. The
agent starts by retrieving the following design guidelines for low band gap MOFs from research papers: 1)
Increasing the conjugation in the linker. 2) Selecting electron-rich metal nodes. 3) Functionalizing the linker
with nitro and amino groups. 4) Altering linker length. 5) Substitute functional groups (i.e., substituting
hydrogen with electron-donating groups on the organic linker). Note that the metal node adaptations
were restrained by simply changing the system input prompt. The agent iteratively implements the above
strategies and makes changes to the MOF. After each modification, the band gap of the new MOF is assessed
using the fine-tuned surrogate MOFormers to ensure a lower band gap. Subsequently, the chemical feasibility
is evaluated. If the new MOF candidate has an invalid SMILES string or a higher band gap, the agent reverts
to the most recent valid MOF candidate with the lowest band gap.
31
7.3 Data and Code Availability
All code and data used to produce results in this study are publicly available in the following GitHub
repository: https://github.com/mehradans92/PoreVoyant.
References
[1] Lirong Li, Han Sol Jung, Jae Won Lee, and Yong Tae Kang. Review on applications of metal–organic
frameworks for co2 capture and the performance enhancement mechanisms. Renewable and Sustainable
Energy Reviews, 162: 112441, 2022.
[2] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.
React: Synergizing reasoning and acting in language models. arXiv preprint, arXiv:2210.03629, 2022.
[3] Mehrad Ansari, Jeffrey Watchorn, Carla E. Brown, and Joseph S. Brown. dZiner: Rational Inverse
Design of Materials with AI Agents. arXiv, 2410.03963, 2024. URL: https://arxiv.org/abs/2410.03963.
[4] Muhammad Usman, Shruti Mendiratta, and Kuang-Lieh Lu. Semiconductor metal–organic frameworks:
future low- g”bandgap materials. Advanced Materials, 29(6):1605071, 2017.
[5] Espen Flage-Larsen, Arne Røyset, Jasmina Hafizovic Cavka, and Knut Thorshaug. Band gap modu-
lations in uio metal–organic frameworks. The Journal of Physical Chemistry C, 117(40):20610–20616,
2013.
[6] Li-Ming Yang, Guo-Yong Fang, Jing Ma, Eric Ganz, and Sang Soo Han. Band gap engineering of
paradigm mof-5. Crystal growth & design, 14(5):2532–2541, 2014.
[7] Li-Ming Yang, Ponniah Vajeeston, Ponniah Ravindran, Helmer Fjellvag, and Mats Tilset. Theoretical
investigations on the chemical bonding, electronic structure, and optical properties of the metal- organic
framework mof-5. Inorganic chemistry, 49(22):10283–10290, 2010.
[8] Maryum Ali, Erum Pervaiz, Tayyaba Noor, Osama Rabi, Rubab Zahra, and Minghui Yang. Recent
advancements in mof-based catalysts for applications in electrochemical and photoelectrochemical water
splitting: A review. International Journal of Energy Research, 45(2):1190–1226, 2021.
[9] Yabin Yan, Chunyu Wang, Zhengqing Cai, Xiaoyuan Wang, and Fuzhen Xuan. Tuning electrical and
mechanical properties of metal–organic frameworks by metal substitution. ACS Applied Materials &
Interfaces, 15(36):42845–42853, 2023.
[10] Chi-Kai Lin, Dan Zhao, Wen-Yang Gao, Zhenzhen Yang, Jingyun Ye, Tao Xu, Qingfeng Ge, Shengqian
Ma, and Di-Jia Liu. Tunability of band gaps in metal–organic frameworks. Inorganic chemistry,
51(16):9039–9044, 2012.
[11] Ryan Greene, Ted Sanders, Lilian Weng, and Arvind Neelakantan. New and improved embedding model,
2022.
[12] Mehrad Ansari and Seyed Mohamad Moosavi. Agent-based learning of materials datasets from scientific
literature. arXiv preprint, arXiv:2312.11690, 2023.
[13] Zhonglin Cao, Rishikesh Magar, Yuyang Wang, and Amir Barati Farimani. Moformer: self-supervised
transformer model for metal–organic framework property prediction. Journal of the American Chemical
Society, 145(5): 2958–2967, 2023.
[14] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-
supervised learning via redundancy reduction. In International Conference on Machine Learning, pages
12310–12320. PMLR, 2021.
[15] Tian Xie and Jeffrey C. Grossman. Crystal graph convolutional neural networks for an accurate and
interpretable prediction of material properties. Physical Review Letters, 120(14):145301, 2018.
32
[16] Seyed Mohamad Moosavi, Aditya Nandy, Kevin Maik Jablonka, Daniele Ongari, Jon Paul Janet, Peter
G Boyd, Yongjin Lee, Berend Smit, and Heather J Kulik. Understanding the diversity of the metal-
organic framework ecosystem. Nature communications, 11(1):1–10, 2020.
[17] Andrew S. Rosen, Shaelyn M. Iyer, Debmalya Ray, Zhenpeng Yao, Alan Aspuru-Guzik, Laura Gagliardi,
Justin M. Notestein, and Randall Q. Snurr. Machine learning the quantum-chemical properties of
metal–organic frameworks for accelerated materials discovery. Matter, 4(5):1578–1597, 2021.
[18] Greg Landrum. RDKit documentation. Release, 1(1-79):4, 2013.
[19] OpenAI. Gpt-4 technical report, 2023.
[20] Harrison Chase. LangChain, 10 2022. URL https://github.com/langchain-ai/langchain.
33
8 How Low Can You Go? Leveraging Small LLMs for Material
Design
Authors: Alessandro Canalicchio, Alexander Moßhammer, Tehseen Rug, Christoph Völker
8.3 Methodology
To investigate this question, we deployed two small-scale instruction LLMs on consumer hardware, specifically
Llama 3 8B and Phi-3 3.8B, both quantized to four bits. Our goal was to evaluate their ability to design
alkali-activated concrete with high compressive strength. The LLMs were tasked via system messages to
determine four mixture parameters from a predetermined grid of 240 recipes: 1) the blending ratio of fly
ash to ground granulated blast furnace slag, 2) the water-to-powder ratio, 3) the powder content, and 4) the
curing method. We measured their performance by comparing the compressive strengths of the suggested
formulations to previously published lab results [6].
34
The workflow and evaluation are shown in Figure 13. Each design run comprised 10 consecutive devel-
opment cycles, where the LLM suggested a formulation and the user provided feedback. This process was
repeated five times for statistical analysis. Additionally, the prompt was rephrased synonymously three times
to increase linguistic variability. Two types of contexts were provided: one with detailed design instructions,
such as “reducing the water-to-powder ratio leads to higher strengths,” and one without additional instruc-
tions. In total, 15 design runs were conducted per model and per context. Finally, we assessed the achieved
10% lower-bound strength, defined as the strength achieved in 90% of the design runs, and compared this
against a random draw as the statistical baseline and SL.
Table 2: Achieved compressive strength of designs suggested by LLMs. Statistical assessment in terms of
10% lower bound strength, i.e., the 10% worst cases.
In conclusion, small-scale LLMs performed surprisingly well, with Phi-3 producing results significantly
above a random guess, though it faced challenges with more complex prompts. The effectiveness of LLMs
in solving design tasks depends on how well material concepts are represented in their hidden states and
how effectively these can be retrieved via prompts, giving larger models an advantage. Despite their smaller
parameter count and less training data, Phi-3 and Llama 3 demonstrated common sense for domain-specific
design tasks, making local deployment a viable option. While 100% reliability in retrieving sensible infor-
mation via LLMs is uncertain, small-scale LLMs can generate educated guesses that potentially accelerate
the familiarization process with novel materials.
8.5 Code
The code used to conduct the experiments is open-source and available here: https://github.com/sandrocan/
LLMs-for-design-of-alkali-activated-concrete-formulations.
References
[1] U. Environment, K. L. Scrivener, V. M. John and E. M. Gartner, ”Eco-efficient cements: Potential
economically viable solutions for a low-CO2 cement-based materials industry,” Cement and Concrete
Research, vol. 114, pp. 2-26; DOI: https://doi.org/10.1016/j.cemconres.2018.03.015, 2018.
[2] J. L. Provis, A. Palomo and C. Shi, ”Advances in understanding alkali-activated materials,” Cement
and Concrete Research, pp. 110-125, 2015.
35
Figure 13: LLM-based material design workflow (left) and diagram showing evaluation metric (right).
[3] H. S. Gökçe, M. Tuyan, K. Ramyar and M. L. Nehdi, ”Development of Eco-Efficient Fly Ash–Based
Alkali-Activated and Geopolymer Composites with Reduced Alkaline Activator Dosage,” Journal of
Materials in Civil Engineering, vol. 32, no. 2, pp. 04019350; DOI: 10.1061/(ASCE)MT.1943-5533.
0003017, 2020.
[4] J. He, Y. Jie, J. Zhang, Y. Yu and G. Zhang, ”Synthesis and characterization of red mud and rice
husk ash-based geopolymer composites,” Cement and Concrete Composites, vol. 37, pp. 108-118; DOI:
http://dx.doi.org/10.1016/j.cemconcomp.2012.11.010, 2013.
[5] C. Völker, T. Rug, K. M. Jablonbka and S. Kurschwitz, ”LLMs can Design Sustainable Concrete – a
Systematic Benchmark,” (Preprint), pp. 1-12; DOI: 10.21203/rs.3.rs-3913272/v1, 2023.
[6] G. M. Rao and T. D. G. Rao, ”A quantitative method of approach in designing the mix proportions of
fly ash and GGBS-based geopolymer concrete,” Australian Journal of Civil Engineering, vol. 16, no. 1,
pp. 53-63; DOI: 10.1080/14488353.2018.1450716, 2018.
36
9 LangSim
Authors: Yuan Chiang, Giuseppe Fisicaro, Greg Juhasz, Sarom Leang, Bernadette Mohr,
Utkarsh Pratiush, Francesco Ricci, Leopold Talirz, Pablo A. Unzueta, Trung Vo, Gabriel
Vogel, Sebastian Pagel, Jan Janssen
The complexity and non-intuitive user interface of scientific simulation software results in a high barrier
for beginners and limits the usage to expert users. In the field of atomistic simulation, the simulation
codes are developed by different communities (chemistry, physics, materials science) using different units,
file names,and variable names. LangSim addresses this challenge, by providing a natural language interface
for atomistic simulation in the field of computational materials science.
Since the introduction of ChatGPT, the application of large language models (LLM) in chemistry and
materials science has transitioned from semantic literature search to research agents capable of autonomously
executing selected steps of the research process. In particular, in research domains with a high level of au-
tomation, like chemical synthesis, the latest research agents already combine access to specialized databases,
scientific software for analysis as well as to robots for executing the experimental measurements [1,2]. These
research agents divide the research question into a series of individual tasks, each addressing one task before
combining them with one controlling agent. With this approach, the flexibility of the LLM is reduced, which
consequently reduces the risk of hallucinations [8].
In analogy, LangSim (Language + Simulation) is a research agent focused on simulation in the field of
computational materials science. LangSim can calculate a series of bulk properties for elemental crystals, like
the equilibrium volume and equilibrium bulk modulus. Internally, this is achieved by constructing simulation
protocols consisting of multiple simulation steps to calculate one material property. For example, the bulk
modulus is calculated by querying the atomistic simulation environment (ASE) [3] for the equilibrium crystal
structure, optimizing the crystal structure in dependence on the choice of simulation model, and finally
evaluating the change of energy over volume change around the equilibrium to calculate the bulk modulus
as the second derivative of the change in energy over volume change. The simulation protocols in LangSim
are independent of the selected level of theory and can be evaluated with either the effective medium theory
model [4] or the foundation machine-learned interatomic potential MACE [5]. Furthermore, to quantify the
uncertainty of these simulation results, LangSim also has access to databases with experimental references
for these bulk properties.
Figure 14
The LangSim research agent is based on the LangChain package. This has two advantages: On the one
hand, the LangChain package [6] simplifies the addition of new simulation agents and simulation workflows
for LangSim. On the other hand, LangChain is compatible with a wide range of different LLM providers to
prevent vendor lock-in. LangSim extends the LangChain framework by providing data types for coupling
the simulation codes with the LLM, like a pydantic dataclass [7] representation of the ASE atoms class and a
series of pre-defined simulation workflows to highlight how existing simulation workflows can be implemented
37
as LangChain agents. Once a simulation workflow is represented as a Python function compatible with a
simulation framework like ASE, the interfacing with LangSim is as simple as changing the input arguments
to LLM-compatible data types indicated by type hints and adding a Docstring as context for the LLM.
Abstractly, these LangChain agents can be understood in analogy to the header files in C programming,
which define the interfaces for public functions. An example LangSim agent to calculate the bulk modulus
is provided below:
from a s e . atoms import Atoms
from a s e . c a l c import C a l c u l a t o r
from a s e . e o s import c a l c u l a t e e o s
from a s e . u n i t s import kJ
from l a n g s i m import (
AtomsDataClass ,
get ase calculator from str ,
)
from l a n g c h a i n . a g e n t s import t o o l
@tool
def get bulk modulus agent (
a t o m d i c t : AtomsDataClass , c a l c u l a t o r s t r : s t r
) −> f l o a t :
”””
Returns t h e bulk modulus o f c h e m i c a l symbol
f o r a g i v e n atoms d i c t i o n a r y and a s e l e c t e d
model s p e c i f i e d by t h e c a l c u l a t o r s t r i n g i n GPa .
”””
return get bulk modulus function (
atoms=Atoms ( ∗ ∗ a t o m d i c t . d i c t ( ) ) ,
c a l c u l a t o r=g e t a s e c a l c u l a t o r f r o m s t r (
c a l c u l a t o r s t r=c a l c u l a t o r s t r
),
)
The example workflow for calculating the bulk modulus highlights how existing simulation frameworks
like ASE can be leveraged to provide the LLM with the ability to construct and execute simulation workflows.
While for this example, in particular for the case of elemental crystals, it would be possible to pre-compute all
combinations and restrict the large language model to a database of pre-computed results, this would become
prohibitive for multi-component alloys and an increasing number of simulation models and workflows. At
this stage the semantic capabilities of the LLM provide the capability to handle all possible combinations in
a systematic way by allowing the LLM to construct and execute the specific simulation workflow when it is
requested from the user.
In summary, the LangSim package provides data classes and utility functions to interface LLMs with
atomistic simulation workflows. This provides the LLM with the ability to execute simulations to answer
scientific questions like a computational material scientist. The functionality is demonstrated for the calcu-
lation of the bulk modulus for elemental crystals.
38
9.1 One sentence summaries
1. Problem/Task: Develop a natural language interface for simulation codes in the field of computational
chemistry and materials science. Current LLMs, including ChatGPT 4, suffer from hallucination,
resulting in simulation protocols that can be executed but fail to calculate the correct physical property
with the specified unit.
2. Approach: Develop a suite of LangChain agents to interface with the atomic simulation environment
(ASE) and corresponding data classes to represent objects used in atomistic simulation in the context
of large language models.
3. Results and Impact: Developed the LangSim package as a prototype for handling the calculation
of multiple material properties using predefined simulation workflows, independent of the theoretical
model, based on the ASE framework.
4. Challenges and Future Work: The current prototype enables the calculation of bulk properties for
unaries, the next step is to extend this functionality to multi-component alloys and more material
properties.
References
[1] Boiko, D.A., MacKnight, R., Kline, B. et al. Autonomous chemical research with large language models.
Nature 624, 570–578 (2023). https://doi.org/10.1038/s41586-023-06792-0.
[2] M. Bran, A., Cox, S., Schilter, O. et al. Augmenting large language models with chemistry tools. Nat
Mach Intell 6, 525–535 (2024). https://doi.org/10.1038/s42256-024-00832-8.
[3] Hjorth Larsen, A., Jørgen Mortensen, J., Blomqvist, J., Castelli, I. E., Christensen, R., Dulak, M., Friis,
J., Groves, M. N., Hammer, B., Hargus, C., Hermes, E. D., Jennings, P. C., Bjerre Jensen, P., Kermode,
J., Kitchin, J. R., Leonhard Kolsbjerg, E., Kubal, J., Kaasbjerg, K., Lysgaard, S., . . . Jacobsen, K.
W. (2017). The atomic simulation environment—a python library for working with atoms. Journal of
Physics: Condensed Matter, 29(27), 273002. https://doi.org/10.1088/1361-648x/aa680e.
[4] Jacobsen, K. W., Stoltze, P., Nørskov, J. K. (1996). A semi-empirical effective medium theory for metals
and alloys. Surface Science, 366(2), 394–402. https://doi.org/10.1016/0039-6028(96)00816-3.
[5] Batatia, I., Benner, P., Chiang, Y., Elena, A. M., Kovács, D. P., Riebesell, J., Advincula, X. R., Asta,
M., Avaylon, M., Baldwin, W. J., Berger, F., Bernstein, N., Bhowmik, A., Blau, S. M., Cărare, V.,
Darby, J. P., De, S., Della Pia, F., Deringer, V. L. et al. (2024). A foundation model for atomistic
materials chemistry. arXiv. https://arxiv.org/abs/2401.00096.
[6] https://github.com/langchain-ai/langchain.
[7] https://pydantic.dev/.
[8] Ye, H., Liu, T., Zhang, A., Hua, W., Jia, W. (2023). Cognitive mirage: A review of hallucinations in
large language models. arXiv. https://arxiv.org/abs/2309.06794.
39
10 LLMicroscopilot: assisting microscope operations through LLMs
Authors: Marcel Schloz, Jose C. Gonzalez
The operation of state-of-the-art microscopes in materials science research is often limited to a selected
group of operators due to their high complexity and significant cost of ownership. This exclusivity creates a
barrier to broadening scientific progress and democratizing access to these powerful instruments. Presently,
operating these microscopes involves time-consuming tasks that demand substantial human expertise, such
as aligning the instrument for optimal performance and transitioning between different operational modes
to address different research questions. These challenges highlight the need for improved user interfaces that
simplify operation and increase the accessibility of microscopes in materials science.
Recent advancements in natural language processing software suggest that integrating large language
models (LLMs) into the user experience of modern microscopes could significantly enhance their usability.
Just as modern chatbots have enabled users without much programming background to create complex
computer programs, LLMs have the potential to simplify the operation of microscopes, thereby making
them more accessible to non-expert users [1]. Early studies have demonstrated the potential of LLMs in
scanning probe microscopy, using microscope-specific external tools for remote access [2] and control [3].
Particularly promising is the application of LLMs as agents with access to specific external tools, providing
operators with a powerful assistant capable of reasoning based on observations and reducing the extensive
hallucinations common in LLM agents. This approach also enhances the accessibility of external tools,
eliminating the need for users to learn tool-specific APIs.
The LLMicroscopilot-team (Jose D. Cojal Gonzalez and Marcel Schloz) has shown that the operation of
a scanning transmission electron microscope can be partially performed by the LLM-powered agent ”LLMi-
croscopilot” through access to microscope-specific control tools. Figure 15 illustrates the interaction process
between the operator and the LLMicroscopilot. The LLMicroscopilot is built on a generally trained founda-
tion model that gains domain-specific knowledge and performance through the provided tools. The initial
prototype uses the API of a microscope experiment simulation tool [4] to perform tasks such as experimental
parameter estimation and experiment execution. This approach reduces the reliance on highly trained human
operators, fostering broader participation in materials science research. Future developments of LLMicro-
scopilot will integrate open-source microscope hardware control tools [5] and database-access tools, allowing
for Retrieval-Augmented Generation possibilities to improve parameter estimation and data analysis.
References
[1] Stefan Bauer et al, Roadmap on data-centric materials science, Modelling Simul. Mater. Sci. Eng., 2024,
32, 063301.
[2] Diao, Zhuo, Hayato Yamashita, and Masayuki Abe. ”Leveraging Large Language Models and Social
Media for Automation in Scanning Probe Microscopy.” arXiv preprint arXiv:2405.15490 (2024).
[3] Liu, Yongtao, Marti Checa, and Rama K. Vasudevan. ”Synergizing Human Expertise and AI Efficiency
with Language Model for Microscopy Operation and Automated Experiment Design.” Machine Learning:
Science and Technology (2024).
[4] Madsen, Jacob, and Toma Susi. ”The abTEM code: transmission electron microscopy from first princi-
ples.” Open Research Europe 1 (2021).
[5] Meyer, Chris, et al. ”Nion Swift: Open Source Image Processing Software for Instrument Control, Data
Acquisition, Organization, Visualization, and Analysis Using Python.” Microscopy and Microanalysis
25.S2 (2019): 122-123.
40
Figure 15: Schematic overview of the LLMicroscopilot assistant. The microscope user interface allows the
user to input queries, which are then processed by the LLM. The LLM executes appropriate tools to provide
domain-specific knowledge, support data analysis, or operate the microscope.
41
11 T2Dllama: Harnessing Language Model for Density Functional
Theory (DFT) Parameter Suggestion
Authors: Chiku Parida, Martin H. Petersen
11.1 Introduction
Large language models are now gaining the attention of many researchers due to their capabilities to process
human language and perform tasks on which they have not been explicitly trained, making them an invaluable
tool for researchers in various fields where the information is in the form of text, like scientific journals, blogs,
news articles, and social media posts, etc. This is particularly applicable to the field of chemical sciences,
which encounters the challenge of dealing with limited and diverse datasets that are often presented in
text format. LLMs have proven their potential in handling these challenges and are progressively being
utilised to predict chemical characteristics, optimise reactions, and even independently design and execute
experiments [1].
Here, we used LLM to process published scientific articles to extract simulation parameters and other
relevant information about different materials. This will help experimentalists get an idea of optimised
parameters for Density Functional Theory (DFT) calculations for the newly discovered material family.
Nowadays, DFT is the most valuable tool to model atomistic materials. The idea behind DFT is to use
Kohn-Sham’s equations to approximately solve Schröding’s equations for the atomic material at hand. The
approximation is done by configuring the electron density for the material at each ionic step, where the ions
move in position based on their energy and forces determined by the configured electron density. The biggest
part of the approximation is the exchange functional, and depending on the complexity of the exchange func-
tional, the approximation becomes more or less comparable with experimental results [6]. When performing
a DFT calculation, the question is always what exchange functional and parameters to use, as well as what
k-space grid to use. This is material-dependent and will change for different materials. The unoptimized
parameters can lead to inaccurate results, resulting in a DFT calculation that fails to describe the relative
values and is therefore not comparable to the experimental results [2]. For that reason, experimentalists
normally collaborate with computational chemists because of their expertise in computational modeling or
make due without the atomistic model. Instead, our T2Dllama [talk-to-douments using Llama] framework
can be an acceptable solution to give the necessary DFT parameters, and using additional tools on the top
of the LLM interface, it can create inputs for atomistic simulations [3, 4] from the provided structure file by
the user.
42
Figure 16: Retrieval Augmented Generation [RAG] architecture with LLM interface
References
[1] Adrian Mirza et al., “Are large language models superhuman chemists?”, https://doi.org/10.48550/
arXiv.2404.01475.
[2] Hafner, Jürgen, ”Ab-initio simulations of materials using VASP: Density-functional theory and beyond.”
Journal of computational chemistry 29.13 (2008): 2044-2078.
[3] Kresse, Georg, and Jürgen Hafner, ”Ab initio molecular dynamics for liquid metals.” Physical review B
47.1 (1993): 558.
43
[4] Mortensen, Jens Jørgen, Lars Bruno Hansen, and Karsten Wedel Jacobsen, ”Real-space grid implemen-
tation of the projector augmented wave method.” Physical Review B—Condensed Matter and Materials
Physics 71.3 (2005): 035109.
[5] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bic, Yi Dai, Jiawei Sun, Meng
Wang, and Haofen Wang, “Retrieval-Augmented Generation for Large Language Models: A Survey”,
https://doi.org/10.48550/arXiv.2312.10997.
[6] Giustino, Feliciano. Materials modelling using density functional theory: properties and predictions.
Oxford University Press, 2014.
[7] LlamaIndex, https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp/.
44
12 Materials Agent: An LLM-Based Agent with Tool-Calling Ca-
pabilities for Cheminformatics
Authors: Archit Datar, Kedar Dabhadkar
12.1 Introduction
Out-of-the box large language model (LLM) implementations such as ChatGPT, while offering interesting
responses, generally provide little to no control over the workflow of the LLM by which the response is
generated. In other words, it is easy to get LLMs to say something in response to a prompt, but difficult to
get them to do something via an expected workflow. A solution to do this problem is to equip LLMs with
tool-calling capabilities; i.e., allow the LLM to generate the response via an available function(s) which is
(are) appropriate to answer the prompt. An LLM with tool-calling capabilities, when prompted, typically
decides which tool(s) to call (execute) and the order in which to execute them, along with the arguments
to pass to them. It then executes these and returns the response. Such a system offers several powerful
capabilities such as the ability to query databases to return the latest information and reliably perform
mathematical calculations. Such capabilities have been incorporated into ChatGPT via plugins such as
Expedia and Wolfram, among others [1]. In the chemistry literature, recent attempts have been made by
researchers such as Smit and coworkers, Schwaller and coworkers, among others [2, 3].
Through Materials Agent, which is an LLM-based agent with tool-calling capabilities for cheminformatics,
we seek to build on these attempts to provide a variety of important tool-calling capabilities and build a
framework to expand on these. We hope that this can serve to increase LLM adoption in the community
and lower the barrier to entry for cheminformatics. Materials Agent is built using the LangChain library [4],
GPT 3.5-turbo [5] as the underlying LLM, and the user interface is based on the FastDash project [6]. In this
demonstration, we have provided tools based on RDKit [7]—a popular cheminformatics library, some custom
tools, as well as a Retrieval Augmented Generation (RAG) to allow LLM to interact with documents. The full
code is available at https://github.com/dkedar7/materials-agent and the working application, hosted
on Google Cloud Platform (GCP), is available at https://materials-agent-hpn4y2dvda-ue.a.run.app/.
The demonstration video is uploaded on YouTube at https://www.youtube.com/watch?v=_5yCOg5Bi_Q&
ab_channel=ArchitDatar. In the following section, we describe some key use cases.
45
Figure 17: Workflow of a response generated to a user prompt by Materials Agent using a tool based on
RDKit.
a TXT file containing snapshots of the location of the water molecule during the simulation. The distance
computation also accounts for triclinic periodic boundary conditions which is the accurate way to quantify
distances for crystalline systems such as this one. The inputs and outputs for this tool are shown below in
Figure 18(a).
Furthermore, we also stress that this approach is easily scalable and transferable, and adding new tools
is exceedingly easy. The reader is encouraged to clone our GitHub repository and experiment with adding
new tools to this software package. New tools can be easily added to the src/tools.py file in the repository
via the format shown in the code snippet in Figure 18(b).
Figure 18: Niche purpose tools with LLM. (a) Illustration of the RDF computing tool. (b) Code snippet to
highlight the ease of transferability of LLMs with tool-calling capabilities.
46
12.4 RAG capabilities
Summarizing and asking questions of documents is another common LLM use case. This capability is
provided out-of-the box through the EmbedChain library [8]. and we have integrated that into Materials
Agent for convenience. We demonstrate the utility of this by supplying the LLM with a URL to a materials
and safety datasheet (MSDS) and asking questions of it (see Figure 19).
In future, we aim to expand the toolkit that the LLM is equipped with. For instance, we can add functions
built on the publicly available PubChem database [9]. as well as some functions built off of it [10]. We also
aim to train it on user manuals of commonly used molecular simulations software such as GROMACS [11],
RASPA [12], and QuantumEspresso [13] to assist with setting up molecular simulations.
Through the experience of building Materials Agent, we realized that, while convenient, such agents
cannot replace the need for human vigilance. At the same time, having the development of such an agent
will make cheminformatics utilities easier to access for a broader range of users, lower the barrier to entry,
and ultimately, accelerate the pace of materials development.
References
[1] OpenAI plugins, https://openai.com/index/chatgpt-plugins/
[2] Jablonka, K.M., Schwaller, P., Ortega-Guerrero, A. et al. Leveraging large language models for predictive
chemistry. Nat Mach Intell 6, 161–169 (2024). https://doi.org/10.1038/s42256-023-00788-1.
[3] M. Bran, A., Cox, S., Schilter, O. et al. Augmenting large language models with chemistry tools. Nat
Mach Intell 6, 525–535 (2024). https://doi.org/10.1038/s42256-024-00832-8.
[4] LangChain, https://www.langchain.com/.
47
[5] OpenAI models, https://platform.openai.com/docs/models.
[6] Fast Dash, https://docs.fastdash.app/.
[7] RDKit: Open-source cheminformatics; http://www.rdkit.org.
[8] Singh, Taranjeet. Embedchain, https://github.com/embedchain/embedchain.
[13] Giannozzi, P., Baroni, S., Bonini, N., Calandra, M., Car, R., Cavazzoni, C., Ceresoli, D., Chiarotti,
G. L., Cococcioni, M., Dabo, I., Dal Corso, A., de Gironcoli, S., Fabris, S., Fratesi, G., Gebauer, R.,
Gerstmann, U., Gougoussis, C., Kokalj, A., Lazzeri, M., . . . Wentzcovitch, R. M. (2009). Quantum
Espresso: A modular and open-source software project for quantum simulations of materials. Journal of
Physics: Condensed Matter, 21(39), 395502. https://doi.org/10.1088/0953-8984/21/39/395502.
48
13 LLM with Molecular Augmented Token
Authors: Luis Pinto, Xuan Vu Nguyen, Tirtha Vinchurkar, Pradip Si, Suneel Kuman
13.1 Objective
Our primary objective is to explore how chemical encoders such as molecular fingerprints or embeddings from
2D/3D deep learning models (e.g., ChemBERTa [1], UniMol [2]) can enhance large language models (LLMs)
for zero-shot tasks such as property prediction, molecule editing, and generation. We aim to benchmark our
approach against state-of-the-art models like LlaSmol [3] and ChatDrug [4], demonstrating the transformative
potential of LLMs in the field of chemistry.
Figure 20: Workflow for integrating chemical encoders with large language models. Molecular data from
SMILES is transformed into molecular tokens and combined with text embeddings for tasks such as property
prediction, molecule editing, and generation.
13.2 Methodology
We identified two key benchmarks to evaluate our approach:
• LlaSmol: This benchmark involves fine-tuning a Mistral 7B model [5] on 14 different chemistry tasks,
including 6 property prediction tasks. The LlaSmol project demonstrated significant performance
improvements over baseline models, both open-source and proprietary, by using the SMolInstruct
dataset, which contains over three million samples.
• ChatDrug: This framework leverages ChatGPT for molecule generation and editing tasks. It in-
cludes a prompt module, a retrieval and domain feedback module, and a conversation module to
facilitate effective drug editing. ChatDrug showed superior performance across 39 drug editing tasks,
encompassing small molecules, peptides, and proteins, and provided insightful explanations to enhance
interpretability.
49
Steps Taken:
• Data Preparation: We utilized chemical encoders to transform molecular structures into suitable
embeddings for the LLM.
• Model Modification: We integrated the embeddings into the LLM’s forward function to enrich its
input.
• Fine-Tuning: We applied QLoRA for efficient training on limited computational resources.
• Preliminary Results and Ongoing Work: Although we are still fine-tuning the LLMs, initial
results are promising.
• Code Snippets: Screenshots in Figures 21 and 22 demonstrate the modifications made to the model
code to extract embeddings and implement the forward function of the LLM.
13.4 Conclusion
Our project underscores the potential of using LLMs enhanced with chemical encoders in materials science
and chemistry. By fine-tuning these models, we aim to improve property prediction and facilitate molecule
editing and generation, paving the way for future research and applications in this space. The code is
available at https://github.com/luispintoc/LLM-mol-encoder.
References
[1] S. Chithrananda, G. Grand, and B. Ramsundar, “ChemBERTa: Large-Scale Self-Supervised Pretraining
for Molecular Property Prediction,” arXiv preprint arXiv:2010.09885, 2020. Available: https://arxiv.
org/abs/2010.09885
[2] G. Zhou, Z. Gao, Q. Ding, H. Zheng, H. Xu, Z. Wei, et al., “Uni-Mol: A Universal 3D Molecular
Representation Learning Framework,” ChemRxiv, 2022, doi:10.26434/chemrxiv-2022-jjm0j.
[3] B. Yu, F. N. Baker, Z. Chen, X. Ning, and H. Sun, “LlaSMol: Advancing Large Language Models
for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset,” arXiv
preprint arXiv:2402.09391, 2024. Available: https://arxiv.org/abs/2402.09391
[4] S. Liu, J. Wang, Y. Yang, C. Wang, L. Liu, H. Guo, and C. Xiao, “ChatGPT-powered Conversational
Drug Editing Using Retrieval and Domain Feedback,” arXiv preprint arXiv:2305.18090, 2023. Available:
https://arxiv.org/abs/2305.18090
50
Figure 22: Modified forward function which allows for the molecular token to be added.
51
14 MaSTeA: Materials Science Teaching Assistant
Authors: Defne Circi, Abhijeet S. Gangan, Mohd Zaki
14.2 Methodology
Our objective was to automate the evaluation of both open-source and proprietary LLMs on materials science
questions from the MaScQA dataset and provide an interactive interface for students to solve these questions.
We evaluated the performance of several language models, including LLAMA3-8B, HAIKU, SONNET, GPT-
4, and OPUS, across the 14 various categories such as characterization, applications, properties, and behavior.
The evaluation involved:
• Extracting corresponding values: For multiple-choice questions, correct answer options were ex-
tracted using regular expressions to compare model predictions against the correct choices.
• Prediction verification: For numerical questions, the predicted value was checked against a specified
range or exact value. For multiple-choice questions, the predicted answer was verified against the correct
option or the extracted corresponding value.
• Calculating accuracy: Accuracy was calculated for each question type and topic, and the overall
accuracy across all questions was computed.
The results of the evaluation are summarized in Table 4, which presents the accuracy of the various models
for different question types and topics. The opus variant of Claude consistently outperformed the others,
achieving the highest accuracy in most categories. GPT-4 also showed strong performance, particularly in
topics related to material processing and fluid mechanics.
Our interactive web app, MaSTeA (Materials Science Teaching Assistant), developed using Streamlit,
allows easy model testing to identify LLMs‘ strengths and weaknesses in different materials science subfields.
The results suggest that there is significant room for improvement to enhance the accuracy of language models
in answering scientific questions. Once these models become more reliable, MaSTeA could be a valuable tool
for students to practice answering questions and learn the steps to get to the answer. By analyzing LLM
performance, we aimed to guide future model development and pinpoint areas for improvement.
Our code and application can be found at:
• https://github.com/abhijeetgangan/MaSTeA
• https://mastea-nhwpzz8fehvc9b3n5bhzya.streamlit.app/
References
[1] Zaki, M., & Krishnan, N. A. (2024). MaScQA: investigating materials science knowledge of large lan-
guage models. Digital Discovery, 3(2), 313-327.
52
Table 3: Sample questions from each category: (a) multiple choice question (MCQ), (b) matching type
question (MATCH), (c) numerical question with multiple choices (MCQN), and (d) numerical question
(NUM). Correct answers are in bold.
53
Table 4: Accuracy of Language Models by Topic
54
15 LLMy-Way
Authors: Ruijie Zhu, Faradawn Yang, Andrew Qin, Suraj Sudhakar, Jaehee Park, Victor
Chen
15.1 Introduction
In the academic realm, researchers frequently present their work and that of others to colleagues and lab
members. This task, while essential, is fraught with difficulties. For example, below are three challenges:
1. Reading and understanding research papers: Comprehending the intricacies of a research paper can
be daunting, particularly for interdisciplinary subjects like materials science.
2. Creating presentation slides: Designing slides that effectively communicate the content requires sig-
nificant effort, including remaking slides, sourcing images, and determining optimal text and image
placement.
3. Tailoring to the audience: Deciding on the appropriate level of technical vocabulary and the number
of slides needed to fit within a given time limit adds another layer of complexity.
Figure 23
These challenges can be effectively addressed using large language models, which can streamline and
automate the text summarization and slide creation process. LLMy Way leverages the power of GPT-3.5-
turbo to automate the creation of academic slides from research articles. The methodology involves primarily
three steps:
55
15.3 Slide Generation
To create slides, we format the language model’s output in a specific manner, using symbols to denote slide
breaks. This output is then parsed and converted into a Markdown file, where additional images and text
formatting are applied as needed. The formatted Markdown file is subsequently transformed into PDF slides
using Marp. Example output formatting:
# Background
Summary o f t h e background h e r e .
−−−
# Challenge
Summary o f t h e c h a l l e n g e h e r e .
−−−
15.5 Conclusion
LLMy Way represents a significant step forward towards the automation of academic presentation prepara-
tion. By leveraging the structured nature of scientific papers and the capabilities of advanced large language
models, our tool addresses common pain points faced by researchers. The current implementation of our
framework can be summarized into three consecutive steps. First, the research paper is parsed by LLM,
which is summarized into predefined sections. Next, the summarized texts are converted into Markdown.
Finally, Marp is used to generate the final slide deck in the PDF format. The current implementation of our
framework uses GPT-3.5-turbo, but it can be adapted to other language models as needed. We also support
the output format of LaTex to fit the needs of many researchers. Future work will focus on further refining
the tool, incorporating user feedback, and exploring additional customization options.
References
[1] OpenAI. (2024). GPT-3.5-turbo. https://openai.com/api/.
[2] Marp. (2024). Markdown Presentation Ecosystem. https://marp.app/
56
16 WaterLLM: Creating a Custom ChatGPT for Water Purifica-
tion Using Prompt-Engineering Techniques
Authors: Viktoriia Baibakova, Maryam G. Fard, Teslim Olayiwola, Olga Taran
16.1 Introduction
Drinking water pollution is a growing environmental problem that, ironically, comes from an increase in
industrial production of new materials and chemicals. Common pollutants include heavy metals, perfluo-
rinated compounds, microplastic particles, excreted medicinal drugs from hospitals, agricultural runoff and
many others [1]. The communities that suffer the most from water contamination often lack the plumbing
infrastructure necessary for centralized water analysis and treatment. Decentralized and localized water
treatments, based on resources available to the communities, can alleviate the problem. Since the resources
can vary greatly, from well equipped analytical facilities to low-cost DIY solutions, a knowledge base that
can rapidly provide information relevant for specific situations is needed. Here we show a prototype Large
Language Model (LLM) chatbot that can take a variety of inputs about possible contaminants (ranging
from detailed LC/MS analysis to general description of the situation) and propose the best solution to the
water treatment for the particular case based on contaminant composition, cost and resources availability.
We employed a Retrieval Augmented Generation (RAG) capability of ChatGPT to answer questions about
possible water treatments based on the latest scientific literature .
For this project, we focused on advanced oxidation procedures (AOPs) for water purification from mi-
croplastics (MPs). In recent times, MPs have received significant attention globally due to their widespread
presence in various species’ bodies, environmental media, and even bottled drinking water, as frequently
documented [2]. Numerous trials have been conducted and reported on the use of AOPs for breaking down
diverse persistent microplastics as wastewater treatment methodologies. However, there remains a lack of
guidelines on selecting the most suitable and cost effective treatment method based on the characteristics of
the contaminant, maximum removal percentage of MP.
The complexity of existing research on AOPs for MPs can be tackled with LLMs enhanced with RAG.
RAG allows to augment LLM’s knowledge and achieve state of the art results on knowledge-intensive tasks.
One straightforward modern way to implement LLM with RAG is through configuring a custom chatGPT.
We uploaded current scientific papers under the “Knowledge” for RAG and tailored chatbot performance
with prompt-engineering techniques such as grounding, context, and chain-of-thought reasoning to ensure
that it delivers accurate, detailed, and useful information.
16.2 Grounding
To make sure that chatbot provides accurate and scientifically valid answers, we loaded the latest research on
microplastic pollution remediation for RAG and implemented grounding in chat prompt. To collect the data,
we gathered the initial set of 10 review articles from the expert in the field that talk about water purification
using Advanced Oxidation Process. Then, we manually found 112 scientific articles discussing the specific
treatment procedure. This way, our dataset has studies on different pollution sources like laundry, hospitals,
industry, different pollutant types and their descriptive characteristics like size, shape, color, and different
treatment methods like Ozonation, Fenton Reaction, UV, Heat-Activated Persulfate. We merged all papers
into 8 pdfs to meet chatGPT “Knowledge” restrictions in files number, size and length. With grounding, we
aim to anchor the chatbot’s responses in concrete and factual knowledge. We explicitly asked chatGPT to
avoid giving wordy broad generalized answers and to provide concise scientific details using the files uploaded
under Knowledge.
16.3 Context
We provided context for the chatbot to understand and respond appropriately to user queries. We specified
that Chatbot has expert-level knowledge on MPs and water purification strategies from MPs and other
contaminants. We defined a user as a technician with basic knowledge on chemical engineering that needs to
choose and apply a purification method. We set that the communication between Chatbot and User should
57
be in the form of interactive dialog. It means that Chatbot should ask follow-up questions from the user
and should finally return an accurate purification protocol with all the details that can be reproduced in the
experiment.
Figure 24: WaterLLM approach: custom chatGPT with RAG from scientific papers, context and chain-of-
thought allowed for interactive dialog with the user anchored to science.
References
[1] Mishra, R. K.; Mentha, S. S.; Misra, Y.; Dwivedi, N. Emerging Pollutants of Severe Environmental
Concern in Water and Wastewater: A Comprehensive Review on Current Developments and Future
Research. Water-Energy Nexus 2023, 6, 74–95. https://doi.org/10.1016/j.wen.2023.08.002.
[2] Li, Y.; Peng, L.; Fu, J.; Dai, X.; Wang, G. A Microscopic Survey on Microplastics in Beverages: The
Case of Beer, Mineral Water and Tea. Analyst 2022, 147 (6), 1099–1105. https://doi.org/10.1039/
D2AN00083K.
58
Figure 25: Sample of the WaterLLM communication with the User.
59
17 yeLLowhaMMer: A Multi-modal Tool-calling Agent for Accel-
erated Research Data Management
Authors: Matthew L. Evans, Benjamin Charmes, Vraj Patel, Joshua D. Bocarsly
As scientific data continues to grow in volume and complexity, there is a great need for tools that can
simplify the job of managing this data to draw insights, increase reproducibility, and accelerate discovery.
Digital systems of record, such as electronic lab notebooks (ELN) or laboratory information management
systems (LIMS), have been a great advancement in this area. However, complex tasks are still often too
laborious, or simply impossible, to accomplish using graphical user interfaces alone, and any barriers to
streamlined data management often lead to lapses in data recording.
As developers of the open-source datalab [1] ELN/LIMS, we explored how large language models (LLMs)
can be used to simplify and accelerate data handling tasks in order to generate new insights, improve
reproducibility, and save time for researchers. Previously, we made progress toward this goal by developing a
conversational assistant, named Whinchat [2], that allows users to ask questions about their data. However,
this assistant was unable to take action with a user’s data. Here, we developed yeLLowhaMmer, a multimodal
large language model (MLLM)-based data management agent capable of taking free-form text and image
instructions from users and executing a variety of complex scientific data management tasks.
Our agent is powered by a low-cost commercial MLLM (Anthropic’s Claude 3 Haiku) used within a
custom agentic infrastructure that allows it to write and execute Python code that interacts with datalab
instances via the datalab-api package. In typical usage, a yeLLowhaMmer user might instruct the agent:
“Pull up my 10 most recent sample entries and summarize the synthetic approaches used.” In this case, the
agent will attempt to write and execute Python code using the datalab API to query for the user’s samples
in the datalab instance and write a human-readable summary. If the code it generates gives an error (or does
not give sufficient information), the agent can iteratively rewrite the program until the task is accomplished
successfully.
Furthermore, we leveraged the powerful multimodal capabilities of the latest MLLMs to allow for prompts
that include visual cues. For example, a user may upload an image of a handwritten lab notebook page and
ask that a new sample entry be added to the datalab instance. The agent uses its multimodal capabilities to
“read” the lab notebook page (even if it is a messy/unstructured page), adds structure to the information
it finds by massaging it into the form requested by the datalab JSON schema, then writes a Python snippet
to ingest the new sample into the datalab instance. Notably, we found that even the inexpensive, fast model
we used (Claude 3 Haiku) was able to perform sufficiently well at this task, while larger models may be
explored in the future to allow for more advanced applications (though with slower speed and greater cost).
We believe the capabilities demonstrated by yeLLowhaMmer show that MLLM agents have the potential to
greatly lower the barrier to advanced data handling in experimental materials and chemistry laboratories.
This proof-of-concept work is accessible on GitHub at bocarsly-group/llm-hackathon-2024, with ongoing
work at datalab-org/yellowhammer.
yeLLowhaMmer was built upon several open-source software packages. The codebox-api Python package
was used to set up a local code sandbox that the agent has access to in order to safely read and save files,
install Python packages, and run the code generated by the model. The datalab-api Python package was
used to interact with datalab instances. An MLLM-compatible tool was designed to allow the model to
use function-calling capabilities, write, and execute code within the sandbox. LangChain was used as a
framework to interact with the commercial MLLM APIs and build the agentic loop. Streamlit was used
to build a responsive GUI to show the conversation and upload/download the files from the codebox. A
customized Streamlit callback was written to display the code and files generated by the agent in a user-
friendly manner.
An interesting challenge in the development of yeLLowhaMmer was the creation of a system prompt
that would enable the agent to reliably generate robust code using the datalab-api package, which is a
recent library not included in the training of the commercial models at the time of writing. Initially, we
copied the existing documentation for the datalab-api into the system prompt, but we found that the code
generated by the model was not working very well. Instead, it was helpful to produce a simplified version
of the documentation that removed extraneous information and gave a few concrete examples of scripts.
Additionally, we provided an abridged version of the datalab schemas in the JSON Schema format in the
system prompt, which was necessary for the generation of compliant data to be inserted into datalab.
60
Figure 26: The yeLLowhaMmer multimodal agent can be used for a variety of data management tasks. Here,
it is shown automatically adding an entry into the datalab lab data management system based on an image
of a handwritten lab notebook page.
Overall, the yeLLowhaMmer system prompt amounts to around 12,000 characters (corresponding to
about 3200 tokens using Claude’s tokenizer). Given the large context windows of the current generation of
MLLMs (e.g., 200k tokens for Claude 3 Haiku), this size of prompt is feasible for fast, extended conversations
involving text, generated code, and images. In the future, we envision that library maintainers may wish
to increase the utility of their libraries by maintaining two parallel sets of documentation: the standard
human-readable documentation, and an abridged agents.txt (or llms.txt – https://llmstxt.org/) file
that can be used by ML agents to write high-quality code using that library.
Going forward, we will undertake further prototyping to incorporate MLLM-based agents more tightly
into our data management workflows, and to ensure that data curated or modified by such an agent will
be appropriately ‘credited’ by, for example, visually demarcating AI-generated content, and providing UI
pathways to verify or ‘relabel’ such data in an efficient manner. Finally, we emphasize the great progress
made within the last year in MLLMs themselves, which are now able to handle audio and video content in
addition to text and images. These will allow MLLM agents to use audiovisual data in real-time to provide
new user interfaces. Based on these promising developments, we believe that data management platforms
are well-placed to help bridge the divide from physical to digital data recording.
References
[1] M. L. Evans and J. D. Bocarsly. datalab, July 2024. URL https://github.com/datalab-org
doi:10.5281/zenodo.12545475.
[2] Jablonka et al. 14 examples of how LLMs can transform materials science and chemistry: a reflection
on a large language model hackathon. Digital Discovery, 2023. doi:10.1039/D3DD00113J.
61
18 LLMads
Authors: Sarthak Kapoor, José M. Pizarro, Ahmed Ilyas, Alvin N. Ladines, Vikrant Chaud-
hary
Parsing raw data into a structured (meta)data standard or schema is a major Research Data Management
(RDM) topic. While defining F.A.I.R. (Findable, Accessible, Interoperable, and Reusable) metadata schemas
are the key to RDM, these empty fields must be populated. This is typically done in two ways:
• Fill a schema manually using electronic or physical lab notebooks, or
• Create scripts that read the input/output raw files and parse them into the data schema.
The first option is used in a lab setting where data is entered in real-time as it is generated. The second
option is used when data files are available, albeit in an incompatible format to fill the schema directly. These
can be measurement files coming from instruments or files generated from simulations. Specific parsers for
each raw file type can transfer large amounts of data into schemas, making them essential for automation
and big-data management. However, implementing parsers for all the possible raw files to a fill schema can
be laborious and time-consuming. It requires expert knowledge of the structure of the raw files and regular
maintenance to keep up with new versions of raw files. In this work, we attempted to substitute parsers with
Large Language Models (LLMs).
We investigated whether LLMs can be used to parse data into structured schemas, thus relieving the
need for coding parsers. As an example, we used raw files from X-ray diffraction (XRD) measurements from
three different instrument vendors (Bruker, Rigaku, and Panalytical). We defined a Pydantic model for our
structured data schema and used the pre-trained Mixtral-8x7b from Groq. The data schemas are provided
to the LLM using the function-calling mechanism. The schema is constructed by defining a Pydantic base
model class and their fields or attributes with well-defined types and descriptions. The LLM tries to extract
data for each variable from the raw files that matches these descriptions. Considering the size of the raw
files and the token limitation of LLMs, we decided to create the following workflow:
We found that when populating the schema, the LLM was correctly extracting the values in cases where
the data types were float and str. This was the case for the XRDSettings class. However, when parsing
values with a data type list[float], the LLM was often unable to extract the data in the expected format.
This occurred when populating XRDResults class with the intensities data. The LLM output included non-
numeric characters like \n or \t, along with hallucinated data values. By providing the previous response
along with the new chunk of data in the prompts, we incorporated some degree of context. We found that
using smaller chunk sizes led to the rapid replacement of the populated data. Sometimes, the correct data
was replaced by hallucinated values.
We used LangChain to build our models and prompt generators. The code is openly available on Github:
https://github.com/ka-sarthak/llmads. Our work uses prompt engineering and function-calling. Future
work into tuning the model temperature and fine-tuning could be explored to combat hallucination. Our
work also indicates a need for human intervention to verify if the schema was filled correctly and to correct it
when necessary. Nevertheless, our prompting strategy proves to be a valuable tool as it manages to initialize
the schema properly for non-vectorial fields, all at the minimal effort of providing a structured schema and
the raw files.
62
19 NOMAD Query Reporter: Automating Research Data Narra-
tives
Authors: Nathan Daelman, Fabian Schöppach, Carla Terboven, Sascha Klawohn, Bernadette
Mohr
Materials science research data management (RDM) platforms and structured data repositories contain
large numbers of entries, each composed of property-value pairs. Users query these repositories by specifying
property-value pairs to find entries matching specific criteria. While this guarantees that all returned entries
have at least the queried properties, they do not provide context or insights into the structure and variety
of other data present in them.
Traditionally, it is up to the data scientist to examine the returned properties and interpret the overall
response. To assist with this task, we use a LLM to create context-aware reports based on the properties
and their meanings. We build and tested our “Query Reporter” [1] prototype on the NOMAD [2] repository,
which stores heterogeneous materials science data, including metadata on scientific methodologies, atomistic
structures, and materials properties.
We developed the NOMAD Query Reporter [1] as a proof-of-concept. It fetches and analyzes entries
and produces a summary of the used methodology and standout results. It does so in a scientific style,
lending itself as the basis for a “methods” section in a journal article. Our agent uses a retrieval-augmented
generation (RAG) approach [3], which enriches an LLM’s knowledge of external DB data without performing
retraining or fine-tuning. To allow its safe application on private and unpublished data, we use a self-hosted
Ollama instance running Meta’s Llama3 70B [4] model.
We tested the agent on publicly available data from NOMAD. To manage Llama3’s context window of
8,000 tokens, the entries are collected as rows into a Pandas dataframe. Each row (e.g., entry) is individually
passed on to the LLM via the chat-completion API. Instead of a single message, it accepts a multi-turn
conversation that simulates several demarcated roles. We use the “system” and “user” roles of the chat to
reinforce the retention of parts of the previous summary. This approach generally conforms to the Naive
RAG category in Gao et al.’s classification [3]. For a step-by-step overview, see Figure 27.
We used two kinds of data for testing: (a) homogeneous, property-value pairs of computational data;
and (b) heterogeneous text typed properties of solar cell experiments, often formatted as in-text tables.
We engineered different prompts for each kind. The agent performed better on the homogeneous than the
heterogeneous data. Here, summaries would often suffer from irrelevant threads, or even hallucinations. We
theorize that homogeneous data maps more consistently onto our predefined dataframe columns, which aids
the LLM in interpreting follow-up messages. Still, we could not improve the performance for heterogeneous
data within the hackathon.
In short, the NOMAD Query Reporter demonstrates that the combined approach of filtering and RAG can
effectively summarize collections of limited-size, structured data directly stored in research data repositories,
allowing for automated drafting of methods and setups for publications at a consistent level of quality
and style. These results suggest applicability for other well-defined materials science APIs, such as the
OPTIMADE standard [5]. Follow-up work includes investigating the impact of Advanced RAG strategies [3].
19.1 Acknowledgements
N. D., S. K., and B. M. are members of the NFDI consortium FAIRmat, funded by the German Research
Foundation (DFG, Deutsche Forschungsgemeinschaft) in project 460197019. F. S. acknowledges funding
received from the SolMates project, which has been supported by the European Union’s Horizon Europe
research and innovation program under grant agreement No 101122288. C. T. is supported by the German
Federal Ministry of Education and Research (BMBF, Bundesministerium für Bildung und Forschung) in the
framework of the project Catlab 03EW0015A.
References
[1] NOMAD: A distributed web-based platform for managing materials science research data. M. Scheidgen,
et al., JOSS, 8(90), 5388 (2024), doi.org/10.21105/joss.05388.
63
Figure 27: Flowchart of the Query Reporter usage, including the back-end interaction with external resources,
i.e., NOMAD and Llama. Intermediate steps managing hallucinations or token limits are marked in red and
orange, respectively.
[2] https://github.com/ndaelman-hu/nomad_query_reporter.
[3] Retrieval-Augmented Generation for Large Language Models: A Survey. Gao, Y., et al., arXiv (2024),
doi.org/10.48550/arXiv.2312.10997.
[4] Llama: open and efficient foundation language models. H. Touvron, et al., arXiv (2023), doi.org/10.
48550/arXiv.2302.13971.
[5] Development and applications of the OPTIMADE API for materials discovery, design, and data ex-
change. Evans, M. L., et al., Digital Discovery (2024), DOI: doi.org/10.1039/D4DD00039K.
64
20 Speech-schema-filling: Creating Structured Data Directly from
Speech
Authors: Hampus Näsström, Julia Schumann, Michael Götte, José A. Márquez
20.1 Introduction
As the amount of materials science data being created increases, so do the efforts to make this data Findable,
Accessible, Reusable, and Interoperable (FAIR) [1]. One pragmatic approach to creating FAIR data is by
defining so-called data schemas for the various types of data being recorded. These schemas can then be used
in both input tools like electronic lab notebooks (ELNs) and storage solutions like data repositories to create
structured data. One widely adopted standard for writing data schemas is the so-called JSON Schema [2].
JSON Schema allows us to define objects, such as, for example, a solution preparation experiment in the
lab, with properties such as temperature and a list of solutes and solvents (see Figure 28a). These schemas
can then be used to create forms in an ELN like NOMAD [3] (see Figure 28b). However, in a lot of lab
situations, such as when working inside a glovebox, it is difficult to i) navigate the ELN and select the right
form and ii) actually fill in the form with experimental data. In our experience, this usually results in users
having to record their data later from memory or even in data not being recorded at all.
We propose a solution for this using LLMs to:
• Converting spoken language in the lab into text using advanced speech recognition technologies, such
as OpenAI’s Whisper.
• Based on the text select and fill the appropriate schema, enabling accurate data capture without
manual text entry.
65
},
{
”name ” : ” p o w d e r s c a l i n g ” ,
” d e s c r i p t i o n ” : ” Schema f o r powder s c a l i n g ” ,
” p a r a m e t e r s ” : S c a l i n g . schema ( ) ,
}
],
)
Where SolutionPreparation and Scaling are Pydantic models for our desired data schemas. Finally,
this can be chained together with a prompt template and used to process the transcribed audio from before:
prompt = PromptTemplate . f r o m t e m p l a t e ( . . . )
c h a i n w i t h t o o l s = prompt | model
response = c h a i n w i t h t o o l s . invoke ( t ra n sc r i be d a ud i o )
20.4 Acknowledgements
H.N., J. S., and J. A. M. are part of the NFDI consortium FAIRmat funded by the Deutsche Forschungsge-
meinschaft (DFG, German Research Foundation) – project 460197019.
References
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data
management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18.
[2] https://json-schema.org/
[3] Scheidgen et al., (2023). NOMAD: A distributed web-based platform for managing materials science
research data. Journal of Open Source Software, 8(90), 5388, https://doi.org/10.21105/joss.05388.
[4] https://pypi.org/project/SpeechRecognition/.
[5] https://pypi.org/project/openai-whisper/.
[6] https://pypi.org/project/langchain-experimental/.
66
Figure 28: a) Part of a JSON Schema defining a data structure for a solution preparation. b) The schema
converted to an ELN form in NOMAD [3].
67
21 Leveraging LLMs for Bayesian Temporal Evaluation of Scien-
tific Hypotheses
Authors: Marcus Schwarting
21.1 Introduction
Science is predicated on empiricism, and a scientist uses the tools at their disposal to gather observations
that support the veracity of their claims. When one cannot gather firsthand evidence for a claim, they must
rely on their own assessment of evidence presented by others. However, the scientific literature on claim
is often large and dense (particularly for those without domain expertise), and the scientific consensus on
the veracity of a claim may drift over time. In this work we consider how large language models (LLMs),
in conjunction with temporal Bayesian statistics, can rapidly provide a more holistic view of a scientific
inquiry. We demonstrate our approach on the hypothesis that the material LK-99, which went viral after its
discovery in April 2023, is in fact a room-temperature superconductor.
21.2 Background
Scientific progress requires a researcher to iteratively update their prior understanding based on new obser-
vations. The process of updating a statistical prior based on new information is the backbone of Bayesian
statistics and is routinely used in scientific workflows [1]. Under such a model, a hypothesis has an inferred
probability PH ∈ (0, 1) of being true and will never be completely discarded (PH = 0) or accepted (PH = 1)
but may draw arbitrarily close to these extrema. This inferred probability from a dataset is also a feature
of assessing the power of a claim using statistical hypothesis tests [2], which are commonly used across most
scientific disciplines.
Modelling the veracity of a scientific claim as a probability also stems from the work of philosopher
Karl Popper [3]. Popper posits that a scientific claim should be falsifiable, such that it can be refuted by
empirical evidence. If a scientific claim cannot be refuted, Popper argues that it is not proven, but gains
further credibility from the scientific community. Scientific progress is not made by proving claims, but
instead by hewing away false claims through observation until only those that withstand repeated scrutiny
remain.
The history of science is littered with discarded hypotheses. Some of these claims were believed to be
true for centuries before being dismissed (including geocentrism, phrenology, energeticism, and spontaneous
generation). In this work, we focus on a claim by Lee, Kim, and Kwon that they had successfully synthesized
a room-temperature superconductor, which they called LK-99 [4]. Such a material would necessitate altering
the existing theory of superconductors at a fundamental level and would enable innovations that are currently
beyond reach. LK-99 went viral in summer 2023 [5], but replication efforts quickly ran into issues [6]. Since
the initial claim was made in April 2023, roughly 160 works have been published, and the scientific consensus
now appears established: LK-99 is not a room-temperature superconductor.
21.3 Methods
Our dataset consists of the 160 papers on Google Scholar, ordered by publication date, that are deemed
relevant to the hypothesis “LK-99 is a room-temperature superconductor.” For each paper abstract in
our dataset, we construct an LLM prompt as follows to perform a zero-shot natural language inference
operation [7]:
Given a Hypothesis and an Abstract, determine whether the Hypothesis is an ‘entailed’, ‘neutral’,
or ‘contradicted’ by the Abstract. \n Hypothesis: {claim} \n Abstract: {abstract}
We then wrote a regular expression to check the LLM response to make an assertion for each publication:
“entailment,” “neutrality,” or “contradiction.” We use Llama2 to make these assertions on all 160 papers,
which complete in under five minutes (on a desktop with an Nvidia GTX-1070 GPU).
Next, we constructed a temporal Bayesian model where, starting from an initial Gaussian prior N (µP , σP2 )
that models the likelihood of accepting the hypothesis, we can update with a Gaussian likelihood N (µL , σL
2
).
68
Our likelihood is a Gaussian designed so that for a given publication, “entailment” acceptance probability
higher, “contradiction” pushes the acceptance probability lower, and “neutrality” leaves the acceptance
probability the same. Our likelihood probability is further weighted according to the impact factor of the
journal, with an impact factor floor set to accommodate publications that have not passed peer review.
Retroactively, we could also weight by the number of citations, however we omit this feature in our analysis
since this is inherently a post-hoc metric and would violate our temporal assessment. We update our Gaussian
prior using the equations [8]:
−1 −1
1 1 1 1
µP µL
µP ← 2 + 2 + ; σ 2
P ← +
σP σL σP2 2
σL σP2 σL2
We are also able to specify the initial probability associated with the hypothesis (µP ), as well as how
flexible we are in changing our perspective based on new evidence (σP2 ). We select two initial probabilities:
50% and 20%, and fix standard deviations σP2 and σL 2
. Assuming either “contradiction” or “entailment”,
the update due to µL then scales linearly with the publication impact factor. Finally, we can compare our
temporal probability assessment with probabilities provided by the betting platform Manifold Markets [9],
where players bet on the outcome of a successful replication of the LK-99 publication results.
21.4 Results
We find that our temporal Bayesian probabilities, with an adjusted initial prior, can mirror the Manifold
Markets probabilities with two interesting divergences. While the initial probability starts at around 20% for
both, the temporal Bayesian approach never goes above 30%. By contrast, the betting market, following the
hype and virality of the LK-99 publication, reaches a peak at around 60%. Furthermore, while the betting
market has a long tail of low probability starting in mid-August 2023, our approach more quickly disregards
the hypothesis based on a continuing accumulation of studies showing that the LK-99 findings could not be
replicated. Our temporal Bayesian model with the adjusted initial prior reaches a probability below 1% by
mid-September 2023, but never entirely dismisses the chance that the hypothesis is true. Figure 29 showcases
these results.
While we specifically select initial probabilities or 20% and 50%, both trajectories end with a steadily
shrinking probability of accepting the hypothesis. Our initial prior probability of 50% mimics an unbiased
observer with no knowledge about whether the hypothesis should be accepted or rejected. A prior probability
of 20% could be considered a reasonable guess for an observer biased by a baseline understanding and
suspicion of the claims and their corresponding evidence. Such an initial guess is admittedly subjective,
as is the degree to which new information affects one’s inherent biases. We treat these settings as presets,
however these are trivial for others to configure and assess based on their background. Finally, for claims
with established scientific consensus, our approach is guaranteed to asymptotically approach that consensus
where the rate of convergence varies according to these initial presets.
Figure 29: Likelihood of accepting the hypothesis “LK-99 is a room-temperature superconductor” via three
approaches, from April 15, 2023 to April 15, 2024. The unadjusted initial probability (set to 50%) is shown
in gray, the adjusted initial probability (set to 20%) is shown in black, and the probability according to the
Manifold Markets online betting platform is shown in red.
69
21.5 Conclusion
Carefully validating a claim based a body of scientific literature can be a time-consuming and challenging
prospect, especially without domain expertise. In this work, we demonstrate how a claim might be evaluated
using a temporal Bayesian model based on a literature evaluation using natural language inference.
We show how our aggregated literature predictions allow us to quickly reject the hypothesis that LK-99 is
a room-temperature superconductor. In the future, we hope to apply this approach to other scientific claims,
including those with debates that are ongoing and as well as claims have an established scientific consensus.
In general, we hope this approach will allow a researcher to quickly measure the scientific community’s
confidence in a claim, as well as aid the public in assessing both the veracity of a claim and the change in
confidence driven by continued experimentation and observation.
References
[1] Settles, Burr. ”Active learning literature survey.” (2009).
[2] Lehmann, Erich Leo, Joseph P. Romano, and George Casella. Testing statistical hypotheses. Vol. 3. New
York: springer, 1986.
[6] Garisto, Dan. ”Claimed superconductor LK-99 is an online sensation—But replication efforts fall short.”
Nature 620, no. 7973 (2023): 253-253.
[7] Liu, Hanmeng, Leyang Cui, Jian Liu, and Yue Zhang. ”Natural language inference in context-
investigating contextual reasoning over long texts.” In Proceedings of the AAAI conference on artificial
intelligence, vol. 35, no. 15, pp. 13388-13396. 2021.
[8] Murphy, Kevin P. ”Conjugate Bayesian analysis of the Gaussian distribution.” def 1, no. 2σ 2 (2007):
16.
[9] “Will the LK-99 room temp superconductivity pre-print replicate in 2023”. Manifold Markets (2024).
https://manifold.markets/Ernie/will-the-lk99-room-temp-ambient-pre-17fc7cb7a2a0.
70
22 Multi-Agent Hypothesis Generation and Verification through
Tree of Thoughts and Retrieval Augmented Generation
Authors: Aleyna Beste Ozhan, Soroush Mahjoubi
22.1 Introduction
Our project, developed during the “LLM Hackathon for Applications in Materials and Chemistry,” aims to
accelerate scientific hypotheses and enhance the creativity of scientific inquiry. We propose using a multi-
agent system of specialized large language models (LLMs) to streamline and enrich hypothesis generation
and verification in materials science. This approach leverages diverse, fine-tuned LLM agents collaborating
to generate and validate novel hypotheses more effectively. While similar pipelines have been proven useful in
the social sciences [1], to the best of our knowledge, this work marks the first adaptation of such an approach
to hypothesis generation in materials science. As illustrated in Figure 30, The system includes agents such as
a background provider, an inspiration generator, a hypothesis generator, and three evaluators. Each agent
plays a crucial role in formulating and assessing hypotheses, ensuring only the most viable and compelling
ideas are developed. This innovative approach fosters an environment conducive to scientific inquiries.
22.2 Methodology
Background Extraction The background extraction module is designed to search through a vast database
for relevant information directly related to the user’s query. This module employs advanced embedding-based
retrieval techniques to identify and extract pertinent corpus. As new papers and findings are added to the
repository, the system dynamically updates, ensuring the use of the most current and relevant information.
Inspiration Generator Agent The inspiration generator agent leverages extensive background data to
effectively formulate inspirations using a Retrieval Augmented Generation (RAG) mechanism. Serving as
71
the strategic core of the hypothesis generation process, it draws inspiration from a broad spectrum of sources
to spawn diverse hypotheses, similar to the branching structure of a ”Tree of Thoughts (ToT)” [2]. The agent
samples ”k” candidates as possible solutions, evaluates their effectiveness through self-feedback, and votes
on the most promising candidates. The selection is narrowed down to ”b” promising options per step, with
this structured approach helping the agent systematically refine its solutions.
Hypothesis Generator Based on the background information and the inspirations, this module generates
meaningful research hypotheses. It is fine-tuned on reasoning datasets, such as Atlas, which encompasses
various types of reasoning including deductive and inductive reasoning, cognitive biases, decision theory, and
argumentative strategies [3].
22.5 Results
In the initial phase of our “Tree of Thoughts” structure, we generated approximately 5,000 inspirations.
These inspirations were refined to around 1,000 through a distillation step. The hypothesis generator, GPT-
3.5 Turbo, fine-tuned on 13,000 data points from the AtlasUnified/Atlas-Reasoning dataset, produced one
hypothesis per inspiration. The evaluation process involved three agents assessing feasibility, utility, and
novelty (FUN) using embedding-based retrieval to identify relevant abstracts. For enhanced precision, GPT-
4 was employed during the evaluation stages. Ultimately, hypotheses that withstood all evaluation stages
were included in the hypothesis pool. Out of the initial 1,000 hypotheses, 243 passed the feasibility filter,
175 were deemed useful, and only 12 were found to be highly novel. The 12 hypotheses deemed feasible,
novel, and useful are listed in Table 5.
72
innovative hypotheses in materials science. Similarly, within the domain of materials science, inspirations
generated based on concrete research could be used to develop hypotheses for other materials, such as
ceramics or composites. This cross-pollination of ideas can foster creativity and drive breakthroughs by
applying concepts from one domain to another.
No. Hypothesis
1 Incorporating Stainless Steel (SS) micropowder from additive manufacturing into cement paste
mixtures can improve the mechanical strength and durability of the mixture, with an optimal
addition of 5% SS micropowder by volume.
2 The use of synthesized zeolites in self-healing concrete can significantly improve the durability and
longevity of concrete structures.
3 The utilization of municipal solid waste (MSW) in cement production by integrating anaerobic
digestion and mechanical-biological treatment to produce refuse-derived fuel (RDF) for cement
kilns can reduce environmental impacts, establish a sustainable waste-to-energy solution, and create
a closed-loop process that aligns waste management with cement production for a more sustainable
future.
4 The use of smart fiber-reinforced concrete systems with embedded sensing capabilities can revolu-
tionize infrastructure monitoring and maintenance by providing real-time feedback on structural
health, leading to safer and more resilient built environments.
5 The use of advanced additives or nanomaterials in geopolymer well cement can enhance its me-
chanical properties and durability, leading to more reliable CO2 sequestration projects.
6 The use of carbonated steel slag as an aggregate in concrete can enhance the self-healing perfor-
mance of concrete, leading to improved durability and longevity.
7 The synergistic effect of combining different pozzolanic materials with varying particle sizes and
reactivities can lead to the development of novel high-performance concrete formulations with
superior properties.
8 Smart bio concrete incorporating bacterial silica leaching exhibits superior strength, durability,
and reduced water absorption capacity compared to traditional concrete.
9 Novel eco-concrete formulation developed by combining carbonated-aggregates with other sustain-
able materials like volcanic ash or limestone powder can create a carbon-negative concrete with
superior mechanical strength, durability, and thermal conductivity.
10 The use of nano-enhanced steel fiber reinforced concrete (NSFRC) will result in a significant
improvement in the mechanical properties, durability, and crack resistance of concrete structures
compared to traditional steel fiber reinforced concrete.
11 The combined addition of silica fume (SF) and nano-silica (NS) can further enhance the sulphate
and chloride resistance to higher than possible with the single addition of SF or NS.
12 The utilization of oil shale fly ash (OSFA) in concrete production can be optimized to develop
sustainable and high-performance construction materials.
References
[1] Yang, Zonglin, et al. ”Large Language Models for Automated Open-domain Scientific Hypotheses Dis-
covery.” arXiv preprint arXiv:2309.02726 (2023). https://arxiv.org/pdf/2309.02726
73
[2] Yao, Shunyu, et al. ”Tree of thoughts: Deliberate problem solving with large language models.” Advances
in Neural Information Processing Systems 36 (2024).
[3] https://huggingface.co/datasets/AtlasUnified/Atlas-Reasoning/commits/main.
[4] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.
74
23 ActiveScience
Authors: Min-Hsueh Chiu
23.1 Introduction
Humans have conducted material research for thousands of years, yet the vast chemical space, estimated
to encompass up to 1060 compounds, remains unexplored. Traditional research methods often focus on
incremental improvements, making the discovery of new materials a slow process. As data infrastructure
has developed, data mining techniques have increasingly accelerated material discovery. However, three
significant obstacles hinder this process: First, the availability and sparsity of data present a major challenge.
Comprehensive and high-quality datasets are essential for effective data mining, yet materials science suffer
from limited data availability. Second, each database typically consists of specific types of quantitative
properties, which may not fully meet researchers’ needs. This fragmentation and specialization of databases
can impede the holistic analysis necessary for breakthrough discoveries. Third, scientists usually focus on
certain materials and related articles, decreasing the likelihood of deeply exploring diverse literature that
reports potential materials or applications not yet utilized in their specific field. This siloed approach further
limits the scope of discovery and innovation.
Unlike the intrinsic properties found in databases, scientific articles provide unstructured but higher- level
information, such as applications, material categories, and properties that might not be explicitly recorded
in databases. Additionally, these texts include inferences and theories proposed by domain experts, which
are crucial for guiding research directions. The challenge lies in automatically extracting and digesting this
unstructured text into actionable insights. This process typically requires experienced experts, creativity,
and a measure of luck to identify the next desirable candidate. These challenges motivate the potential
of utilizing large language models to parse scientific reports and integrate the extracted information into
knowledge graphs, thereby constructing high-level insights.
23.2 Approach
The Python-based ActiveScience framework consists of three key functionalities: data source API, large
language model, and graph database. LangChain is employed for downstream applications within this
framework. The schematic pipeline is illustrated in Figure 1.Notably, this framework is not restricted to the
specific packages or APIs used in this demonstration; alternative tools that provide the required functionality
and input/output can also be integrated.
ArXiv APIs were used to retrieve scientific report titles, abstracts, and URLs. This demonstration focused
on reports related to alloys. Consequently, the string “cat:cond-mat.mes-hall AND ti:alloy” was queried in
the ArXiv APIs, which returned the relevant articles. GPT-3.5 Turbo models was used to access the large
language model. The system’s role was defined as: “You are a material science professor and want to extract
information from the paper’s abstract.” The provided prompt was: “Given the abstract of a paper, can you
generate a Cypher code to construct a knowledge graph in Neo4j? ...” along with the designated ontology
schema. The generated Cypher code is then input into Neo4j, which is used to ingest the entity relationships,
store the knowledge graph, and provide robust visualization and querying interfaces.
23.3 Results
With the implemented knowledge graph, GraphCypherQAChain module from LangChain was emploied to
perform retrieval-augmented generation. For instance, when asked, “Give me the top 3 references URLs where
the Property contains ’opti’?” GraphCypherQAChain automatically generates a Cypher query according to
the designated schema, executes it in Neo4j, and ultimately returns the relevant answer, as shown in the
right bottom box in Figure 31. Although this demonstration used a simple question, more complex queries
can be processed using this framework. However, handling such queries effectively will require more refined
prompting techniques.
75
Figure 31: A Schematic illustration of ActiveScience architecture and its potential applications. Code snippet
demonstrating the use of LangChain.
76
24 G-Peer-T: LLM Probabilities For Assessing Scientific Novelty
and Nonsense
Authors: Alexander Al-Feghali, Sylvester Zhang
Large language models (LLMs) and foundation models have garnered significant attention lately due to
their natural language programmability and potential to parse high-dimensional data from reactions to the
scientific literature [1]. While these models have demonstrated utility in various chemical and materials
science applications, we propose leveraging their designed strength in language processing to develop a first
pass peer review system for materials science research papers [2].
Traditional-gram tests such as BLEU or ROUGE, as well as X-of-thought LLM-based evaluations, are
not sensitive enough for creativity or diversity in scientific writing [3]. Our approach utilizes the learned
probabilistic features to establish a baseline for typical scientific language in materials science, based on
fine-tuning on materials science abstracts through a historical train-test split. New abstracts are scored by
their weighted-average probabilities, identifying those that deviate from the expected norms, flagging both
possibly innovative or potentially nonsensical works.
As a proof-of-concept in this direction, we fine-tuned two models: OPT (6.7B) and TinyLLama (1.1B)
using the Huggingface PEFT library’s Low Rank Adapters (LoRA) to access the log probabilities of the
abstracts, which is not typically accessible for modern API services [4–6]. Our results come with the usual
caveats for small models with small computational costs.
We curated a dataset of 6000 abstracts from PubMed, published between 2017–2020, focusing on Materials
Science and Chemistry [7]. The models were fine-tuned over 200 steps using this dataset. We compared
highly cited papers (>200 citations) with those of average citation counts. Our preliminary findings suggest
that higher-cited papers exhibit less “typical” language use, with mean log probabilities of -2.24 ± 0.32 for
highly cited works compared to -1.79 ± 0.3 for average papers. However, the calculated p-value of 0.07
indicates that these results are not statistically significant at the conventional 0.05 level.
Full training with more steps on larger models, as well as more experimentation and method optimization,
would yield more reliable results and be of modern relevance. Our documented code with step-by-step
instructions is available in the repository [8].
References
[1] K. M. Jablonka, P. Schwaller, A. Ortega-Guerrero, B. Smit, Nat. Mach. Intell., 2024, 6, 161–169.
[2] D. A. Boiko, R. MacKnight, B. Kline, G. Gomes, Nature, 2023, 624, 570–578.
[3] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, Proc. 2023 Conf. Empir. Methods Nat. Lang. Process.,
2023, 2511–2522.
[4] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, B. Bossan, PEFT: State-of-the-art Parameter-
Efficient Fine-Tuning Methods; GitHub: https://github.com/huggingface/peft, 2022.
[5] P. Zhang, G. Zeng, T. Wang, W. Lu, TinyLlama: An Open-Source Small Language Model;
arXiv:2401.02385.
[6] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T.
Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, L. Zettlemoyer,
OPT: Open Pre-trained Transformer Language Models; arXiv:2205.01068.
[7] National Center for Biotechnology Information (NCBI) [Internet]. Bethesda (MD): National Library
of Medicine (US), National Center for Biotechnology Information; https://www.ncbi.nlm.nih.gov/,
1998.
[8] A. Al-Feghali, S. Zhang, G-Peer-T ; GitHub: https://github.com/alxfgh/G-Peer-T, 2024.
77
25 ChemQA: Evaluating Chemistry Reasoning Capabilities of Multi-
Modal Foundation Models
Authors: Ghazal Khalighinejad, Shang Zhu, Xuefeng Liu
25.1 Introduction
Current foundation models exhibit impressive capabilities when prompted with text and image inputs in the
chemistry domain. However, it is essential to evaluate their performance on text alone, image alone, and a
combination of both to fully understand their strengths and limitations. In chemistry, visual representations
often enhance comprehension. For instance, determining the number of carbons in a molecule is easier for
humans when provided with an image rather than SMILES annotations. This visual advantage underscores
the need for models to effectively interpret both types of data. To address this, we propose ChemQA—a
benchmark dataset containing problems across five question-and-answering (QA) tasks. Each example is
presented with isomorphic representations: visual (images of molecules) and textual (SMILES). ChemQA
enables a detailed analysis of how different representations impact model performance.
We observe from our results that models perform better when given both text and visual inputs compared
to when they are prompted with image-only inputs. Their accuracy significantly decreases when provided
with only visual information, highlighting the importance of multimodal inputs for complex reasoning tasks
in chemistry.
• Counting Numbers of Carbons and Hydrogens in Organic Molecules: adapted from the 600
PubChem molecules created by [3], evenly divided into validation and evaluation datasets.
– Example: Given a molecule image or its SMILES notation, identify the number of carbons and
hydrogens.
• Calculating Molecular Weights in Organic Molecules: adapted from the 600 PubChem molecules
created by [3], evenly divided into validation and evaluation datasets.
– Example: Given a molecule image or its SMILES notation, calculate its molecular weight.
• Name Conversion: From SMILES to IUPAC: adapted from the 600 PubChem molecules created
by [3], evenly divided into validation and evaluation datasets.
– Example: Convert a given SMILES string or a molecule image to its IUPAC name.
• Molecule Captioning and Editing: inspired by [3], adapted from the dataset provided in [1],
following the same training, validation, and evaluation splits.
– Example: Given a molecule image or its SMILES notation, find the most relevant description of
the molecule.
• Retro-synthesis Planning: inspired by [3], adapted from the dataset provided in [4], following the
same training, validation, and evaluation splits.
– Example: Given a molecule image or its SMILES notation, find the most likely reactants that can
produce the molecule.
78
Figure 32: Performance of Gemini Pro, GPT-4 Turbo, and Claude3 Opus on text, visual, and text+visual
representations. The plot shows that models achieve higher accuracy with combined text and visual inputs
compared to visual-only inputs.
References
[1] Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between
molecules and natural language. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Pro-
ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 375–413,
Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:
10.18653/v1/2022.emnlp-main.26. URL https://aclanthology.org/2022.emnlp-main.26.
[2] Deqing Fu, Ghazal Khalighinejad, Ollie Liu, Bhuwan Dhingra, Dani Yogatama, Robin Jia, and Willie
Neiswanger. Isobench: Benchmarking multimodal foundation models on isomorphic representations,
2024. URL https://arxiv.org/abs/2404.01266.
79
[3] Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest,
and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark
on eight tasks, 2023. URL https://arxiv.org/abs/2305.18365.
[4] Ross Irwin, Spyridon Dimitriadis, Jiazhen He, and Esben Jannik Bjerrum. Chemformer: a pre-trained
transformer for computational chemistry. Machine Learning: Science and Technology, 3(1):015022, Jan-
uary 2022. doi: 10.1088/2632-2153/ac3ffb. URL https://dx.doi.org/10.1088/2632-2153/ac3ffb.
[5] Shang Zhu, Xuefeng Liu, and Ghazal Khalighinejad. ChemQA: a multimodal question-and-answering
dataset on chemistry reasoning. https://huggingface.co/datasets/shangzhu/ChemQA, 2024.
80
26 LithiumMind - Leveraging Language Models for Understand-
ing Battery Performance
Authors: Xinyi Ni, Zizhang Chen, Rongda Kang, Sheng-Lun Liao, Pengyu Hong, Sandeep
Madireddy
26.1 Introduction
In this project, we explore multiple applications of Large Language Models (LLMs) in analyzing the perfor-
mance of lithium metal batteries. Central to our investigation is the concept of Coulombic efficiency (CE),
a critical measure in battery technology that assesses the efficiency of electron transfer during a battery’s
charge and discharge cycles. Improving the CE is essential for advancing the adoption of high-energy-density
lithium metal batteries, which are crucial for next-generation energy storage solutions. A key focus of our
analysis is the role of the liquid electrolyte engineering, which is instrumental in determining the battery’s
cyclability and improving the CE.
We explored two methods of integrating LLM as a supplemental tool in battery research. First, we utilize
LLMs as information extractors to distill structural knowledge essential for the design of electrolytes from
the vast amount of paper. Second, we introduce an LLM-powered chatbot specifically tailored to respond to
inquiries related to lithium metal batteries. We believe these applications of LLMs may potentially enhance
our capabilities in battery innovation, streamlining research processes and increasing the accessibility of
specialized knowledge. Our code is available at: https://github.com/KKbeckang/LiGPT-Beta.
Preliminary Results The LLM extracted 334 CE- electrolyte pairs in 71 papers, while the original paper
found 152 pairs. Since it is difficult to verify our results without the help of human experts, we filtered
the extracted pairs through the Coulombic Efficiency and found 46 matches to the original dataset. We
classified these verifiable data into three categories: correct (the extracted data exactly matches the human-
labeled data); incorrect (the types or amounts of the solvent/salt do not match the labels); and unknown
(the extracted data provides more details than the human-labeled data). For example, the label only shows
the types of the salts, but the extracted data contains not only the correct types but also the mix ratio of
different salts. The results are shown in Table 6:
81
Table 6: Results
Future Work One major challenge is the low recall of the extracted information, as only 46 out of 152
labeled pieces of information were retrieved. Upon investigating the papers, we found that much of the
Coulombic Efficiency was recorded in tables and figures, which were dropped during PDF parsing. It is
necessary to introduce multimodal LLMs to further investigate those papers.
General Q/A Building In this section, we detail our comprehensive Q/A pipeline designed for the
exploration of lithium metal battery-related research. We began by sourcing and downloading 71 research
papers pertinent to lithium metal batteries. The information extracted from these papers was encoded into
vectors and stored in Chroma databases. To best reflect the specialized language of the field, we created two
distinct databases: one utilizing MaterialsBERT for materials science content, and another using ChemBERT
for chemical context.
During the retrieval phase, we employ LOTR (MergerRetriever) to enhance the precision of document
retrieval. Upon receiving a user query, the system retrieves relevant document segments from each database.
It then removes any redundant results from the merged outputs and selects the top 10 most pertinent
document chunks. Finally, both the selected context and the user query are processed by GPT-4 Turbo to
generate an informed and accurate response. This pipeline exemplifies a robust integration of state-of-the-art
technologies to facilitate advanced research interrogation and knowledge discovery in the domain of lithium
metal batteries.
82
Knowledge Graph-based Q/A Building We built a knowledge graph with the extracted information
using Neo4j. The knowledge graph consists of four node types:
• (electrolyte)-[:VOLUME] -¿(solvent)
• (electrolyte)-[:CONCENTRATION] -¿(salt)
• (electrolyte)-[:CITED] -¿(reference)
The built knowledge graph can be accessed and visualized using the Neo4j web application.
References
[1] Kim, S.C., et al. ”Data-driven electrolyte design for lithium metal anodes.” Proceedings of the National
Academy of Sciences 120.10 (2023): e2214357120. https://www.pnas.org/doi/full/10.1073/pnas.
2214357120.
83
27 KnowMat: Transforming Unstructured Material Science Lit-
erature into Structured Knowledge
Authors: Hasan M. Sayeed, Ramsey Issa, Trupti Mohanty, Taylor Sparks
27.1 Introduction
The rapid growth of materials science has led to an explosion of scientific literature, often presented in
unstructured formats that pose significant challenges for systematic data extraction and analysis. To address
this, we developed KnowMat, a novel tool designed to transform complex, unstructured material science
literature into structured knowledge. Leveraging advanced Large Language Models (LLMs) such as GPT-3.5,
GPT-4, Llama 3, KnowMat automates the extraction of critical information from scientific texts, converting
them into structured JSON formats. This tool not only enhances the efficiency of data processing but also
facilitates deeper insights and innovation in material science research. KnowMat’s user-friendly interface
allows researchers to input material science papers, which are then parsed and analyzed to extract key
insights. The tool’s versatility in handling multiple input files and its capacity for customization through sub-
field specific prompts make it an indispensable resource for researchers aiming to streamline their workflow.
Additionally, KnowMat’s ability to integrate with other tools and platforms, along with its support for
various LLMs, ensures that it remains adaptable to the evolving needs of the research community.
27.2 Method
Data Parsing and Extraction The KnowMat workflow begins with parsing unstructured text from
material science literature using a tool called Unstructured [1]. This tool reads the input file, separates
out the sections, and stores everything in a machine-readable format. This initial step involves identifying
relevant sections and extracting pertinent information related to material compositions, properties, and
experimental conditions.
Customizable Prompts KnowMat provides field-specific prompts for several fields, and it offers the
flexibility for users to customize these prompts further or create their own prompts for new fields. This
feature ensures that the extracted data is both relevant and comprehensive, tailored to the specific needs of
the researcher. The interface allows users to define the scope and focus of the extraction process effectively.
Integration and Interoperability To enhance usability and interoperability, KnowMat supports seam-
less integration with other tools and platforms. Extracted results can be easily exported in CSV format,
enabling straightforward data sharing and further analysis. The tool’s flexibility extends to its compatibility
with various LLMs, including both subscription-based models like GPT-3.5 and GPT-4 [2], and open-source
models like Llama 3 [3]. This ensures that researchers can select the most suitable LLM for their specific
requirements.
84
Figure 34: KnowMat Workflow. The graphical abstract illustrates the KnowMat workflow, which begins
with parsing input files in XML or PDF formats. The Large Language Model (LLM) powered by engineered
prompts processes the reference text to extract key information, such as material composition, properties,
and measurement conditions, and converts it into a structured JSON output.
In conclusion, KnowMat represents a significant advancement in the field of knowledge extraction from
scientific literature. By converting unstructured material science texts into structured formats, it provides
researchers with powerful tools to unlock insights and drive innovation in their fields.
References
[1] https://unstructured.io/.
[2] https://platform.openai.com/docs/models
[3] https://llama.meta.com/llama3/
85
28 Ontosynthesis
Authors: Qianxiang Ai, Jiaru Bai, Kevin Shen, Jennifer D’Souza, Elliot Risch
28.1 Introduction
Organic synthesis is often described in unstructured text without a standard taxonomy, which makes syn-
thesis data less searchable and less compatible with downstream data-driven tasks (e.g., retrosynthesis,
condition recommendation) compared to structured records. The specificities and writing styles of these
descriptions also vary, ranging from simple sentences about chemical transformations to long paragraphs
that include procedure details. This leads to ambiguities, unidentified missing information, challenges in
comparing different syntheses, and can impede proper experimental reproduction.
In last year’s hackathon, we fine-tuned an open-source LLM to extract data structured in the Open
Reaction Database schema from synthesis procedures [1]. While this method has proved to be successful
for patent text, it relies on existing unstructured-structured data pairs and does not generalize well to non-
patent writing styles. The dependency of fine-tuning on existing data makes it less useful, especially when
considering new writing styles, or newly developed data structures or ontologies are preferred.
In this project, we explore the potential of LLMs in structured data extraction without fine-tuning.
Specifically, given an ontology (formally defined concepts and relationships) for organic synthesis, we aim to
extract structured information as Resource Description Framework (RDF) graphs from unstructured text
using LLMs with zero-shot prompting. RDF is a World Wide Web Consortium (W3C) standard that serves
as a foundation for the Semantic Web and expressing meaning and relationships between data. While LLMs
can create ontologies on the fly for a given piece of text, “grounding” to a pre-specified ontology allows
standardizing the extracted data and reasoning with existing axioms. The extraction workflow is wrapped in
a web application which also allows visualization of the extracted results. We showcased the capability of our
application with case studies where RDFs were extracted from reaction descriptions of different complexities
and specificities.
86
28.3 Application
We collected a set of eight reaction descriptions taken from patents, journal articles (main text or supporting
information), and lab notebooks. Each of them is assigned a complexity rating and a specificity rating
using three-point scales. Based on these test cases, we found our workflow was able to produce valid RDF
graphs representing the chemical reactions, even for multi-step reactions including many elements. Expert
inspections indicate the resulting RDF graphs better represent the unstructured text when OntoReaction is
used as the target ontology compared to the larger Synthesis Operation Ontology (the latter contains more
classes and object properties).
Since the extracted data is in RDF format, they can be readily visualized using interactive graph libraries.
Using dash-cytoscape [5], we created an interface application to the extraction workflow. The interface
allows submitting unstructured text as input to the extraction workflow with a user-provided OpenAI API
key, retrieving and interactively visualizing the extracted knowledge graph, as well as displaying the extracted
RDF text. A file-based caching routine is used to store previous extraction results. All code and test cases
are available in the project GitHub repository [2].
Figure 35
28.4 Acknowledgements
Q.A. acknowledges support by the National Institutes of Health under award number U18TR004149. The
content is solely the responsibility of the authors and does not necessarily represent the official views of the
National Institutes of Health. J. D. acknowledges the SCINEXT project (BMBF, German Federal Ministry
of Education and Research, Grant ID: 01lS22070).
References
[1] Ai, Q.; Meng, F.; Shi, J.; Pelkie, B.; Coley, C. W. Extracting Structured Data from Organic Synthesis
Procedures Using a Fine-Tuned Large Language Model. ChemRxiv April 8, 2024. https://doi.org/
10.26434/chemrxiv-2024-979fz.
[2] Ontosynthesis, 2024. https://github.com/qai222/ontosynthesis (accessed 2024-07-07).
[3] Bai, J.; Mosbach, S.; Taylor, C. J.; Karan, D.; Lee, K. F.; Rihm, S. D.; Akroyd, J.; Lapkin, A. A.; Kraft,
M. A Dynamic Knowledge Graph Approach to Distributed Self-Driving Laboratories. Nat. Commun.
2024, 15 (1), 462. https://doi.org/10.1038/s41467-023-44599-9.
87
[4] Ai, Q.; Klein, C. Synthesis Operation Ontology. GitHub. https://github.com/qai222/
ontosynthesis/blob/main/ontologies/soo/soo.md (accessed 2024-07-07).
[5] Dash-Cytoscape: A Component Library for Dash Aimed at Facilitating Network Visualization in
Python, Wrapped around Cytoscape.Js. https://dash.plotly.com/cytoscape (accessed 2024-07-06).
88
29 Knowledge Graph RAG for Polymer Simulation
Authors: Jiale Shi, Weijie Zhang, Dandan Tang, Chi Zhang
Figure 36: Creating Knowledge Graph Retrieval-Augmented Generation (KGRAG) for Polymer Simulation.
Molecular modeling and simulations have become essential tools in polymer science and engineering,
offering predictive insights into macromolecular structure, dynamics, and material properties. However, the
complexity of polymer simulations poses challenges in model development/selection, computational method
choices, advanced sampling techniques, and data analysis. While literature [1] provides guidelines to en-
sure the validity and reproducibility of these simulations, these resources are often presented in massive,
unstructured text formats, making it difficult for new learners to systematically and comprehensively un-
derstand the correct approaches for model selection, simulation execution, and post-processing analysis.
Therefore, this study presents a novel approach to address these challenges by implementing a Knowledge
Graph Retrieval-Augmented Generation (KGRAG) system for building an AI chatbot focused on polymer
simulation guidance and education.
Our team utilized the large language model GPT-3.5 Turbo [2] and Microsoft’s GraphRAG [3, 4] frame-
work to extract polymer simulation-related entities and relationships from unstructured documents, con-
structing a comprehensive KGRAG, as shown in Figure 36, where the nodes are colored by their degrees.
Those nodes with high degrees include “Polymer Simulation”, “Atomistic Model,” “CG Model,” “Force
Field,” and “Polymer Informatics,” which are all the keywords about polymer simulation and modeling,
illustrating the effective performance of entity extraction. We run the query engineering of KGRAG to
ask questions. For comparative analysis, we also implemented a baseline Retrieval-Augmented Generation
(RAG) system using LlamaIndex [5]. Upon comparing the responses from the baseline RAG and KGRAG by
human experts, we found that the KGRAG demonstrates substantial improvements in question-answering
performance when analyzing complex and high-level information about polymer simulation. This improve-
ment is attributed to KGRAG’s ability to capture semantic concepts and uncover hidden connections within
the data, providing more accurate, logical, and insightful responses compared to traditional vector-based
RAG methods.
This study contributes to the growing field of data-driven approaches in polymer science by offering a
89
powerful tool for knowledge extraction and synthesis. Our KGRAG system shows promise in enhancing the
understanding of massive unstructured polymer simulation guidance in the relevant literature, potentially
improving the validity and reproducibility of these polymer simulations, and accelerating the development
of new polymer simulation methods. We found that the quality of prompts is crucial for effective entity
extraction and knowledge graph construction. Therefore, our future work will focus on optimizing prompts
for entity extraction, relationship building, and knowledge graph construction to further improve the system’s
performance and applicability in polymer simulation research.
29.2 Author
• Jiale Shi, Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge,
Massachusetts 02139, United States ([email protected])
• Weijie Zhang, Department of Chemistry, University of Virginia, Charlottesville, Virginia 22904, United
States ([email protected])
References
[1] Gartner, T. E., III; Jayaraman, A. Modeling and Simulations of Polymers: A Roadmap. Macromolecules
2019, 52 (3), 755-786. DOI: 10.1021/acs.macromol.8b01836.
[2] GPT-3.5 Turbo. 2024. https://platform.openai.com/docs/models/gpt-3-5-turbo (accessed
2024/07/10).
90
30 Synthetic Data Generation and Insightful Machine Learning
for High Entropy Alloy Hydrides
Authors: Tapashree Pradhan, Devi Dutta Biswajeet
The generation of synthetic data in materials science and chemistry is traditionally performed by machine
learning interatomic potentials (MLIPs) that approximate the first principle functional form used to compute
wave functions of known electronic configurations [1]. Synthetic data is a component of the active-learning
feedback loop that is utilized to retrain these potentials. The purpose of this active-learning component is
to incorporate a wider range of physical conditions into the potential’s application domain. However, the
initial cost of data generation to train the MLIPs is a major setback for complex chemistries like in the case
of high entropy alloys (HEAs).
The potential application of high entropy alloys in hydrogen storage [2] demands acceleration in the
computation of surface and mechanical properties of the alloys by better approximation of the potential
landscape. The use of Large Language Models (LLMs) in the generation of synthetic data to tackle this
bottleneck problem poses an alternative to the traditional MLIPs. LLMs like ChatGPT, Llama3, Claude,
and Gemini [3] learn from text-embeddings of the training data, capturing inherent trends between semantic
or numerical relationships of the text and making them suitable for learning certain complex relationships
in materials physics that might be present in the language itself.
The current work aims to build LLM applications working in conjunction with an external database of
high entropy alloy hydrides via Retrieval-Augmented Generation (RAG) [4] to populate synthetic data for
predictive modeling later. The inbuilt RAG feature of GPT-4 enables us to write a few prompts to make
a synthetic data generator utilizing a custom dataset of HEA hydrides. The work also utilizes OpenAI’s
API [5] and the text-embedding-3-large model [6] to configure custom generators that can be fine-tuned via
prompts for synthetic data generation.
The development of the entire product is aimed at a web-based application that allows users to upload
their datasets and instruct the GPT model to generate more entries that can serve as training data for
predictive ML models like hydrogen capacity predictors. The term “Insightful Machine Learning” refers to
a sequential pipeline starting with (a) a reference database that serves as the retrieval component of an
LLM, (b) the generation of synthetic data and features, and (c) getting insights from a chatbot on physics
of the problem having multiple retrieval components inclusive of the predictive model. Figure 37 shows the
flowchart of the pipeline which is currently at the prototype stage under development. The current code to
generate synthetic data is available for use and modification.
References
[1] Focassio, B., Freitas, M., Schleder, G.R. (2024). Performance assessment of universal machine learning
interatomic potentials: Challenges and directions for materials’ surfaces. ACS Applied Materials &
Interfaces.
[2] Marques, F., Balcerzak, M., Winkelmann, F., Zepon, G., Felderhoff, M. (2021). Review and outlook on
high-entropy alloys for hydrogen storage. Energy & Environmental Science, 14(10), 5191-5227.
[3] Jain, S.M. (2022). Hugging face: Introduction to transformers for NLP with the hugging face library and
models to solve problems. Berkeley, CA: Apress.
91
Figure 37: Insightful machine learning for HEA hydrides
[4] Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances
in Neural Information Processing Systems, 33, 9459-9474.
[5] OpenAI. (2020). OpenAI GPT-3: Language models are few-shot learners. Retrieved from https://
openai.com/blog/openai-api.
[6] OpenAI. (2023). New embedding models and API updates. Retrieved from https://openai.com/
index/new-embedding-models-and-api-updates.
92
31 Chemsense: Are large language models aligned with human
chemical preference?
Authors: Martiño Rı́os-Garcı́a, Nawaf Alampara, Mara Schilling-Wilhelmi, Abdelrahman
Ibrahim, Kevin Maik Jablonka
31.1 Introduction
Generative AI models are revolutionizing molecular design by learning from vast chemical datasets to create
novel compounds [1]. The challenge in molecular design goes beyond just creating new structures. A key issue
is ensuring the designed molecules have specific useful properties, e.g., high solubility, high synthesizability,
etc. LLMs that can be conditioned on the desired property seem to be a promising solution [4]. However,
current frontier models often lack an innate chemical understanding, which can lead to impractical or even
dangerous molecular designs [5].
A factor that distinguishes many successful chemists is their chemical intuition. This intuition, for
instance, describes the preference for certain compounds that are not grounded in knowledge that can be
easily conveyed but rather by tacit knowledge accumulated over years of experience. If models could possess
this chemical intuition, they would be more useful for real-world scientific applications.
In this work, we introduce ChemSense, in which we explore how well frontier models are aligned with
human experts in chemistry. By aligning AI with human knowledge and preferences, one can aim to create
tools that can propose feasible, safe, and desirable molecular structures, bridging the gap between theoretical
capabilities and practical scientific needs. Moreover, ChemSense would help us in understanding the emergent
alignment of frontier models with dataset scale and size.
Figure 38: Comparison of the alignment of the different LLMs with the SMILES (left) and IUPAC (right)
molecular representations. For both representations, note that the random baseline is at a fixed value of 0.5
since all the questions were binary and the datasets are balanced.
93
choose the 900 samples, only questions in which both molecules could be converted into the IUPAC names
(using the chemistry name resolver) were selected. In that way, we ensure some homogeneity in the test
dataset.
LLM evaluation To study the chemical preference of the actual LLMs, some of the models performing
best on the ChemBench benchmark [5] were prompted with a simple instruction prompt inspired by the
question that was asked to collect the original data from the scientists of Novartis. Additionally, to study
how molecular representations could affect the model, each of the 900 questions was given to the model using
different molecular representations.
To ensure the correct format of the answers, OpenAI models as well as Claude-3 were constrained to
answer only “A” or “B” using the Instructor package. On the other hand, the Llama models and Mixtral
8Bx7 used (in an unconstrained setup) through Groq API service and instead further prompted to encourage
them only to answer “A” or “B”.
31.4 Results
Comparison of chemical preference of LLMs We compare the accuracy of the preference prediction
of the different models and representations (Figure 1). The accuracy of all models ranges from 49% to
61% where 50% is the accuracy one would obtain for a random choice. The GPT-4o model achieves the
highest accuracy of all models and performs best with the SMILES and IUPAC representations. This might
be explained by the widespread use of both of the representations and, therefore, a high occurrence in the
training data of the model. We observe the same trend in the GPT-4 and GPT-3.5 Turbo predictions. For
the other LLMs, representation seems to have a smaller impact on the accuracy with values that presumably
are random.
References
[1] Bilodeau, Camille et al. (2022). “Generative models for molecular discovery: Recent advances and
challenges”. In: Wiley Interdisciplinary Reviews: Computational Molecular Science 12.5, e1608.
[2] Chennakesavalu, Shriram et al. (2024). Energy Rank Alignment: Using Preference Optimization to
Search Chemical Space at Scale. DOI: 10.48550/ARXIV.2405.12961. URL: https://arxiv.org/abs/
2405.12961.
[3] Choung, Oh-Hyeon et al. (Oct. 2023). “Extracting medicinal chemistry intuition via preference machine
learning”. In: Nature Communications 14.1. ISSN: 2041-1723. DOI: 10.1038/s41467- 023-42242-1. URL:
http://dx.doi.org/10.1038/s41467-023-42242-1.
[4] Jablonka, Kevin Maik et al. (Feb. 2024). “Leveraging large language models for predictive chemistry”.
In: Nature Machine Intelligence 6.2, pp. 161–169. ISSN: 2522-5839. DOI: 10.1038/s42256-023-00788-1.
URL: http://dx.doi.org/10.1038/s42256-023-00788-1.
[5] Mirza, Adrian et al. (2024). “Are large language models superhuman chemists?” In: arXiv preprint.
DOI: 10.48550/arXiv.2404.01475. arXiv: 2404.01475 [cs.LG].
94
[6] Yang, Kaiqi et al. (2024). Are Large Language Models (LLMs) Good Social Predictors? DOI:
10.48550/ARXIV.2402.12620. URL: https://arxiv.org/abs/2402.12620.
95
32 GlossaGen
Authors: Magdalena Lederbauer, Dieter Plessers, Philippe Schwaller
Academic articles, particularly reviews, and grant proposals would greatly benefit from a glossary explain-
ing complex jargon and terminology. However, the manual creation of such glossaries is a time-consuming and
repetitive task. To address this challenge, we developed GlossaGen, an innovative framework that leverages
large language models to automatically generate glossaries from PDF or TeX files, streamlining the process
for academics. The generated glossary is not only a list of terms and definitions but also visualized as a
knowledge graph, illustrating the intricate relationships between various technical concepts (see Figure 39).
Figure 39: Overview of (left) the graphical user interface (GUI) protoype and (right) the generated Neo4J
knowledge graph (right). Our results demonstrate that LLMs can greatly accelerate glossary creation,
increasing the likelihood that authors will include a helpful glossary without the need for tedious manual
effort. Additionally, an analysis of our test case by a zeolite domain expert revealed that LLMs produce
good results, with about 70% - 80% of explanations requiring little to no manual changes.
The project’s codebase was developed as a Python package on GitHub using a template [1] and DSPy [2]
as an LLM framework. This modular approach facilitates seamless collaboration and easy incorporation of
new features.
To overcome the limitations of LLMs in directly processing PDFs, a prevalent format for scientific pub-
lications, we implemented a pre-processing step that converts papers into manageable text sequences. This
step involves extracting textual information using PyMuPDF [3], automatically obtaining the title and DOI,
and chunking the text into smaller sections. This chunking preserves context and makes it easier for the
models to handle the input.
We used GPT-3.5-Turbo [4] and GPT-4-Turbo [5] to extract scientific terms and their definitions from
the text chunks. Employing Typed Predictors [6] and Chain-of-Thought prompting [7] ensures the outputs
are well-structured and contextually accurate, guiding the model to produce precise definitions through a
simulated reasoning process. Post-processing involved identifying and removing duplicate entries, ensuring
each term appears only once in the final glossary. Figure 40 shows details about the GlossaryGenerator
class that was used to process documents into the corresponding glossaries. We selected a review article on
zeolites [8] (shown in Figure 39) as a test publication to manually tune and evaluate the pipeline’s output.
The obtained glossary is transformed into an ontology that defines nodes and relationships for the knowl-
edge graph. For instance, relationships like ’MATERIAL – exhibits → PROPERTY’ illustrate how different
terms are interconnected. The knowledge graph is constructed using the library Neo4J [9] and Graph
96
Figure 40: Overview of the GlossaryGenerator class, responsible for processing text chunks and extracting
relevant, correctly formatted terms and definitions.
Maker [10] on the processed text chunks. We developed a user-friendly front-end interface with Gradio [11],
as shown in Figure 39. This interface allows users to interact with the glossary, making it easier to navigate
and customize the information.
The quick prototyping provided us with several ideas for future work. We can improve the glossary
output by fine-tuning the LLM, incorporating retrieval-augmented generation, and parsing article images.
Additionally, the user experience can be enhanced by allowing users to input specific terms for glossary
explanations as a backup when the LLM omits certain terms. Integration with LaTeX would broaden
usability, envisioning commands like \glossary similar to \bibliography. We also consider connecting
the knowledge graph directly to the user interface and enhance its ontology creation feature. Overall, this
rapidly developed prototype, with numerous future possibilities, demonstrates the potential of LLMs to assist
researchers in their scientific outreach.
References
[1] Copier template: Available at https://github.com/copier-org/copier.
[2] DSPy: Khattab, O.; Singhvi, A.; Maheshwari, P.; Zhang, Z.; Santhanam, K.; Vardhamanan, S.; Haq, S.;
Sharma, A.; Joshi, T. T.; Moazam, H.; Miller, H.; Zaharia, M.; Potts, C. DSPy: Compiling Declarative
Language Model Calls into Self-Improving Pipelines. preprint arXiv:2310.03714. 2023.
[3] PyMuPDF: Available at ttps://github.com/pymupdf/PyMuPDF.
97
[4] GPT-3.5-Turbo: OpenAI. Available at
https://platform.openai.com/docs/models/gpt-3-5-turbo.
[5] GPT-4-Turbo: OpenAI. Available at
https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4.
[8] Test review article on zeolites: Rhoda, H. M.; Heyer, A. J.; Snyder, B. E. R.; Plessers, D.; Bols, M.
L.; Schoonheydt, R. A.; Sels, B. F.; Solomon, E. I. Second-Sphere Lattice Effects in Copper and Iron
Zeolite Catalysis. Chem. Rev. 2022, 122, 12207–12243.
[9] Neo4J: Documentation at https://neo4j.com/.
98