1 s2.0 S2352711020300492 Main
1 s2.0 S2352711020300492 Main
1 s2.0 S2352711020300492 Main
SoftwareX
journal homepage: www.elsevier.com/locate/softx
article info a b s t r a c t
Article history: Battery evaluation and early prediction software package (BEEP) provides an open-source Python-based
Received 21 February 2020 framework for the management and processing of high-throughput battery cycling data-streams. BEEPs
Received in revised form 2 April 2020 features include file-system based organization of raw cycling data and metadata received from cell
Accepted 9 April 2020
testing equipment, validation protocols that ensure the integrity of such data, parsing and structuring
Keywords: of data into Python-objects ready for analytics, featurization of structured cycling data to serve as
Battery input for machine-learning, and end-to-end examples that use processed data for anomaly detection
Cycling experiments and featurized data to train early-prediction models for cycle life. BEEP is developed in response to the
Python software and expertise gap between cell-level battery testing and data-driven battery development.
Data management © 2020 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
Machine-learning (http://creativecommons.org/licenses/by/4.0/).
Code metadata
1. Motivation and significance of electrodes, electrolytes, additives [2–5] or formation [6], and
designing state-of-health (SoH) [7–10], state-of-charge (SoC) [10–
Energy storage in Li-ion batteries revolutionized the portable 12], early prediction models [9,10,13] or advanced battery man-
electronics industry and is now defining the future of vehicle agement systems (BMSs) [14–16]. An increase in the volume of
electrification. The growing consumer adoption of electric vehi- standardized cycling data can open the door to improvement of
cles (EVs) and the potential for positive environmental benefits
existing approaches to data-driven prognostics, [9,13,15,17] or to
have spurred academic and industrial interest in improving the
design of more complex algorithms capable of delivering accurate
capacity, energy and power density, durability and safety of Li-ion
cells, as well as lowering the manufacturing costs [1]. Transition- health predictions.
ing to a data-driven research paradigm shows great potential to For the broader adoption of data-driven battery development,
accelerate battery development –a traditionally slow and tedious reusable high-throughput battery testing data [13,18] and soft-
process- in areas including the optimization of the chemistry ware tools for processing and analysis of such data are essential.
While the hardware for automated battery cycling is accessible,
∗ Correspondence to: 4440 El Camino Real, Los Altos, CA 94022, USA. the field still lacks open software for both acquisition and man-
E-mail address: [email protected] (M. Aykol). agement of cycling data and preparing the data for analytics.
https://doi.org/10.1016/j.softx.2020.100506
2352-7110/© 2020 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
2 P. Herring, C. Balaji Gopal, M. Aykol et al. / SoftwareX 11 (2020) 100506
Community-driven software development can be effective in fill- all paths are defined in reference to the BEEP_EP_ROOT en-
ing this gap [19] and can yield reliable and reusable tools (as vironment variable. If collate is called from the command
experienced in computational materials science [20,21]). Such line, the module locates the raw data files, parses the meta-
software libraries can form the basis for advanced development data, and collates files according to a combination of proto-
capabilities for the expert and lower the barrier for the novice to col, channel number, and date, organizing them in ‘/data-
set foot in data-centric battery research. share/collated_cycler_files’. This functionality is han-
The repetitive nature of battery experiments defines the re- dled mainly by the function process_files_json which can
quirements for such a tool to be useful to battery researchers. also be called directly. The output is a JSON string that contains
Experiments consist of repeated application of ‘‘cycling protocols’’ ids, paths and names for raw cycler files, paths for the collated
(which prescribe how the battery should be charged and dis-
cycler files, cycling protocols corresponding to each file, channel
charged) to a user-supplied battery cell by the hardware. Such cy-
number and the date the original file was generated.
cling experiments can take from a few hours to several months to
complete. During these experiments, many types of information,
such as time, capacity, voltage, cycle number and temperature 2.2. Validate
are recorded with high sampling frequency, and the size of raw
data can grow rapidly. Besides, the naming conventions for dif-
As in any experimental process, erroneous or corrupted data
ferent projects, data, metadata and protocol files can vary among
can be produced as a consequence of instrument failures (e.g.
vendors or members of a research group. Hence, a scalable data
power outage), changes in environmental conditions (e.g. temper-
management and processing system is required. In addition, the
ature), software glitches or human errors (e.g. misconfigured pro-
data structures can be complicated due to the cyclic nature of the
experiments. For example, raw data may need to be grouped over tocols) in any stage of battery cycling experiments. If unnoticed,
one axis and interpolated over another. Data formats and stor- such data may contaminate the analytical process or misguide
age technologies (e.g. databases, file systems) also vary among the research. To address this issue, BEEP provides a validate
different cycler hardware. Researchers could benefit from stan- module, where the ValidatorBeep class validates collated cy-
dardized data formats, alongside programmatic interfaces to such cling data against researcher-defined schemas prescribed in yaml
databases as needed. Data streams should be validated against files (examples can be found in VALIDATION_SCHEMA_DIR) or as
human errors, equipment errors or failures, and environmental dictionary-based rule definitions of the external Python library
circumstances to ensure their integrity. Organized, processed, and cerberus [24]. The validation schemas can include data-types,
validated data are the key ingredient for a data-driven research min/max values, ranges, non-allowed values or complex rules via
pipeline, and can be used in unsupervised modeling (e.g. for cerberus, which adopts a convenient, dictionary based schema
anomaly detection) or followed by a ‘‘featurization’’ step that definition. A fast, lightweight version is provided as Simpl-
computes engineered features from the data [13,22]. Featurized eValidatorBeep which does dataframe-based validation (re-
cycling data can be used in training predictive models (e.g. for stricted to type, min/max and non-allowed) and supports the
failure prediction). cerberus syntax for interchangeable use. Validation stores the
To the best of our knowledge, there is no open software list of files being validated and the results in JSON format, at
that satisfies the requirements or features summarized above, DEFAULT_VALIDATION_RECORDS.
which are expected to be useful for enabling wider adoption of
data-driven approaches in battery research. The battery experi-
mentation and early prediction Python library, BEEP, aims to fill 2.3. Structure
this gap. Since it is built on common Python libraries such as
NumPy, SciPy, scikit-learn and pandas, and adopts common data Battery cycling tests accumulate information in a tabular form,
interchange formats like JSON, we expect BEEP to make this tran- containing thousands to millions of rows, and produce large data
sition to data-driven research easier for individual researchers
files that need to be structured for analytics. The structure
and provide useful building blocks for battery research platforms
module contains two classes that serve this purpose: RawCy-
developed by research groups [23].
clerRun and ProcessedCyclerRun. The first class supports
parsing and indexing of raw data into appropriate integer (e.g.
2. Software description
step, cycle index) and float (e.g. time, current, voltage, charge ca-
pacity, temperature) columns in a dataframe, and provides meth-
BEEP consists of six main modules: collate, validate,
ods to identify diagnostic cycles, and deliver summary statistics
structure, featurize, generate_protocol and run_
and metadata. This class can interpolate target variables over
model. While the modules can be used independently, this order
other variables and return interpolated data-containers of the
of execution is typical for the data management and processing
steps delivered by the BEEP framework, and therefore the output same structure (e.g. interpolating variables on a consistent volt-
of the main methods in each module is the input for the next. The age scale), which is useful for machine-learning models. Pro-
default functionality of each module can also be directly accessed cessedCyclerRun provides project-specific structuring of raw
from the command-line, where input arguments are provided as data from RawCyclerRun, for which example schemas are pro-
JSON-strings. Almost all BEEP classes are serializable and can be vided in the conversion_schemas folder for various types of
stored as such objects. Here we explain the main functionalities hardware. Input data needed for structuring exist in datafiles
delivered with each module in BEEP. of almost any cycling hardware, often recorded with different
naming conventions. This library of conversion schemas can be
2.1. Collate expanded to other formats and provide a centralized resource
for the community to be used with other battery cyclers and
The collate module is used for standardization of raw cycler instruments. The ProcessedCyclerRun class produces a rich,
files and metadata as well as organization of the standardized serializable object, which flexibly allows addition of fields and
files. BEEP follows a name-based convention for file storage and data as needed.
P. Herring, C. Balaji Gopal, M. Aykol et al. / SoftwareX 11 (2020) 100506 3
Fig. 1. Code snippets demonstrating the raw data file handling, processing and featurization.
Fig. 2. Code snippets demonstrating model prediction using existing (stored) model and training of a new model using featurized data.
Fig. 3. Predicted vs. actual cycle life at different capacity fade thresholds. The capacity fade threshold is shown as a percentage of initial nominal capacity in each
panel.
safety values included in the file; setting these values incorrectly and display rendering can become a roadblock in large battery
can result in damage to the cycling system or the cell under test, cycling experiments. Hence, components of BEEP are modularized
either of which might be catastrophic. However, the benefits of for easy deployment to cloud-based services that can scale up
this capability are far reaching, such as the elimination of manual or down. Additional cloud infrastructure can provide messaging
test creation, which is time-consuming and error-prone, and the and coordination of various scripts to deal with large data loads
ability to create and run protocols in cyclers with minimal human and processing-heavy tasks. For such application environments,
input. These features enable automated control and selection we designed most of the scripts to be containerized and run via
of experiments by active learning systems, as demonstrated in command-line arguments, and messaging between components
Ref. [25]. Future developments include the ability to convert from to be achieved with event streaming.
one system’s protocol file to another’s (within the appropriate Currently, BEEP assumes that data are arriving in a flat-file
hardware constraints) and supporting more systems. format, with one file per test. There are, however, database-
centered cycler systems that do not store or export data in this
2.7. Other features form, which improves performance but makes data less accessi-
ble. Scripts that enable integration with such systems and output
BEEP scripts can be run locally on a machine with appropriate of flat files ingestible by BEEP are available at https://github.
access to data and adequate compute/memory. But dealing with com/TRI-AMDD/beep-integration-scripts. There are two caveats
the computational workload of data processing, model fitting to such scripts. First, there are assumptions about the way that
P. Herring, C. Balaji Gopal, M. Aykol et al. / SoftwareX 11 (2020) 100506 5
the tests are run, such as the user will not duplicate a test name
on a channel, each test runs to completion on a single channel,
etc. These assumptions stem from common practices, but the
scripts have not been tested against a large number of conditions.
Second, while for most cases the data files are less than a few
hundred megabytes, sufficient memory must be available to fit
large tests into memory. Further development might reduce the
memory requirement and provide more robust extraction of the
data, e.g. with improved SQL queries.
3. Illustrative example