0% found this document useful (0 votes)
11 views

Geochem Geophys Geosyst - 2024 - ZhangZhou - Geochemistry Automated Machine Learning Python Framework For Tabular Data

Uploaded by

tamanna1303e
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Geochem Geophys Geosyst - 2024 - ZhangZhou - Geochemistry Automated Machine Learning Python Framework For Tabular Data

Uploaded by

tamanna1303e
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

METHOD Geochemistry π: Automated Machine Learning Python

10.1029/2023GC011324
Framework for Tabular Data
J. ZhangZhou and Can He are co‐first J. ZhangZhou1 , Can He2 , Jianhao Sun3, Jianming Zhao1, Yang Lyu1, Shengxin Wang4,
authors.
Wenyu Zhao1, Anzhou Li1 , Xiaohui Ji5, and Anant Agarwal6
Key Points: 1
Key Laboratory of Geoscience Big Data and Deep Resource of Zhejiang Province, School of Earth Sciences, Zhejiang
• Open‐source Python framework for University, Hangzhou, China, 2School of Computing, National University of Singapore, Singapore, Singapore, 3School of
machine learning applications in
Earth Sciences, China University of Geosciences, Wuhan, China, 4School of Earth Sciences, Lanzhou University, Lanzhou,
geochemistry
• Automated pipeline for tabular data China, 5School of Information Engineering, China University of Geosciences, Beijing, China, 6Department of Data
• Question‐and‐answer format obviates Science, Nissan Motor Corporation, Yokohama, Japan
the need for coding experience

Abstract Although machine learning (ML) has brought new insights into geochemistry research, its
Supporting Information:
implementation is laborious and time‐consuming. Here, we announce Geochemistry π, an open‐source
Supporting Information may be found in
the online version of this article. automated ML Python framework. Geochemists only need to provide tabulated data and select the desired
options to clean data and run ML algorithms. The process operates in a question‐and‐answer format, and thus
Correspondence to:
does not require that users have coding experience. After either automatic or manual parameter tuning, the
J. ZhangZhou and C. He, automated Python framework provides users with performance and prediction results for the trained ML model.
[email protected]; Based on the scikit‐learn library, Geochemistry π has established a customized automated process for
[email protected] implementing classification, regression, dimensionality reduction, and clustering algorithms. The Python
framework enables extensibility and portability by constructing a hierarchical pipeline architecture that
Citation: separates data transmission from the algorithm application. The AutoML module is constructed using the Cost‐
ZhangZhou, J., He, C., Sun, J., Zhao, J., Frugal Optimization and Blended Search Strategy hyperparameter search methods from the A Fast and
Lyu, Y., Wang, S., et al. (2024).
Geochemistry π: Automated machine Lightweight AutoML Library, and the model parameter optimization process is accelerated by the Ray
learning Python framework for tabular distributed computing framework. The MLflow library is integrated into ML lifecycle management, which
data. Geochemistry, Geophysics, allows users to compare multiple trained models at different scales and manage the data and diagrams generated.
Geosystems, 25, e2023GC011324. https://
doi.org/10.1029/2023GC011324 In addition, the front‐end and back‐end frameworks are separated to build the web portal, which demonstrates
the ML model and data science workflow through a user‐friendly web interface. In summary, Geochemistry π
Received 1 NOV 2023 provides a Python framework for users and developers to accelerate their data mining efficiency with both
Accepted 18 JAN 2024 online and offline operation options.
Author Contributions:
Plain Language Summary Geochemistry π is a helpful tool for scientists who work with
Conceptualization: J. ZhangZhou,
Can He geochemical data. One of its standout features is its simplicity. Scientists can use the tool to perform machine
Funding acquisition: J. ZhangZhou learning (ML) on the tabular data by answering a series of questions about what they want to discover. The tool
Software: J. ZhangZhou, Can He, does the rest by using advanced ML techniques to uncover insights from the data. Even scientists without coding
Jianhao Sun, Jianming Zhao, Yang Lyu,
Wenyu Zhao, Anzhou Li, Xiaohui Ji, skills can use Geochemistry π effectively. This tool is built on a reliable library called scikit‐learn, ensuring that
Anant Agarwal it works well with different ML methods. It is also flexible, allowing researchers to customize it to fit their
Writing – original draft: J. ZhangZhou, specific needs. Geochemistry π separates data processing from ML tasks, making it adaptable and expandable. It
Can He, Yang Lyu
Writing – review & editing: includes features for continuous training and managing the entire ML process. To prove its effectiveness,
J. ZhangZhou, Jianming Zhao, Xiaohui Ji, Geochemistry π was tested against previous geochemical studies in areas such as regression, classification,
Anant Agarwal clustering, and dimensional reduction. The results showed that it could replicate the findings of these studies
accurately. Accessible through a web portal or command line, Geochemistry π is a valuable asset for
geochemists and researchers looking to analyze large geochemical data sets.

© 2024 The Authors. Geochemistry,


Geophysics, Geosystems published by
Wiley Periodicals LLC on behalf of 1. Introduction
American Geophysical Union.
This is an open access article under the In recent decades, the rapid development of analytical instruments has produced astronomical amounts of
terms of the Creative Commons
Attribution‐NonCommercial‐NoDerivs geochemical data. Furthermore, thanks to community databases, humans and machines can access large amounts
License, which permits use and of data effectively through the Findable, Accessible, Interoperable, and Reusable principle (Chamberlain
distribution in any medium, provided the et al., 2021; Farrell et al., 2021; Goldstein et al., 2014; Klöcking et al., 2023; Lehnert et al., 2000). Under these
original work is properly cited, the use is
non‐commercial and no modifications or circumstances, geochemical data mining has become a practical approach for elucidating the processes occurring
adaptations are made. on Earth's surface and within its interior, at present and throughout geological history, such as tephrachronology

ZHANGZHOU ET AL. 1 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

Figure 1. Activity diagram of Geochemistry π.

and tephra studies (Bolton et al., 2020; Lubbers et al., 2023), hydrological studies (Shaughnessy et al., 2021; Wen
et al., 2021), magmatic processes (Boschetty et al., 2022; Cortés et al., 2007; Costa et al., 2023; Keller
et al., 2015), thermobarometry (Higgins et al., 2022; Jorgenson et al., 2022; Petrelli et al., 2020), tectonic dis-
trimiantion (Doucet et al., 2022; Petrelli & Perugini, 2016; Ueki et al., 2018), sedimentary composition reflecting
crust composition over time (Lipp et al., 2021; Ptáček et al., 2020), and long‐term secular cooling of the mantle
(Keller & Schoene, 2012).

In addition to images (e.g., geochemical maps, electron microscope images), a significant amount of important
geochemical information is stored in tabular data, such as the concentrations and speciation of chemical com-
pounds, elemental concentrations, and isotopic ratios. When applied to these data sets, machine learning (ML) can
reveal deep structural patterns with the data, thereby bringing new geochemical insights (Chicchi et al., 2023; He
et al., 2022; Morrison et al., 2017; Petrelli & Perugini, 2016; Prabhu et al., 2021; Qin et al., 2022; Stracke
et al., 2022; Tao et al., 2021; Wen et al., 2021). Although flourishing, ML implementation is laborious and time‐
consuming for most geochemists because they must, for example, locate codes from scikit‐learn (Pedregosa
et al., 2011), modify codes to fit their unique data set(s), and tune the model's hyperparameters.
Compared to the common application of ML in other geoscientific disciplines (e.g., geohazards, seismology;
Bergen et al., 2019; Li et al., 2023; Reichstein et al., 2019), geochemical data‐driven research has fallen behind. A
recent search on the Web of Science for publications from 2018 to 2022 reveals this gap, with the proportion of
publications using ML in petrology and geochemistry increasing from 0.24% to 1.35% (total publications in 2018
and 2022: 9360 and 11,101, respectively) compared to a more substantial increase from 1.48% to 6.90% in the field
of earth and planetary science (total publications in 2018 and 2022: 22,669 and 31,179, respectively). This gap can
be attributed, in part, to the absence of convenient tools designed specifically for geochemical data mining.

Here, we introduce Geochemistry π, a novel, automated, open‐source Python framework designed to simplify the
implementation of ML on tabular data, eliminating the need for prior coding experience (see Figure 1). Users
merely need to provide their data in .csv or .xlsx spreadsheet formats and make selections in a straightforward

ZHANGZHOU ET AL. 2 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

question‐and‐answer format. Geochemistry π can seamlessly handle various tasks, including data cleaning,
standardization, regression, classification, clustering, and dimensionality reduction.

The software offers benefits to the community in two significant ways. First, for geochemists who are proficient in
coding and employing ML, Geochemistry π can serve as a benchmark for validating their results. Second, for
those geochemists lacking coding expertise, Geochemistry π lowers the entry barriers to utilizing ML effectively.
In the sections that follow, we will introduce the software from both user and developer perspectives, including
benchmark tests to replicate ML applications from prior studies.

2. Geochemistry π From the User's Side


Before implementing an ML model, users must define their scientific problem and prepare two data sets in a
tabular format (either .csv or .xlsx files): one for training the ML models and the other for the application of the
trained ML models. Then, users will be led through question‐and‐answering prompts in Geochemistry π to select
the software options required to achieve their goals. Results will be generated and preserved in the results folder as
images and spreadsheets, including sample statistics, ML model performance, map plots, and predictions based
on the application data set.
Geochemistry π is compatible with MacOS, Windows, and Linux operating systems. Users can install the
software through the command line using the “pip install geochemistrypi” command. In Jupyter Notebook and
Google Colab environments, commands are the same, but preceded by an exclamation point; for example, in the
case of installation, the relevant command is “!pip install geochemistrypi.” All source codes are available on
GitHub (https://github.com/ZJUEarthData/geochemistrypi), with a detailed operational manual catering to both
users and developers (https://geochemistrypi.readthedocs.io/en/latest/). This resource serves to facilitate smooth
navigation and instruction.
Operation of the software begins with the initiation command “geochemistrypi data‐mining ‐‐training InputData.
xlsx ‐‐application ApplicationData.xlsx” executed at the command line, where “InputData.xlsx” refers to the path
of the input data file while “ApplicationData.xlsx” refers to the path of the application data file. Users must assign
a name to each modeling instance, here called an “experiment” and differentiate various trials within the modeling
instance by using a second‐level name called a “run” (Figure S1 in Supporting Information S1). Next, users select
experimental data from the input data set (Figure S2 in Supporting Information S1), meaning the selected
experimental data are read from the input tabular file, and relevant information is written into another data file
specifically for ML use. The chosen data files and corresponding distribution images are automatically generated.
Users are free to define header names in the input data set, but Geochemistry π is capable of reading the data
accordingly. During data pre‐processing, users are guided through how missing values in the data set should be
handled (Figure S3 in Supporting Information S1), accompanied by an automatic hypothesis testing process.
Then, feature engineering is implemented to further refine the data.
Geochemistry π offers data preparation and then four operational modes: regression, classification, clustering, and
dimension reduction. Upon selecting either regression or classification mode, users begin partitioning their selected
experimental data into feature and label data sets (Figure S4 in Supporting Information S1) for model training. This
phase applies feature scaling techniques (Figure S5 in Supporting Information S1) and feature selection techniques
(Figure S6 in Supporting Information S1) to the feature data set. After that, users need to split the feature and label
data sets into train and test sets for model valuation (Figure S7 in Supporting Information S1). Subsequently, users
can opt for one or multiple models within the chosen mode (Figure S8 in Supporting Information S1), which may
entail either automated or manual hyperparameter tuning (Figure S9 in Supporting Information S1). The confir-
mation steps encompass the automated generation of data, images, models, metrics, and parameter files, all
seamlessly synchronized with the progression of the model's execution. All results will be stored in the “geo-
pi_output” folder, including the trained ML model performance, and the prediction results of the application data.
The operation is broken into the following five steps, with steps 1–2 dedicated to data preparation, steps 3–4 to
ML model training, and step 5 to predictions using the trained ML models:
1. Data selection

To begin, users load their prepared data as the input data into Geochemistry π from tabular files (in either .csv or .
xlsx format) (Figure S1 in Supporting Information S1) to train the model. The software prompts users to select

ZHANGZHOU ET AL. 3 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

which columns of data will be utilized for subsequent tasks. Users respond by specifying the range or specific
columns to be used. Geochemistry π provides descriptive statistics of the selected data, with particular emphasis
on highlighting missing values that require attention.

2. Missing values and standardization


Dealing with missing values in geochemical data offers several strategies and options. The first strategy involves
leaving the missing values as‐is and proceeding with the models which can deal with missing values in regression,
classification and clustering tasks. However, this strategy restricts the use of dimension reduction on data with
missing values and may negatively impact model performance. Geochemistry π asks users whether they want to
impute the missing values. Users can choose “No” to proceed to limited models for the training data selection such
as XGBoost. The second strategy is to drop rows with missing values, but this may lead to a significant loss of data
if too many features are chosen. Therefore, users must carefully consider the statistical results from step (1). The
third strategy involves choosing “Yes” to proceed to impute missing values with the mean, the median, most
frequent, or a constant value such as zero (Figure S3 in Supporting Information S1). Users simply type the
corresponding number for the desired imputation method in Geochemistry π. The choice of strategy relies on the
user's understanding of the data set and the circumstances that led to the missing values. Geochemistry π offers
three options in this regard. If users wish to augment their data set with additional features based on data char-
acteristics, they can choose the feature engineering option in Geochemistry π. Before training the ML algorithm,
users will perform data partitioning option (Figure S4 in Supporting Information S1) and can further scale the
feature data set through standardization (Figure S5 in Supporting Information S1) or filter out less important
features through feature selection (Figure S6 in Supporting Information S1).
3. Selecting ML algorithms

Geochemistry π provides four operational modes: regression, classification, clustering, and dimension reduction.
Users select their preferred mode by typing 1, 2, 3, or 4 in the software. Following the mode selection, the
software presents a list of ML algorithms from which users can choose one or all algorithms to train their ML
models (Figure S8 in Supporting Information S1).

4. Options for Parameter optimization


Users have the flexibility to optimize model parameters through automated methods or manual input. In the
automated mode, details of the automation method are covered in Section 3.2. Alternatively, users can choose
“No” when asked if automated optimization is required. In this case, users manually input hyperparameter values
as prompted by the software, such as learning rate, minimum split loss, and maximum tree depth.
5. Applying the trained ML model on the application data
Once supervised ML models are well‐trained, users proceed to apply them to application data. Similar to step (1),
users select a tabular data set in the appropriate format as the application data. It's crucial to ensure that the column
names in the application data perfectly matches the input data used for training ML model. Any mismatch could
lead to issues, a consideration that applies both to Geochemistry π and Python‐based ML approaches. With the
training procedure finished, the trained ML model would be applied automatically to make predictions based on
the application data, enabling the extraction of valuable insights and conclusions.

3. Geochemistry π From the Developer's Side


3.1. Workflow Overview
Geochemistry π is an open‐source automated ML framework with specialized design patterns to facilitate the
application of ML techniques to data mining of tabular data. The framework consists of three key components:
continuous training, ML lifecycle management, and model inference. Geochemistry π implements continuous
training and model inference by a customized automated ML training pipeline and ML lifecycle management
with a customized storage mechanism (Figure S12 in Supporting Information S1).

After users upload their tabular data into Geochemistry π, a customized automated ML pipeline powered by
hyperparameter optimization solutions from a fast and lightweight AutoML library (FLAML) (Wang, Wu,
Huang, & Amin, 2021) implements data mining techniques through a question‐and‐answer format (Figure S10 in
Supporting Information S1). After that, previously trained models will be automatically applied to new provided

ZHANGZHOU ET AL. 4 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

application data sets due to the auto‐built transformer pipeline. Every operation performed on the data and the
correspondingly generated artifacts are logged by MLflow (Zaharia et al., 2018) and the tracking mechanism
designed in this study. Finally, users can review the results of their experiments, including descriptions of ex-
periments and runs, details about trained model parameters, evaluation metrics, and any generated artifacts.

3.2. Customized Automated ML Pipeline


Our customized automated ML pipeline (Figure S11 in Supporting Information S1) consists of the following eight
main phases, presented in a sequential order:
1. Experiment setup. It consists of data loading, dependency checking and the configuration of experiment and
run metadata.
2. Data extraction. Users select relevant data from either their own or built‐in data sets. This step ensures that the
specified range of data is used for the analysis.
3. Data preparation. It includes imputation for missing data, feature engineering to create new features from
existing ones, feature and label partitioning, feature scaling, feature selection, and train and test set splitting.
These steps refine the raw data into a data set suitable for subsequent data analysis and model training.
4. Data analysis. This includes a descriptive statistics report, statistical plotting (correlation, distribution, log
distribution, and probability plots), and combines Monte Carlo simulation with hypothesis testing in the
section of statistical analysis to better understand the data structure and characteristics.
5. Geochemical toolkit demonstration. Currently, we offer a world map projection to display global sample
distributions. This feature is useful for gaining insights into the spatial aspects of geochemical data. In the
future, more geochemical analysis techniques will be incorporated into this toolkit.
6. Model training. ML model selection and automatic or manual hyperparameter tuning. We offer four kinds of
ML models (regression, classification, clustering, and dimension reduction) to ensure that model versatility
and cover a broad range of analytical tasks. For a more in‐depth explanation of the hyperparameter search,
please refer to the latter part of Section 3.2.
7. Model evaluation. This comprises cross‐validation and artifact production. The former provides an unbiased
estimate of how well a model can be generalized, whereas the latter provides crucial indicators of the model's
effectiveness and potential for enhancement. This phase ensures the robustness of trained models.
8. Model inference. In the model application section, the trained ML model is deployed into the production
environment to make batch predictions on new data. This step allows the model to be used in real‐world
applications.
The use of MLflow to track ML models with unique identifiers is crucial for managing the lifecycles of ML
models in our software. Geochemistry π integrates MLflow to streamline the end‐to‐end ML lifecycle. MLflow
treats the user's scientific problem as an experiment and treats different trial‐and‐error iterations using our
automated ML pipeline as runs. Each run in an experiment logs parameters, metrics, artifacts, and setup infor-
mation, which researchers can use to easily compare different models, hyperparameters, and data pre‐processing
techniques and identify the most effective approaches. Furthermore, each run packages the trained models into
reproducible artifacts that can then be easily deployed across various environments.

In Geochemistry π, the storage mechanism (Figure S12 in Supporting Information S1) consists of two compo-
nents: the “geopi_tracking” folder and the “geopi_output” folder. MLflow uses the “geopi_tracking” folder as the
store for visualized operations in the web interface, which researchers cannot modify directly. The “geo-
pi_output” folder is a regular folder aligning with MLflow's storage structure, which researchers can operate.
Overall, this unique storage mechanism is purpose‐built to track each experiment and its corresponding runs in
order to create an organized and coherent record of researchers' scientific explorations.
Geochemistry π includes an AutoML module that leverages two hyperparameter search methods Cost‐Frugal
Optimization (CFO) and BlendSearch (Blended Search Strategy) from FLAML library to locate the optimal
combination of hyperparameters for the selected ML algorithms at low cost. CFO is known for its efficiency in
minimizing the computational cost of hyperparameter tuning, while BlendSearch combines multiple optimization
strategies to improve the chances of finding the best hyperparameters (Wang, Wu, Weimer, & Zhu, 2021; Wu
et al., 2005). To further expedite the parameter optimization process, the AutoML module harnesses the power of
the Ray framework for distributed and parallel computing, which is particularly advantageous when dealing with
complex models or large data sets (Philipp et al., 2018).

ZHANGZHOU ET AL. 5 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

3.3. Design Pattern and Hierarchical Pipeline Architecture


Geochemistry π refers to the software design pattern “Abstract Factory” (Figure S13 in Supporting Informa-
tion S1), serving as the foundational framework upon which our advanced automated ML capabilities are built.
This pattern provides an interface for creating families of related or dependent objects without specifying their
concrete classes. It further allows for the interchangeability of concrete implementations without affecting the
client code.

In the context of Geochemistry π, this design pattern acts as a blueprint for creating the diverse components of
automated ML workflows. The framework is a four‐layer hierarchical pipeline architecture that promotes the
creation of workflow objects through a set of model selection interfaces. The critical layers of this architecture are
as follows:

1. Layer 1: the realization of ML model‐associated functionalities with specific dependencies or libraries.


2. Layer 2: the abstract components of the ML model workflow class include regression, classification, clus-
tering, and dimension reduction.
3. Layer 3: the scikit‐learn API‐style model selection interface implements the creation of ML model workflow
objects.
4. Layer 4: the customized automated ML pipeline operated at the command line or through a web interface with
a complete data‐mining process.
This pattern‐driven architecture offers users a standardized and intuitive way to create an ML model workflow
class in Layer 2 by using a unified and consistent approach to object creation in Layer 3. Furthermore, it ensures
the interchangeability of different model applications, allowing for seamless transitions between methodologies
in Layer 1. Overall, this hierarchical architecture unlocks new dimensions of flexibility and extensibility,
empowering researchers to effortlessly contribute to our open‐source software or make their customized
products.
Geochemistry π adopts object‐oriented programming principles to construct its core functionalities. It leverages
specialized design patterns to build a comprehensive automated ML framework from modular and shareable
components, with command line and web interface software. The web application is implemented via a frontend‐
backend separation architecture (Figure S14 in Supporting Information S1) with authentication to support
multiple users. Additionally, it utilizes the SQLAlchemy store and file system store for data storage. Those
characteristics make Geochemistry π a potential candidate for a large‐scale ML Operations application at
architectural Level 1 (Figure 2), which aims to streamline and automate the lifecycle of ML models from
development to deployment.

4. Benchmarking: Product Test


To benchmark Geochemistry π, we compare application results with those of Petrelli et al. (2020) for regression,
Qin et al. (2022) for classification, Stracke et al. (2022) for clustering, and Tao et al. (2021) for dimension
reduction. In the following examples, Geochemistry π was used at the command line on a Windows operating
system. The summary of computational efficiency is provided in Table S1 in Supporting Information S1. More
detailed worked examples can be found in Code Documentation in Supporting Information S1.

4.1. Test Exercise for Regression


Petrelli et al. (2020) applied six regression algorithms to clinopyroxene and melt chemical data to obtain ML‐
based geothermobarometers. The algorithms include ordinary least squares linear regression, ML‐based sto-
chastic gradient boosting, extremely randomized trees (Extra‐Trees), random forests, k‐nearest neighbors, and
decision trees. To test the performance of regression models in Geochemistry π, we use the same data sets for
training, to run the same six models with automatic parameter adjustment, evaluating model performances using
the coefficient of determination (R2) and the root‐mean‐square error (RMSE).

The data implemented by Geochemistry π derives from Table S3 in Petrelli et al. (2020). The training data set
comprises the major element compositions of melt (SiO2, TiO2, Al2O3, FeOt, MnO, MgO, CaO, Na2O, K2O,
Cr2O3, P2O5, and H2O) and clinopyroxene (SiO2, TiO2, Al2O3, FeOt, MnO, MgO, CaO, Na2O, K2O, and
Cr2O3), whereas the prediction data set includes pressure or temperature. The data was split into 70% for

ZHANGZHOU ET AL. 6 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

Figure 2. System Architecture of Geochemistry π.

training and 30% for testing. In the data pre‐processing options in Geochemistry π, we first filled missing values
with zeroes. Then, we scaled the training data by standardization, with an average value of zero and one
standard deviation.
Table S2 in Supporting Information S1 reports the R2 and RMSE values obtained for the six regression algorithms
using the major element compositions of clinopyroxene and melt data for pressure or temperature estimations.
Using Geochemistry π, Extra‐Trees achieved the highest R2 values (0.94 and 0.95) and the lowest RMSE scores
(2.3 kbar and 40 K) when estimating pressure and temperature, respectively. These results align with those of
Petrelli et al. (2020), who also found Extra‐Trees to be the best model for the data and obtained comparable
respective R2 values of 0.91 and 0.94 and RMSE values of 2.6 kbar and 40 K (Table S2 in Supporting
Information S1).
Figure 3 presents a comparative analysis between Extra‐tree regression as presented by Petrelli et al. (2020) and
Geochemistry π. This analysis focuses on a thermometer model and utilizes training, test, and validation data sets
provided by Petrelli et al. (2020). The training and test data sets are employed for model training, while the
validation data set corresponds to the application data defined within the Geochemistry π framework. It is worth
noting that the validation data set in this context contains known temperature values. The automated hyper-
parameter tuning offered by Geochemistry π demonstrates comparable model performance to that achieved
through manual tuning by Petrelli et al. (2020), albeit with different hyperparameters (Figures 3a, 3b, 3d, and 3e).
However, the validation data results obtained by Petrelli et al. (2020) outperform those achieved using
Geochemistry π (Figures 3c and 3f). To understand the underlying reasons for this discrepancy, we conducted an
examination of the Code Documentation in Supporting Information S1. It was discovered that Petrelli et al. (2020)
provided values for four specific hyperparameters (namely, n_estimators, criterion, max_features, and ran-
dom_state), while the remaining hyperparameters employed default values from the scikit‐learn library. In
contrast, Geochemistry π offers flexibility in modifying a more extensive set of hyperparameters (including
n_estimators, max depth, max features, min samples split, min samples leaf, bootstrap, max samples, and

ZHANGZHOU ET AL. 7 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

Figure 3. Regression results from (a–c) Petrelli et al. (2020) and (d–f) produced in this study using Geochemistry π in automated mode. Results are plotted using (a, d)
train, (b, e) test, and (c, f) validation data sets respectively.

oob_score) instead of relying solely on default values from scikit‐learn. This divergence in hyperparameter
configurations implies that models with different hyperparameters may exhibit similar overall performance but
potentially yield variations in their predictive capabilities.

4.2. Test Exercise for Classification


Qin et al. (2022) applied the XGBoost classification algorithm to clinopyroxene major (SiO2, TiO2, Al2O3, Cr2O3,
FeOT, CaO, MgO, MnO, and Na2O; n = 21,605) (Figure 4a) and trace element analyses (Sc, Ti, V, Cr, Ni, Rb, Sr,
Y, Zr, Nb, Ba, La, Ce, Pr, Nd, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu, Hf, Ta, Pb, Th, and U; n = 2,967)
(Figure 4d). In this example using the automated mode of Geochemistry π, we used their “Labeled data set” as the
training data set and an automated parameter adjustment process to achieve optimization. This evaluation process
entails the calculation of both the Accuracy and F1 scores (see definitions in Qin et al. (2022)) to comprehensively
evaluate the classification capabilities.

We executed nine variations of the XGBoost algorithm, each using distinct sets of features in the training data:
MajorI‐9, comprising all nine major elements; MajorI‐4, comprising four major elements (Al2O3, FeOT, CaO,
MgO); MajorI‐2a, comprising two major elements (CaO, Al2O3); MajorI‐2b, comprising two alternate major
elements (2 features, FeOT, MgO); TraceI‐30, comprising all 30 trace elements; TraceI‐25, comprising all trace
elements except Rb, Sr, Ba, Pb, and U; TraceI‐4, comprising four trace elements (La, Yb, Eu, Ti); TraceI‐2a,
comprising two trace elements (Eu, Ti); and TraceI‐2b, comprising two alternate trace elements (La, Yb). The
prediction data were from the column “TRUE_VALUE.” The input data sets were divided into 70% for training
and 30% for validation. For the missing values, Qin et al. (2022) did not process it and directly applied XGBoost,
with the confusion matrix results presented Figure 4a. As part of the data pre‐processing in Geochemistry π,
missing values were treated the same as Qin et al. (2022) (results in Figures 4b and 4e) or imputed using the
mean value of the element across all analyses (results in Figures 4c and 4f). The slightly better performance with

ZHANGZHOU ET AL. 8 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

Figure 4. Classification results from (a, d) Qin et al. (2022) and (b, c, e, and f) generated in this study using Geochemistry π in
automated mode. Missing values are handled differently: (a, b, d, and e) as null and (c, f) as mean values. The
hyperparameters are summarized in Table S4 in Supporting Information S1.

mean value imputation might reflect a better representation of the data structure given their true but not analyzed
values.
Table S3 in Supporting Information S1 reports the F1 and Accuracy scores of the XGBoost classification al-
gorithms by training data set. Across the nine models implemented in Geochemistry π, the F1 and Accuracy
scores for the training sets span the ranges of 0.81–0.96 and 0.76–0.96, respectively, closely paralleling the results
of Qin et al. (2022; 0.85–0.97 and 0.82–0.97, respectively). When applied to the testing sets, Geochemistry π
attained F1 and Accuracy scores of 0.81–0.96 and 0.76–0.95, respectively, again aligning closely with the results
of Qin et al. (2022; 0.82–0.97 and 0.77–0.95, respectively).

4.3. Test Exercise for Clustering


Stracke et al. (2022) applied the t‐distributed stochastic neighbor embedding (t‐SNE) clustering algorithm to a
compiled data set of Sr‐Nd‐Pb isotope ratios (87Sr/86Sr, 143Nd/144Nd, 206Pb/204Pb, 207Pb/204Pb, and 207Pb/204Pb)
from mid‐ocean ridge basalts and ocean island basalts (see their Table S1 in Supporting Information S1,
n = 6,723). They used a perplexity parameter value of 48 when applying the t‐SNE methods and categorized the
data into 16 clusters using the single linkage agglomerative method (Figure 5a).
To assess the performance of the clustering models in Geochemistry π, we tried to utilize the same dataset and
parameter values to perform clustering while utilizing manual parameter adjustments (Figure 5b). Upon
comparing the results, we observed disparities between the clustering outcomes obtained using Geochemistry π
and the approach outlined by Stracke et al. (2022). These discrepancies can likely be attributed to variations in the
data set and parameter settings utilized in the code. To address this, we made a specific modification by setting the
random state to a fixed value of 42, in contrast to the absence of a random state in Figures 5a and 5b. As a result,
the Jupyter Notebook by Stracke et al. (2022) and Geochemistry π yielded consistent results with each other
(Figures 5c and 5d).

ZHANGZHOU ET AL. 9 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

Figure 5. t‐SNE “maps” depicting 5D MORB‐OIB isotopic data from Stracke et al. (2022). These maps were generated with a
perplexity value of 48 and utilized an agglomerative clustering approach (n_clusters = 16, linkage = single). Panels (a, c)
illustrate results obtained using distinct random state values, namely (a) none and (c) 42, as generated by the Jupyter
Notebook from Stracke et al. (2022). Panels (c, d) display t‐SNE “maps” corresponding to the same two random state values,
namely (b) none and (d) 42, as produced by Geochemistry π.

4.4. Test Exercise for Dimensional Reduction


Tao et al. (2021) characterized the high‐dimensional data structure of TOC‐normalized concentrations of 123
polar lipid biomarkers from three habitats: riverine soils, river sediments, and marine sediments. The three applied
unsupervised dimensional reduction approaches (principal component analysis, PCA; nonmetric multi‐
dimensional scaling, nMDS; and t‐SNE) using the “factoextra,” “vegan,” and “Rtsne” packages, respectively,
in R (version 4.0.2) using RStudio software. To test Geochemistry π, we used the data set available in their Table
S1 in Supporting Information S1 and again used a manual parameter adjustment process to achieve optimization.

Figures 6a–6c depict the outcomes of the three models utilizing the data set from Table S1 of Tao et al. (2021).
The Principal component analysis (PCA) representation using Geochemistry π exhibits a distribution akin to that
of Tao et al. (2021), with a simple 90‐degree rotation (Figure 6d). The Dim2 values are largely negative in our

ZHANGZHOU ET AL. 10 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

Figure 6. Dimensional reduction maps of total polar lipid biomarker compounds, illustrating the difference among three habitats of soil and sediment samples in the
Yellow River and Bohai Sea. Results are shown from (a–c) Tao et al. (2021) by R Studio and (d–i) produced in this study using Geochemistry π based on Python using
manual (d–f) or automated mode (g–i). The axes are unitless and represent the higher‐dimensional structural data relationships. (a, d, and g) Principal component
analysis, (b, e, and h) nonmetric multidimensional scaling (nMDS), and (c, f, and i) t‐distributed stochastic neighbor embedding (t‐SNE). The hyperparameters are
summarized in Table S5 in Supporting Information S1.

ZHANGZHOU ET AL. 11 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

results, whereas they are largely positive in those of Tao et al. (2021). According to Tao et al. (2021), compared to
linear approaches like PCA, non‐linear dimensional reduction techniques such as t‐SNE and nMDS have superior
capacities for unveiling environmental sample classifications (Figures 6b and 6c). Nonetheless, the overall
effectiveness in distinguishing sample habitats remains limited, which can be attributed to the varied sources and
intricate secondary changes of the polar lipid biomarker components during transplantation and their contribu-
tions to the observed data structure. Our benchmark tests indicated notable disparities in results when comparing
Python and RStudio (Figure 6). Additionally, varying hyperparameters led to different outcomes in t‐SNE and
nMDS analyses (as evidenced in Figures 6e, 6f, 6h, and 6i).
We speculate that the discrepancy between R Studio and Geochemistry π Python codes is due to the different
functions or parameters in the models. Geochemistry π utilizes scikit‐learn version 1.13. There is a notable
distinction in how the t‐SNE algorithm's initialization parameter is handled in R Studio compared to scikit‐learn.
R Studio defaults to using “PCA” for initialization, while scikit‐learn uses a “random” initialization method by
default. It's important to point out that when employing “PCA” for initialization in t‐SNE, it's not feasible to apply
it to precomputed distances. This is a significant factor to consider during data preparation and algorithm
configuration. Furthermore, when it comes to nMDS, R Studio demonstrates flexibility by accepting non‐square
matrices directly as inputs. In contrast, Python's implementation necessitates an additional step where non‐square
matrices must first be converted into square matrices before they can be processed.

5. Conclusions
Geochemistry π provides an open‐source, user‐friendly Python framework for conducting ML with minimal
coding requirements. Accessible via a web portal or as a command‐line executable, Geochemistry π streamlines
ML tasks by simplifying data input and ML model selection through intuitive question‐and‐answer prompts.
Users can effortlessly load data in .csv or .xlsx spreadsheet formats and instruct Geochemistry π to automatically
train ML models and generate predictions.

Built upon the scikit‐learn library, Geochemistry π is intentionally designed for extendability and portability,
ensuring its versatility in diverse applications. The framework adopts a hierarchical pipeline architecture,
effectively separating data processing from algorithm implementation. With two AutoML cores, continuous
training capabilities, and comprehensive ML lifecycle management, Geochemistry π offers a robust platform for
conducting geochemical ML experiments.
Benchmarking tests conducted in this study largely reproduced results from previous geochemical ML research,
encompassing regression, classification, clustering, and dimensional reduction tasks. While Geochemistry π
demonstrated consistency with prior studies in most cases, some discrepancies arose, possibly attributable to
model function and hyperparameters. Consequently, we recommend Geochemistry π as a valuable benchmarking
tool capable of producing consistent results while allowing users to save hyperparameters in spreadsheets for
reference.
Geochemistry π is an open‐source software, affording users the transparency to inspect its source code. Its
advantages include providing a benchmarking tool for users experienced in coding and lowering the entry barrier
for ML applications among non‐coders. However, users are encouraged to acquire a fundamental understanding
of ML principles to interpret results effectively. Some limitations of Geochemistry π include less flexibility
compared to direct coding with Python and scikit‐learn, as well as the absence of certain functionalities available
in other coding language packages. Additionally, computational efficiency may be on par with or lower than
using Python directly. As we continue to develop Geochemistry π, we remain committed to addressing user
feedback and improving the software. We welcome contributions from those interested in enhancing this open‐
source tool.

Data Availability Statement


Software developed in this work has been open‐sourced on GitHub: https://github.com/ZJUEarthData and
archived on Zenodo: https://doi.org/10.5281/zenodo.10509049. The data for the product test can be downloaded
from Zenodo: https://doi.org/10.5281/zenodo.10509049. The software developed in this paper is available at
ZhangZhou et al. (2024).

ZHANGZHOU ET AL. 12 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

Acknowledgments References
The authors appreciate the discussions
with Yu Qi and Tien Nguyen. We express Bergen, K., Johnson, P., Hoop, M., & Beroza, G. (2019). Machine learning for data‐driven discovery in solid Earth geoscience. Science,
our gratitude to Maurizio Petrelli, Ben Qin, 363(6433), eaau0323. https://doi.org/10.1126/science.aau0323
Andrew Stracke, and Ke‐yu Tao for Bolton, M. S., Jensen, B. J., Wallace, K., Praet, N., Fortin, D., Kaufman, D., & De Batist, M. (2020). Machine learning classifiers for attributing
generously providing both the data and the tephra to source volcanoes: An evaluation of methods for Alaska tephras. Journal of Quaternary Science, 35(1–2), 81–92. https://doi.org/10.
hyperparameters necessary to replicate 1002/jqs.3170
their previously published results. We Boschetty, F. O., Ferguson, D. J., Cortés, J. A., Morgado, E., Ebmeier, S. K., Morgan, D. J., et al. (2022). Insights into magma storage beneath a
thank Robert Dennen for polishing the frequently erupting arc volcano (Villarrica, Chile) from unsupervised machine learning analysis of mineral compositions. Geochemistry,
language of the paper. J. ZhangZhou Geophysics, Geosystems, 23(4), e2022GC010333. https://doi.org/10.1029/2022gc010333
acknowledges support from the Chamberlain, K. J., Lehnert, K. A., McIntosh, I. M., Morgan, D. J., & Wörner, G. (2021). Time to change the data culture in geochemistry. Nature
Fundamental Research Funds for the Reviews Earth & Environment, 2(11), 737–739. https://doi.org/10.1038/s43017‐021‐00237‐w
Central Universities K20220232 and Chicchi, L., Bindi, L., Fanelli, D., & Tommasini, S. (2023). Frontiers of thermobarometry: GAIA, a novel deep learning‐based tool for volcano
NSFC (Grant 42072066). The authors plumbing systems. Earth and Planetary Science Letters, 620, 118352. https://doi.org/10.1016/j.epsl.2023.118352
acknowledge the cloud computing Cortés, J. A., Palma, J. L., & Wilson, M. (2007). Deciphering magma mixing: The application of cluster analysis to the mineral chemistry of
resources provided by the Deep‐time crystal populations. Journal of Volcanology and Geothermal Research, 165(3–4), 163–188. https://doi.org/10.1016/j.jvolgeores.2007.
Digital Earth (DDE) science program, 05.018
which has been instrumental in supporting Costa, S., Caricchi, L., Pistolesi, M., Gioncada, A., Masotta, M., Bonadonna, C., & Rosi, M. (2023). A data driven approach to mineral chemistry
the web‐based version of our software. unveils magmatic processes associated with long‐lasting, low‐intensity volcanic activity. Scientific Reports, 13(1), 1314. https://doi.org/10.
1038/s41598‐023‐28370‐0
Doucet, L. S., Tetley, M. G., Li, Z. X., Liu, Y., & Gamaleldien, H. (2022). Geochemical fingerprinting of continental and oceanic basalts: A
machine learning approach. Earth‐Science Reviews, 233, 104192. https://doi.org/10.1016/j.earscirev.2022.104192
Farrell, Ú. C., Samawi, R., Anjanappa, S., Klykov, R., Adeboye, O. O., Agic, H., et al. (2021). The sedimentary geochemistry and paleo-
environments project. Geobiology, 19(6), 545–556. https://doi.org/10.1111/gbi.12462
Goldstein, S. L., Hofmann, A. W., & Lehnert, K. A. (2014). Requirements for the publication of geochemical data, version 1.0. Interdisciplinary
Earth Data Alliance (IEDA). Retrieved from https://www.earthchem.org
He, Y., Zhou, Y., Wen, T., Zhang, S., Huang, F., Zou, X., et al. (2022). A review of machine learning in geochemistry and cosmochemistry:
Method improvements and applications. Applied Geochemistry, 140, 105273. https://doi.org/10.1016/j.apgeochem.2022.105273
Higgins, O., Sheldrake, T., & Caricchi, L. (2022). Machine learning thermobarometry and chemometry using amphibole and clinopyroxene: A
window into the roots of an arc volcano (Mount Liamuiga, Saint Kitts). Contributions to Mineralogy and Petrology, 177(1), 10. https://doi.org/
10.1007/s00410‐021‐01874‐6
Jorgenson, C., Higgins, O., Petrelli, M., Bégué, F., & Caricchi, L. (2022). A machine learning‐based approach to clinopyroxene thermobarometry:
Model optimization and distribution for use in Earth sciences. Journal of Geophysical Research: Solid Earth, 127(4), e2021JB022904. https://
doi.org/10.1029/2021JB022904
Keller, C. B., & Schoene, B. (2012). Statistical geochemistry reveals disruption in secular lithospheric evolution about 2.5 Gyr ago. Nature,
485(7399), 490–493. https://doi.org/10.1038/nature11024
Keller, C. B., Schoene, B., Barboni, M., Samperton, K. M., & Husson, J. M. (2015). Volcanic–plutonic parity and the differentiation of the
continental crust. Nature, 523(7560), 301–307. https://doi.org/10.1038/nature14584
Klöcking, M., Wyborn, L., Lehnert, K., Ware, B., Prent, A., Profeta, L., et al. (2023). Community recommendations for geochemical data, services
and analytical capabilities in the 21st century. Geochimica et Cosmochimica Acta, 351, 192–205. https://doi.org/10.1016/j.gca.2023.04.024
Lehnert, K., Su, Y., Langmuir, C. H., Sarbas, B., & Nohl, U. (2000). A global geochemical database structure for rocks. Geochemistry,
Geophysics, Geosystems, 1(5), 1012. https://doi.org/10.1029/1999GC000026
Li, Y. E., O’Malley, D., Beroza, G., Curtis, A., & Johnson, P. (2023). Machine learning developments and applications in Solid‐Earth geosciences:
Fad or future? Journal of Geophysical Research: Solid Earth, 128(1), e2022JB026310. https://doi.org/10.1029/2022JB026310
Lipp, A. G., Roberts, G. G., Whittaker, A. C., Gowing, C. J. B., & Fernandes, V. M. (2021). Source region geochemistry from unmixing
downstream sedimentary elemental compositions. Geochemistry, Geophysics, Geosystems, 22(10), 2021GC009838. https://doi.org/10.1029/
2021GC009838
Lubbers, J., Loewen, M., Wallace, K., Coombs, M., & Addison, J. (2023). Probabilistic source classification of large tephra producing eruptions
using supervised machine learning: An example from the Alaska‐Aleutian arc. Geochemistry, Geophysics, Geosystems, 24(11),
e2023GC011037. https://doi.org/10.1029/2023GC011037
Morrison, S., Liu, C., Eleish, A., Prabhu, A., Li, C., Ralph, J., et al. (2017). Network analysis of mineralogical systems. American Mineralogist,
102(8), 1588–1596. https://doi.org/10.2138/am‐2017‐6104CCBYNCND
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit‐learn: Machine learning in Python. Journal of
Machine Learning Research, 12, 2825–2830. Retrieved from https://arxiv.org/abs/1201.0490v4
Petrelli, M., Caricchi, L., & Perugini, D. (2020). Machine learning thermo‐barometry: Application to clinopyroxene‐bearing magmas. Journal of
Geophysical Research: Solid Earth, 125(9), e2020JB020130. https://doi.org/10.1029/2020JB020130
Petrelli, M., & Perugini, D. (2016). Solving petrological problems through machine learning: The study case of tectonic discrimination using
geochemical and isotopic data. Contributions to Mineralogy and Petrology, 171(10), 81. https://doi.org/10.1007/s00410‐016‐1292‐2
Philipp, M., Robert, N., Stephanie, W., Alexey, T., Richard, L., Eric, L., et al., (2018). Ray: A distributed framework for emerging AI applications.
In A. Arpaci‐Dusseau & G. Voelker (Eds.) 13th USENIX symposium on operating systems design and implementation (OSDI 18) (pp.
561–577). https://doi.org/10.48550/arXiv.1712.05889
Prabhu, A., Morrison, S. M., Eleish, A., Zhong, H., Huang, F., Golden, J. J., et al. (2021). Global earth mineral inventory: A data legacy. Geosci.
Data J., 8(1), 74–89. https://doi.org/10.1002/gdj3.106
Ptáček, M., Dauphas, N., & Greber, N. (2020). Chemical evolution of the continental crust from a data‐driven inversion of terrigenous sediment
compositions. Earth and Planetary Science Letters, 539, 116090. https://doi.org/10.1016/j.epsl.2020.116090
Qin, B., Huang, F., Huang, S., Python, A., Chen, Y., & ZhangZhou, J. (2022). Machine learning investigation of clinopyroxene compositions to
evaluate and predict mantle metasomatism worldwide. Journal of Geophysical Research: Solid Earth, 127(5), e2021JB023614. https://doi.org/
10.1029/2021JB023614
Reichstein, M., Camps‐Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., & Prabhat (2019). Deep learning and process understanding
for data‐driven Earth system science. Nature, 566(7743), 195–204. https://doi.org/10.1038/s41586‐019‐0912‐1
Shaughnessy, A. R., Gu, X., Wen, T., & Brantley, S. L. (2021). Machine learning deciphers CO2 sequestration and subsurface flowpaths from
stream chemistry. Hydrology and Earth System Sciences, 25(6), 3397–3409. https://doi.org/10.5194/hess‐25‐3397‐2021

ZHANGZHOU ET AL. 13 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324

Stracke, A., Willig, M., Genske, F., Béguelin, P., & Todd, E. (2022). Chemical geodynamics insights from a machine learning approach.
Geochemistry, Geophysics, Geosystems, 23(10), e2022GC010606. https://doi.org/10.1029/2022GC010606
Tao, K., Xu, Y., Wang, Y., Wang, Y., & He, D. (2021). Source, sink and preservation of organic matter from a machine learning approach of polar
lipid tracers in sediments and soils from the Yellow River and Bohai Sea, eastern China. Chemical Geology, 582, 120441. https://doi.org/10.
1016/j.chemgeo.2021.120441
Ueki, K., Hino, H., & Kuwatani, T. (2018). Geochemical discrimination and characteristics of magmatic tectonic settings: A machine‐learning‐
based approach. Geochemistry, Geophysics, Geosystems, 19(4), 1327–1347. https://doi.org/10.1029/2017gc007401
Wang, C., Wu, Q., Huang, S., & Amin, S., (2021). Economical hyperparameter optimization with blended search strategy. In S. Mohamed & K.
Hofmann (Eds.) The ninth international conference on learning representations (ICLR 2021) (pp. 1–17). Retrieved from https://www.
microsoft.com/en‐us/research/publication/economical‐hyperparameter‐optimization‐with‐blended‐search‐strategy/
Wang, C., Wu, Q., Weimer, M., & Zhu, E. E., (2021). Flaml: A fast and lightweight automl library. In J. Konečný (Ed.). Proceedings of the 4th
MLSys conference (pp. 1–17). https://doi.org/10.48550/arXiv.1911.04706
Wen, T., Liu, M., Woda, J., Zheng, G., & Brantley, S. (2021). Detecting anomalous methane in groundwater within hydrocarbon production areas
across the United States. Water Research, 200, 117236. https://doi.org/10.1016/j.watres.2021.117236
Wu, Q., Wang, C., & Huang, S. (2005). Frugal optimization for cost‐related hyperparameters. arXiv:2005.01571 [cs.LG]. https://doi.org/10.
48550/arXiv.2005.01571
Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong, S., Konwinski, A., et al. (2018). Accelerating the machine learning lifecycle with
MLflow. IEEE Data Eng. Bull., 41(4), 39–45. Retrieved from https://api.semanticscholar.org/CorpusID:83459546
ZhangZhou, J., He, C., Sun, J., Zhao, J., Lyu, Y., Wang, S., et al. (2024). ZJUEarthData/geochemistrypi: v0.5.0 (v0.5.0) [Software]. Zenodo.
https://doi.org/10.5281/zenodo.10509049

ZHANGZHOU ET AL. 14 of 14

You might also like