Geochem Geophys Geosyst - 2024 - ZhangZhou - Geochemistry Automated Machine Learning Python Framework For Tabular Data
Geochem Geophys Geosyst - 2024 - ZhangZhou - Geochemistry Automated Machine Learning Python Framework For Tabular Data
10.1029/2023GC011324
Framework for Tabular Data
J. ZhangZhou and Can He are co‐first J. ZhangZhou1 , Can He2 , Jianhao Sun3, Jianming Zhao1, Yang Lyu1, Shengxin Wang4,
authors.
Wenyu Zhao1, Anzhou Li1 , Xiaohui Ji5, and Anant Agarwal6
Key Points: 1
Key Laboratory of Geoscience Big Data and Deep Resource of Zhejiang Province, School of Earth Sciences, Zhejiang
• Open‐source Python framework for University, Hangzhou, China, 2School of Computing, National University of Singapore, Singapore, Singapore, 3School of
machine learning applications in
Earth Sciences, China University of Geosciences, Wuhan, China, 4School of Earth Sciences, Lanzhou University, Lanzhou,
geochemistry
• Automated pipeline for tabular data China, 5School of Information Engineering, China University of Geosciences, Beijing, China, 6Department of Data
• Question‐and‐answer format obviates Science, Nissan Motor Corporation, Yokohama, Japan
the need for coding experience
Abstract Although machine learning (ML) has brought new insights into geochemistry research, its
Supporting Information:
implementation is laborious and time‐consuming. Here, we announce Geochemistry π, an open‐source
Supporting Information may be found in
the online version of this article. automated ML Python framework. Geochemists only need to provide tabulated data and select the desired
options to clean data and run ML algorithms. The process operates in a question‐and‐answer format, and thus
Correspondence to:
does not require that users have coding experience. After either automatic or manual parameter tuning, the
J. ZhangZhou and C. He, automated Python framework provides users with performance and prediction results for the trained ML model.
[email protected]; Based on the scikit‐learn library, Geochemistry π has established a customized automated process for
[email protected] implementing classification, regression, dimensionality reduction, and clustering algorithms. The Python
framework enables extensibility and portability by constructing a hierarchical pipeline architecture that
Citation: separates data transmission from the algorithm application. The AutoML module is constructed using the Cost‐
ZhangZhou, J., He, C., Sun, J., Zhao, J., Frugal Optimization and Blended Search Strategy hyperparameter search methods from the A Fast and
Lyu, Y., Wang, S., et al. (2024).
Geochemistry π: Automated machine Lightweight AutoML Library, and the model parameter optimization process is accelerated by the Ray
learning Python framework for tabular distributed computing framework. The MLflow library is integrated into ML lifecycle management, which
data. Geochemistry, Geophysics, allows users to compare multiple trained models at different scales and manage the data and diagrams generated.
Geosystems, 25, e2023GC011324. https://
doi.org/10.1029/2023GC011324 In addition, the front‐end and back‐end frameworks are separated to build the web portal, which demonstrates
the ML model and data science workflow through a user‐friendly web interface. In summary, Geochemistry π
Received 1 NOV 2023 provides a Python framework for users and developers to accelerate their data mining efficiency with both
Accepted 18 JAN 2024 online and offline operation options.
Author Contributions:
Plain Language Summary Geochemistry π is a helpful tool for scientists who work with
Conceptualization: J. ZhangZhou,
Can He geochemical data. One of its standout features is its simplicity. Scientists can use the tool to perform machine
Funding acquisition: J. ZhangZhou learning (ML) on the tabular data by answering a series of questions about what they want to discover. The tool
Software: J. ZhangZhou, Can He, does the rest by using advanced ML techniques to uncover insights from the data. Even scientists without coding
Jianhao Sun, Jianming Zhao, Yang Lyu,
Wenyu Zhao, Anzhou Li, Xiaohui Ji, skills can use Geochemistry π effectively. This tool is built on a reliable library called scikit‐learn, ensuring that
Anant Agarwal it works well with different ML methods. It is also flexible, allowing researchers to customize it to fit their
Writing – original draft: J. ZhangZhou, specific needs. Geochemistry π separates data processing from ML tasks, making it adaptable and expandable. It
Can He, Yang Lyu
Writing – review & editing: includes features for continuous training and managing the entire ML process. To prove its effectiveness,
J. ZhangZhou, Jianming Zhao, Xiaohui Ji, Geochemistry π was tested against previous geochemical studies in areas such as regression, classification,
Anant Agarwal clustering, and dimensional reduction. The results showed that it could replicate the findings of these studies
accurately. Accessible through a web portal or command line, Geochemistry π is a valuable asset for
geochemists and researchers looking to analyze large geochemical data sets.
ZHANGZHOU ET AL. 1 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
and tephra studies (Bolton et al., 2020; Lubbers et al., 2023), hydrological studies (Shaughnessy et al., 2021; Wen
et al., 2021), magmatic processes (Boschetty et al., 2022; Cortés et al., 2007; Costa et al., 2023; Keller
et al., 2015), thermobarometry (Higgins et al., 2022; Jorgenson et al., 2022; Petrelli et al., 2020), tectonic dis-
trimiantion (Doucet et al., 2022; Petrelli & Perugini, 2016; Ueki et al., 2018), sedimentary composition reflecting
crust composition over time (Lipp et al., 2021; Ptáček et al., 2020), and long‐term secular cooling of the mantle
(Keller & Schoene, 2012).
In addition to images (e.g., geochemical maps, electron microscope images), a significant amount of important
geochemical information is stored in tabular data, such as the concentrations and speciation of chemical com-
pounds, elemental concentrations, and isotopic ratios. When applied to these data sets, machine learning (ML) can
reveal deep structural patterns with the data, thereby bringing new geochemical insights (Chicchi et al., 2023; He
et al., 2022; Morrison et al., 2017; Petrelli & Perugini, 2016; Prabhu et al., 2021; Qin et al., 2022; Stracke
et al., 2022; Tao et al., 2021; Wen et al., 2021). Although flourishing, ML implementation is laborious and time‐
consuming for most geochemists because they must, for example, locate codes from scikit‐learn (Pedregosa
et al., 2011), modify codes to fit their unique data set(s), and tune the model's hyperparameters.
Compared to the common application of ML in other geoscientific disciplines (e.g., geohazards, seismology;
Bergen et al., 2019; Li et al., 2023; Reichstein et al., 2019), geochemical data‐driven research has fallen behind. A
recent search on the Web of Science for publications from 2018 to 2022 reveals this gap, with the proportion of
publications using ML in petrology and geochemistry increasing from 0.24% to 1.35% (total publications in 2018
and 2022: 9360 and 11,101, respectively) compared to a more substantial increase from 1.48% to 6.90% in the field
of earth and planetary science (total publications in 2018 and 2022: 22,669 and 31,179, respectively). This gap can
be attributed, in part, to the absence of convenient tools designed specifically for geochemical data mining.
Here, we introduce Geochemistry π, a novel, automated, open‐source Python framework designed to simplify the
implementation of ML on tabular data, eliminating the need for prior coding experience (see Figure 1). Users
merely need to provide their data in .csv or .xlsx spreadsheet formats and make selections in a straightforward
ZHANGZHOU ET AL. 2 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
question‐and‐answer format. Geochemistry π can seamlessly handle various tasks, including data cleaning,
standardization, regression, classification, clustering, and dimensionality reduction.
The software offers benefits to the community in two significant ways. First, for geochemists who are proficient in
coding and employing ML, Geochemistry π can serve as a benchmark for validating their results. Second, for
those geochemists lacking coding expertise, Geochemistry π lowers the entry barriers to utilizing ML effectively.
In the sections that follow, we will introduce the software from both user and developer perspectives, including
benchmark tests to replicate ML applications from prior studies.
To begin, users load their prepared data as the input data into Geochemistry π from tabular files (in either .csv or .
xlsx format) (Figure S1 in Supporting Information S1) to train the model. The software prompts users to select
ZHANGZHOU ET AL. 3 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
which columns of data will be utilized for subsequent tasks. Users respond by specifying the range or specific
columns to be used. Geochemistry π provides descriptive statistics of the selected data, with particular emphasis
on highlighting missing values that require attention.
Geochemistry π provides four operational modes: regression, classification, clustering, and dimension reduction.
Users select their preferred mode by typing 1, 2, 3, or 4 in the software. Following the mode selection, the
software presents a list of ML algorithms from which users can choose one or all algorithms to train their ML
models (Figure S8 in Supporting Information S1).
After users upload their tabular data into Geochemistry π, a customized automated ML pipeline powered by
hyperparameter optimization solutions from a fast and lightweight AutoML library (FLAML) (Wang, Wu,
Huang, & Amin, 2021) implements data mining techniques through a question‐and‐answer format (Figure S10 in
Supporting Information S1). After that, previously trained models will be automatically applied to new provided
ZHANGZHOU ET AL. 4 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
application data sets due to the auto‐built transformer pipeline. Every operation performed on the data and the
correspondingly generated artifacts are logged by MLflow (Zaharia et al., 2018) and the tracking mechanism
designed in this study. Finally, users can review the results of their experiments, including descriptions of ex-
periments and runs, details about trained model parameters, evaluation metrics, and any generated artifacts.
In Geochemistry π, the storage mechanism (Figure S12 in Supporting Information S1) consists of two compo-
nents: the “geopi_tracking” folder and the “geopi_output” folder. MLflow uses the “geopi_tracking” folder as the
store for visualized operations in the web interface, which researchers cannot modify directly. The “geo-
pi_output” folder is a regular folder aligning with MLflow's storage structure, which researchers can operate.
Overall, this unique storage mechanism is purpose‐built to track each experiment and its corresponding runs in
order to create an organized and coherent record of researchers' scientific explorations.
Geochemistry π includes an AutoML module that leverages two hyperparameter search methods Cost‐Frugal
Optimization (CFO) and BlendSearch (Blended Search Strategy) from FLAML library to locate the optimal
combination of hyperparameters for the selected ML algorithms at low cost. CFO is known for its efficiency in
minimizing the computational cost of hyperparameter tuning, while BlendSearch combines multiple optimization
strategies to improve the chances of finding the best hyperparameters (Wang, Wu, Weimer, & Zhu, 2021; Wu
et al., 2005). To further expedite the parameter optimization process, the AutoML module harnesses the power of
the Ray framework for distributed and parallel computing, which is particularly advantageous when dealing with
complex models or large data sets (Philipp et al., 2018).
ZHANGZHOU ET AL. 5 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
In the context of Geochemistry π, this design pattern acts as a blueprint for creating the diverse components of
automated ML workflows. The framework is a four‐layer hierarchical pipeline architecture that promotes the
creation of workflow objects through a set of model selection interfaces. The critical layers of this architecture are
as follows:
The data implemented by Geochemistry π derives from Table S3 in Petrelli et al. (2020). The training data set
comprises the major element compositions of melt (SiO2, TiO2, Al2O3, FeOt, MnO, MgO, CaO, Na2O, K2O,
Cr2O3, P2O5, and H2O) and clinopyroxene (SiO2, TiO2, Al2O3, FeOt, MnO, MgO, CaO, Na2O, K2O, and
Cr2O3), whereas the prediction data set includes pressure or temperature. The data was split into 70% for
ZHANGZHOU ET AL. 6 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
training and 30% for testing. In the data pre‐processing options in Geochemistry π, we first filled missing values
with zeroes. Then, we scaled the training data by standardization, with an average value of zero and one
standard deviation.
Table S2 in Supporting Information S1 reports the R2 and RMSE values obtained for the six regression algorithms
using the major element compositions of clinopyroxene and melt data for pressure or temperature estimations.
Using Geochemistry π, Extra‐Trees achieved the highest R2 values (0.94 and 0.95) and the lowest RMSE scores
(2.3 kbar and 40 K) when estimating pressure and temperature, respectively. These results align with those of
Petrelli et al. (2020), who also found Extra‐Trees to be the best model for the data and obtained comparable
respective R2 values of 0.91 and 0.94 and RMSE values of 2.6 kbar and 40 K (Table S2 in Supporting
Information S1).
Figure 3 presents a comparative analysis between Extra‐tree regression as presented by Petrelli et al. (2020) and
Geochemistry π. This analysis focuses on a thermometer model and utilizes training, test, and validation data sets
provided by Petrelli et al. (2020). The training and test data sets are employed for model training, while the
validation data set corresponds to the application data defined within the Geochemistry π framework. It is worth
noting that the validation data set in this context contains known temperature values. The automated hyper-
parameter tuning offered by Geochemistry π demonstrates comparable model performance to that achieved
through manual tuning by Petrelli et al. (2020), albeit with different hyperparameters (Figures 3a, 3b, 3d, and 3e).
However, the validation data results obtained by Petrelli et al. (2020) outperform those achieved using
Geochemistry π (Figures 3c and 3f). To understand the underlying reasons for this discrepancy, we conducted an
examination of the Code Documentation in Supporting Information S1. It was discovered that Petrelli et al. (2020)
provided values for four specific hyperparameters (namely, n_estimators, criterion, max_features, and ran-
dom_state), while the remaining hyperparameters employed default values from the scikit‐learn library. In
contrast, Geochemistry π offers flexibility in modifying a more extensive set of hyperparameters (including
n_estimators, max depth, max features, min samples split, min samples leaf, bootstrap, max samples, and
ZHANGZHOU ET AL. 7 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
Figure 3. Regression results from (a–c) Petrelli et al. (2020) and (d–f) produced in this study using Geochemistry π in automated mode. Results are plotted using (a, d)
train, (b, e) test, and (c, f) validation data sets respectively.
oob_score) instead of relying solely on default values from scikit‐learn. This divergence in hyperparameter
configurations implies that models with different hyperparameters may exhibit similar overall performance but
potentially yield variations in their predictive capabilities.
We executed nine variations of the XGBoost algorithm, each using distinct sets of features in the training data:
MajorI‐9, comprising all nine major elements; MajorI‐4, comprising four major elements (Al2O3, FeOT, CaO,
MgO); MajorI‐2a, comprising two major elements (CaO, Al2O3); MajorI‐2b, comprising two alternate major
elements (2 features, FeOT, MgO); TraceI‐30, comprising all 30 trace elements; TraceI‐25, comprising all trace
elements except Rb, Sr, Ba, Pb, and U; TraceI‐4, comprising four trace elements (La, Yb, Eu, Ti); TraceI‐2a,
comprising two trace elements (Eu, Ti); and TraceI‐2b, comprising two alternate trace elements (La, Yb). The
prediction data were from the column “TRUE_VALUE.” The input data sets were divided into 70% for training
and 30% for validation. For the missing values, Qin et al. (2022) did not process it and directly applied XGBoost,
with the confusion matrix results presented Figure 4a. As part of the data pre‐processing in Geochemistry π,
missing values were treated the same as Qin et al. (2022) (results in Figures 4b and 4e) or imputed using the
mean value of the element across all analyses (results in Figures 4c and 4f). The slightly better performance with
ZHANGZHOU ET AL. 8 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
Figure 4. Classification results from (a, d) Qin et al. (2022) and (b, c, e, and f) generated in this study using Geochemistry π in
automated mode. Missing values are handled differently: (a, b, d, and e) as null and (c, f) as mean values. The
hyperparameters are summarized in Table S4 in Supporting Information S1.
mean value imputation might reflect a better representation of the data structure given their true but not analyzed
values.
Table S3 in Supporting Information S1 reports the F1 and Accuracy scores of the XGBoost classification al-
gorithms by training data set. Across the nine models implemented in Geochemistry π, the F1 and Accuracy
scores for the training sets span the ranges of 0.81–0.96 and 0.76–0.96, respectively, closely paralleling the results
of Qin et al. (2022; 0.85–0.97 and 0.82–0.97, respectively). When applied to the testing sets, Geochemistry π
attained F1 and Accuracy scores of 0.81–0.96 and 0.76–0.95, respectively, again aligning closely with the results
of Qin et al. (2022; 0.82–0.97 and 0.77–0.95, respectively).
ZHANGZHOU ET AL. 9 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
Figure 5. t‐SNE “maps” depicting 5D MORB‐OIB isotopic data from Stracke et al. (2022). These maps were generated with a
perplexity value of 48 and utilized an agglomerative clustering approach (n_clusters = 16, linkage = single). Panels (a, c)
illustrate results obtained using distinct random state values, namely (a) none and (c) 42, as generated by the Jupyter
Notebook from Stracke et al. (2022). Panels (c, d) display t‐SNE “maps” corresponding to the same two random state values,
namely (b) none and (d) 42, as produced by Geochemistry π.
Figures 6a–6c depict the outcomes of the three models utilizing the data set from Table S1 of Tao et al. (2021).
The Principal component analysis (PCA) representation using Geochemistry π exhibits a distribution akin to that
of Tao et al. (2021), with a simple 90‐degree rotation (Figure 6d). The Dim2 values are largely negative in our
ZHANGZHOU ET AL. 10 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
Figure 6. Dimensional reduction maps of total polar lipid biomarker compounds, illustrating the difference among three habitats of soil and sediment samples in the
Yellow River and Bohai Sea. Results are shown from (a–c) Tao et al. (2021) by R Studio and (d–i) produced in this study using Geochemistry π based on Python using
manual (d–f) or automated mode (g–i). The axes are unitless and represent the higher‐dimensional structural data relationships. (a, d, and g) Principal component
analysis, (b, e, and h) nonmetric multidimensional scaling (nMDS), and (c, f, and i) t‐distributed stochastic neighbor embedding (t‐SNE). The hyperparameters are
summarized in Table S5 in Supporting Information S1.
ZHANGZHOU ET AL. 11 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
results, whereas they are largely positive in those of Tao et al. (2021). According to Tao et al. (2021), compared to
linear approaches like PCA, non‐linear dimensional reduction techniques such as t‐SNE and nMDS have superior
capacities for unveiling environmental sample classifications (Figures 6b and 6c). Nonetheless, the overall
effectiveness in distinguishing sample habitats remains limited, which can be attributed to the varied sources and
intricate secondary changes of the polar lipid biomarker components during transplantation and their contribu-
tions to the observed data structure. Our benchmark tests indicated notable disparities in results when comparing
Python and RStudio (Figure 6). Additionally, varying hyperparameters led to different outcomes in t‐SNE and
nMDS analyses (as evidenced in Figures 6e, 6f, 6h, and 6i).
We speculate that the discrepancy between R Studio and Geochemistry π Python codes is due to the different
functions or parameters in the models. Geochemistry π utilizes scikit‐learn version 1.13. There is a notable
distinction in how the t‐SNE algorithm's initialization parameter is handled in R Studio compared to scikit‐learn.
R Studio defaults to using “PCA” for initialization, while scikit‐learn uses a “random” initialization method by
default. It's important to point out that when employing “PCA” for initialization in t‐SNE, it's not feasible to apply
it to precomputed distances. This is a significant factor to consider during data preparation and algorithm
configuration. Furthermore, when it comes to nMDS, R Studio demonstrates flexibility by accepting non‐square
matrices directly as inputs. In contrast, Python's implementation necessitates an additional step where non‐square
matrices must first be converted into square matrices before they can be processed.
5. Conclusions
Geochemistry π provides an open‐source, user‐friendly Python framework for conducting ML with minimal
coding requirements. Accessible via a web portal or as a command‐line executable, Geochemistry π streamlines
ML tasks by simplifying data input and ML model selection through intuitive question‐and‐answer prompts.
Users can effortlessly load data in .csv or .xlsx spreadsheet formats and instruct Geochemistry π to automatically
train ML models and generate predictions.
Built upon the scikit‐learn library, Geochemistry π is intentionally designed for extendability and portability,
ensuring its versatility in diverse applications. The framework adopts a hierarchical pipeline architecture,
effectively separating data processing from algorithm implementation. With two AutoML cores, continuous
training capabilities, and comprehensive ML lifecycle management, Geochemistry π offers a robust platform for
conducting geochemical ML experiments.
Benchmarking tests conducted in this study largely reproduced results from previous geochemical ML research,
encompassing regression, classification, clustering, and dimensional reduction tasks. While Geochemistry π
demonstrated consistency with prior studies in most cases, some discrepancies arose, possibly attributable to
model function and hyperparameters. Consequently, we recommend Geochemistry π as a valuable benchmarking
tool capable of producing consistent results while allowing users to save hyperparameters in spreadsheets for
reference.
Geochemistry π is an open‐source software, affording users the transparency to inspect its source code. Its
advantages include providing a benchmarking tool for users experienced in coding and lowering the entry barrier
for ML applications among non‐coders. However, users are encouraged to acquire a fundamental understanding
of ML principles to interpret results effectively. Some limitations of Geochemistry π include less flexibility
compared to direct coding with Python and scikit‐learn, as well as the absence of certain functionalities available
in other coding language packages. Additionally, computational efficiency may be on par with or lower than
using Python directly. As we continue to develop Geochemistry π, we remain committed to addressing user
feedback and improving the software. We welcome contributions from those interested in enhancing this open‐
source tool.
ZHANGZHOU ET AL. 12 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
Acknowledgments References
The authors appreciate the discussions
with Yu Qi and Tien Nguyen. We express Bergen, K., Johnson, P., Hoop, M., & Beroza, G. (2019). Machine learning for data‐driven discovery in solid Earth geoscience. Science,
our gratitude to Maurizio Petrelli, Ben Qin, 363(6433), eaau0323. https://doi.org/10.1126/science.aau0323
Andrew Stracke, and Ke‐yu Tao for Bolton, M. S., Jensen, B. J., Wallace, K., Praet, N., Fortin, D., Kaufman, D., & De Batist, M. (2020). Machine learning classifiers for attributing
generously providing both the data and the tephra to source volcanoes: An evaluation of methods for Alaska tephras. Journal of Quaternary Science, 35(1–2), 81–92. https://doi.org/10.
hyperparameters necessary to replicate 1002/jqs.3170
their previously published results. We Boschetty, F. O., Ferguson, D. J., Cortés, J. A., Morgado, E., Ebmeier, S. K., Morgan, D. J., et al. (2022). Insights into magma storage beneath a
thank Robert Dennen for polishing the frequently erupting arc volcano (Villarrica, Chile) from unsupervised machine learning analysis of mineral compositions. Geochemistry,
language of the paper. J. ZhangZhou Geophysics, Geosystems, 23(4), e2022GC010333. https://doi.org/10.1029/2022gc010333
acknowledges support from the Chamberlain, K. J., Lehnert, K. A., McIntosh, I. M., Morgan, D. J., & Wörner, G. (2021). Time to change the data culture in geochemistry. Nature
Fundamental Research Funds for the Reviews Earth & Environment, 2(11), 737–739. https://doi.org/10.1038/s43017‐021‐00237‐w
Central Universities K20220232 and Chicchi, L., Bindi, L., Fanelli, D., & Tommasini, S. (2023). Frontiers of thermobarometry: GAIA, a novel deep learning‐based tool for volcano
NSFC (Grant 42072066). The authors plumbing systems. Earth and Planetary Science Letters, 620, 118352. https://doi.org/10.1016/j.epsl.2023.118352
acknowledge the cloud computing Cortés, J. A., Palma, J. L., & Wilson, M. (2007). Deciphering magma mixing: The application of cluster analysis to the mineral chemistry of
resources provided by the Deep‐time crystal populations. Journal of Volcanology and Geothermal Research, 165(3–4), 163–188. https://doi.org/10.1016/j.jvolgeores.2007.
Digital Earth (DDE) science program, 05.018
which has been instrumental in supporting Costa, S., Caricchi, L., Pistolesi, M., Gioncada, A., Masotta, M., Bonadonna, C., & Rosi, M. (2023). A data driven approach to mineral chemistry
the web‐based version of our software. unveils magmatic processes associated with long‐lasting, low‐intensity volcanic activity. Scientific Reports, 13(1), 1314. https://doi.org/10.
1038/s41598‐023‐28370‐0
Doucet, L. S., Tetley, M. G., Li, Z. X., Liu, Y., & Gamaleldien, H. (2022). Geochemical fingerprinting of continental and oceanic basalts: A
machine learning approach. Earth‐Science Reviews, 233, 104192. https://doi.org/10.1016/j.earscirev.2022.104192
Farrell, Ú. C., Samawi, R., Anjanappa, S., Klykov, R., Adeboye, O. O., Agic, H., et al. (2021). The sedimentary geochemistry and paleo-
environments project. Geobiology, 19(6), 545–556. https://doi.org/10.1111/gbi.12462
Goldstein, S. L., Hofmann, A. W., & Lehnert, K. A. (2014). Requirements for the publication of geochemical data, version 1.0. Interdisciplinary
Earth Data Alliance (IEDA). Retrieved from https://www.earthchem.org
He, Y., Zhou, Y., Wen, T., Zhang, S., Huang, F., Zou, X., et al. (2022). A review of machine learning in geochemistry and cosmochemistry:
Method improvements and applications. Applied Geochemistry, 140, 105273. https://doi.org/10.1016/j.apgeochem.2022.105273
Higgins, O., Sheldrake, T., & Caricchi, L. (2022). Machine learning thermobarometry and chemometry using amphibole and clinopyroxene: A
window into the roots of an arc volcano (Mount Liamuiga, Saint Kitts). Contributions to Mineralogy and Petrology, 177(1), 10. https://doi.org/
10.1007/s00410‐021‐01874‐6
Jorgenson, C., Higgins, O., Petrelli, M., Bégué, F., & Caricchi, L. (2022). A machine learning‐based approach to clinopyroxene thermobarometry:
Model optimization and distribution for use in Earth sciences. Journal of Geophysical Research: Solid Earth, 127(4), e2021JB022904. https://
doi.org/10.1029/2021JB022904
Keller, C. B., & Schoene, B. (2012). Statistical geochemistry reveals disruption in secular lithospheric evolution about 2.5 Gyr ago. Nature,
485(7399), 490–493. https://doi.org/10.1038/nature11024
Keller, C. B., Schoene, B., Barboni, M., Samperton, K. M., & Husson, J. M. (2015). Volcanic–plutonic parity and the differentiation of the
continental crust. Nature, 523(7560), 301–307. https://doi.org/10.1038/nature14584
Klöcking, M., Wyborn, L., Lehnert, K., Ware, B., Prent, A., Profeta, L., et al. (2023). Community recommendations for geochemical data, services
and analytical capabilities in the 21st century. Geochimica et Cosmochimica Acta, 351, 192–205. https://doi.org/10.1016/j.gca.2023.04.024
Lehnert, K., Su, Y., Langmuir, C. H., Sarbas, B., & Nohl, U. (2000). A global geochemical database structure for rocks. Geochemistry,
Geophysics, Geosystems, 1(5), 1012. https://doi.org/10.1029/1999GC000026
Li, Y. E., O’Malley, D., Beroza, G., Curtis, A., & Johnson, P. (2023). Machine learning developments and applications in Solid‐Earth geosciences:
Fad or future? Journal of Geophysical Research: Solid Earth, 128(1), e2022JB026310. https://doi.org/10.1029/2022JB026310
Lipp, A. G., Roberts, G. G., Whittaker, A. C., Gowing, C. J. B., & Fernandes, V. M. (2021). Source region geochemistry from unmixing
downstream sedimentary elemental compositions. Geochemistry, Geophysics, Geosystems, 22(10), 2021GC009838. https://doi.org/10.1029/
2021GC009838
Lubbers, J., Loewen, M., Wallace, K., Coombs, M., & Addison, J. (2023). Probabilistic source classification of large tephra producing eruptions
using supervised machine learning: An example from the Alaska‐Aleutian arc. Geochemistry, Geophysics, Geosystems, 24(11),
e2023GC011037. https://doi.org/10.1029/2023GC011037
Morrison, S., Liu, C., Eleish, A., Prabhu, A., Li, C., Ralph, J., et al. (2017). Network analysis of mineralogical systems. American Mineralogist,
102(8), 1588–1596. https://doi.org/10.2138/am‐2017‐6104CCBYNCND
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit‐learn: Machine learning in Python. Journal of
Machine Learning Research, 12, 2825–2830. Retrieved from https://arxiv.org/abs/1201.0490v4
Petrelli, M., Caricchi, L., & Perugini, D. (2020). Machine learning thermo‐barometry: Application to clinopyroxene‐bearing magmas. Journal of
Geophysical Research: Solid Earth, 125(9), e2020JB020130. https://doi.org/10.1029/2020JB020130
Petrelli, M., & Perugini, D. (2016). Solving petrological problems through machine learning: The study case of tectonic discrimination using
geochemical and isotopic data. Contributions to Mineralogy and Petrology, 171(10), 81. https://doi.org/10.1007/s00410‐016‐1292‐2
Philipp, M., Robert, N., Stephanie, W., Alexey, T., Richard, L., Eric, L., et al., (2018). Ray: A distributed framework for emerging AI applications.
In A. Arpaci‐Dusseau & G. Voelker (Eds.) 13th USENIX symposium on operating systems design and implementation (OSDI 18) (pp.
561–577). https://doi.org/10.48550/arXiv.1712.05889
Prabhu, A., Morrison, S. M., Eleish, A., Zhong, H., Huang, F., Golden, J. J., et al. (2021). Global earth mineral inventory: A data legacy. Geosci.
Data J., 8(1), 74–89. https://doi.org/10.1002/gdj3.106
Ptáček, M., Dauphas, N., & Greber, N. (2020). Chemical evolution of the continental crust from a data‐driven inversion of terrigenous sediment
compositions. Earth and Planetary Science Letters, 539, 116090. https://doi.org/10.1016/j.epsl.2020.116090
Qin, B., Huang, F., Huang, S., Python, A., Chen, Y., & ZhangZhou, J. (2022). Machine learning investigation of clinopyroxene compositions to
evaluate and predict mantle metasomatism worldwide. Journal of Geophysical Research: Solid Earth, 127(5), e2021JB023614. https://doi.org/
10.1029/2021JB023614
Reichstein, M., Camps‐Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., & Prabhat (2019). Deep learning and process understanding
for data‐driven Earth system science. Nature, 566(7743), 195–204. https://doi.org/10.1038/s41586‐019‐0912‐1
Shaughnessy, A. R., Gu, X., Wen, T., & Brantley, S. L. (2021). Machine learning deciphers CO2 sequestration and subsurface flowpaths from
stream chemistry. Hydrology and Earth System Sciences, 25(6), 3397–3409. https://doi.org/10.5194/hess‐25‐3397‐2021
ZHANGZHOU ET AL. 13 of 14
15252027, 2024, 1, Downloaded from https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324 by Ub Frankfurt/Main Universitaet, Wiley Online Library on [09/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geochemistry, Geophysics, Geosystems 10.1029/2023GC011324
Stracke, A., Willig, M., Genske, F., Béguelin, P., & Todd, E. (2022). Chemical geodynamics insights from a machine learning approach.
Geochemistry, Geophysics, Geosystems, 23(10), e2022GC010606. https://doi.org/10.1029/2022GC010606
Tao, K., Xu, Y., Wang, Y., Wang, Y., & He, D. (2021). Source, sink and preservation of organic matter from a machine learning approach of polar
lipid tracers in sediments and soils from the Yellow River and Bohai Sea, eastern China. Chemical Geology, 582, 120441. https://doi.org/10.
1016/j.chemgeo.2021.120441
Ueki, K., Hino, H., & Kuwatani, T. (2018). Geochemical discrimination and characteristics of magmatic tectonic settings: A machine‐learning‐
based approach. Geochemistry, Geophysics, Geosystems, 19(4), 1327–1347. https://doi.org/10.1029/2017gc007401
Wang, C., Wu, Q., Huang, S., & Amin, S., (2021). Economical hyperparameter optimization with blended search strategy. In S. Mohamed & K.
Hofmann (Eds.) The ninth international conference on learning representations (ICLR 2021) (pp. 1–17). Retrieved from https://www.
microsoft.com/en‐us/research/publication/economical‐hyperparameter‐optimization‐with‐blended‐search‐strategy/
Wang, C., Wu, Q., Weimer, M., & Zhu, E. E., (2021). Flaml: A fast and lightweight automl library. In J. Konečný (Ed.). Proceedings of the 4th
MLSys conference (pp. 1–17). https://doi.org/10.48550/arXiv.1911.04706
Wen, T., Liu, M., Woda, J., Zheng, G., & Brantley, S. (2021). Detecting anomalous methane in groundwater within hydrocarbon production areas
across the United States. Water Research, 200, 117236. https://doi.org/10.1016/j.watres.2021.117236
Wu, Q., Wang, C., & Huang, S. (2005). Frugal optimization for cost‐related hyperparameters. arXiv:2005.01571 [cs.LG]. https://doi.org/10.
48550/arXiv.2005.01571
Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong, S., Konwinski, A., et al. (2018). Accelerating the machine learning lifecycle with
MLflow. IEEE Data Eng. Bull., 41(4), 39–45. Retrieved from https://api.semanticscholar.org/CorpusID:83459546
ZhangZhou, J., He, C., Sun, J., Zhao, J., Lyu, Y., Wang, S., et al. (2024). ZJUEarthData/geochemistrypi: v0.5.0 (v0.5.0) [Software]. Zenodo.
https://doi.org/10.5281/zenodo.10509049
ZHANGZHOU ET AL. 14 of 14