CalCOFI Machine Learning Model
CalCOFI Machine Learning Model
Abstract—Monitoring chlorophyll concentrations is essential Conventional techniques for assessing chlorophyll concen-
for comprehending the health and alterations occurring in marine trations, including satellite remote sensing and water sampling,
ecosystems. This research aim was to develop a machine learn- have demonstrated efficacy but are frequently constrained by
ing model for predicting chlorophyll concentrations in oceanic
waters, utilizing data obtained from the CalCOFI bottle surveys. geographical and temporal resolution or practical limitations
The algorithm accurately estimates chlorophyll concentrations [3]. Recent breakthroughs in machine learning have shown the
by assessing environmental parameters such as temperature, ability to improve traditional methods by utilizing large data
salinity, and nutrient levels. Data cleansing procedures were inte- sets and recognizing intricate patterns within environmental
grated with sophisticated regression methods to train the model variables [4]. Machine learning algorithms have shown notable
alongside feature engineering and hyperparameter tuning, guar-
anteeing optimal performance and minimizing errors with R2 efficacy in managing non-linear interactions among physical
value of 0.7889. This method possesses considerable potential for and chemical parameters, making them suitable for forecasting
monitoring marine ecosystems, providing an effective instrument chlorophyll concentrations [5].
for real-time chlorophyll level detection. Enhanced monitoring This research targets to create a machine learning model for
of phytoplankton health enables the model to identify ecological predicting chlorophyll concentrations with data from the Cali-
imbalances promptly, facilitate sustainable fisheries management,
and bolster conservation initiatives. This approach is applicable fornia Cooperative Oceanic Fisheries Investigations (CalCOFI)
to many locales and datasets, rendering it a significant asset for bottle surveys. The CalCOFI dataset, which includes character-
worldwide maritime monitoring initiatives. The results illustrate istics such as temperature, salinity, and nutrient concentrations,
the capability of data-driven models in enhancing marine science provides a valuable resource for analyzing the determinants
and environmental conservation. of chlorophyll dynamics. We aim to get high prediction
Index Terms—chlorophyll detection, marine ecosystem moni-
toring, machine Learning, calCOFI bottle data, oceanographic accuracy and reliability through the integration of data pre-
measures processing approaches, cross validation, feature engineering,
and hyperparameter tweaking. This method enhances real-time
monitoring of marine ecosystems and supports sustainable
I. I NTRODUCTION
fisheries management and conservation initiatives.
Marine environments are essential for sustaining ecologi-
cal equilibrium and fostering global biodiversity. Chlorophyll II. L ITERATURE R EVIEW
level is a crucial indication of phytoplankton biomass and Chlorophyll-a (Chl-a) is a key pigment found in phytoplank-
primary productivity, vital for assessing marine health. Pre- ton and plays a crucial role in photosynthesis. Monitoring Chl-
cise monitoring of chlorophyll concentrations is crucial for a concentration is essential for understanding water quality,
identifying biological alterations, such as detrimental algal assessing marine ecosystem health, and detecting harmful
blooms, which can damage marine ecosystems and fisheries algal blooms. Traditional methods for Chl-a estimation, such
[1]. Additionally, chlorophyll data helps in evaluating the as direct sampling and satellite-based remote sensing, often
effects of climate change on ocean production, facilitating face limitations due to environmental factors like cloud cover,
informed decision-making for conservation initiatives [2]. atmospheric interference, and the inability to capture data
during polar nights. In recent years, machine learning (ML) bining multiple data sources to enhance spatial and tempo-
techniques have gained significant attention as they offer the ral resolution. Researchers studying the Barents Sea applied
potential to overcome these challenges by leveraging large Sentinel-2 MSI data with in situ measurements, using a neural
datasets and improving the accuracy and efficiency of Chl-a network model called Ocean Color Network (OCN). This
predictions across various water bodies. approach achieved a 51.7% reduction in errors compared to
Machine learning models have shown remarkable poten- traditional empirical methods, demonstrating the potential of
tial in predicting Chl-a concentrations by integrating diverse data fusion techniques for more accurate and robust monitor-
datasets and identifying complex patterns that traditional em- ing [12].Another notable study employed MODIS/Aqua data
pirical methods struggle to capture. Madani et al. (2024) intro- and several ML models, including Support Vector Machine
duced a machine learning-based approach to generate a con- (SVM), Random Forest (RF), and Extreme Gradient Boost-
tinuous solar-induced chlorophyll fluorescence (SIF) dataset ing (XGBoost). The Differential Evolution-based SVM (DE-
for the Arctic Ocean [6]. They employed Random Forest SVM) model outperformed conventional methods, achieving
models trained on environmental parameters such as Chl- an impressive R² value of 0.926 [13]. These advancements
a concentration, sea surface temperature (SST), and salinity underscore the importance of combining satellite observations
(SSS), effectively extending the SIF dataset back to 2004. This with ML algorithms to enhance the reliability of Chl-a mon-
study provided valuable insights into phytoplankton activity itoring across diverse aquatic environments.While machine
and their responses to changing climatic conditions.Similarly, learning has significantly advanced Chl-a estimation, several
Chusnah et al. (2023) utilized multi-satellite imagery to de- challenges remain. One of the primary issues is the scarcity
velop high-resolution models for Chl-a concentration estima- of high-quality in situ data, which is crucial for training
tion in inland water bodies [7]. Their approach combined and validating ML models. Additionally, the generalization
Sentinel-3 OLCI data with Sentinel-2 MSI imagery using of models across different geographic regions and varying
Random Forest algorithms to enhance spatial resolution and environmental conditions presents a significant hurdle. The
achieve high prediction accuracy, with R² values of 0.873 and lack of explainability and interpretability in complex ML
0.822. These findings highlight the potential of integrating models also raises concerns about their widespread adoption
multiple data sources for more precise and reliable Chl-a in environmental management.
estimation.Deep learning has emerged as a powerful tool for Future research should focus on developing more inter-
Chl-a estimation, capable of capturing intricate spatial and pretable ML models, integrating additional environmental vari-
temporal patterns. Yao et al. (2023) explored deep learning ables such as nutrient levels and ocean currents, and expanding
models such as ConvLSTM, CNN-LSTM, and Self-Attention datasets to enhance the robustness of predictions. The incor-
ConvLSTM (SA-ConvLSTM) to forecast Chl-a levels in the poration of advanced techniques like transfer learning and
Yellow Sea and Bohai Sea [8]. The SA-ConvLSTM model domain adaptation could further improve the adaptability of
achieved the highest accuracy with a Pearson correlation models to new regions and changing environmental conditions.
coefficient of 0.887, demonstrating its ability to account for As per our knowledge, there is no proper machine learning
dynamic oceanographic changes. Zeng et al. (2023) proposed model which used CalCOFI bottle data to predict chlorophyll
a hybrid model that combines a 1D CNN for feature extrac- level. As a result, we tried to build a regression model that
tion with Support Vector Regression (SVR) for prediction, can predict accurate chlorophyll levels, which can help in
yielding an R² value of 0.892 [9]. These studies underscore real-time monitoring of marine ecosystem. The application
the significance of deep learning models in improving the of machine learning in Chl-a estimation has revolutionized
precision of Chl-a predictions. The combination of different the field by providing accurate, scalable, and cost-effective
machine learning approaches has proven to enhance prediction solutions for monitoring marine ecosystems. The integration of
accuracy further. A study focusing on the Venice Lagoon satellite data, deep learning techniques, and hybrid modeling
integrated Random Forest and Multi-Layer Perceptron (MLP) approaches has significantly improved prediction accuracy
models with the SHYFEM-BFM biogeochemical framework, and resolution. As technology continues to evolve, machine
offering valuable insights into how Chl-a concentrations might learning will play a crucial role in advancing environmental
evolve under different climate change scenarios [10]. The hy- monitoring and informing policy decisions aimed at preserving
brid approach allowed for improved long-term forecasting by aquatic ecosystems.
combining data-driven insights with process-based models.In
addition, optimized ensemble models such as Random Forest, III. M ETHODOLOGY
Gradient Boosting, and Extra Trees have been used to predict This part of the paper reflects a detailed overview of the
phytoplankton absorption coefficients across different wave- steps taken to build the models including data collection, data
lengths. The Extra Trees model, in particular, demonstrated preprocessing, exploratory data analysis, model training, fea-
exceptional accuracy with an R² value of 0.9033 at 510 nm, ture engineering, cross-validation, ensemble model, and hyper-
showcasing the effectiveness of ensemble learning techniques parameter tuning. Several tools were used during the model-
in capturing complex ecological interactions [11]. building process. Figure 1 reflects the proposed methodology
The integration of satellite data with machine learning of the study. Data manipulation and analysis were done using
models has significantly improved Chl-a estimation by com- Pandas [14] while Seaborn [15] and Matplotlib [16] helped
to visualize the data. Scikit-learn [17] was used to develop 247,000. Subsequently, physical attributes such as temper-
machine learning models and Optuna [18] was applied for ature, salnty, and depth, along with chemical components
hyperparameter optimization. Computations were managed on like O2ml L, STheta, O2Sat, PO4uM, SiO3uM, NO3uM, and
Google Collab [19] and Kaggle [20]. Sta ID (to track location), were selected for analysis.
During preprocessing, it was observed that several attributes
contained missing values. Attributes with 5–10% or 10–20%
missing data were handled using techniques such as K-Nearest
Neighbors (KNN), mean imputation, and interpolation. How-
ever, PO4uM, SiO3uM, and NO3uM had more than 50%
missing values and were subsequently removed, as imputing
such a large proportion of missing data would not yield reliable
results. The final set of selected attributes, after these prepro-
cessing steps are ChlorA, Depthm, STheta, O2Sat, O2ml L,
T degC, Salnty and Sta ID.
Outliers were addressed using the Interquartile Range (IQR)
method, which was appropriate given the large, complex, and
diverse nature of the dataset. Figure 2 illustrates that filtering
the data resulted in a narrower and smoother distribution for
each variable, suggesting that the removal of noise and outliers
significantly improved data quality.
VI. C ONCLUSION
This paper aimed to build a machine learning model for
chlorophyll level detection using CalCOFI data to monitor
marine ecology. The model was successful to provide proper
chlorophyll level using various environmental parameters, such
Fig. 7. feature Importance from Random Forest
as temperature, salinity, and nutrient levels such as oxygen
level, oxygen saturation, potential density, to provide accurate
predictions of chlorophyll concentrations. Result shows that
the model achieved an R2 score of 0.7876 with the fol- model performance was improved using proper data pro-
lowing hyperparameters: 490 estimators, a max depth of cessing, feature engineering and hyperparameter optimization.
44, min samples split of 2, min samples leaf of 1, and Particularly, Random Forest Regressor was demonstrated that
max features set to None. Despite this, the best performing it can be highly effective in capturing relationships within the
trial remains Trial 45, which achieved a higher R2 score of data and predict chlorophyll concentration. Despite the com-
0.7889. This indicates that the hyperparameters from Trial 45 plexity of marine ecosystems and variability in environmental
still provided the best model performance. The optimization factors, our model achieved a high level of accuracy, highlight-
process continues to explore different hyperparameter config- ing the potential of data-driven approaches in environmental
urations to improve predictive accuracy. This tuning improved monitoring which is a great success for the paper. However,
the overall model R2 value from 0.7852 to 0.7889. deep learning can be used to improve the model performance
Among the five models tested, Random Forest outperformed as the dataset is diverse, complex, and large. In the future,
the others with the highest R² value of 0.7852, alongside implementing more advance deep learning models and atten-
the lowest MAE (0.0561) and MSE (0.0090), demonstrating tion mechanisms can lead the model to more advanced and
its ability to handle complex relationships between physical accurate. In conclusion, the findings of this study contribute to
and chemical variables and reduce overfitting by combining the growing body of research on machine learning applications
multiple decision trees. While an ensemble stacking model in marine science and environmental monitoring, providing a
combining Random Forest, SVR, and Gradient Boosting Re- robust framework for future investigations into the dynamics
gressor achieved an R² of 0.71, it did not surpass Random In conclusion, the findings of this study contribute to the
Forest’s performance. Feature engineering revealed that Depth growing body of research on machine learning applications
had the strongest influence on chlorophyll levels, followed by in marine science and environmental monitoring, providing a
Potential Density, with feature importance values of 0.4 and robust framework for future investigations into the dynamics
1.15, respectively. Hyperparameter tuning using Optuna led of marine ecosystems.
to a slight improvement in model performance, achieving an
R² of 0.7889 in Trial 45, compared to 0.7876 in Trial 96, R EFERENCES
reflecting the optimal configuration for the Random Forest [1] A. W. Griffith and C. J. Gobler, “Harmful algal blooms: A cli-
model. mate change co-stressor in marine and freshwater ecosystems,”
Harmful Algae, vol. 91, pp. 1–15, Mar. 2019. [Online]. Available:
https://doi.org/10.1016/j.hal.2019.03.008
V. F UTURE S COPE [2] V. G. Dvoretsky, V. V. Vodopianova, and A. S. Bulavina, “Effects
Adding more sophisticated elements like an attention mech- of Climate Change on Chlorophyll a in the Barents Sea: A Long-
Term Assessment,” Biology, vol. 12, no. 1, p. 119, Jan. 2023. [Online].
anism in the future will help to improve the performance of Available: https://doi.org/10.3390/biology12010119
the model even further. Commonly utilized in disciplines such [3] E. T. Harvey, S. Kratzer, and P. Philipson, “Satellite-based water quality
computer vision and natural language processing, the attention monitoring for improved spatial and temporal retrieval of chlorophyll-a
in coastal waters,” Remote Sensing of Environment, vol. 158, Mar. 2015.
system could enable the model to concentrate on the most [Online]. Available: https://doi.org/10.1016/j.rse.2014.11.017
critical data points or grasp data evolution over time [22]. [4] D. B. Olawade, O. Z. Wada, A. O. Ige, B. I. Egbewole, A. Olojo,
Furthermore improving the way the data is presented could and B. I. Oladapo, “Artificial intelligence in environmental monitoring:
Advancements, challenges, and future directions,” Hygiene and Envi-
help the model learn better, hence producing more accuracy ronmental Health Advances, vol. 12, p. 100114, Dec. 2024. [Online].
and simpler interpretation. Available: https://doi.org/10.1016/j.heha.2024.100114
[5] J. W. Han, T. Kim, S. Lee, T. Kang, and J. K. Im, “Machine learning and
explainable AI for chlorophyll-a prediction in Namhan River Watershed,
South Korea,” Ecological Indicators, vol. 2024, p. 112361. [Online].
Available: https://doi.org/10.1016/j.ecolind.2024.112361
[6] N. Madani, N. C. Parazoo, M. Manizza, and A. Chatterjee, “A ma-
chine learning approach to produce a continuous solar-induced chloro-
phyll fluorescence dataset for understanding ocean productivity,” March
2024. [Online]. Available: https://doi.org/10.22541/essoar.171164956.
61516407/v1
[7] W. N. Chusnah, H. J. Chu, Tatas et al., “Machine-learning-estimation of
high-spatiotemporal-resolution chlorophyll-a concentration using multi-
satellite imagery,” Sustain. Environ. Res., vol. 33, no. 11, 2023. [Online].
Available: https://doi.org/10.1186/s42834-023-00170-1.
[8] L. Yao, X. Wang, J. Zhang, and X. Yu, “Prediction of sea surface
chlorophyll-a concentrations based on deep learning and time-series
remote sensing data,” Remote Sensing, vol. 15, no. 18, p. 4486, Sep.
2023. [Online]. Available: https://doi.org/10.3390/rs15184486.
[9] D. Kim, K. J. Lee, S. M. Jeong, M. S. Song, B. J. Kim, J. Park, and
T. Y. Heo, “Real-time chlorophyll-a forecasting using machine learning
framework with dimension reduction and hyperspectral data,” Environ.
Res., 2024, Art. no. 119823. [Online]. Available: https://doi.org/10.1016/
j.envres.2024.119823
[10] F. Zennaro, E. Furlan, D. Canu, L. Aveytua Alcazar, G. Rosati, C.
Solidoro, S. Aslan, and A. Critto, “Venice lagoon chlorophyll-a evalu-
ation under climate change conditions: A hybrid water quality machine
learning and biogeochemical-based framework,” Environmental Science
and Pollution Research, vol. 2024, pp. 1-12, 2024. [Online]. Available:
https://doi.org/10.1016/j.envres.2024.119823
[11] M. S. Alam, S. P. Tiwari, and S. M. Rahman, “Optimized ensemble
machine learning models for predicting phytoplankton absorption co-
efficients,” IEEE Access, vol. 12, pp. 5760-5769, 2024. doi: 10.1109/
ACCESS.2024.3350328.
[12] M. Asim, C. Brekke, A. Mahmood, T. Eltoft, and M. Reigstad, “Improv-
ing Chlorophyll-A Estimation From Sentinel-2 (MSI) in the Barents Sea
Using Machine Learning,” IEEE Journal of Selected Topics in Applied
Earth Observations and Remote Sensing, vol. 14, pp. 5529-5549, 2021,
doi: 10.1109/JSTARS.2021.3074975.
[13] K. Chen, J. Zhang, Y. Zheng, and X. Xie, “A Study on Global Oceanic
Chlorophyll-a Concentration Inversion Model for MODIS Using Ma-
chine Learning Algorithms,” IEEE Access, vol. 12, pp. 128843-128859,
2024, doi: 10.1109/ACCESS.2024.3456481.
[14] W. McKinney, “Pandas: A fast, powerful, flexible, and easy-to-use
open-source data analysis and manipulation library,” 2010. [Online].
Available: https://pandas.pydata.org/
[15] M. Waskom, “Seaborn: statistical data visualization,” Journal of Open
Source Software, vol. 6, no. 60, p. 3021, 2021. [Online]. Available:
https://seaborn.pydata.org/
[16] J. D. Hunter, “Matplotlib: A 2D graphics environment,” Computing
in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007. [Online].
Available: https://matplotlib.org/
[17] F. J. Pedregosa et al., “Scikit-learn: Machine learning in Python,”
Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[Online]. Available: https://scikit-learn.org/
[18] T. Akiba et al., “Optuna: A Next-generation Hyperparameter Optimiza-
tion Framework,” Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, 2019. [Online].
Available: https://optuna.org/
[19] Google, “Google Colaboratory: A free Jupyter notebook environment
that requires no setup and runs entirely in the cloud,” [Online]. Available:
https://colab.research.google.com/
[20] Kaggle, “Kaggle: Your Home for Data Science,” [Online]. Available:
https://www.kaggle.com/
[21] California Cooperative Oceanic Fisheries Investigations
(CalCOFI), ”CalCOFI Bottle Database,” [Online]. Available:
https://calcofi.org/data/oceanographic-data/bottle-database/. [Accessed:
25-Jan-2025].
[22] D. Hu, “An introductory survey on attention mechanisms in NLP
problems,” in Intelligent Systems and Applications, Advances in Intel-
ligent Systems and Computing, vol. 295, pp. 432-448, Jan. 2020, doi:
10.1007/978-3-030-29513-4 31.
[23] A. Lamba, P. Cassey, R. R. Segaran, and L. P. Koh, “Deep learning for
environmental conservation,” Curr. Biol., vol. 29, no. 20, pp. R1156-
R1164, Oct. 2019, doi: 10.1016/j.cub.2019.08.016.