Meca
Meca
ed
2 system
iew
4
5 1Department of Industrial Engineering, University of San Jose - Recoletos, Cebu City, 6000, Philippines;
ev
7 2Advanced Science and Technology Institute, DOST, Quezon City 1101, Philippines;
r
10
11
12
Abstract er
Technological advancements enable the use of intelligent systems in various fields such as operations,
pe
13 management, as well as in agriculture. In the field of agriculture, smart systems create a positive impact on
14 fruit classification, specifically for export needs. Even with its wide application, there have been limited
15 studies on its utilization in coconut exportation. Another challenge identified is that gathering a large
ot
16 amount of dataset has become difficult, expensive, time-consuming, and prone to errors due to human
tn
17 inconsistencies. Due to this majority of the initial datasets in existing studies are unbalanced and sometimes,
18 inadequate to contribute significant analysis and results. In this study, an investigation is conducted using
19 an existing dataset of coconut acoustic signals to generate more data to balance out the samples across the
rin
20 three maturity levels. Data and data augmentation techniques are utilized along with combining techniques
21 for acoustic signals, feature extraction, and data clustering. The data points under three maturity levels were
ep
22 clustered appropriately using the summing method and Mel Frequency Cepstral Coefficient feature. All
23 data synthesizers failed to generate quality synthetic data. However, both audiomentation and procedural
24 audio generation were able to produce quality augmented data after validating using linear layers, 1-
Pr
25 dimensional convolution, and long-short term memory model as the audio classification technique. The
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
26 new dataset was then fed to the same models and the results yielded to significant increase in its
ed
27 classification performance for all models. The study can be further improved by incorporating other coconut
iew
29 Keywords: signal processing, artificial neural network, support vector machine, random forest model,
31 1. Introduction
ev
32 The majority of the global coconut supply comes from tropical countries where coconut trees normally
33 thrive. The three largest coconut-producing countries—Indonesia, the Philippines, and India—contributed
r
34 78% of the global coconut production and the remaining 12% were from the rest of the coconut-producing
35 countries (Burns, et al., 2020). In the Philippines, coconut is one of the greatest contributors to general
36
37
er
economic activity for agricultural land and labor (Moreno, et al., 2020). It is also considered a predictor
of general economic activity due to the performance of coconut as the major source of income in the country
pe
38 (Moreno, et al., 2020). The country has average earnings of 91.4 billion PHP (1.7 billion USD) and an
39 annual coconut production of 14.7 metric tons in nut terms from 2014 to 2019 (Philippine Coconut
40 Authority, 2018). About 25% to 33% of the country’s population is dependent on the coconut industry for
ot
42 For commercial purposes, the coconuts vary in the classification of the fruit’s maturity level
43 (Mahayothee, et al., 2016; Javel, et al., 2018; Terdwongworakul, et al., 2009). It is the prime
44 determinant of the fruit’s maximum economic value and consumption (Chen, et al., 2021). Most
rin
45 commonly, they are classified into three levels, i.e., premature, mature, and overmature (Burns, et
46 al., 2020; Terdwongworakul, et al., 2009; Gatchalian & De Leon, 1994; Mahayothee, et al., 2016;
ep
47 Javel, et al., 2018). Traditionally, coconuts are sorted manually into their maturity levels, which
48 poses a lot of drawbacks from human-related constraints such as inconsistency, variability, and
Pr
49 subjectivity, among others (Zhang, et al., 2014; ElMasry, et al., 2012; Hameed, et al., 2018;
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
50 Elmasry, et al., 2012). Dealing with a large volume of coconuts to be exported or delivered for industrial
ed
51 processing as a fresh product, the traditional technique of manually sorting fruits will no longer be feasible
52 (Gatchalian & De Leon, 1994). Behera et al. (2020) reported that due to the lack of skilled workers and
iew
53 human subjectivity in classifying coconuts, 30% to 35% of the harvested coconuts are wasted. With the
54 advancement of technology, manual fruit classification is gradually being replaced with mechanical
56 Classification learning models are data-driven and are highly dependent on their training data to yield
ev
57 high classification accuracy (Salamon & Bello, 2017; Himanen, et al., 2019). These models are classified
58 under big data approaches that are known for dealing with extensively large databases for accurate
r
59 predictions and classification (Wang, et al., 2021). The cascade of multiple hidden layers in a neural
60
61
er
network architecture learns simple to complex features of the training set in its raw form (Shinde & Shah,
2018; Assen, et al., 2020; Janiesch, et al., 2021; Ning & You, 2019). This is how the training data directly
pe
62 affects the performance of a classification model architecture (Shaoa, et al., 2019). Commonly, these
63 models require a large amount of labeled data for good results (Krizhevsky, et al., 2012). However, training
64 samples are usually unbalanced because real-world high-quality data collection is challenging, costly, and
ot
65 time-consuming (Yasar & Laskowski, 2023; Shaked, 2023; Maguolo, et al., 2021; Shaoa, et al., 2019).
66 An imbalanced dataset has been a reoccurring problem in training models, which restricts the accuracy
tn
67 and stability of a model, thus, resulting in poor fruit classification performance (Zheng, et al., 2021). And
68 this can be resolved with the use of artificially generated data produced from data generation methods
rin
69 (Zheng, et al., 2021; Kong, et al., 2021; Park, et al., 2018). In the case of sound and acoustic signals, the
70 most common application domains for sound classification are speech recognition (Chowanda, et al., 2023;
71 Yuan, et al., 2023; Chotirat & Meesad, 2021; Nallanthighal, et al., 2021), music classification (Yu, et al.,
ep
72 2020; Hizlisoy, et al., 2021; Pendyala, et al., 2022; Ghatas, et al., 2022), or environmental sound recognition
74 Data generation is most frequently applied to image processing (Le, et al., 2019), speech patterns (Xia,
75 et al., 2014; Siriwardena, et al., 2022), and musical notes (Engel, et al., 2017). There are two types of data
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
76 generation: data synthesis and data augmentation. Data synthesis usually involves data generative models
ed
77 and can produce synthetic data without the original dataset while data augmentation focuses on slight
78 manipulations to generate more data (Awan, 2022). There are several studies in the literature that use data
iew
79 synthesis or data augmentations concerning sound classification and acoustic signal processing. For data
80 synthesis, these include Kong et al. (2021) proposing a diffusion model for waveform generation through
81 the Markov chain, Binkowski et al. (2020) introducing a generative adversarial network (GAN) for tasks
82 that involve text and speech, and Yamamoto et al. (2020) designing a distillation-free and fast waveform
ev
83 generation method using GAN. For data augmentation, these include Byran (2020) estimating reverberation
84 time and direct-to-reverberant ratio from speech using deep convolutional neural networks (CNN),
r
85 Siriwardena et al. (2022) using audio data augmentation for speech inversion, and Sun et al. (2022)
86
87
er
classifying animal sounds using CNN with data augmentation. Although these studies provide significant
insights and results, the majority of them deal with speech analysis and processing while some deal with
pe
88 environmental sounds such as animal sounds. Evidently, there is a limited variety of applications for data
90 The objective of this study is to improve the existing methods of classifying coconuts to their maturity
ot
91 levels with the use of data generation methods, specifically for mass fruit exportation. In addition, data
92 generation methods are investigated to find which methods and techniques apply to the nature of the dataset
tn
93 used in the study—acoustic signals. By improving such a system and creating an improved classification
94 system, it will aid the agricultural industry and coconut exportation companies in improving their
rin
95 performance in sorting and producing significantly fewer fruits that are misclassified and have gone to
96 waste due to manual methods. This study made use of the dataset available from Caladcad et al. (2022) as
97 the initial dataset to be employed in the data generation methods. The newly generated dataset will be
ep
98 deployed to three machine learning algorithms (MLAs) used in the study of Caladcad et al. (2020) for
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
100 2. Materials and methods
ed
101 2.1. Initial coconut dataset
102 The initial dataset in this study is from a recently published dataset of coconut based on a tapping
iew
103 system in Caladcad et al. (2022). The raw data are acoustic signals from the three coconut maturity levels—
104 premature, mature, and overmature. It is composed of 16-bit information, a sampling frequency of 44.1
105 kHz, and 132,300 time-series data points. From the 129 coconut samples gathered, there are 8 premature,
106 36 mature, and 85 overmature samples. About 66% of the samples are overmature coconuts, which is more
ev
107 than half of the total samples gathered. The coconut samples were gathered during the postharvest season,
108 resulting in the dataset's significant imbalance across maturity levels, which is beyond the scope of the
r
109 study. There is also an insufficient amount of available data prior to data generation, specifically for
110
111
er
premature and mature samples. This dataset serves as the primary dataset to reproduce more data using data
113 Presented in Fig. 1 is the proposed audio data generation framework. It follows a systematic approach
114 for conducting data analytics in data generation for acoustic signals. This serves as the basis of the study to
ot
115 understand and analyze acoustic signals to make informed decisions. The steps in the framework are
116 summarized as follows: (1) pre-processing, (2) data generation, and (3) data validation. Note that the
tn
117 methods and steps in this section are implemented using a Python-based library: Librosa, NumPy, SciPy,
119 Before proceeding to audio data generation, pre-processing is conducted first. This involves combining
120 the acoustic signals, undergoing feature extraction, clustering, and cleaning the data per maturity level. Pre-
121 processing techniques such as feature extraction, data clustering, and data cleaning before data generation
ep
122 are important to improve the dataset’s quality (Maharana, et al., 2022). Artificially generated data will
123 primarily affect the accuracy, prediction, and classification abilities of the learning models, which is why
Pr
124 ensuring that the dataset is of quality is crucial (Maharana, et al., 2022).
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
ed
iew
r ev
125
127
128
er
Firstly, the initial dataset has undergone initial cleaning, which is summarized in Table 1. Both the
missing data and the wrongly labeled data were deleted. The initial dataset is composed of 6.20%, 27.91%,
pe
129 and 65.89% of premature, mature, and overmature, respectively, and all maturity levels have equal duration.
130 It can be observed that there is a clear domination of the number of samples under the overmature level and
134 In the previous study by Caladcad et al. (2020), each audio signal is treated as one sample; three knocks
135 on the three ridges of one coconut sample. This resulted in having three audio signals per coconut sample.
ep
136 However, it is not possible to cluster the data per ridge knock, consequently, not possible to proceed with
137 audio data generation. To remedy this, combination methods like extending and summing are explored.
138 Extending is defined as simply merging audio signals into one by mere extension while summing is adding
Pr
139 the values of all audio signals relative to the time signal (Pulakka & Alku, 2011). As a result of the extending
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
140 method, three audio signals under one coconut sample were merged. Shown in Fig. 2 are two samples under
ed
141 the premature level as a result of extending. This also extended the audio duration from three seconds to
142 nine seconds; one audio signal is three seconds long for each ridge knock sample. On the other hand, the
iew
143 audio duration when audio signals were summed remained the same, as shown in Fig. 3.
r ev
144
er
pe
ot
145
146 Figure 2. Two samples of acoustic signals by extending ridges under premature level
tn
rin
ep
147
148 Figure 3. Two samples of acoustic signals by summing ridges under mature level
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
149 After combining, the next step is to cluster the data to form clear distinctions of the data points between
ed
150 the maturity levels. Data clustering was done through feature engineering using the following features: (1)
151 spectrogram (Ngo, et al., 2020; Dennis, et al., 2011; Harmanny, et al., 2014), (2) Mel Filterbank Energy
iew
152 (MFE) (Tak, et al., 2017; Madikeri & Murthy, 2011), and (3) Mel Frequency Cepstral Coefficient (MFCC)
153 (Paul S, et al., 2021; A, et al., 2018; KS, et al., 2021). Data clustering and feature extraction go together in
154 this process since the features were used for clustering the data. The combining method paired with the
155 feature selected that created clusters will be used in the succeeding steps of data generation.
ev
156 Shown in Fig. 4 is the result of clustered data points using the spectrogram feature and the extending
157 method. As presented, it is clear that there are no clusters formed using these two methods. Data points
r
158 overlapped against each other, which didn’t form any distinction between the maturity levels. And so, the
159
160
er
use of the spectrogram feature and extending method for this type of dataset wasn’t used to generate more
data.
pe
ot
tn
161
rin
162 Figure 4. Data clustering using the spectrogram feature of extended ridges
163 On the other hand, shown in Figs. 5 and 6 are the data clustering using MFE and MFCC features,
ep
164 respectively, with the summing method. Fig. 5 showed similar results with Fig. 4 in such a way that the
165 data points couldn’t be clustered according to their maturity levels. The same results showed when time-
166 series, MFE and MFCC feature extraction methods were paired with the extending method, as well as with
Pr
167 time-series and spectrogram feature extraction methods were paired with summing method. However, the
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
168 use of the MFCC feature with summed ridges shows a clear clustering of three maturity levels. In the figure,
ed
169 Cluster 0 indicates the premature samples, while Cluster 1 is for the mature samples, and Cluster 2 is where
170 the overmature samples are. With this, the summed ridges and MFCC feature will be used for the
iew
171 succeeding processes.
r ev
172
173
er
Figure 5. Data clustering using the MFE feature of summed ridges
pe
ot
tn
rin
174
175 Figure 6. Data clustering using the MFCC feature of summed ridges
176 The last process under the pre-processing step is data cleaning. This is the removal of outliers for each
ep
177 cluster to preserve the quality of the samples prior to the implementation of data generation methods.
178 Outliers are those data points that deviate away from their cluster. This is also to prevent inconsistencies
Pr
179 that could reduce the quality of the newly generated dataset and to avoid erroneous conclusions and
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
180 classifications (Maharana, et al., 2022). Summarized in Table 2 is the comparison of the number of data
ed
181 samples per maturity level of the initial dataset and the processed data after pre-processing was conducted.
182 In the table, there is only 1 sample that remained under the premature level, which hinders the process of
iew
183 generating more data. To remedy this, false data points are added under the premature level for the purpose
184 of data generation only. These data points are samples that do not belong to the premature level but falsely
185 belong to the premature level when clustered. Note that the false data points added will be removed after
186 generating more data. It will not be included during the validation of the quality of the dataset and other
ev
187 further processes. After the pre-processing step, there are now 11 premature, 13 mature, and 21 overmature
r
189 Table 2. Summary of original vs. processed data after pre-processing
Maturity level
Premature
Initial dataset
8
er
Cleaned data
1
Processed data
False data points
10
Total
11
pe
Mature 36 13 0 13
Overmature 85 21 0 21
190
191 Two different audio data generation methods were explored: data synthesis and data augmentation. For
ot
192 data synthesis, the available speech synthesizer repository taken from the study of Kong et al. (2021) was
193 adopted, autoencoder (AE) and variational autoencoder (VAE) (Anwar, 2021). Kong et al. (2021) proposed
tn
194 DiffWave, which is a high-quality neural vocoder and waveform synthesizer. It uses a Markov chain to
195 convert white noise signal into structured waveform. On the other hand, AE is a special type of neural
rin
196 network model that learns efficient coding of unlabeled data to ignore signal noise (Bandyopadhyay, 2021).
197 While VAE is similar to AE, but addresses the non-regularized latent space (Anwar, 2021). Both have been
198 successfully applied to vast amounts of unlabeled data as prominent data synthesizers (Tschannen, et al.,
ep
199 2018).
200 Applying it to the processed dataset presented in Table 2, shown in Table 3 is the summary of the
Pr
201 parameters in training the model. The model only generated 11 premature and 13 mature synthetic data.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
202 This is the same amount of data from the processed dataset in Table 2, thus, only doubling the samples for
ed
203 premature and mature levels. All of the generated audio were noise, instead of the knocking sounds expected
204 from them. Even with several parameter changes, specifically the noise schedule, it did not help improve
iew
205 the generated audio at all. With these results, the use of the speech synthesizer from the study of Kong et
206 al. (2021) failed to generate additional and quality data for this dataset and, therefore, cannot be used for
208 Proceeding to other data synthesis methods, two synthesizers available in the public repository were
ev
209 investigated. These two methods are AE and VAE. The parameters used for both methods are summarized
210 in Table 3 and their loss graphs are presented in Figs. 7 and 8 for AE and VAE, respectively. From the
r
211 table, both models failed to generate quality synthetic data. This is further proven in their loss values
212
213
er
presented in the loss graphs. In Fig. 7, it can be seen that the loss values of the AE model did not yield to
zero or were not even close to it at all. Similarly, loss values of VAE did not yield to zero. The values
pe
214 behave erratically with sporadic changes along its epoch.
215 Table 3. Summary of parameters and results used for all data synthesizers
learning_rate=2e-4
sample_rate=44100
DiffWave (Kong, et al., 2021) opt = torch.optim.Adam(model.parameters() Failed
residual_channels=64
tn
dilation_cycle_length=10
batch_size = 64
input_size = fixed_length * 128
hidden_size = 256
output_size = fixed_length * 128
rin
Autoencoder Failed
optimizer = optim.AdamW(model.parameters()
lr=0.001
weight_decay=0.01)
num_epochs = 200
input_dim = 13 * 300
ep
latent_dim = 64
Variational autoencoder optimizer = optim.Adam(vae.parameters() Failed
lr=0.001
num_epochs = 1000
216
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
ed
iew
ev
217
r
er
pe
ot
219
tn
221 From the three deep generative models from both speech and nonspeech repositories, the quality of the
rin
222 dataset produced is not qualified as additional data to increase the initial dataset. The use of DiffWave from
223 Kong et al. (2021) generated synthetic data that are all static noise. Consequently, no data was synthesized
224 when using the AE and VAE models because loss values never converged to 0. With this, it is not fit to use
ep
225 data synthesis as a method to generate additional quality data for the knocking sounds of coconut fruits as
226 the basis for the classification of the fruits’ maturity levels.
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
227 Under data augmentation, two methods of data augmentation were examined, namely, audiomentation
ed
228 and procedural generation. Pitch, audio duration, and noise background are just some of the features of
229 audio signals that are being manipulated in the audiomentation method to generate and increase the data
iew
230 (Salamon & Bello, 2017). On the other hand, the procedural generation method utilizes frequency filters.
231 It creates more data by manipulating the frequencies of the audio signals using these filters (Lundberg,
232 2020). Both methods exploit the original data with some minor changes to increase the volume and diversity
233 of the training set, thus, they are referred to as data augmentation methods. All deformation techniques and
ev
234 filters used with their corresponding parameters are summarized in Table 5.
235 Table 4. Deformation techniques and filters used for audiomentation and procedural audio generation
r
Data augmentation method Parameters
Audiomentation stretch_factor = random.uniform(0.8, 1.2)
er
shift_factor = random.randint(-1000, 1000)
pitch_factor = random.randint(-3, 3)
compression_factor = random.uniform(0.1, 0.5)
pe
noise_factor = random.uniform(0, 0.05)
shift_factor = random.uniform(-0.1, 0.1)
filter_factor = random.randint(10, 90)
audio_sep = harmonic_percussive_separation((audio, sr))
audio_vibrato = vibrato((audio, sr))
Procedural audio generation apply_time_varying_lowpass_filter(audio_data, sr):
ot
filter_order = 6
window_size = 1024
overlap = 0.5
tn
236
237 A summary of the generated data from both audiomentation and procedural generation methods
ep
238 compared with the initial dataset is shown in Table 6. Both methods were able to produce 2,025 data under
239 premature and mature levels, having a total of 4,050 data samples. Similar results were also achieved for
240 the mature level with a total of 4,050 data, with each method producing an equal number of data samples.
Pr
241 Both methods produced 2,925 data for overmature level with a total of 5,850 data samples. The total amount
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
242 of data under each method is 6,975 with a grand total of 13,950 data samples. Note that the false data points
ed
243 are already removed and not included in these numbers. When comparing to the initial dataset, samples per
244 maturity levels almost balanced out with only the overmature having greater samples than the two maturity
iew
245 levels.
246 Table 5. Augmented data using data augmentation methods in comparison with the initial dataset
Initial dataset
Maturity level Audiomentation Procedural audio generation Total
(Caladcad, et al., 2020)
ev
Premature 2,025 2,025 4,050 24
Mature 2,025 2,025 4,050 108
Overmature 2,925 2,925 5,850 255
Total 6,975 6,975 13,950 387
r
247
er
pe
ot
tn
248
250 The augmented data were then processed to validate their quality and trainability. To validate, one-
rin
251 dimensional convolution, LSTM, and linear layers are used as the audio classification module. In Fig. 9,
252 the loss graph during the training phase of the module is presented. With epochs of 20 and batch size of 32,
ep
253 the model is slowly approaching a loss value of 0. It achieved its lowest loss value of 0.0862 at epoch 16.
254 Although the loss graph at the end of the training phase slowly increased, it still achieved a training accuracy
255 of 96.12%, testing accuracy of 93.98%, average precision of 93.32%, and average recall of 93.72%, which
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
256 proves the quality of the dataset. The newly generated dataset will then be used as data inputs to improve
ed
257 the classifying algorithms.
iew
259 One of the limitations of this study that was previously mentioned is the available data that was used
260 prior to data generation. There are only 8 premature samples and 36 mature samples against the 85
261 overmature samples from the initial dataset. When divided, there is a small amount of data for the testing
262 set, which may not be enough to represent the data distribution. With that, the newly generated dataset will
ev
263 serve as data inputs in the proposed pipeline, as shown in Fig. 10. The three algorithms—artificial neural
264 network (ANN), support vector machine (SVM), and random forest (RF)—from the study of Caladcad et
r
265 al. (2020) will still be used. The pipeline comprises the training and testing phases. Ten-fold cross-
266
267
er
validation is still implemented with a 90/10 data partition, in which 90% of the newly generated dataset is
used for the training and 10% for the testing phases. The machine learning algorithms will then be evaluated
pe
268 under the testing phase, and the following parameters are used:
( 1
)
ot
𝐹1 ― 𝑠𝑐𝑜𝑟𝑒 = 2 (2)
1 1
+
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙
tn
269 The accuracy is the ratio of the correctly classified samples and total number of samples, as shown in Eq.
270 1, while the F1-score assesses the weighted average of precision and recall, as shown in Eq. 2. Additionally,
271 a normalized confusion matrix will show the actual classification versus the predicted classification
rin
272 performance of each model. The performance of the machine learning algorithms with the augmented data
273 is compared to their previous performance from the study of Caladcad et al. (2020) without the addition of
ep
274 augmented data. The same parameters from the study of Caladcad et al. (2020) are still implemented in the
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
ed
iew
r ev
276
280 completely, the samples across the three maturity levels. The ML models are evaluated with the newly
281 generated dataset as their data inputs to assess the improvement of the classifying models in predicting the
ot
282 maturity levels of the coconut. Presented in Fig. 11 are the confusion matrices of ANN, RF, and SVM with
283 the new dataset. The figure elaborates on the ratio of the models’ prediction of the maturity level to their
tn
284 actual maturity level. For ANN, the model correctly classified 97.10%, 88.64%, and 91.62% of the
285 premature, mature, and overmature samples, respectively. Furthermore, the RF model has correctly
rin
286 predicted the maturity level of 94.44% of samples under premature, 93.68% of samples under mature, and
287 98.63% of samples under overmature levels. Lastly, the SVM correctly classified 91.06% premature
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
ed
iew
ev
289 (a) (b)
r
er
pe
290 (c)
ot
291 Figure 11. Confusion matrices using newly generated dataset of (a) ANN, (b) RF, and (c) SVM
tn
292 Comparing the above results to the study of Caladcad et al. (2020), a summary is presented in Table 7.
293 The table shows the percentages of correctly classified samples per maturity level by the ML models with
294 and without the addition of generated data. It can be observed that there is a significant difference in the
rin
295 performance of the ML models. For instance, the ANN model before data generation could only classify
296 less than 45% of both premature and mature samples. For the RF model prior to data generation, it could
ep
297 only classify 25% premature samples and 59% mature samples, while SVM could only predict 38% of
298 samples from both maturity levels. The difference in the models’ performance for classifying both
299 premature and mature samples ranges from 34.68% to 69.44%. On the other hand, only in classifying
Pr
300 overmature samples did the models before data generation outperform the models trained and tested with
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
301 the newly generated dataset. This was because the majority of the samples in the initial dataset were
ed
302 composed mainly of the overmature samples. The models were able to distinctively classify overmature
303 samples against premature and mature samples. Nonetheless, the results of ML models with the newly
iew
304 generated dataset weren’t far from the previous results. In fact, the difference in the percentages only ranges
306 Table 6. Performance comparison of ML models with and without data generation for correctly classified
307 samples
ev
Correctly classified samples per maturity level (%)
ML Models
Premature Mature Overmature
ANN* 38.00 44.00 100.00
r
ANN 97.10 88.64 91.62
RF* 25.00 59.00 100.00
RF
SVM*
SVM
94.44
38.00
91.06
er 93.68
38.00
75.76
98.63
100.00
89.92
pe
308 *The results are taken from the study of Caladcad et al. (2020) using the dataset without generated data.
309 The ML models are also evaluated on their accuracies and F1-scores. The results are summarized in
310 Table 8, along with the performance of ML models before data generation. Both the accuracies and F1-
ot
311 scores of ANN and RF with the generated dataset are 92% and 96%, respectively. The SVM model has the
312 lowest accuracy and F1-score at 87% and 86%, respectively, compared to the other models. However, when
tn
313 comparing the ML models’ performance before data generation, all models’ accuracy and F1-score greatly
314 improved. For accuracy, the percentage difference ranges from 7% to 12.52% with the RF model having
315 the greatest improvement and SVM having the least improvement among the three models. On the other
rin
316 hand, the difference for the F1-score ranges from 9.33% to 14.65% with the RF model still having the
317 greatest improvement and SVM remaining to have the least difference.
ep
318 Table 7. Comparison of ML models with and without data generation using performance indicators
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
Accuracy (%) 83.48 96.00
ed
RF
F1-score (%) 81.35 96.00
Accuracy (%) 80.00 87.00
SVM
F1-score (%) 76.67 86.00
319 *The results are taken from the study of Caladcad et al. (2020) using the dataset without generated data.
iew
320 Generally, the performance of the ML models in classifying coconut samples to their maturity levels
321 significantly improved with the use of the newly generated dataset. The balanced distribution and the
322 increase in the number of samples across maturity levels greatly contributed to such improvement. Unlike
ev
323 in the previous study, all three ML models can distinctly classify not just the overmature samples, but all
324 samples in the three maturity levels this time. Still, even with the addition of generated data, the RF model
r
325 continues to outperform both ANN and SVM models in terms of their overall performance. It has the highest
326 accuracy at 96% and F1-score at 96% among the three ML models.
327 4. Conclusion
er
pe
328 The study investigated several data generation methods in order to increase the existing dataset of
329 coconut acoustic signals to improve the performance of ML models. Data synthesizers failed to produce
330 quality synthetic data while data augmentation techniques successfully generated quality data that was
ot
331 added to the initial dataset. With the application of data generation, the dataset increased by about 35 times
332 that of the original dataset. The performance of all three ML models significantly increased and is better
tn
333 than the prior study when data augmentation was implemented. This serves as the basis for integrating ML
334 in developing a noninvasive classification system for coconut fruits for mass exportation.
rin
335 References
336 A, S., Thomas, A., & Mathew, D. (2018). Study of MFCC and IHC Feature Extraction Methods With
337 Probabilistic Acoustic Models for Speaker Biometric Applications. Procedia Computer Science,
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
339 Anwar, A. (2021). Difference between AutoEncoder (AE) and Variational AutoEncoder (VAE). Retrieved
ed
340 May 2023, from https://towardsdatascience.com/difference-between-autoencoder-ae-and-
341 variational-autoencoder-vae-ed7be1c038f2
iew
342 Assen, M., Lee, S. J., & De Cecco, C. N. (2020). Artificial intelligence from A to Z: From neural network
344 Awan, A. (2022). A Complete Guide to Data Augmentation. Retrieved May 2023, from
345 https://www.datacamp.com/tutorial/complete-guide-data-augmentation
ev
346 Bandyopadhyay, H. (2021). Autoencoders in Deep Learning: Tutorial & Use Cases. Retrieved May 2023,
r
348 guide#:~:text=An%20autoencoder%20is%20an%20unsupervised,even%20generation%20of%20i
349
350
mage%20data.
er
Behera, S., Rath, A., Mahapatra, A., & Sethy, P. (2020). Identification, classification & grading of fruits
pe
351 using machine learning & computer intelligence: a review. Journal of Ambient Intelligence and
353 Binkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., . . . Simonyan, K. (2020).
ot
355 doi:10.48550/arXiv.1909.11646
tn
356 Bryan, N. J. (2020). Impulse Response Data Augmentation and Deep Neural Networks for Blind Room
357 Acoustic Parameter Estimation. ICASSP 2020-2020 IEEE International Conference on Acoustics,
rin
359 Burns, D., Johnston, E.-L., & Walker, M. J. (2020). Authenticity and the Potability of Coconut Water - a
361 Caladcad, J., Cabahug, S., Catamco, M., Villaceran, P., Cosgafa, L., Cabizares, K., . . . Piedad, E. (2020).
362 Determining Philippine coconut maturity level using machine learning algorithms based on
Pr
364 doi:10.1016/j.compag.2020.105327
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
365 Caladcad, J., Piedad, E., Cabahug, S., Catamco, M., Villaceran, P., Cosgafa, L., . . . Hermosilla, M. (2022).
ed
366 Acoustic Signal Dataset: Tall Coconut Fruit Species. Mendeley Data. doi:10.17632/hxh8kd3snj.1
367 Chen, X., Zhou, G., Chen, A., Pu, L., & Chen, W. (2021). The fruit classification algorithm based on the
iew
368 multi-optimization convolutional neural network. Multimedia Tools and Applications, 20, 11313-
369 11330.
370 Chotirat, S., & Meesad, P. (2021). Part-of-Speech tagging enhancement to natural language processing for
ev
372 doi:10.1016/j.heliyon.2021.e08216
373 Chowanda, A., Iswanto, I., & Andangsari, E. (2023). Exploring deep learning algorithm to model emotions
r
374 recognition from speech. Procedia Computer Science, 216, 706-713.
375
376
doi:10.1016/j.procs.2022.12.187
er
Dennis, J., Tran, H., & Li, H. (2011). Spectrogram Image Feature for Sound Event Classification in
pe
377 Mismatched Conditions. IEEE Signal Processing Letters, 18(2), 130-133.
378 doi:10/11.09/lsp.2010.2100380
379 ElMasry, G., Cubero, S., Molto, E., & Blasco, J. (2012). In-line sorting of irregular potatoes by using
ot
380 automated computer-based machine vision system. Journal of Food Engineering, 112(1-2), 30-38.
381 Elmasry, G., Kamruzzaman, M., Sun, D.-W., & Allen, P. (2012). Principles and Applications of
tn
382 Hyperspectral Imaging in Quality Evaluation of Agro-Food Products: A Review. Critical Reviews
384 Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., & Simonyan, K. (2017). Neural
385 Audio Synthesis of Musical Notes with WaveNet Autoencoders. In Proceedings of the 34th
387 Gatchalian, M. M., & De Leon, S. (1994). Measurement of Young Coconut (Cocos nucifera, L.) Maturity
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
389 Ghatas, Y., Fayek, M., & Hadhoud, M. (2022). A hybrid deep learning approach for musical difficulty
ed
390 estimation of piano symbolic music. Alexandria Engineering Journal, 61(12), 10183-10196.
391 doi:10.1016/j.aej.2022.03.060
iew
392 Hameed, K., Chai, D., & Rassau, A. (2018). A comprehensive review of fruit and vegetable classification
394 Harmanny, R., de Wit, J., & Prémel Cabic, G. (2014). Radar Micro-Doppler Feature Extraction Using the
395 Spectrogram and the Cepstrogram. In Proceedings of 2014 11th European Radar Conference.
ev
396 doi:10.1109/eurad.2014.6991233
397 Himanen, L., Geurts, A., Foster, A., & Rinke, P. (2019). Data-Driven Materials Science: Status, Challenges,
r
398 and Perspectives. Advanced Science 2019. doi:10.1002/advs.201900808
399
400
er
Hizlisoy, S., Yildirim, S., & Tufekci, Z. (2021). Music emotion recognition using convolutional long short
term memory deep neural networks. Engineering Science and Technology, an International
pe
401 Journal, 24(3), 760-767. doi:10.1016/j.jestch.2020.10.009
402 Janiesch, C., Zschech, P., & Heinrich, K. (2021). Machine learning and deep learning. Electronic Markets,
404 Javel, I. M., Bandala, A. A., Salvador, R. C., Bedruz, R. R., Dadios, E. P., & Vicerra, R. P. (2018). Coconut
405 Fruit Maturity Classification using Fuzzy Logic. 2018 IEEE 10th International Conference on
tn
408 Kong, Z., Ping, W., Huang, J., Zhao, K., & Catanzaro, B. (2021). DiffWave: A Versatile Diffusion Model
410 Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional
ep
411 Neural Networks. Advances in neural information processing systems (NIPS), 1097-1105.
412 KS, D., MD, R., & G, S. (2021). Comparative performance analysis for speech digit recognition based on
Pr
413 MFCC and vector quantization. Global Transitions Proceedings, 2(2), 513-519.
414 doi:10.1016/j.gltp.2021.08.013
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
415 Le, T.-T., Lin, C.-Y., & Piedad, E. (2019). Deep learning for noninvasive classification of clustered
ed
416 horticultural crops – A case for banana fruit tiers. Postharvest Biology and Technology, 156,
417 110922.
iew
418 Lundberg, A. (2020). Data-Driven Procedural Audio: Procedural Engine Sounds Using Neural Audio
420 Madikeri, S. R., & Murthy, H. A. (2011). Mel Filter Bank Energy-Based Slope Feature and Its Application
ev
422 doi:10.1109/ncc.2011.5734713
423 Maguolo, G., Paci, M., Nanni, L., & Bonan, L. (2021). Audiogmenter: a MATLAB toolbox for audio data
r
424 augmentation. Applied Computing and Informatics. doi:10.1108/ACI-03-2021-0064
425
426
er
Maharana, K., Mondal, S., & Nemade, B. (2022). A review: Data pre-processing and data augmentation
428 Phenolic Compounds, Antioxidant Activity, and Medium Chain Fatty Acids Profiles of Coconut
429 Water and Meat at Different Maturity Stages. International Journal of Food Properties, 19(9),
ot
430 2041-2051.
431 Moreno, M. L., Kuwornu, J. K., & Szabo, S. (2020). Overview and Constraints of the Coconut Supply
tn
432 Chain in the Philippines. International Journal of Fruit Science, 20(sup2), 1-18.
433 Nallanthighal, V., Mostaani, Z., Härmä, A., Strik, H., & Magimai-Doss, M. (2021). Deep learning
rin
434 architectures for estimating breathing signal and respiratory parameters from speech recordings.
436 Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio
ep
438 Ngo, D., Hoang, H., Nguyen, A., Ly, T., & Pham, L. (2020). Sound Context Classification Basing on Join
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
440 Ning, C., & You, F. (2019). Optimization under Uncertainty in the Era of Big Data and Deep Learning:
ed
441 When Machine Learning Meets Mathematical Programming. Computers & Chemical Engineering,
iew
443 Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., & Kim, Y. (2018). Data Synthesis based on
445 Pascua, A. M. (2017). Impact Damage Threshold of Young Coconut (Cocos nucifera L.). International
ev
447 Paul S, B., Glittas, A., & Gopalakrishnan, L. (2021). A low latency modular-level deeply integrated MFCC
448 feature extraction architecture for speech recognition. Integration, 76, 69-75.
r
449 doi:10.1016/j.vlsi.2020.09.002
450
451
er
Pendyala, V. S., Yadav, N., Kulkarni, C., & Vadlamudi, L. (2022). Towards building a Deep Learning
based Automated Indian Classical Music Tutor for the Masses. Systems and Soft Computing, 4,
pe
452 200042. doi:10.1016/j.sasc.2022.200042
453 Philippine Coconut Authority. (2018). Coconut Statistics. Retrieved May 2022, from
454 https://pca.gov.ph/index.php/resources/coconut-statistics
ot
455 Pulakka, H., & Alku, P. (2011). Bandwidth Extension of Telephone Speech Using a Neural Network and a
456 Filter Bank Implementation for Highband Mel Spectrum. IEEE Transactions on Audio, Speech,
tn
458 Salamon, J., & Bello, J. (2017). Deep Convolutional Neural Networks and Data Augmentation for
rin
459 Environmental Sound Classification. IEEE Signal Processing Letters, 24(3), 279-283.
460 doi:10.1109/LSP.2017.2657381
461 Shaked, S. (2023). Why Use Synthetic Data vs Real Data? Retrieved May 2023, from
ep
462 https://www.datomize.com/why-use-synthetic-data-versus-real-data/
463 Shaoa, S., Wang, P., & Yan, R. (2019). Generative adversarial networks for data augmentation in machine
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
465 Shinde, P. P., & Shah, S. (2018). A Review of Machine Learning and Deep Learning Applications. In
ed
466 Proceedings of the 2018 Fourth International Conference on Computing Communication Control
iew
468 Siriwardena, Y. M., Attia, A., Sivaraman, G., & Espy-Wilson, C. (2022). Audio Data Augmentation for
469 Acoustic-to-Articulatory Speech Inversion using Bidirectional Gated RNNs. Audio and Speech
470 Processing.
471 Sun, Y., Maeda, T., Solis-Lemus, C., Pimentel-Alarcon, D., & Burivalova, Z. (2022). Classification of
ev
472 animal sounds in a hyperdiverse rainforest using convolutional neural networks with data
r
474 Tak, R. N., Agrawal, D. M., & Patil, H. A. (2017). Novel Phase Encoded Mel Filterbank Energies for
475
476
er
Environmental Sound Classification. Pattern Recognition and Machine Intelligence, 317-325.
doi:10.1007/978-3-319-69900-4_40
pe
477 Terdwongworakul, A., Chaiyapong, S., Jarimopas, B., & Meeklangsaen, W. (2009). Physical properties of
478 fresh young Thai coconut for maturity sorting. Biosystems Engineering, 103, 208-216.
479 Tschannen, M., Bachem, O., & Lucic, M. (2018). Recent Advances in Autoencoder-Based Representation
ot
481 Wang, Q., Velasco, L., Breitung, B., & Presser, V. (2021). High-Entropy Energy Materials in the Age of
tn
482 Big Data: A Critical Guide to Next-Generation Synthesis and Applications. Advanced Energy
484 Xia, X.-J., Ling, Z.-H., Jiang, Y., & Dai, L.-R. (2014). HMM-based unit selection speech synthesis using
485 log likelihood ratios derived from perceptual data. Speech Communication, 63-64, 27-37.
486 doi:10.1016/j.specom.2014.04.002
ep
487 Yamamoto, R., Song, E., & Kim, J.-M. (2020). Parallel WaveGAN: A fast waveform generation model
488 based on generative adversarial networks with multi-resolution spectrogram. Audio and Speech
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841
490 Yasar, K., & Laskowski, N. (2023). Synthetic data. Retrieved May 2023, from
ed
491 https://www.techtarget.com/searchcio/definition/synthetic-
492 data#:~:text=Synthetic%20data%20is%20information%20that's,machine%20learning%20(ML)%
iew
493 20models.
494 Yu, Y., Luo, S., Liu, S., Qiao, H., Liu, Y., & Feng, L. (2020). Deep attention based music genre
496 Yuan, B., Xie, H., Wang, Z., Xu, Y., Zhang, H., Liu, J., . . . Wu, J. (2023). The domain-separation language
ev
497 network dynamics in resting state support its flexible functional segregation and integration during
r
499 doi:10.1016/j.neuroimage.2023.120132
500
501
er
Zhang, B., Huang, W., Li, J., Zhao, C., Fan, S., Wu, J., & Liu, C. (2014). Principles, developments and
applications of computer vision for external quality inspection of fruits and vegetables: A review.
pe
502 Food Research International, 62, 326-343.
503 Zheng, T., Song, L., Wang, J., Teng, W., Xu, X., & Ma, C. (2021). Data synthesis using dual discriminator
504 conditional generative adversarial networks for imbalanced fault diagnosis of rolling bearings.
ot
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4864841