AI-Driven_Feature_Extraction_and_Classification_Algorithm_for_Large-Scale_High-Dimensional_Data
AI-Driven_Feature_Extraction_and_Classification_Algorithm_for_Large-Scale_High-Dimensional_Data
Puwen An1, Yiding Huang1, Chengkuan Liu1, Xianyue Chen1, Laizhuo Xiang2
1
Dalian University of Technology, Dalian, Liaoning, 116024 China
2
China Railway Guangzhou Group Co., Ltd. Guangzhou EMU Depot, Guangzhou, Guangdong, 510088 China
[email protected], [email protected], [email protected],
[email protected], [email protected]
Abstract—In this paper, an AI-driven feature extraction representations can be effectively learned from data, thereby
and classification algorithm is studied to solve the challenges in achieving accurate classification of data[4-6]. In addition,
large-scale high-dimensional data processing. The algorithm emerging methods such as reinforcement learning, transfer
proposed in this paper uses self-attention mechanism and learning, and meta learning have also provided new ideas
Transformer network to effectively capture the dependence and technical means for processing large-scale high-
between the internal features of the sample in the feature dimensional data.
extraction stage, and realizes end-to-end classification in the
classification stage. Experimental validation conducted on the This paper aims to discuss the research status, methods
MNIST dataset substantiates that the introduced algorithm and applications of AI-driven feature extraction and
surpasses conventional techniques in terms of performance classification algorithms for large-scale high-dimensional
across varying self-attention head depths, exhibiting enhanced data. Through in-depth discussion of related theories,
generalization capabilities. Furthermore, an assessment of the technologies and practices, the paper will reveal the
algorithm's efficacy is presented across diverse data volumes, advantages and limitations of AI in processing large-scale
revealing a consistent enhancement in performance as the high-dimensional data, and look forward to the future
training dataset size augments. The findings of this research research direction and application prospects, so as to make
offer innovative perspectives and methodologies for tackling contributions to promoting the development of this field.
feature extraction and classification challenges within extensive
high-dimensional datasets, contributing significantly to both II. OVERVIEW OF LARGE-SCALE HIGH-DIMENSIONAL DATA
theoretical foundations and practical applications.
With the continuous progress of science and technology
Keywords—AI, feature extraction and classification, large- and the improvement of information level, large-scale high-
scale high-dimensional data dimensional data has become the norm in all fields of today's
society. Because of its large scale and high dimensionality,
I. INTRODUCTION these data pose unprecedented challenges to data processing
With the rapid development of information technology and analysis.
and the diversification of data generation methods, large- Large-scale high-dimensional data usually refers to data
scale high-dimensional data has become an important set containing a large number of samples and high-
resource in all fields of society today. The generation of dimensional features. Among them, "large-scale" refers to
these data covers all forms, from user behavior data on the the huge scale of the data set, which may contain billions or
Internet to remote sensing data in scientific research, and its even more samples; "High dimension" means that the feature
high dimension and large scale bring unprecedented dimension of each sample is very high, which may contain
challenges to traditional data processing and analysis. In this hundreds or even thousands of features.
context, how to extract effective features from large-scale
high-dimensional data and classify them accurately has In the field of biomedicine, genomics, protein omics and
become one of the focuses of academic and industrial circles. neuroscience, a large number of high-dimensional data have
been generated, such as gene expression data, protein
Traditional methods face many challenges when dealing interaction network data and brain image data. In the field of
with large-scale high-dimensional data. One of the most Internet and social media, user behavior data, social network
prominent problems is the disaster of data dimension, that is, data and text data also show large-scale and high-
when the dimension of data becomes very high, traditional dimensional characteristics[7]. In engineering and scientific
feature extraction, selection and classification algorithms research, remote sensing data, meteorological data and
often cannot effectively deal with[1-2]. In addition, due to seismic data are often large-scale and high-dimensional.
the complexity and diversity of data, traditional algorithms
also have limitations in extracting potential features from Large-scale high-dimensional data are widely used in
data and accurately classifying them. However, the rapid various fields, such as data analysis and pattern recognition
development of AI in recent years provides new in biomedicine, genomics, protein omics and medical
opportunities to solve these problems. As a branch of AI, imaging. Internet and social media, user behavior analysis,
deep learning has shown great potential in processing large- recommendation system and public opinion analysis [8].
scale high-dimensional data[3]. By constructing deep neural Data processing and pattern recognition in the fields of
networks, especially models such as Convolutional Neural engineering and scientific research, remote sensing data
Networks (CNNs), Recurrent Neural Networks (RNNs), and analysis, meteorological prediction and geological
Autoencoders, more abstract and high-level feature exploration.
745
Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:40:23 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Feature extractor based on self-attention mechanism
Assume that the input data is a feature vector sequence Where WQ , WK , WV is the linear transformation matrix
X x1 , x2 , , xn with dimension d , where n of query, key and value respectively. Through the self-
represents the length of the sample sequence. The attention mechanism, a new feature representation Z can be
eigenvector of self-attention representation is calculated by obtained, in which the feature vector of each position takes
the following formula: into account the information of all positions in the sequence.
After the representation sequence Z is obtained in the
XWQ XWK T feature extraction stage, it is input into the Transformer
Z softmax XWV (1) network for classification (Figure 2). In this study, a
d Transformer classifier is designed, whose structure is similar
to the standard Transformer model, but the output layer uses
softmax function to classify.
The calculation process of Transformer network is as MultiHead Concat head1 ,head 2 , ,head h W O
follows:
(3)
QK T
Attention Q,K,V softmax V (2) Transformer X
d
k
LayerNorm X MultiHead XWQ , XWK , XWV
(4)
746
Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:40:23 UTC from IEEE Xplore. Restrictions apply.
Output X softmax XWO (5) Experimental environment configuration: CPU: Intel
Core i7-8700k @ 3.70ghz; GPU: NVIDIA GeForce RTX
2080 Ti; Memory: 32 GB DDR 4; Storage: SSD.
Among them, Q,K,V is the sequence after linear
Software environment: operating system: Ubuntu 20.04
transformation of query, key and value respectively; LTS; Deep learning framework: PyTorch 1.8.1; Python
W O ,WQ ,WK ,WV is a linear transformation matrix; h is version: 3.8.5.
the number of heads of attention; dk is the dimension of Images are normalized and their pixel intensities are
adjusted to a range of 0 to 1. The configuration for the self-
each head; Concat stands for the splicing operation of attention mechanism includes four heads, with each head
multi-head attention results. having a hidden dimensionality of 256. The batch size is set
Through the Transformer network, end-to-end feature at 64, while the learning rate is fixed at 0.001. The Adam
extraction and classification can be realized without optimization algorithm facilitates the training process. Under
manually designing features or classifiers, which makes the the specified computational environment, the duration for
algorithm more automatic and intelligent. model training varies from approximately 30 minutes to an
hour, influenced by the model's complexity and the extent of
V. EXPERIMENT AND RESULT ANALYSIS the dataset.
To assess the efficacy of the novel AI-based feature Figure 3 distinctly illustrates fluctuations in the
extraction and classification approach, an array of trials are algorithm's performance across varying depths of self-
executed within the scope of this document, accompanied by attention heads. An incremental rise in the quantity of self-
a thorough examination of the outcomes. The benchmark attention heads correlates with a pattern where the
MNIST dataset, comprising handwritten digits, serves as the algorithm's classification precision initially improves, only to
testbed for these experiments, offering a corpus of 60,000 decline subsequently. Optimal performance and peak
training images and 10,000 evaluation images, each adhering classification accuracy are attained when the count of deep
to the 28*28 pixel grayscale format. The novel algorithm is self-attention heads reaches four. This shows that in the
then pitted against the conventional CNN and logistic current algorithm design, choosing the appropriate number of
regression-based techniques in terms of performance metrics. self-attention heads can achieve the best performance, which
not only considers the effectiveness of the model, but also
avoids excessive complexity and increase of computational
overhead.
At lower counts of self-attention heads, the algorithm complexity and computational demands, consequently
may struggle to fully discern intricate relationships within detracting from performance.
the data, leading to subpar classification outcomes. As more
self-attention heads are incorporated, the algorithm becomes In scenarios where the count of self-attention heads is
adept at exploiting information within the dataset, enhancing limited, the algorithm's capability to discern subtle attributes
its classification precision. Nevertheless, an excess in the and interrelations within the data might be compromised,
number of self-attention heads could escalate model translating into reduced classification precision. As the
number of self-attention heads augments, the algorithm
becomes more adept at an exhaustive evaluation of the
747
Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:40:23 UTC from IEEE Xplore. Restrictions apply.
dataset's information, thereby sharpening its classification indicate that the suggested algorithm substantially
accuracy. Conversely, an excessive number of self-attention outperforms traditional methodologies when applied to
heads may introduce superfluous data, elevating model MNIST datasets. The ensuing Table I presents a synopsis of
intricacy and potentially impairing performance. the comparative performance outcomes.
Subsequently, it stands in comparison with conventional
CNN and logistic regression approaches. Empirical findings
TABLE I. PERFORMANCE COMPARISON OF ALGORITHMS
Model Accuracy Precision Recall F1 score
Traditional CNN 0.98 0.98 0.98 0.98
Logistic regression 0.92 0.92 0.92 0.92
The proposed AI algorithm 0.99 0.99 0.99 0.99
Observations indicate that the conventional CNN exhibits and F1 score of 0.99 each, marking a notable enhancement
commendable performance on the MNIST dataset, with over established methods.
metrics like accuracy, precision, recall, and F1 score all
attaining a level of 0.98. In contrast, logistic regression lags It can be clearly seen from Figure 4 that the
behind, registering lower scores across these same measures generalization ability of the algorithm has changed under
at approximately 0.92. The cutting-edge AI algorithm different data scales. With the increase of training data scale,
introduced surpasses traditional techniques in all evaluated the classification accuracy of the algorithm presents a
criteria, achieving an impressive accuracy, precision, recall, gradual upward trend.
In instances where the volume of training data is limited, By comparing the experimental results, we find that the
the algorithm may struggle to adequately discern the traits algorithm achieves the best performance and the highest
and patterns within the samples, leading to a diminished classification accuracy when the training data scale is 30000.
classification accuracy. As the scale of training data expands, This shows that in the current algorithm design, choosing the
the algorithm gains access to a richer set of sample appropriate training data scale can achieve the best
information, enhancing its capability to accurately capture generalization ability, avoiding both under-fitting and over-
the intrinsic characteristics of the samples and thus improve fitting.
classification precision. Nevertheless, an excessively large
dataset might introduce superfluous noise and redundant VI. CONCLUSION
information, which could counteract the positive effects on In this study, an AI-driven feature extraction and
performance increments. classification algorithm is proposed to meet the challenges in
In the case of small data scale, the algorithm may be large-scale high-dimensional data processing. By introducing
affected by over-fitting, resulting in poor performance on the self-attention mechanism and Transformer network, an
test set. With the increase of training data scale, the innovative algorithm is designed, which effectively captures
algorithm can learn the true distribution of samples better, the dependence between the internal features of the sample in
thus reducing the risk of over-fitting and improving the the feature extraction stage and realizes end-to-end
generalization ability. However, when the data scale is too classification in the classification stage. The empirical
large, it may increase the complexity of the model and lead validation conducted on the MNIST dataset corroborates that
to the gradual weakening of the performance improvement the posited algorithm significantly outperforms conventional
effect. methods across varying depths of self-attention heads,
demonstrating superior generalization capabilities.
748
Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:40:23 UTC from IEEE Xplore. Restrictions apply.
Furthermore, the investigation assesses the algorithm's [6] Zhou, H. , Yu, K. M. , Chen, Y. C. , & Hsu, H. P. (2021). A hybrid
performance under diverse data scales, revealing a consistent feature selection method rfstl for manufacturing quality prediction
based on a high dimensional imbalanced dataset. IEEE Access,
enhancement in performance with the accretion of training 2021(99), 1-1.
data size. In summary, our findings offer an innovative [7] Bai, Y. , Zhang, Q. , Lu, Z. , & Zhang, Y. (2019). Ssdc-densenet: a
conceptual framework and solution for tackling feature cost-effective end-to-end spectral-spatial dual-channel dense network
extraction and classification challenges in large-scale, high- for hyperspectral image classification. IEEE Access, 2019(99), 1-1.
dimensional datasets, bearing substantial theoretical and [8] Alsenan, S. A. , Alturaiki, I. M. , & Hafez, A. M. (2020). Feature
applied relevance. Future endeavors may delve into the extraction methods in quantitative structure – activity relationship
refinement and expansion of this algorithm to foster modeling: a comparative study. IEEE Access, 2020(99), 1-1.
adaptability across an extensive array of application contexts. [9] Gao, Y. , Zhang, G. , Zhang, C. , Wang, J. , & Zhao, Y. (2021).
Federated tensor decomposition-based feature extraction approach for
REFERENCES industrial iot. IEEE Transactions on Industrial Informatics, 2021(99),
1-1.
[1] Wei, T. , Liu, W. L. , Zhong, J. , & Gong, Y. J. (2020). Multiclass [10] Wickramasinghe, C. S. , Marino, D. L. , & Manic, M. (2021). Resnet
classification on high dimension and low sample size data using autoencoders for unsupervised feature learning from high-
genetic programming. IEEE Transactions on Emerging Topics in dimensional data: deep models resistant to performance degradation.
Computing, 2020(99), 1-1. IEEE Access, 2021(99), 1-1.
[2] Shi, X. , Qin, P. , Zhu, J. , Zhai, M. , & Shi, W. (2020). Feature [11] Li, P. , He, X. , Cheng, X. , Gao, X. , & Li, Z. (2019). Object
extraction and classification of lower limb motion based on semg extraction from very high-resolution images using a convolutional
signals. IEEE Access, 2020(99), 1-1. neural network based on a noisy large-scale dataset. IEEE Access,
[3] Hammad, M. , Zhang, S. , & Wang, K. (2019). A novel two- 2019(99), 1-1.
dimensional ecg feature extraction and classification algorithm based [12] Huang, W. , Yue, B. , Chi, Q. , & Liang, J. (2019). Integrating data-
on convolution neural network for human authentication. Future driven segmentation, local feature extraction and fisher kernel
generation computer systems, 101(Dec.), 180-196. encoding to improve time series classification. Neural processing
[4] Zhang, J. , Mei, K. , Zheng, Y. , & Fan, J. (2019). Exploiting mid- letters, 49(1), 43-66.
level semantics for large-scale complex video classification. IEEE [13] Gao, Y. , Zhong, P. , Tang, X. , Hu, H. , & Xu, P. (2021). Feature
transactions on multimedia, 2019(10), 21. extraction of laser welding pool image and application in welding
[5] Xue, Y. , Zhao, Y. , & Slowik, A. (2020). Classification based on quality identification. IEEE Access, 2021(99), 1-1.
brain storm optimization with feature selection. IEEE Access,
2020(99), 1-1.
749
Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:40:23 UTC from IEEE Xplore. Restrictions apply.