0% found this document useful (0 votes)
14 views

Big Data Processing and Application Research

Uploaded by

SHINDE PRITAM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Big Data Processing and Application Research

Uploaded by

SHINDE PRITAM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2020 2nd International Conference on Artificial Intelligence and Advanced Manufacture (AIAM)

Big Data Processing and Application Research


2020 2nd International Conference on Artificial Intelligence and Advanced Manufacture (AIAM) | 978-1-7281-9986-3/20/$31.00 ©2020 IEEE | DOI: 10.1109/AIAM50918.2020.00031

Penglin Gao Zhaoming Han Fucheng Wan*


Key Laboratory of China's Ethnic Key Laboratory of China's Ethnic Key Laboratory of China's Ethnic
Languages and Intelligent Processing Languages and Intelligent Processing Languages and Intelligent Processing
of Gansu Province of Gansu Province of Gansu Province
Northwest Minzu University Northwest Minzu University Northwest Minzu University
Lanzhou, China Lanzhou, China Lanzhou, China
e-mail:[email protected] e-mail: [email protected] e-mail: [email protected]

Abstract—Nowadays, big data has become a constantly


extended and widely mentioned term .It can excavate, describe II. BIG DATA CONCEPT
and utilize a large amount of structured, unstructured and In this section, the concepts and characteristics of big
semi-structured data to obtain more information. With the data are given and described as follows.
rapid increase of data, big data has become more and more
diverse, and the big data technology has emerged consequently.
This paper reviews the literature of big data and the related A. Definition of Big Data
technologies, such as Hadoop and Map Reduce. And it It is reported that, Gartner proposed the term "big data" to
discusses the life cycle of big data, that is, big data acquisition, elaborate a large amount of data derived from multiple data
preprocessing, storage and analysis. Then it expounds the sources[3].We have experienced the continuous development
representative application of big data .Finally, based on the of the times and entered the information age. Information is
above study, this paper summarizes the development of big data. naturally an inseparable part of our daily lives. We are
exposed to more and more data through different channels,
Keywords—Big Data, Hadoop, Big data analysis, Big data and data assets have become more and more important in this
application era. Assets, if so much data cannot be used, these data are
useless information to us and worthless to us, then we need
I. INTRODUCTION big data technology to process this information. On the
In recent years, big data has become a key technical contrary, if our big data technology does not have so much
discussion topic among academia and computer practitioners. data developed by the times, it will not be born in accordance
In the past few decades, companies have generally relied on with the development of the times. Big data technology and
relational data to make decisions and judgments and ignored our big data can be said to be mutually dependent on each
a large amount of poorly structured data contained in some other.
network information, articles, documents or social media. Big data is "a large collection of data", and these data are
With the rapid expansion of the computer industry, the data valuable. For example, a large amount of user consumption
is explosive growing with huge amount of useful information data collected by e-commerce platforms, a large amount of
that can be excavated. Therefore, more and more companies user browsing data recorded by search engines, and national
are beginning to enter the field of big data analysis. At demographic data information collected by the government.
present, the webpage data that Baidu processes every day has As the ability to generate data has greatly improved in recent
already reached 10PB-100PB. According to the research years, a large number of these billions of magnitude
report of IDC Information Analysis Company, the amount of information have flooded our side, and the concept of big
data in the world has increased from 0.8ZB data has emerged from this. As our statistical tools and
(1ZB=1000EB=1000000PB) in 2009 to 35ZB in 2020 within technologies become more powerful, they can process and
10 years, an increase of 44 times during 10 years, with an analyze these billions of information at the same time,
average annual growth rate of 40% [1]. without having to randomly select some "samples" from this
The development of big data is very rapid. It is a new information for feature analysis, and the results must be more
technological revolution for all industries in society, just like accurate. Big data has great potential value.
the Internet, which has developed rapidly and is extremely
influential. The popularization of its related technologies is a B. Characteristics of Big Data
breakthrough in science and technology[2]. Very obvious. Big data usually has 4 characteristics : Volume, Variety,
This article gives the characteristics and definitions of big Velocity, and Value, which we call as 4V
data in the second section. Then, in the third section, it
reviewd the big data literature and investigation of related Although big data is not specifically expressed as a
technologies. The fourth section introduces the value system specific amount data, this term is often used when talking
of big data. Section 5 expounds and analyzes the about gigabytes or large data, many of which cannot be
representative applications of big data. Finally, it gives a easily integrated. In general, the four characteristics of big
brief summary in Section 6. data can be briefly summarized as follows:
x Volume. However, as time goes by, the storage unit
changes from GB in the past to TB, and even the current
levels, PB and EB. The new incoming data has become so

978-1-7281-9986-3/20/$31.00 ©2020 IEEE 125


DOI 10.1109/AIAM50918.2020.00031

Authorized licensed use limited to: Carleton University. Downloaded on June 03,2021 at 11:59:05 UTC from IEEE Xplore. Restrictions apply.
large that it is difficult for you to process it, that is, big data. in, as well as the speed at which data is analyzed and
Therefore, the volume of big data is always very large, and it processed. In today's big data era, many events need to be
does not actually need a specific amount to qualify. If the analyzed in real time. With the generation and analysis and
stored old data or the new incoming data become so large processing of large amounts of data, a large amount of server
that it is difficult to process, it can be named as big data. resources are needed. Therefore, big data requires very high
data processing speed.
x Variety. The diversity refers to the amount of sources
or incoming vectors of the database. A wide range of data x Value. The core feature of big data is value. The core
sources determines the diversity of big data forms. The large feature of big data is value. The value of big data lies in the
amount of data can come from social media, networks, user ability to draw useful conclusions from a bunch of unrelated
logs, etc. The diversity of data also means the diversity of data through the analysis and processing of the data.
databases.
Big data has 4 characteristics : Volume, Variety, Velocity
x Velocity. Speed refers to the speed at which data comes and Value. The 4V model of big data is shown in figure 1.

Fig. 1. The 4V model

III. BIG DATA RELATED TECHNOLOGIES B. Hadoop/Map Reduce


As the understanding of big data becomes clearer, this Apache HadoopS is a highly fault-tolerant distributed file
section will introduce technical aspects, including cloud system that can perform distributed processing of big data.
computing, Map Reduce, and Hadoop. Big data analysis Through Hadoop, the computing resources of a large number
generally uses cloud computing, and requires a Hadoop-like of cheap machines can be organized to solve the problem of
distributed platform, using Map Reduce to process multiple massive data processing that cannot be solved by a single
data. machine. It has high reliability, high scalability, high
efficiency and high fault tolerance. The main core content of
A. Big Data Cloud Computing Hadoop is the MapReduce computing framework.
Big data analysis is realized through cloud computing. Hadoop/Map Reduce is a simple-to-use software
Big data collection and processing large amounts of framework. Applications written based on it can run on a
unstructured data, semi-structured or structured data, for data large cluster composed of thousands of commercial
analysis and mining. Cloud computing is the integration and machines, and process data sets of T level in parallel in a
virtualization of hardware resources, while big data reliable and fault-tolerant manner . Hadoop highly abstracts
processing is the accurate and efficient processing of data. MapReduce into two phases: Map phase and Reduce phase.
The relationship between cloud computing and big data is Each phase uses Key/Value pairs as the input and output of
closely connected and mutually reinforcing. The the process.
development and emergence of big data has accelerated the The entire working process of MapReduce is shown in
development of cloud computing. Cloud computing provides the figure. It contains the following 4 independent entities:
a platform for storing and analyzing big data. The parallel
computing capabilities of cloud computing can improve the x Client .used to submit MapReduce jobs.
efficiency of big data collection and analysis, and the
distributed storage technology based on the loud computing x Jobtracker. used to coordinate the operation of the job.
can effectively manage big data. Cloud computing uses x Tasktracker. used to process tasks after job division.
computing resources as a service to support the mining of big
data, and development trend of big data to provide the value x HDFS. used to share job files among other entities.
information that needs for real-time interactive massive data The entire work process is carried out in an orderly
query and analysis. manner as follows: job submission, job initialization, task
assignment, task execution, progress update, status update
and job completion. Hadoop uses the classic framework to
run the MapReduce job process as shown in figure 2.

126

Authorized licensed use limited to: Carleton University. Downloaded on June 03,2021 at 11:59:05 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Hadoop runs a MapReduce job using the classic framework

to replace low-level or data-level data objects, and construct


IV. BIG DATA LIFE CYCLE new attributes based on existing attribute sets to help data
In this section, we will focus on the life cycle of big data, Processing process.
from four aspects: big data collection, big data preprocessing,
big data storage , big data processing and analysis. Data specification: On the basis of maximizing the
originality of the data , the data volume is simplified to the
maximum to obtain the operation of a smaller data set.
A. Big Data Collection
Big data collection refers to the collection and storage of C. Big Data Storage
data from different places. Therefore, there are countless
ways to collect big data. In this section, we choose two Big data storage refers to process of using memory to
common methods to introduce the film. store the collected data in form of a database. Generally, it
can be roughly divided into tree file system and object
Network data collection: With the help of APIs or storage. Object storage has a tendency to gradually replace
network tools exposed on the Internet, the data on the web the traditional tree file system. Object storage supports
pages are collected and unified and structured into a data parallel data structures, and all files have a unique ID. It is
collection method for local data. easier and more convenient to handle a large number of
objects in a parallel file system structure than in a vertical
Database collection: Use Sqoop and ETL tools to collect
file system structure.
data stored in the traditional mainstream relational databases:
MySQL and Oracle.
D. Big Data Analysis
B. Big Data Preprocessing The core part of the big data cycle is data analysis. Big
data is commonly used for visual analysis, predictive
Because the amount of collected data that needs to be
analysis, data mining, text analysis and statistical analysis to
analyzed is very large and messy, and the data is generally
extract and refine data. In this section, we will focus on
incomplete and inconsistent dirty data. Therefore, data
visual analysis, data mining and predictive analysis.
analysis and mining cannot be performed directly, or the
results of analysis and mining are not as expected, and data Visual analysis: Graphical display of data, analysis
objects must be preprocessed before data analysis. means to convey and communicate information clearly and
effectively[4]. Mainly used in massive data association
Data cleaning: Use cleaning tools to remove errors and
analysis, that is, with the aid of a visual data analysis
abnormalities in the collected and integrated data and remove
platform, the process of performing association analysis on
useless information.
dispersed heterogeneous data and making a complete
Data integration: Integrate data scattered in several data analysis chart.
sources into a unified data collection through certain thinking
Data mining: Data mining refers to the technical means
logic or physical logic. By integrating interrelated distributed
of finding out potential connection patterns between data
heterogeneous data sources, users can access these data
through data mining algorithms, creating data mining models,
sources in a transparent manner, making the data more
and extracting effective information from data.
operable and valuable for those who visit it.
Predictive analysis: Predictive analysis combines a
Data conversion: Data conversion is to convert or merge
variety of advanced analysis functions, such as: Such as: data
data to form a description form suitable for data processing.
mining, machine learning and text analysis to achieve the
Including: remove the noise in the data, summarize or
purpose of predicting uncertain events. Big data applications.
aggregate the data, use more abstract (higher-level) concepts

127

Authorized licensed use limited to: Carleton University. Downloaded on June 03,2021 at 11:59:05 UTC from IEEE Xplore. Restrictions apply.
V. BIG DATA APPLCATIONS The maturity of 5G and IoT applications will bring massive
There is no doubt that big data is slowly changing our amounts of video and IoT data, as demand With the increase
lives, Big data technology and began to be widely used in in volume, big data will slowly change our lives. In the era of
various fields. Major e-commerce platforms such as Taobao, big data, everything can be quantified and everything can be
Jingdong use big data to analyze user information and deliver analyzed.
targeted advertisements to stimulate consumption. "Smart
cities" have been implemented in many places [5]. Through ACKNOWLEDGMENT
big data, government departments can perceive the During this period of time, the teacher's (Corresponding
development and change the needs of society, and also author: Fucheng Wan) encouragement gave me the
provide citizens with corresponding public services and motivation to move forward. The complex mood is like
resource allocation in a more scientific, precise, and rational every time I do an experiment and I have a result. It is
manner. In the medical treatment, big data assists in the uncontrollable and expectant.
clinical decision-making, standardizes diagnosis and
treatment paths, and the efficiency improvement through
clinical data comparison, real-time statistical analysis, REFERENCES
remote patient data analysis, and medical behavior analysis. [1] Fucheng Wan. Extracting Algorithm for the Optimum Solution
Answer Oriented Towards the Restricted Domain. Quarterly Journal
of Indian Pulp and Paper Technical Association, vol. 12, 2018.
VI. CONCLUSION [2] B. P. Russom, ‘‘Big data analytics,’’ TDWI Best Practices Rep., vol.
After 15 years of development, big data technology has 19, no. 4, pp. 1–34, 2011.[5] Wu Youzheng, Zhao Jun, Duan Xiangyu,
et al. Survey of question-and-answer retrieval technology and
gradually matured and is widely used in various fields. In evaluation research. Chinese Information Journal, vol. 19, no. 3. pp.
recent years, the development of cloud computing, artificial 1-13. 2005
intelligence and other technologies, as well as the changes in [3] Guo Zhimao, Zhou Aoying. Overview of data quality and data
the underlying chips and memory terminals, and the cleaning research. Journal of Software, vol. 13, no. 11. pp. 20762082.
popularity of video applications, have all contributed Big 2002
data technology brings new requirements. In the future, big [4] Xu Zongben, Zhang Wei, Liu Lei, et al. The scientific principles and
data technology will continue to change along the directions development prospects of data science and big data. Technology for
development, vol. 10, no.1. pp. 6675. 2014
of heterogeneous computing, batch stream integration,
cloudification, AI compatibility, memory computing, etc. [5] WATTS D J. Six Degrees: The Science of A Connected Age. New
York: WW Norton, 2004.

128

Authorized licensed use limited to: Carleton University. Downloaded on June 03,2021 at 11:59:05 UTC from IEEE Xplore. Restrictions apply.

You might also like