0% found this document useful (0 votes)
26 views35 pages

Chap7 BigData

The document discusses artificial intelligence and big data, focusing on learning from millions of data. It covers the data lifecycle including collection, storage, analysis and exploitation of data. It provides examples of large amounts of data generated every minute and defines key characteristics of big data including volume, velocity, variety and truthfulness.

Uploaded by

jmnogales87sena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views35 pages

Chap7 BigData

The document discusses artificial intelligence and big data, focusing on learning from millions of data. It covers the data lifecycle including collection, storage, analysis and exploitation of data. It provides examples of large amounts of data generated every minute and defines key characteristics of big data including volume, velocity, variety and truthfulness.

Uploaded by

jmnogales87sena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

ARTIFICIAL INTELLIGENCE

BIG DATA: LEARNING WITH MILLIONS OF DATA


Artificial Intelligence – Big Data: Learning with millions of data

© Structuralia 2
Artificial Intelligence – Big Data: Learning with millions of data

INDEX

INDEX ............................................................................................................................................................................ 3

1. INTRODUCTION ....................................................................................................................................................... 4

2. DATA LIFE CYCLE ................................................................................................................................................... 7

3. INFORMATION COLLECTION ................................................................................................................................. 8

3.1 Types of Data ........................................................................................................................................................ 8


3.2 Collection System ................................................................................................................................................ 13

4. INFORMATION STORAGE ..................................................................................................................................... 16

4.1 Text files .............................................................................................................................................................. 16


4.2 Databases ........................................................................................................................................................... 16
4.3 HDFS ................................................................................................................................................................... 19

5. INFORMATION ANALYSIS .................................................................................................................................... 21

5.1 Parallel Programming through MapReduce ......................................................................................................... 21


5.2 Parallel computing in the cloud ............................................................................................................................ 25
5.3 Data Analysis ....................................................................................................................................................... 28

6. INFORMATION EXPLOITATION ............................................................................................................................ 30

7. CONCLUSIONS ...................................................................................................................................................... 33

8. REFERENCES ........................................................................................................................................................ 34

3 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

1. INTRODUCTION
In the last decade, there has been a revolution in the way that human beings interact with
technology due to the emergence of social networks and different types of smartphones. Social
networks have changed the way humans interacted with each other by using technology to give
us the ability to create different types of content and distribute them easily, This meant the
massive creation of information by means of texts, comments and images that fed the contents
of this type of application and that was stored in the data centers of the large companies that own
the applications. The use of this type of applications was enhanced with the appearance of
smartphones, which allowed to use new types of phones in a similar way to a computer increasing
exponentially the use of this type of applications. In addition, this new type of device not only
boosted the use of networks, Instead, it produced hundreds of new types of applications that used
the features they offered, such as the GPS position that made it possible to offer functionality
based on the position of the devices. The different companies began to store huge amounts of
information in order to be able to develop techniques that would allow them to analyze them in
order to increase the functionalities that they could offer to their users and also increase greater
profits.

The use of all this information has been one of the biggest technological challenges for
organizations, both technology giants and small businesses, in finding a pragmatic approach to
capture, analysis and use of information about their customers, products and services. At the
beginning the interaction with the clients was simple since it involves offering a simple service
and often at local level in their own markets (countries), but as the technology began to evolve
exponentially, markets became something global with the disappearance of barriers at the local
level which produced that companies needed to create some kind of competitive advantage in
order to maintain their customers and grow in order to expand their services to the new global
market in they had to compete with similar applications and services that had been created in
different local markets. In order to be able to offer new functionalities and services, companies
began to look for ways to efficiently exploit existing data, improve the processes of collecting all
this data and build systems that can use all this information in order to offer to execute these new
functionalities. Figure 1 presents an estimate of the data generated by the most used applications
in 2018 in just 60 seconds.

© Structuralia 4
Artificial Intelligence – Big Data: Learning with millions of data

Figure 1. Quantity of data generated per minute for the most used apps on Internet.
Copyright © - @LoriLewis y @OfficiallyChadd

As you can see on Figure 1, the amount of information generated in only 60 seconds very high.
In 1 minute 187 million emails are sent/received or 38 million messages are sent on WhatsApp
or 3.7 million searches are made in the Google search engine or 38 million are sent. This involves
the generation of a large amount of information that must be stored, processed, analyzed and
used in order to extract value from the data.

5 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

This process you can consider as the life cycle of data and begins with the so-called "Big Data"
that is describes the large volume of data, both structured and unstructured, that are created daily
by today’s society. Once these data are generated, they must be stored, processed, analyzed
and used to complete the life cycle of the data and extract some kind of value from all these data.
Based on these definitions we could define a series of basic characteristics that define the concept
of "Big Data". When the term "Big Data" was coined, three basic characteristics or magnitudes
(Volume, Velocity and Variety), known as the 3Vs, were defined that described the meaning of
this term, but in my opinion necessary to at least include a fourth magnitude concerning the
Truthfulness of the information we are using.

▪ Volume: Volume refers to the amount of data that is generated every second. It is
considered the most important feature associated with the concept of "Big Data", as it
refers to the massive amounts of information that are generated and stored in order to
analyze all this information in order to obtain some kind of value.
▪ Speed: Speed refers to the speed at which information is created, stored and processed
in real time. Currently there are many processes in which time is a fundamental element,
such as in the process of fraud detection in banking transactions or the monitoring of
certain events in social networks.
▪ Variety: Variety refers to the format, type and sources of information. Currently the data
has many types of format from the data stored in the databases to any type of text
documents, emails, sensor data, audios, videos, images, publications in our social media
profiles, blog articles, the click sequences we make when browsing on a particular
website, etc.; which must be stored and processed in a similar way regardless of format,
type and source.
▪ Truthfulness: Truthfulness refers to the degree of uncertainty of the data, that is, the
degree of reliability of the information that has been generated or obtained. Currently
millions of data are generated per second but not all have the same degree of truthfulness.
The values generated by a perfectly calibrated sensor are not the same as the tweets
written by a person who is trying to propagate false information on social networks, so it
is very important to define systems which allow us to identify the information reliability.

© Structuralia 6
Artificial Intelligence – Big Data: Learning with millions of data

2. DATA LIFE CYCLE


Data is the most important element around the concept of "Big Data" but in our opinion this
concept goes further than the simple generation of data in a massive way. The "Big Data" can be
described as a process of collecting, storing and subsequent analysis and manipulation of data
at a massive level in order to extract value. This process can be identified as the life cycle of the
data that can be described using the diagram presented in Figure 2. Each phase of this process
is described in detail throughout this document.

Collection

Storage and
Exploitation processing

Analysis

Figure 2. Data Life Cycle

7 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

3. INFORMATION COLLECTION
The process of data collection is the first phase of the data life cycle and probably the most
important phase of all since depending on the quality and quantity of the data collected in this
phase, the others can be applied correctly. The different

3.1 Types of Data

Information obtained through different types of systems and/or devices can be classified in
different ways based on how they are stored. That is, if the data is stored based on some type of
structure or precise meaning that allows to identify its content accurately. In other words, the
separation between data types is partly based on their meaning or labeling.

3.1.1. Structured Data

The concept of structured data refers to that data set which has a defined structure, format and
length, the most common format of this data type being an alphanumeric character string, such
as a customer’s full name or billing address. That is, the precise meaning of each data has been
correctly defined by a human and always keep a precise structure. That is, the profile information
of the users of a website is always composed of a certain number of fields, which are always
labeled with a certain meaning. This type of data has been stored by companies since the first
storage systems first appeared and is commonly stored through the use of some type of database
normally of a relational type. It is considered that this type of data corresponds to 20% of the total
data currently stored. Depending on the process used for its generation, two types of structured
data can be differentiated.

© Structuralia 8
Artificial Intelligence – Big Data: Learning with millions of data

Data Generated by Computers

Computer-generated data are those structured data generated by a machine without any human
intervention. Examples of such structured data include:

▪ Operating or log data: are the different operating or activity data generated by the different
services, applications, networks, etc. This type of information is stored in log files that are
usually stored locally on the devices where the services or applications are executed. The
log information usually contains huge amounts of information, which is often very costly
for organizations because of which it is usually deleted after a period of time.
Nevertheless, it is a very useful information that can be used to identify security breaches
or execution errors.
▪ Sensor data: data generated by different types of sensors (accelerometers, gyroscopes,
radio frequency identification systems, global positioning system, etc.) which usually
include information concerning the type of sensor and the information obtained by them.
For example, one of the most popular sensors currently in use is radio frequency
identification devices (RFID)[1] which allow information to be stored by the use of labels
or cards using RFID transponders. Figure 3 is an example of an RFID tag is presented, which
consists of a small microchip that stores the device’s information and a transmitter antenna that
tracks the device and extracts the information that has been obtained by it. In addition, this type
of labels can include other types of sensors to collect more information.

Microchip: Device that stores


the product’s information

Contact

Transmission antenna.
They can be low- antennas that
Condensator transmit radio waves up to 2
meters, medium- and high-
frequency, up to 100 meters

Figure 3. Example of an RFID tag

9 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

Data Generated by Humans

Human-generated data are data generated by the use of some type of device and usually
corresponds to user-specific data, such as personal data, banking data, etc. Examples of such
structured data include:

▪ Input data: this type of data corresponds to those data that are entered through some type
of interface. The most typical example of data structures are those collected through the
different forms of mobile or web applications.
▪ Click-through data: Click-through data is generated every time you click a link on a
website. For many years they have been used to identify user behavior when interacting
with applications and apply certain changes in order to get more users or change the
functioning of applications.
▪ Execution data in games: this type of data corresponds to actions performed by players
in online video games, in order to know the way in which we play humans and generate
more complex behaviors for the agents of the game.

3.1.2. Unstructured Data

The concept of unstructured data refers to that set of data that do not follow any specific format
but have some realization due to its storage format or collection process. The most common
unstructured data are images which can represent any type of content but are identified based
on their format. This type of data can be stored in different ways, from a database to a set of files
in a directory. It is considered that this type of data corresponds to 80% of the total data currently
stored. Many people believe that the term unstructured data is an erroneously used concept
because many times the documents used to generate this type of data have a structure. For
example, a text document is considered as a non-structural data type, but it has a well-defined
structure that is divided into sections that in turn contain paragraphs that are made up of words
that contain letters. But we don’t really know what those paragraphs or sections or words mean.
In other words, we don’t know if they correspond to a song, a dialogue, a story, a comment or
anything else that makes them unstructured data with no precise meaning. They are the most
common data and the most difficult to analyze as they are general data that normally do not have
a classification or labeling. Like structured type data these can be classified into two groups
depending on whether they are collected by a machine or a human.

© Structuralia 10
Artificial Intelligence – Big Data: Learning with millions of data

Data Generated by Computers

Computer-generated data are those data of an unstructured type generated by a machine without
any human intervention. Examples of such unstructured data include:

▪ High resolution images: This type of data corresponds to high-resolution imagery such as
meteorological satellites or military surveillance satellites. This type of system generates
hundreds of millions of images, each of which has a different set of information that is
difficult to extract and has usually always been manually analyzed by humans.
▪ Scientific data: This type of data corresponds to the set of data referring to physical
quantities that are collected by some device directly such as atmospheric data, high
energy physics, etc.
▪ Low resolution images: This type of images corresponds to information obtained through
traditional security systems, such as video surveillance cameras that allow capturing
images and videos.
▪ Radar or sonar data: This corresponds to global information through vehicles that are not
collected by any type of application, meteorological data, oceanographic seismic data, etc.

Data Generated by Humans

Human-generated data are those unstructured human-generated data, this information should
not be generated through any specific interface. Examples of such structured data include:

▪ Text data: This type of data corresponds to all information contained in physical and/or
electronic documents within an organization. For example, all paper documentation prior
to machines, surveys, legal documents, emails, etc. This type of information is not only
stored in digital format, but also in physical format, which is causing many companies to
start digitizing all this information using automatic or semi-automatic type systems.
▪ Application content: This data corresponds to all the information generated by the different
applications that do not have any structure. Most of them correspond to the information
generated by the social networks and correspond to any data without structure that
includes different types of format such as images, plain text, emojis, video, etc.
▪ Mobile data: This type of data corresponds to information produced by mobile devices,
being of this type the telephone calls in the case that these are recorded for some reason,
the text messages (SMS) and the global location information of mobile devices that is
obtained by triangulation by telephony antennas.

11 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

3.1.3. Semi-structured Data

Semi-structured data are those that cannot be cataloged in any of the above groups because they
have some kind of defined structure but are of variable type. For example the structure of a web
page is formed by tags based on the HTML language that have a precise meaning, but their
number, order, content varies depending on each of the web pages, so despite being all HTML
data, have a similar but variable structure based on their content. The most common formats used
to define semi-structured data types are:
▪ The Hypertext Markup Language (HyperText Markup Language)[2][3] is a language for
defining the basic structure of a web page. It is used to define the content of the website.
This language is combined with two other languages to describe the
appearance/presentation of a web page (CSS) or its functionality (JavaScript). This
language is based on a series of tags (head>, title>, body>, header>, article>, section>,
p>, div>, span>, img>, etc) that allow you to mark or label the contents of the website so
that they will be displayed in a specific way in the web browser.
▪ Extensible markup language (XML)[4] is a general purpose markup language that uses
tags or tags arranged hierarchically to identify the structure and meaning of information
where there is no general set of tags. That is, the tags used to represent the information
contained in an XML file are defined by the creators of the files. This type of format allows
the creation of any label-based language. It is considered one of the main languages used
to share general information between different web applications. Some examples are
XHTML, MathML, XSLT, RSS, and RDF. Figure 4 is an example of an XML file.
▪ The JavaScript object notation language (JavaScript Object Notation, JSON)[5] is a
language based on the literal notation of objects used by the JavaScript scripting
language. Due to its simplicity and nature it has become the current standard for
information transfer surpassing XML in part because it can be processed (parsed) by using
the javascript eval function that is present in almost all web browsers simplifying the
process of syntax analysis that has any language. Figure 4 is an example of an JSON file.

© Structuralia 12
Artificial Intelligence – Big Data: Learning with millions of data

Figure 4. Example of the same information represented by different types of languages.

3.2 Collection System

In general, the information collection process is considered to be a more important and complex
phase of the data life cycle. In addition, many of the applications currently being developed have
a pressing need for information to function correctly, which further increases the importance of
the collection process. There are multiple ways to collect information for our applications, but in
this topic we will only talk about the most common ones.

3.2.1. Traditional Methods

The traditional methods of information collection are those that are used in a general way and
corresponds to the use of logs files, form or basic functionalities for the collection of the
information. This is the most common form and usually works perfectly in applications that do not
have a need for data to function properly, but once they have been collected they can be used to
improve the user experience or offer them new functionalities.

13 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

3.2.2. Web Scraping

Web scraping[6] is a technique of obtaining information by extracting content from web pages.
This type of technique consists of simulating the process of navigation of a human on a web page
by means of an automatic system, so-called robot, which downloads all the content from each of
the web pages and then uses the different links within the content to load new pages. It can be
considered as a recursive process within each domain where you start with the main page (home)
and navigate through the different internal links (links) of the web page in a similar way to how a
search algorithm would work where the root node would be the home page and the different
successor nodes would be the links of each page. The web scraping process is closely related to
the content indexing system used by the search engines[] that perform a crawling process using
robots, so-called spiders, which collect information regarding the links presented on the websites.

Content

Program Database
Content
Link list

Figure 5: Basic operation of a web scraping.

Although normally the web scraping process not only performs an indexing process, but also
performs a transformation on the content of the website into structured data which can be stored
and subsequently analyzed. The Figure presents the basic operation of a web scraping, which
consists of a program or scraper (robot) that is deployed on multiple computers each of which will
analyze a set of web pages defined in the link list. This program will have a specific processing
system that will extract certain information from the website that will be combined to form some
kind of structured and stored in some kind of database. These types of systems are very effective
and easy to build but may incur illegalities when collecting information from websites with some
type of copyright or creative rights.

© Structuralia 14
Artificial Intelligence – Big Data: Learning with millions of data

3.2.3. Message queues

The exponential growth of the use of certain applications caused problems in their performance
when processing and storing the information entered by users, since database insertion services
were not able to support the number of requests per second produced by users.
In order to solve this problem, message queues were developed, which are a system of
asynchronous information communication between services (producer and consumer) which is
used in microservice architectures where there is no main server in order to avoid possible
bottlenecks. These types of communication systems use two types of microservices: a producer
that inserts messages in the queue and a consumer that extracts and/or removes queue
message. So, the messages are stored in the queue temporarily until they are consumed and
removed from it. Figure 6 shows the basic structure of a message queue.

Message queue

Producer Consumer

Figure 6. Operation of a message queue

This type of architecture allows to create collection system decoupled by low complexity services
using a temporary buffer for light storage of messages where different services write information
(producers) which is processed and processed by other services (consumers) that insert this
information into the storage structures. This type of communication system is used regularly for
the insertion of mass information generated manually or automatically. The two most commonly
used message queues are Kafka[7] message queues and RabbitMQ[8] message queues.

15 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

4. INFORMATION STORAGE
Storage of information is the second phase of the data life cycle and consists of storing information
in a "physical" manner. I mean, it’s disk so it can be handled easily.

4.1 Text files

Text files are the traditional method for storing information that simulates the "workings of books"
but in digital format. The basic structure of a text file consists of a sequence of characters
terminated by a single end-of-file mark.

The characters in turn are divided into separate fragments by marking the line endings of the file.
Each of these lines can be considered as a basic piece of information. This is the format by which
all the information is stored in the computers and in the case of Big Data is usually used for the
storage of information files and logs of applications, servers, databases.

4.2 Databases

A database[9][10] is a "repository" that allows large amounts of information to be stored in an


organized manner that can then be easily accessed.

More formally, a database can be defined as a system composed of a set of data stored on disk
that allows direct access to them and a set of programs that manipulate that set of data.
Depending on how these data are related to each other, two types of databases can be
distinguished.

4.2.1. Relation databases

A relational database[11] (Structured Query Language, SQL) is a set of structured data between
which a series of predefined relationships exist. The data are organized by a set of tables
representing the different entities that will be defined in the database, formed by a set of rows and
columns. Where each column represents a data type or attributes of the table and each row
corresponds to an example with specific values for each of the attributes.

© Structuralia 16
Artificial Intelligence – Big Data: Learning with millions of data

Each of these rows is uniquely identified by a special attribute, called the main key. In addition,
the tables are related to each other through the use of attributes, called foreign keys, that take
the value of the main key of another table.

Buyer Item Seller

Name Sell
Name Name
Address Price Last name
Last name contains

make

Sent Invoice
receives
Datesent Datepay
Is relative to
Address sent Amount
billingaddress

Figure 7: Example of a relational database for the sale and purchase of goods

To represent the structure of each of the tables and interact with them by using operations
(queries, stored procedures, triggers, etc.) the structured query language[12] (Structured Query
Language, SQL) which is supported by all relationships database engines. This language allows
you to define the structure of the tables and their relationships as well as the different insertion,
deletion, update and search operations. Some examples of relationships databases are:

▪ MySQL is an open source relational database management system (RDBMS) and is


considered the world’s most popular database for web development environments.

▪ PostgreSQL is an open source object-oriented relational database management system


(Object Data Base Management System ODBMS). In addition to the basic functionalities
of the database management system, it allows to execute procedures stored in different
programming languages.

17 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

▪ Oracle is a relational object-relational database management system (ORDBMS) for use


on enterprise servers.

▪ Microsoft SQL server is a relational database management system developed by


Microsoft. There are different versions called Express, Web, Standard and Enterprise.

▪ Amazon Aurora is a relational database management system compatible with MySQL and
PostgreSQL developed by Amazon.

▪ MariaDB is a MySQL-compatible database engine derived from MySQL, which is being


developed by the original MySQL developers.

4.2.2. Nonrelational databases


A nonrelational database[13] (Not only Structured Query Language, NoSQL) is a set of structured
data with flexible schemes where there is no implicit relationship between the different schemes
that represent the data. This type of database began to be used in the late 2000s because of the
performance problems of relational databases when the number of data stored was very high.
This type of database does not follow a single representation model as it did in the relational
databases but there are different types:

▪ Documentaries: They are databases that store semi-structured data in the form of
documents, each of which may have a different structure. This makes it possible to store
information in a more natural way. Examples of such databases include: IBM Lotus
Domino, MongoDB or SimpleDB.

▪ In graph: They are databases that store data structured by a graph, where the nodes
represent the information entities and the edges the relationships that exist between the
nodes. This type of structure allows using graph theory to traverse the database.
Examples of such databases include: Neo4j, AllegroGraph, ArangoDB or InfiniteGraph.

▪ Key-value: They are databases that store semi-structured data using a key-value method,
where the key corresponds to a univocal identifier and the value to a set of information.
Both keys and values can be anything from a simple string to a complex object. Examples
of such databases include: DynamoDB, Apache Cassandra or Redis.

© Structuralia 18
Artificial Intelligence – Big Data: Learning with millions of data

In this type of database (NoSQL) the information is usually stored as a JSONdocument. That is,
the information is stored by attributes/properties in a single document identified by a univocal key,
which allows this type of database to manage large amounts of independent information.

4.3 HDFS

The Hadoop Distributed File System (HDFS) is a distributed, scalable and portable file system for
storing and handling large files where the block size of the files (64 Mb) is much superior to the
file systems used by the operating systems (512 bytes, 1 Kb, 2 Mb) in order to minimize the time
in the reading process. The information stored on this file system follows the "Write once read
many" pattern (write once and read many times) where the files are normally written only once by
some kind of batch process and are read multiple times by different analysis algorithms. This type
of information distribution is achieved by a distributed architecture where the files are divided into
blocks that are distributed between the different nodes of the architecture (the blocks of the same
files can be stored in different nodes). A HDFS-like file system is composed of a cluster of nodes
that store information in a distributed manner using a Master-Slave architecture consisting of two
node types:

▪ DataNode: It is each of the nodes in which the information blocks of the different files are
stored which are retrieved at the request of the master node. These are the slave-type nodes
of the architecture because they execute the storage and recovery commands defined by the
name node. In order to offer high availability, the data blocks are replicated in multiple data
nodes (by default in 3 nodes).
▪ NameNode: It is the root node of the cluster that is responsible for managing all the data
nodes. It has a name space to uniquely identify each of the data nodes and is responsible for
managing the distribution of information by identifying the location of the different blocks of
the stored files. There is only one such node in the HDFS cluster.

19 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

Data sheet

Data sheet Data sheet Data sheet

Data file
Bloc 1 Bloc 2

Figure 8. Basic structure of a 4-node Hadoop cluster

Figure 8 presents the basic architecture of a hadoop cluster formed by 4 nodes: 1 NameNode
and 3 DataNodes. In addition, the distribution of blocks in a data file is graphically described. In
this case each block is stored in two nodes.

© Structuralia 20
Artificial Intelligence – Big Data: Learning with millions of data

5. INFORMATION ANALYSIS
Once the information has been collected, processed and stored in the different storage systems,
it is possible to exploit this information in order to extract some kind of knowledge only to obtain
and/or calculate by combining the information on a global or partial level. This type of process of
exploitation of large amounts of information began to develop in the 2000s by companies such as
Google, which had the need to process large amounts of information to calculate the PageRank
of web pages whose quantity had grown exponentially. The PageRank calculation process needs
to perform multiplication operations on large matrices in order to calculate the value for each web
page, which was practically impossible to do in a traditional environment because of the large
amount of information that had to be used. In order to solve this type of problem, the
MapReduce[14] programming model was developed, which made it possible to perform
mathematical operations on very large volumes of information by using the parallelization of
independent operations.
This computation model proved extremely useful for problem solving, which due to its
computational complexity or the size of the information they used, could not be solved by
traditional techniques. The ease of solving new problems through this new "paradigm" produced
a rapid dissemination of this type of programming model, leading to the emergence of different
programming frameworks that allowed the deployment, in a very simple way, algorithms in small
clusters of internal computers. This led to the emergence of new technologies related to creation,
the collection and manipulation of information as well as the use of Machine Learning algorithms
on very large sets of information allowing them to demonstrate their great potential to create
learning models.

5.1 Parallel Programming through MapReduce

MapReduce[14] is a programming model strongly oriented to parallel execution and distributed


across multiple computers, which allows to work with large volumes of information so that given
a very large set of input information (from a file, a database or any other data source) is capable
of generating an output data set of undetermined size by combining or manipulating input
information using two well-known functional programming functions: Map and Reduce functions.

21 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

Figure 9. How MapReduce works

Figure 9 is a graphic description of how MapReduce works. This process consists of two phases
(actually there are four, but the first and third phases are transparent to the developer). The
division phase (Splitting), the mapping phase (Map), the clustering phase (Shuffling) and the
reduction phase (Reduce). The splitting phase divides the input set into fragments that can be
processed individually on different computers. The clustering phase consists of grouping the
information generated by the mapping process on the basis of a key so that it can be used by the
reduction process. This process was designed for the use of structured data in the form of tuples
of the type (key, value), so that both the input information and the output information is
represented by this format.

© Structuralia 22
Artificial Intelligence – Big Data: Learning with millions of data

Reduce

Input Output

Mapping Group Reduction


Division

Figure 10. Example of MapReduce operation to count the number of occurrences of each word.

5.1.1. Map function

The Map function is a mapping function that receives as input a tuple in the form (key, value) to
generate as output a list of tuples in the form (key, value) where the input keys and values are
different from the input. Given a set of input tuples, the map function performs a division of the
elements of the set by applying on each of them an operation that results in a list of output tuples.
The great advantage of the mapping process is that it can be parallelized in different processes
in order to minimize mapping time. In the example presented in Figure 10, a MapReduce process
is applied to count the number of occurrences of each word in a set of phrases. For them it has
as input a set of phrases (each corresponds to an input tuple), on which the map function is
applied individually on each of the phrases of the input set. In general, the mapping process
consists of simplifying the semantics of the information.

5.1.2. Reduced function

Once the map operation is executed the Reduce operation is applied, performing a grouping of
all the tuples from the different lists generated by the Map operation that contain the same key
creating a set of lists where all the values share the same key. These lists are then reduced by
some type of operation that decreases the size of the sets and finally merges them all for general
output of the MapReduce process.

23 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

In the example presented in the Figure a grouping of the similar words of each of the lists
generated by the Map operation is performed and then an operation is performed that calculates
the sum of the values of the tuples of each set in order to know the number of occurrences of
each word. To finish all the lists are combined into a final list that corresponds to the result of the
operation.

5.1.3. MapReduced-based framework

Due to the power of the parallel programming model of MapReduce, different development
frameworks began to be developed that would allow the simple implementation of algorithms
using this operating model. The first algorithms implemented on these frameworks consisted of
basic mathematical operations, but as the use of this type of framerworks became standardized
many of the Machine Learning algorithms described in the previous two chapters were included,
in order to produce learning models on large data sets. The most commonly used frameworks
based on MapReduce are:

▪ Apache Hadoop: It is a framework[15] for the creation of distributed applications based on


MapReduce that stores the information in physical memory (Disk) by using the HDFS file
system.
▪ Apache Spark: It is a framework[16] for the creation of distributed applications based on
MapReduce that stores information in volatile memory (RAM). Normally this framework
relies on some kind of physical storage from which you obtain the input information and
where you store the output information.

The deployment of this type of technology involved the use of complex infrastructure in large
clusters of computers that would enable the distribution of operations processing and data
storage.

© Structuralia 24
Artificial Intelligence – Big Data: Learning with millions of data

5.2 Parallel computing in the cloud

At the beginning, all the necessary infrastructure for the deployment of this type of technology,
based on MapReduce, is made through networks of "local" computers. In other words, the
different organizations had their own private computer networks in which they executed their
parallel processes. This in the long run entailed a very high economic cost for organizations
because they had to make an investment in infrastructure and personnel in order to have a
computer network that was often not fully used because parallel processes were not running 24
hours a day. This major disadvantage meant that most companies were unable to access this
type of technology because of its high costs. The need for infrastructure not only computing,
storage or deployment of servers or services was a great business opportunity for large
technology giants who had large infrastructures that could offer small and medium-sized
enterprises storage services and data manipulation in "the cloud" which allowed companies to
rent the necessary infrastructure to temporarily deploy their processes at low cost.

"Cloud computing" (Cloud computing) is a paradigm that allows to provide a set of shared services
(applications, storage, servers, deployment platforms, etc) through a common network, which is
usually the Internet.
This kind of paradigm allowed a democratization of "Big Data" since any company could access
this type of services very cheaply without having to deploy its own physical infrastructure. Despite
the great advantages that this type of paradigm offered, many companies have decided not to
adopt it completely, in part because of the type of information they manipulate. For example,
public health services store personal information of all their patients whose privacy must be
maintained based on a set of state laws, which means that this type of information must be stored
based on a number of criteria in order to maintain your privacy. This type of situation led to
variations in this paradigm leading to different cloud implementation models:

25 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

Hybrid

Private
Public

Figure 11: This type of situation led to variations in this paradigm leading to different cloud implementation models:

▪ Private clouds are a demand-side infrastructure managed for a single customer or


company that has full control over the cloud. In other words, the company is the owner of
all the physical infrastructure and has total control of the access and use of its resources
(servers, network and disk). The management of this type of cloud can be managed by a
company external to the company that owns the cloud, but it has total control over access
to services. This type of cloud allows you to maintain the privacy of the information stored
in the cloud but usually has a higher cost since the infrastructure has to be maintained by
the company itself.
▪ Public clouds are an infrastructure maintained and managed by a service provider that
does not have any kind of link with the company and/or users who access the services
offered by the cloud. Normally, there is some kind of contractual link as the provider rents
its services to users. In this type of cloud, the information and applications of the different
users (clients) are distributed in a shared way between the physical components of the
architecture without having any knowledge about the other users who are using the
services. This is the most common case in the cloud due to its scalability, where a
company puts at the service of customers an infrastructure under user demand at a certain
cost, which is quite low since the maintenance and management of the infrastructure is
transparent to the end-user.

© Structuralia 26
Artificial Intelligence – Big Data: Learning with millions of data

▪ Hybrid clouds are a combination of the previous two where different users own one part of the
cloud and share others in a controlled way. These types of clouds try to offer the best of both
cases but have a big problem related to security and access to various types of clouds at once,
which increases its complexity when implementing. Despite their problems they are quite
useful for the deployment of simple applications, which do not require any synchronization or
need storage of information with different levels of privacy.
Regardless of the implementation model of the cloud this can offer a set of services, as presented
in Figure 12 , distributed at different levels ranging from cloud-hosted application execution to full
control of the physical infrastructure of the servers that make up the cloud. Depending on which
level of control you have over the different elements of the cloud you can different three types of
models.

Business Operating Server Security Database


Apps development system system
tools
Figure 12. Different types of service models offered by the cloud

▪ Software as a Service (SaaS) is the most limited of the service models offered by the
cloud. This model allows users to access and use a suite of cloud-hosted applications
over the Internet. This type of model typically relies on the use of a web browser to access
the service where all the underlying infrastructure, the middleware, the operating system,
the software and application data are stored and managed by the service provider, which
guarantees the availability and security of the application and its data. This type of model
is usually the simplest and most economical because it provides a series of basic services
without having to worry about the maintenance, deployment or development of the
elements that make the different services work correctly. An example of such a model is
the cloud-hosted messaging systems provided by Microsoft or Google.

27 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

▪ Platform as a Service (PaaS) is a more complex model of the previous one that provides
a complete development and deployment environment in the cloud that allows to deploy
from simple applications based on the cloud to much more complex business applications
developed by the company itself that hire the service. The PaaS model allows you to
configure middleware, development and test tools, different business intelligence services
(BI), database management systems, etc. That is, this model allows you to control the
entire software life cycle of an application: compilation, testing, implementation,
administration and update. This type of model allows the company that contracts the
service to take care of the whole process of acquisition and payment of software licenses
of the different development or deployment tools as well as the physical infrastructure in
which this service will be developed. That is, this type of model allows to acquire servers
that are deployed in the cloud, where the user has partial control of the server, but should
not manage anything related to the operating system or the physical architecture.
An example of such a model is servers that can be purchased through Google Cloud.

▪ Infrastructure as a service (IaaS) is the most complex model and consists of providing an
immediate computer infrastructure that is provisioned and managed through the Internet.
That is, it is a way of acquiring a server that is fully controlled but without assuming the
purchase and administration of physical servers since the service provider manages the
physical infrastructure and security, while the user manages the software infrastructure as
it is responsible for installing, configuring and managing its own software (operating
systems, middleware and applications).

5.3 Data Analysis

All these cloud deployment models, service models and programming models are the tools that
allow us to build software that can analyze the large amount of data we have stored in the different
storage systems. All these tools will allow us to analyze the information based on two strategies:

▪ Model generation: The generation of models is based on using Machine Learning


techniques, such as those described in topics 5 and 6 of this course, on large volumes of
information in order to build learning models that can enable our applications to deploy
automatic decision-making systems in order to improve the user experience and/or predict
information that can be used by users to interact differently with the services offered.

© Structuralia 28
Artificial Intelligence – Big Data: Learning with millions of data

▪ Business Intelligence: Business Intelligence (BI) is based on the generation of knowledge


through the analysis of large volumes of information in order to present this information in
a more compact way facilitating the decision-making process of human beings. In other
words, this type of strategy seeks to present all the available information in order to
present it visually through graphs, images, diagrams in order to reflect those most
important elements so that they can facilitate the decision-making process of humans. For
example, in order to increase investment in a certain type of product, hire more staff to
strengthen some of the areas of the company, etc.

29 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

6. INFORMATION EXPLOITATION
This is the last phase of the data life cycle and involves exploiting the different products that have
been generated after the entire previous process. That is, look for a way to use the different
learning models or knowledge that have been generated in the analysis phase in order to improve
some aspects of the products offered or the functioning of the company. The exploitation process
for these products is usually carried out in at least three different ways.

▪ Application programming interface: The different learning models that are produced during
the analysis phase can be easily integrated into our products by using an API. An
application programming interface (API) is a set of methods or functions offered by a
library or service that can be used by another service. Models can be defined as functions,
which we have learned, where given an input defined in a certain format they are able to
generate an output defined in another format. For example, if we have managed to build
a model that indicates the price of a house in Spain we will be able to obtain a price if we
supply our model with the information of a house. Based on all this, our learning models
can be offered as a localized service in our public or private cloud that offers an API that
other services can access.
▪ Data display: There are many situations where because of the large amount of information
or its complexity it is very difficult to draw simple conclusions from the data. A very useful
way to extract these conclusions is through a visual representation of the data that can
allow us to highlight certain features that are not visible without sufficient amount of data
are not combined or if these are not presented in a way visual. An example of a data
visualization system is the Circos to Genomics application that allows genomes to be
compared sequentially and the relationship between them to be visually paired. Circos is
a data visualization software that allows the 24 chromosomes of the human genome
(including the X and Y sex chromosomes) to be represented by a circular ring model and
position them with genes related to certain diseases placed outside the circle. Figure 13
is an example of this type of visualization where for each chromosome is presented its
map of genes where the data placed on the top of the chromosome ring highlight the
genes involved in certain diseases such as cancer, diabetes and glaucoma. The data
placed within the ring links genes related to a certain disease that are in the same
biochemical pathway (gray) and the degree of similarity for a subset of the genome
(colored).

© Structuralia 30
Artificial Intelligence – Big Data: Learning with millions of data

Figure 13. Representation of the relationships between different genes with different types of patients. Copyright ©
Genome Res

▪ Management dashboard A dashboard is the simplified representation of a set of indicators


that give a general idea of the behavior of an area or a process. In other words, they allow
to graphically represent the trend or state of a set of indicators (Key Performance Indicator,
KPI) considered relevant to the management of the process or area being analyzed. The
idea is to visualize in a simple way all the indicators (KPI) comparing them with their
respective target values (Key Goal Indicator, KGI).

31 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

This way of visualization allows to obtain an overview of the state of the process or area
in order to identify those actions necessary to change and/or maintain the results
presented in the control box. In addition, it is very important that dashboards provide extra
functionality for detailed analysis and more specific data queries. Figure 14 Is an example of
a control panel. Some of the most widely used BI systems today are Tableau[17],
Looker[18] or Microsoft Power BI[19].

Figure 14. Example of a control panel

© Structuralia 32
Artificial Intelligence – Big Data: Learning with millions of data

7. CONCLUSIONS
In this topic the concept of "Big Data" has been presented in a simple way and consists of the life
cycle of the data. Throughout this topic we have tried to present the importance of information
which is our opinion the most important concept in "Big Data" because the large amount of
information we humans generate in every second is what has created this concept, term or
science. It is not entirely clear how "Big Data" will influence our lives in the coming years and it is
something to keep in mind. The great advantage of "Big Data" is that it is able to extract
information regarding the majority decisions that humans make, that is, it is able to identify the
behaviors of the majority of the population that generates information. This information can be
very useful in order to eliminate harmful behaviors in our society but it can also help certain
companies to manipulate our behaviors by introducing certain stimuli that allow them to modify
those behaviors towards a result expected in order to obtain a higher profit. So the big challenge
facing "Big Data" at the moment it is like identifying what are the limits when using all the
information that humans generate continuously and if the systems built by using this information
no more than better systems that will ultimately suggest us to choose only among those options
that are valid from the point of view of the entity that controls the data. In spite of everything, the
"Big Data" will become one of the most useful resources of humanity since it allows us to see
"clearly" between immense universe of data.

33 © Structuralia
Artificial Intelligence – Big Data: Learning with millions of data

8. REFERENCES
[1] 1 Dargan G., Johnson B., Panchalingam M. and Stratis C. (2004), The Use of Radio Frequency
Identification as a Replacement for Traditional Barcoding, IBM.

[2] Duckett J. (2011), HTML & CSS: Design and Build Web Sites, John Wiley & Sons Inc.
ISBN 9781118008188

[3] Lenguaje HTML (World Wide Web Consortium): https://www.w3.org/html/

[4] Lenguaje HTML (World Wide Web Consortium):https://www.w3.org/XML

[5] https://www.json.org/

[6] Vanden Broucke S. and Baesens B. (2018), Practical Web Scraping for Data Science: Best
Practices and Examples, Apress. ISBN 978-1-4842-3582-9

[7] Kafka: https://kafka.apache.org

[8] RabbitMQ: https://www.rabbitmq.com

[9] A. de Miguel y M. Piattini. (1990) Fundamentos y Modelos de Bases de Datos. (2Nd edition).
RA-MA. ISBN 9788478973613

[10] Silberschatz A., Korth H. and Sudarshan S. (2014). Fundamentos de Bases de Datos.
McGraw-Hill Interamericana de España S.L. (6Th edition). ISBN978-8448190330

[11] M. Piattini, E. Marcos, C. Calero y B. Vela. (2006) Technology y Modelos de Bases de Datos.
RA-MA. ISBN 9788478977338

[12] Godoc E. (2014). SQL – Fundamentos del lenguaje, ENI. ISBN 9782746091245.

[13] Sullivan D. (2015). NoSQL for Mere Mortals, Addison-Wesley. ISBN 9780134023212

[14] Dean J. and Ghemawat S. (2004) , MapReduce: simplified data processing on large clusters,
Proceedings of the 6th conference on Symposium on Opearting Systems Design &
Implementation, volumen 6, pp 1-10.

[15] Apache Hadoop (Página web oficial): https://hadoop.apache.org/

[16] Apache Spark (Página web oficial): https://spark.apache.org/

[17] Tableau (Página web oficial): https://www.tableau.com

[18] Looker Business Intelligence (Página web oficial): https://looker.com/

© Structuralia 34
Artificial Intelligence – Big Data: Learning with millions of data

[19] Microsoft Power BI (Página web oficial): https://powerbi.microsoft.com/es-es/

35 © Structuralia

You might also like