0% found this document useful (0 votes)
2 views30 pages

BDA U-3

The document outlines a course on Big Data Analytics, detailing its outcomes, assessment structure, and syllabus. Key topics include data management, big data tools, analytics, machine learning algorithms, and data visualization techniques. The course aims to equip students with skills in various big data technologies and their applications in business contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views30 pages

BDA U-3

The document outlines a course on Big Data Analytics, detailing its outcomes, assessment structure, and syllabus. Key topics include data management, big data tools, analytics, machine learning algorithms, and data visualization techniques. The course aims to equip students with skills in various big data technologies and their applications in business contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Big Data Analytics

(Course Code: AI732PE)

Dr. A.PramodKumar
Associate Professor,
Dept. of AIML.,
CMR Engineering College (Autonomous),
Email: [email protected]
Contact:9000159660
COURSE OUTCOMES
Upon successful completion of this course, student will be able to:

❑ CO1: Outline the basic big data concept


❑ CO2: Simulate and apply various big data technologies like Hadoop, Map
Reduce, Spark, Impala, Pig and Hive.
❑ CO3: Categorize and summarize the Big Data and its importance in Business
domains.
❑ CO4: Differentiate various learning approaches in machine learning to process
data, and to interpret the concepts of ML algorithms and test cases
❑ CO5: Develop the numerous features of data for visualization in associate with
Tableau, Qlick View and D3.
Course Assessment

Course Title INTERNET OF THINGS Course Type Integrated

IV Year
Course Code: AI732PE Credits 3 Class
I Semester
Contact Total Number of
TLP Credits Work Load
Hours Classes Assessment in
Weightage
Theory 3 4 4 Per Semester

Course Practice 0 0 0
Structure Theory Practical CIE SEE
Tutorial - - -

Total 3 4 4 42 0 30% 70%

Course Lead:
Theory Practice

Course
Instructors 1. Dr. A. Pramod Kumar
A. Dr. A. Pramod Kumar
COURSE SYLLABUS
UNIT I: Data Management Maintain Healthy, Safe & Secure Working Environment:
Data Management: Design Data Architecture and manage the data for analysis, understand various
sources of Data like Sensors/signal/GPS etc. Data Management, Data Quality (noise, outliers, missing
values, duplicate data) and Data Preprocessing. Export all the data onto Cloud ex. AWS/Rackspace etc.
Maintain Healthy, Safe & Secure Working Environment : Introduction, workplace safety, Report
Accidents & Emergencies, Protect health & safety as your work, course conclusion, and assessment.

UNIT- II: Big Data Tools & Provide Data/Information in Standard Formats :
Big Data Tools: Introduction to Big Data tools like Hadoop, Spark, Impala etc., Data ETL process,
Identify gaps in the data and follow-up for decision making.
Provide Data/Information in Standard Formats: Introduction, Knowledge Management, and
Standardized reporting & compliances, Decision Models, course conclusion. Assessment

UNIT- III: Big Data Analytics :


Big Data Analytics: Run descriptives to understand the nature of the available data, collate all the
data sources to suffice business requirement, Run descriptive statistics for all the variables and
observer the data ranges, Outlier detection and elimination.

UNIT- IV: Machine Learning Algorithms:


Machine Learning Algorithms: Hypothesis testing and determining the multiple analytical
methodologies, Train Model on 2/3 sample data using various Statistical/Machine learning
algorithms, Test model on 1/3 sample for prediction etc.
Unit V: Data Visualization:
Data Visualization: Prepare the data for Visualization, Use tools like Tableau, Qlick View and D3, Draw
insights out of Visualization tool. Product Implementation.
TEXT BOOKS
❑ Michael Minelli, Michelle Chambers and AmbigaDhiraj, “Big Data, Big Analytics: Emerging
Business Intelligence and Analytic Trends for Today's Businesses”, Wiley, 2013.
❑ ArvindSathi, “Big Data Analytics: Disruptive Technologies for Changing the Game”, 1st Edition,
IBM Corporation, 2012.
❑ Davy Cielen, Arno D. B. Meysman, and Mohamed Ali ,“Introducing Data Science - Big data,
machine learning, and more, using Python tools” , Dreamtech Press 2016
❑ Data Science & Big Data Analytics Discovering, Analyzing, Visualizing and Presenting Data EMC
Education Services, Wiley Publishers, 2015.
REFERENCE BOOKS:
❑ Introduction to Data Mining, Tan, Steinbach and Kumar, Addison Wesley, 2006
❑ Cay Horstmann, Wiley John Wiley & Sons, “Big Java”, 4th Edition, INC
❑ Data Mining Analysis and Concepts, M. Zaki and W. Meira, New Edition 2014.Camebridge
University press. NPTEL/SWAYAM/MOOCS:
❑ https://onlinecourses.nptel.ac.in/noc23_cs112/preview
E books
❑ https://bmsce.ac.in/Content/IS/Big_Data_Analytics_-_Unit_1.pdf
❑ https://mrcet.com/downloads/digital_notes/IT/(R17A0528)%20BIG%20DATA%20A
NALYTICS.pdf
3
E-RESOURCES:
1. http://freevideolectures.com/Course/3613/Big-Data-and-Hadoop/18
2. http://www.comp.nus.edu.sg/~ooibc/mapreduce-survey.pdf
UNIT-III

Big Data Analytics

5
Run descriptive to understand the nature of the available data
Nowadays, Big Data and Data Science have become high volume keywords. They tend to
become extensively researched and this makes this data to be processed and studied with
scrutiny. One of the techniques to analyse this data is Descriptive Analysis.
Big data analytics is the process of examining large data sets containing a variety of data
types -- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends,
customer preferences and other useful business information.
The analytical findings can lead to more effective marketing, new revenue opportunities,
better customer service, improved operational efficiency, competitive advantages over
rival organizations and other business benefits.
The primary goal of big data analytics is to help companies make more informed business
decisions by enabling data scientists, predictive modelers and other analytics professionals
to analyze large volumes of transaction data, as well as other forms of data that may be
untapped by conventional business intelligence(BI) programs.
Relational and transactional databases based on SQL language have clearly dominated the
market of data storage and data manipulation over the past 20 years.
From relational databases to Big Data, In particular they had to face five major weaknesses
of relational databases: the scaling of treatment the scaling of data, the redundancy the
velocity the variety and complexity.
Descriptive analytics is a branch of data analytics that involves summarizing and
interpreting historical data to understand patterns
As a relational database, it provides a set of functionalities to access data across several
entities (tables) by complex queries. It provides also integrity referential to insure the
constant validity of the links between entities.
Such mechanisms are extremely costly and complex to implement in distributed
architecture, considering that it is necessary to insure that all data that are linked together
have to be hosted on the same node. Moreover, it implies the definition a static data-model
or schema, not applicable to the velocity of web data.

As a transactional database, they must respect the ACID constraints, i.e. the Atomicity of
updates, the Consistency of the database, the Isolation and the Durability of queries.
These constraints are perfectly applicable in a centralized architecture, but much more
complex to insure in a distributed architecture. No SQL and Big Data
Coherence: All the nodes of the system have to see
exactly the same data at the same time
Availability: The system must stay up and running
even if one of its node is failing down
Partition Tolerance: each subnet-works must be
autonomous
we are describing our data with the help of
various representative methods using
charts, graphs, tables, excel files, etc
Most of the time it is performed on small data sets and this analysis helps us a lot
to predict some future trends based on the current findings. Some measures that
are used to describe a data set are measures of central tendency and measures of
variability or dispersion.
Types of Descriptive Statistics
1.Measures of central tendency 2.measure of variability 3. measure of frequency
distribution

Measures of Central Tendency


It represents the whole set of data by a single value. It gives us the location of the
central points. There are three main measures of central tendency
Mean It is the sum of observations divided by the total number of observations. It
is also defined as average which is the sum divided by count.

where, x = Observations n = number of terms


Mode
It is the value that has the highest frequency in the given data set. The data set
may have no mode if the frequency of all data points is the same. we can have
more than one mode if we encounter two or more data points having the same
frequency.
Median
It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is the median
and if it is even then the median would be the average of two central elements.
Measure of Variability
Measures of variability are also termed measures of dispersion as it helps to gain
insights about the dispersion or the spread of the observations at hand
Range
The range describes the difference between the largest and smallest data point in
our data set. The bigger the range, the more the spread of data and vice versa.
Range = Largest data value – smallest data value
Variance
It is defined as an average squared deviation from the mean. It is calculated by
finding the difference between every data point and the average which is also
known as the mean, squaring them, adding all of them, and then dividing by the
number of data points present in our data set.

where, x -> Observation under consideration N -> number of terms mu -> Mean
Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the
Mean, then subtracting each number from the Mean which is also known as the
average, and squaring the result.
where,
x = Observation under consideration N = number of terms mu = Mean
Measures of Frequency Distribution
Measures of frequency distribution help us gain valuable insights into the
distribution and the characteristics of the dataset. Count frequency, relative
frequency, cumulative frequency

collate all the data sources to suffice business requirement


Key-value store Concept
This technology can address a large volume of data due to the simplicity of its
data model. Each object is identified by a unique key and the access to this data is
only possible through this key.
The structure of the object is free. This model only provides the four basic
operations to Create, Read, Update and Delete an object from its key
these databases are providing in façade a HTTP REST API so that they can
interoperate with any language.
This simple approach has the benefit to provide exceptional performance in read
and write access, and a large scalability of data.
it provides only limited querying facilities, considering that data can only be
retrieved from their key, and not their content.

Columns based databases Concept

Columns based databases are storing data in grids, in which the column is the
basic entity that represents a data field.
Columns can be grouped together through the concept of columns NoSQL and
Big Data families. Rows of the grids are assimilated to records and identified by a
unique Key such as in the Key-value model
some providers are also including in their model the concept of version as a third
dimension of the grid.
The organization of the database in grids can appear similar to the tables of
relational databases.
While the columns of a relational table are static and present for each record, this
is not the case in Columns Oriented Database so that it is possible to dynamically
add a column to a table with no cost in term of storage space

These databases are designed to store up to several millions of columns that can
be fields of an entity or one-to many relationships
Document based databases Concept
Document based databases are similar to Key-value stores except that the value
associated to the key can be a structured and complex objects rather than a simple
types.
These complex objects are generally structured in XML or JSON formalism
This approach allows the implementation of queries on the content of the
documents and not only through the key of the record. NoSQL and Big Data
The simplicity and flexibility of this data model makes it particularly applicable
to Content Management Systems (CMS).
Graph databases Concept
The graph paradigm is a data model in which entities are nodes and associations
between entities are arcs or relationships. Both nodes and relationships are
characterized by a set of properties.
This category of databases is typically designed to address the complexity of
databases more than their volumetric
they are applied in cartography, social networks, and more generally in network
modelling.
MapReduce
MapReduce is a programming model or pattern within the Hadoop framework that
is used to access big data stored in the Hadoop File System (HDFS). It is a core
component, integral to the functioning of the Hadoop framework.
MapReduce is a processing technique and a program model for distributed
computing based on java. Map Reduce is responsible for processing the file

Map
➢ Iterate over large number of records
➢ Extract data of interest
➢ Shuffle and sort intermediate results
▪ Reduce
➢ Aggregate intermediate results
➢ Generate final output
Components of MapReduce
1. PayLoad: The applications implement Map and Reduce functions and form the core of
the job
2. MR Unit: Unit test framework for MapReduce
3. Mapper: Mapper maps the input key/value pairs to the set of intermediate key/value
pairs
4. Name Node: Node that manages the HDFS is known as named node
5. Data Node: Node where the data is presented before processing takes place
6. Master Node: Node where the job trackers runs and accept the job request from the
clients
7. Slave Node: Node where the Map and Reduce program runs
8. Job Tracker: Schedules jobs and tracks the assigned jobs to the task tracker
9. Task Tracker: Tracks the task and updates the status to the job tracker
10. Job: A program that is an execution of a Mapper and Reducer across a dataset
11. Task: An execution of Mapper and Reducer on a piece of data
12. Task Attempt: A particular instance of an attempt to execute a task on a Slave Node
Abstraction of MapReduce
1. Hive - Query engine(uses SQL)
2. Pig - yahoo invented
3. Sqoop –SQL+Hadoop
4. Oozie - yahoo invented use for automating/scheduling
MapReduce Job workflow
Hadoop API
Hadoop MapReduce is a software framework for easily writing applications
which process big amounts of data in-parallel on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that
Hadoop programs perform:
1.The Map Task: This is the first task, which takes input data and converts it into
a set of data, where individual elements are broken down into tuples (key/value
pairs).
2. The Reduce Task: This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is always
performed after the map task.

6
The framework takes care of scheduling tasks, monitoring them and re-executes
the failed tasks.
The MapReduce framework consists of a single master Job Tracker and one slave
Task Tracker per cluster-node.
The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves,
monitoring them and re-executing the failed tasks.
The slaves Task Tracker execute the tasks as directed by the master and provide
task-status information to the master periodically.
Run descriptive statistics for all the variables and observer the data ranges
Called the “simplest class of analytics”, descriptive analytics allows you to
condense big data into smaller, more useful bits of information
It has been estimated that more than 80% of business analytics (e.g. social
analytics) are descriptive.
Some social data could include the number of posts, fans, followers, page views,
check-ins,pins, etc. It would appear to be an endless list if we tried to list them all.
Outlier detection and elimination.
An outlier is a data point that significantly deviates from the rest of the data. It
can be either much higher or much lower than the other data points, and its
presence can have a significant impact on the results of machine learning
algorithms. They can be caused by measurement or execution errors. There are
two main types of outliers
Global outliers: these are isolated data points that are far away from the main
body of the data. They are often easy to identify and remove.
Contextual outliers: these are data points that are unusual in a specific context
but may not be outliers in a different context. They are often more difficult to
identify and may require additional information or domain knowledge to
determine their significance
•Data that don’t conform to the normal and expected patterns are Outliers
•Wide range of application in various domains including finance, security,
intrusion detection in cyber Security
•Criteria for what constitutes an outlier depend the problem domain
•Typically involve large amount of data which may be unstructured
Outlier Detection Methods
Outlier detection plays a crucial role in ensuring the quality and accuracy of
machine learning models. By identifying and removing or handling outliers
effectively, we can prevent them from biasing the model, reducing its
performance
Statistical Methods:
Z-Score: This method calculates the standard deviation of the data points and
identifies outliers. it exceeding a certain threshold (typically 3 or -3).
Interquartile Range (IQR): IQR identifies outliers as data points falling outside
the range defined by Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1), where Q1 and Q3 are
the first and third quartiles, and k is a factor (typically 1.5).
Distance-Based Methods:
K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K
nearest neighbors are far away from them.
Local Outlier Factor (LOF): This method calculates the local density of data
points and identifies outliers as those with significantly lower density compared
to their neighbors.
Clustering-Based Methods:
Density-Based Spatial Clustering of Applications with Noise
(DBSCAN): In DBSCAN, clusters data points based on their density and
identifies outliers as points not belonging to any cluster
Hierarchical clustering: It involves building a hierarchy of clusters by iteratively
merging or splitting clusters based on their similarity.
Other Methods:
Isolation Forest: Isolation forest randomly isolates data points by splitting
features and identifies outliers as those isolated quickly and easily.
One-class Support Vector Machines (OCSVM): One-Class svm learns a
boundary around the normal data and identifies outliers as points falling outside
the boundary.
Outlier Removal
This involves identifying and removing outliers from the dataset before training
the model.
Thresholding: Outliers are identified as data points exceeding a certain threshold
(e.g., Z-score > 3).
Distance-based methods: Outliers are identified based on their distance from
their nearest neighbors.
Clustering: Outliers are identified as points not belonging to any cluster or
belonging to very small clusters.

You might also like