0% found this document useful (0 votes)
77 views

BDA Unit 1

Big data refers to large, complex datasets that traditional software cannot effectively manage due to the immense volume. Sources like stock trades, jet engine sensors, and internet usage generate petabytes of new data daily. The evolution of big data has seen technological advances enable the storage and analysis of massive amounts of structured and unstructured data from sources like the internet, sensors, and social media. Big data brings challenges like sharing data across organizations while maintaining privacy and security, and ensuring analytics can scale to handle extremely large, fast-growing datasets.

Uploaded by

Sp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

BDA Unit 1

Big data refers to large, complex datasets that traditional software cannot effectively manage due to the immense volume. Sources like stock trades, jet engine sensors, and internet usage generate petabytes of new data daily. The evolution of big data has seen technological advances enable the storage and analysis of massive amounts of structured and unstructured data from sources like the internet, sensors, and social media. Big data brings challenges like sharing data across organizations while maintaining privacy and security, and ensuring analytics can scale to handle extremely large, fast-growing datasets.

Uploaded by

Sp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Introduction to Big Data Analytics

1.Introduction
big data is larger, more complex data sets, especially from new data sources. These
data sets are so voluminous that traditional data processing software just can't
manage them. But these massive volumes of data can be used to address business
problems you wouldn't have been able to tackle before.
Big Data is a collection of data that is huge in volume, yet growing exponentially with
time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but with
huge size.
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to
many Petabytes.

Evolution of Big Data


The evolution of big data has been a dynamic journey shaped by technological
advancements, changing business needs, and societal shifts. Here's a timeline
highlighting key milestones in the evolution of big data:

1. Early Days (1950s-1980s):


 Big data's origins can be traced back to the early days of computing
when mainframe computers were used to process large volumes of
data for scientific and government applications.
 Databases like IBM's Information Management System (IMS) and
relational databases emerged during this period, laying the groundwork
for data storage and management.
2. Data Warehousing (1980s-1990s):
 The concept of data warehousing gained prominence as organizations
sought to centralize their data for analysis and reporting.
 Technologies like Online Analytical Processing (OLAP) and Enterprise
Data Warehouses (EDW) enabled businesses to store and analyze
vast amounts of structured data.
3. Internet Era (1990s-2000s):
 The proliferation of the internet led to an explosion of digital data.
Websites, emails, and online transactions generated massive volumes
of unstructured and semi-structured data.
 Search engines like Google and Yahoo pioneered techniques for
indexing and searching web content, paving the way for scalable data
processing and retrieval.
4. Hadoop and MapReduce (2000s):
 Apache Hadoop, an open-source framework for distributed storage and
processing of big data, emerged as a game-changer.
Introduction to Big Data Analytics
 Inspired by Google's MapReduce paper, Hadoop introduced a
scalable, fault-tolerant architecture that could handle petabytes of data
across clusters of commodity hardware.
5. Emergence of NoSQL Databases (2000s-2010s):
 Traditional relational databases struggled to handle the variety and
volume of big data. NoSQL databases, designed for non-relational,
distributed data models, gained traction.
 Technologies like MongoDB, Cassandra, and HBase offered flexible,
schema-less data storage options suitable for big data use cases.
6. Cloud Computing (2010s-present):
 Cloud computing platforms like Amazon Web Services (AWS),
Microsoft Azure, and Google Cloud Platform democratized access to
scalable computing and storage resources.
 Organizations leveraged cloud services to deploy big data solutions
without the upfront costs and complexities associated with managing
on-premises infrastructure.
7. Real-time Analytics and AI (2010s-present):
 With the rise of IoT devices, social media, and mobile applications, the
demand for real-time analytics surged.
 Technologies like Apache Spark, Kafka, and Flink emerged to enable
real-time data processing and stream analytics.
 Artificial Intelligence (AI) and Machine Learning (ML) techniques
became integral to big data analytics, offering predictive insights and
automation capabilities.
8. Ethical and Regulatory Considerations (2010s-present):
 As big data applications proliferated, concerns about data privacy

Types Of Big Data


Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data. Over the period of time, talent in computer science has
achieved greater success in developing techniques for working with such kind of
data (where the format is well known in advance) and also deriving value out of it.

An ‘Employee’ table in a database is an example of Structured Data.


Introduction to Big Data Analytics
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it. A typical example of unstructured
data is a heterogeneous data source containing a combination of simple text files,
images, videos etc

Examples Of Un-Structured Data

The output returned by ‘Google Search’.

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined with e.g. a table
definition in relational DBMS. Example of semi-structured data is a data represented
in an XML file.

Examples Of Semi-Structured Data

Personal data stored in an XML file.

1.1. Characteristics Of Data


Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size
of data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data. Hence, ‘Volume’ is one characteristic which needs to be
considered while dealing with Big Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only
sources of data considered by most of the applications. Nowadays, data in the form
of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being
considered in the analysis applications. This variety of unstructured data poses
certain issues for storage, mining and analyzing data.
Introduction to Big Data Analytics
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data
at times, thus hampering the process of being able to handle and manage the data
effectively.

1.2. Challenges with Big Data


The challenges in Big Data are the real implementation hurdles. These
require immediate attention and need to be handled because if not handled
then the failure of the technology may take place which can also lead to some
unpleasant result. Big data challenges include the storing, analysing the
extremely large and fast-growing data.

1. Sharing and Accessing Data:


 Perhaps the most frequent challenge in big data efforts is the
inaccessibility of data sets from external sources.
 Sharing data can cause substantial challenges.
 It includes the need for inter and intra- institutional legal documents.
2.Privacy and Security:
 It is another most important challenge with Big Data. This challenge
includes sensitive, conceptual, technical as well as legal significance.
 Most of the organizations are unable to maintain regular checks due to
large amounts of data generation. However, it should be necessary to
perform security checks and observation in real time because it is most
beneficial.
3. Analytical Challenges:
 There are some huge analytical challenges in big data which arise some
main challenges questions like how to deal with a problem if data volume
gets too large?
 Or how to find out the important data points?
4. Fault tolerance:
 Fault tolerance is another technical challenge and fault tolerance
computing is extremely hard, involving intricate algorithms.
5. Scalability:
 Big data projects can grow and evolve rapidly. The scalability issue of Big
Data has lead towards cloud computing.
Introduction to Big Data Analytics
 It leads to various challenges like how to run and execute various jobs so
that goal of each workload can be achieved cost-effectively.
1.3 Why Big Data

Big data refers to the massive volume of structured and unstructured data that
inundates businesses on a day-to-day basis. There are several reasons why big data
is important:

1. Insights Generation: Big data provides valuable insights into customer


behaviors, market trends, and operational patterns that were previously
inaccessible or difficult to obtain.
2. Decision Making: With the help of big data analytics, organizations can make
data-driven decisions rather than relying solely on intuition or past
experiences.
3. Competitive Advantage: Companies that effectively harness big data can
gain a competitive edge by optimizing processes, improving customer
experiences, and identifying new business opportunities.
4. Innovation: Big data fuels innovation by enabling organizations to develop
new products, services, and business models based on a deep understanding
of data.
5. Cost Reduction: By analyzing big data, companies can identify inefficiencies
in their operations and streamline processes, leading to cost savings.
2.1 Introduction to Big Data Analytics
Big data analytics refers to the process of examining large and varied datasets,
typically referred to as "big data," to uncover hidden patterns, unknown correlations,
market trends, customer preferences, and other useful information that can help
organizations make more informed decisions.

Big data analytics involves using advanced analytics techniques such as predictive
analytics, data mining, machine learning, and statistical analysis to extract insights
from massive datasets. These datasets are often too large or complex for traditional
data processing applications to handle.

The primary goals of big data analytics are to:

1. Gain insights: Discover patterns, correlations, and trends within large


datasets that can help organizations understand their operations, customers,
and market dynamics better.
2. Make data-driven decisions: Use the insights gained from big data
analytics to inform strategic and operational decisions, optimize processes,
and improve outcomes.
3. Improve efficiency and effectiveness : Identify opportunities to
streamline operations, improve resource allocation, enhance customer
experiences, and increase productivity.
Introduction to Big Data Analytics
Big data analytics is widely used across various industries, including finance,
healthcare, retail, manufacturing, telecommunications, and many others, to drive
innovation, improve competitiveness, and create value from data.
2.2 Classification of Analytics
Analytics can be classified into several categories based on different criteria such as
the type of data being analysed, the methods used for analysis, and the goals of the
analysis. Here are some common classifications:

1. Descriptive Analytics: This type of analytics focuses on summarizing


historical data to understand what has happened in the past. It provides
insights into past performance and often involves basic statistical analysis and
data visualization techniques.
2. Diagnostic Analytics: Diagnostic analytics aims to identify the reasons
behind past outcomes or events. It involves analysing historical data to
determine why certain outcomes occurred, often through root cause analysis
or correlation analysis.
3. Predictive Analytics: Predictive analytics uses historical data to forecast
future outcomes or trends. It involves applying statistical models and machine
learning algorithms to identify patterns in data and make predictions about
future events.
4. Prescriptive Analytics: Prescriptive analytics goes beyond predicting
future outcomes to recommend actions that can be taken to achieve desired
outcomes. It involves using optimization and simulation techniques to
generate actionable insights and recommendations.

2.3 Why is Big Data Analytics Important


Introduction to Big Data Analytics

Big data analytics helps organisations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient
operations, higher profits and happier customers. Businesses that use big data with
advanced analytics gain value in many ways, such as:

1. Reducing cost. Big data technologies like cloud-based analytics can


significantly reduce costs when it comes to storing large amounts of data (for
example, a data lake). Plus, big data analytics helps organisations find more
efficient ways of doing business.
2. Making faster, better decisions. The speed of in-memory analytics –
combined with the ability to analyse new sources of data, such as streaming
data from IoT – helps businesses analyse information immediately and make
fast, informed decisions.
3. Developing and marketing new products and services. Being able to
gauge customer needs and customer satisfaction through analytics empowers
businesses to give customers what they want, when they want it. With big
data analytics, more companies have an opportunity to develop innovative
new products to meet customers’ changing needs.

2.4 Data Science

Data science is a multidisciplinary field that uses scientific methods, algorithms, processes,
and systems to extract knowledge and insights from structured and unstructured data. It
combines aspects of statistics, computer science, and domain expertise to analyse complex
datasets and solve real-world problems. Data scientists employ various techniques such as
data mining, machine learning, predictive analytics, and data visualization to uncover
patterns, trends, and correlations in data that can inform decision-making and drive
innovation in various industries.
Introduction to Big Data Analytics

2.5 Responsibilities of a Data Scientist

The responsibilities of a data scientist can vary depending on the organization and
the specific role, but here are some common tasks and responsibilities:

1. Data Collection and Cleaning: Gathering data from various sources such as
databases, APIs, or web scraping. Cleaning and preprocessing the data to
remove errors, missing values, and inconsistencies.
2. Data Analysis and Exploration: Exploring the data to understand patterns,
trends, and relationships. Using statistical methods and visualization
techniques to gain insights and identify potential areas for further
investigation.
3. Model Development: Developing machine learning models and algorithms to
solve specific business problems or make predictions. This involves selecting
appropriate models, feature engineering, hyperparameter tuning, and
evaluating model performance.
4. Model Deployment: Deploying models into production environments, which
may involve working with software engineers to integrate models into existing
systems or develop new applications.
5. Testing and Validation: Testing the performance of models using validation
techniques such as cross-validation or holdout validation. Ensuring that
models are robust and generalize well to new data.
Introduction to Big Data Analytics
6. Communication and Visualization: Communicating findings and insights to
stakeholders through reports, presentations, or interactive dashboards.
Visualizing data and model results in a clear and understandable way.

Overall, data scientists play a crucial role in extracting meaningful insights from data to
inform decision-making and drive business value.

2.6 Terminologies Used in Big Data Environment


In a big data environment, there are several terminologies commonly used to
describe various concepts, technologies, and processes. Here are some of the key
ones:

1. Big Data: Refers to large volumes of data, both structured and unstructured,
that inundates a business on a day-to-day basis.
2. Hadoop: An open-source framework used for distributed storage and
processing of large datasets across clusters of computers.
3. MapReduce: A programming model for processing and generating large data
sets with a parallel, distributed algorithm on a cluster.
4. Data Warehouse: A central repository of integrated data from one or more
disparate sources, used for reporting and data analysis.
5. Data Lake: A storage repository that holds a vast amount of raw data in its
native format until it is needed.
6. ETL (Extract, Transform, Load): The process of extracting data from various
sources, transforming it to fit operational needs, and loading it into a data
warehouse or data lake.
7. NoSQL: A type of database that provides a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations
used in relational databases.
8. SQL (Structured Query Language): A domain-specific language used in
programming and designed for managing data held in a relational database
management system or for stream processing in a relational data stream
management system.
9. Data Mining: The process of discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database
systems.
10. Machine Learning: A subset of artificial intelligence that uses statistical
techniques to enable computer systems to learn from and make predictions or
decisions based on data.
11. Data Visualization: The graphical representation of information and data to
communicate complex information clearly and efficiently.
Introduction to Big Data Analytics

You might also like