0% found this document useful (0 votes)
48 views

Introduction

Big data is handled through a multi-step process: 1) Data is ingested from various sources using tools like Apache Kafka or NiFi. 2) Data is stored in specialized systems like Hadoop HDFS or Amazon S3 for easy access and processing. 3) Tools like Apache Hadoop and Spark are used to analyze and extract insights from large, complex datasets.

Uploaded by

sanjeevscbe41
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Introduction

Big data is handled through a multi-step process: 1) Data is ingested from various sources using tools like Apache Kafka or NiFi. 2) Data is stored in specialized systems like Hadoop HDFS or Amazon S3 for easy access and processing. 3) Tools like Apache Hadoop and Spark are used to analyze and extract insights from large, complex datasets.

Uploaded by

sanjeevscbe41
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Introduction Session 1

– Session 2 –
Table Of Content
What is big data

• Understanding the basic concept of big data

Why is big data important

• What are the reasons to keep big data

How is big data handled

• What are basic processing steps needed

Big data for Machine learning

• How is big data handled for ML

Introduction to cloud

• Understand the fundamental concept behind cloud based big data handling
What is big data ?
What is big data ?
Big data refers to extremely large and complex sets of data that cannot be easily processed or analyzed using
traditional data processing tools and methods. It often involves structured, unstructured, and semi-structured data
from various sources and can be used to uncover insights, patterns, and trends that can inform decision-making
and improve business operations.

Volume in big Data Veracity in Big Data Velocity in Big Data


In the context of big data, volume Veracity is one of the key Velocity is one of the key
refers to the sheer amount of data characteristics of big data and characteristics of big data and refers
that is being generated and refers to the accuracy, consistency, to the speed at which data is being
collected from various sources. and trustworthiness of the data. generated, collected, and processed.
This data can come from social When dealing with large and With the advent of the internet and
media, internet searches, sensors, complex datasets, it is common to social media, data is being generated
mobile devices, and other sources. encounter data quality issues such and shared at an unprecedented rate,
The volume of data is usually as missing or incomplete data, making it difficult to keep up. Velocity
measured in terabytes, petabytes, errors, inconsistencies, and biases. is important because the faster data
or even exabytes. Managing and Veracity is concerned with the can be analyzed, the quicker insights
analyzing such large volumes of degree to which the data can be can be gained, and decisions can be
data requires specialized tools trusted and relied upon to make made. Technologies such as real-time
and techniques, such as informed decisions. It is important analytics, stream processing, and in-
distributed computing, cloud to ensure the veracity of data memory computing have emerged to
computing, and machine learning before analyzing it to avoid making handle the high velocity of data. more
algorithms. incorrect or unreliable conclusions. quickly.
What is big data ?
Variety in Big Data One example of big data use case is in the healthcare industry. With the
Variety is one of the key increasing amount of medical data being generated and collected from various
characteristics of big data and sources, big data analytics can help improve patient outcomes and reduce costs.
refers to the diversity of data For instance, the Mount Sinai Health System in New York City has
types and sources. Technologies implemented a big data platform to analyze patient data in real-time and provide
such as Hadoop, NoSQL personalized treatment plans.
databases, and data lakes have Specifications:
emerged to handle the variety of Volume: The platform can handle over 2.5 petabytes of data, including electronic
data types and sources. medical records, genomic data, and other sources.
Velocity: The platform can process and analyze data in real-time, allowing for
Value in Big Data faster diagnosis and treatment decisions.
Value is one of the key goals of Variety: The platform can handle a variety of data types, including structured
big data and refers to the benefits and unstructured data from various sources such as medical devices, wearables,
that can be gained from analyzing and social media.
large and complex datasets. By Veracity: The platform uses advanced analytics and machine learning
uncovering insights, patterns, algorithms to improve data quality and accuracy, reducing errors and
and trends in data, organizations inconsistencies.
can make informed decisions. The The big data platform at Mount Sinai has helped improve patient outcomes and
value of big data is often reduce costs by providing personalized treatment plans and identifying potential
measured in terms of return on health risks before they become serious. For example, the platform has helped
investment (ROI) and can include identify patients at risk of developing sepsis, a life-threatening infection, and
increased revenue, reduced costs, provided early intervention to prevent its onset. The platform has also helped
improved customer satisfaction, reduce hospital readmissions and unnecessary medical procedures, resulting in
and more.. cost savings for both patients and the healthcare system.
Why is big data important ?
Why is Big Data important ?
Big data is important for several reasons, including:

1.Improved decision-making: By analyzing large and complex datasets, organizations can gain insights and
make informed decisions that can improve business operations, customer satisfaction, and overall performance.

2.Increased efficiency and productivity: Big data technologies can automate and streamline processes,
reducing manual labor and improving efficiency.

3.Competitive advantage: Organizations that can effectively analyze and utilize big data can gain a competitive
advantage by identifying new opportunities, improving customer experiences, and optimizing operations.

4.Better customer insights: Big data analytics can provide deeper insights into customer behavior and
preferences, allowing organizations to tailor their products and services to meet customer needs.

5.Innovation and discovery: Big data can help drive innovation and discovery by identifying new trends and
patterns that were previously unknown.

6.Cost savings: Big data technologies can help reduce costs by improving efficiency, reducing waste, and
optimizing operations.

7.Improved risk management: Big data analytics can help identify potential risks and threats, allowing
organizations to take proactive measures to mitigate them.
Why is Big Data important ?
Here are a few more reasons why big data is important:

8. Personalization: Big data analytics can help personalize experiences for customers, such as recommending
products or services based on their preferences and behavior.

9. Predictive analytics: Big data can be used to develop predictive models that can forecast future trends and
outcomes, allowing organizations to make proactive decisions and strategies.

10. Improved supply chain management: Big data can be used to optimize supply chain operations, such as
predicting demand, identifying potential bottlenecks, and improving inventory management.

11. Better fraud detection: Big data analytics can help identify potential fraudulent activity, such as credit card
fraud, by analyzing patterns and anomalies in data.

12. Improved healthcare outcomes: Big data can be used to analyze patient data and develop personalized
treatment plans, as well as identify potential health risks before they become serious.

13. Environmental sustainability: Big data can be used to monitor and analyze environmental data, such as air
and water quality, to identify potential issues and improve sustainability efforts.

Overall, big data has the potential to impact almost every aspect of modern life, from business operations to
healthcare to the environment. As such, it is becoming increasingly important for organizations to effectively collect,
store, process, and analyze data in order to stay competitive and improve outcomes.
How is big data handled ?
How is Big Data Handled ?
1 Data sources 2 Data ingestion 3 Data storage 4 Data processing
Big data can come Data must be ingested into Big data must be stored in a way that allows Big data processing involves the
from a variety of the big data system for for easy access and processing. This can use of tools and technologies such
sources, including processing. This can be include traditional databases as well as as Apache Hadoop, Apache Spark,
social media, sensors, done through tools such as specialized big data storage solutions such and MapReduce to analyze and
mobile devices, and Apache Kafka, Flume, or as Hadoop Distributed File System (HDFS), extract insights from large and
other sources. NiFi. Apache Cassandra, or Amazon S3. complex datasets.

5 Data analysis 6 Data visualization 7 Data governance 8 Data security:


Once the data has been Data visualization tools Big data must be Big data must be protected against
processed, it can be such as Tableau, governed to ensure that potential security threats such as cyber
analyzed using a variety of Power BI, or Matplotlib it is accurate, consistent, attacks, data breaches, and
tools such as machine can be used to create and secure. This involves unauthorized access. This can involve
learning algorithms, data visual representations policies and procedures tools such as firewalls, access controls,
visualization tools, and of the data, making it for data quality, privacy, and encryption.
statistical models. easier to understand and security.
and interpret.
How is Big Data Handled ?
Cloud infrastructure Here are the typical steps involved in big data lifecycle management:
Big data handling can 1.Data creation: Big data is created from various sources such as sensors, social media, mobile devices, or other
be done on-premise or applications.
in the cloud. Cloud 2.Data ingestion: Data must be ingested into the big data system for processing. This can be done through
infrastructure such as tools such as Apache Kafka, Flume, or NiFi.
Amazon Web Services 3.Data storage: Big data must be stored in a way that allows for easy access and processing. This can include
traditional databases as well as specialized big data storage solutions such as Hadoop Distributed File System
(AWS), Microsoft
(HDFS), Apache Cassandra, or Amazon S3.
Azure, or Google Cloud
4.Data processing: Big data processing involves the use of tools and technologies such as Apache Hadoop,
Platform can provide Apache Spark, and MapReduce to analyze and extract insights from large and complex datasets.
scalable and cost- 5.Data analysis: Once the data has been processed, it can be analyzed using a variety of tools such as machine
effective solutions for learning algorithms, data visualization tools, and statistical models.
big data handling. 6.Data visualization: Data visualization tools such as Tableau, Power BI, or Matplotlib can be used to create
visual representations of the data, making it easier to understand and interpret.
Data lifecycle 7.Data governance: Big data must be governed to ensure that it is accurate, consistent, and secure. This
management involves policies and procedures for data quality, privacy, and security.
Big data must be 8.Data security: Big data must be protected against potential security threats such as cyber attacks, data
managed throughout breaches, and unauthorized access. This can involve tools such as firewalls, access controls, and encryption.
its lifecycle, including 9.Data archiving: As big data volumes grow, it is important to archive data that is no longer needed for analysis
data creation, storage, to free up storage space and maintain system performance.
processing, analysis, 10.Data disposal: At the end of its lifecycle, big data must be disposed of in a secure and compliant manner to
and disposal. avoid any potential data breaches or regulatory issues.
How is big data is handled for
Machine learning tasks ?
How is Big Data Handled for Machine learning tasks?
Machine Learning algorithms
can help overcome these
challenges by automatically
detecting patterns in the data.
Overall, Big Data and
Machine Learning are
complementary fields.
Together they can help
machines learn how to
recognize patterns in
complex datasets and make
valuable predictions
Introduction to cloud based big data
handling
Cloud based big data handling and Analytics

Cloud Data storage Data processing Data analysis Data visualization

Amazon Amazon
AWS Amazon S3 Amazon EMR
SageMaker QuickSight

AZURE Azure Blob Microsoft Azure Microsoft Azure Microsoft Power


Storage HDInsight Machine Learning BI

Google Cloud Google Cloud Google Cloud AI Google Data


GCP
Storage Dataproc Platform Studio

Apache Hadoop,
Tableau, Power
Tools Apache Spark, or
BI, or Matplotlib
MapReduce
Cloud based big data handling and Analytics AWS Redshift is a fast, scalable, and
fully managed data warehouse that
makes it easy to analyze large
amounts of data.

AWS Kinesis is a fully


managed service that
processes and analyzes
streaming data in real-
time at a massive scale. AWS Beanstalk is a fully
managed service that
makes it easy to deploy
and run applications in a
variety of languages.
Thank you !
Introduction

Ashutosh Vyas
Solutions Lead – Data Science & Quantum Computing
Harman DTS India Pvt. Ltd.
“He has a 8+ years of experience in data science domain. He has worked on multiple projects of pattern
recognition, time series forecasting, regression modelling, NLP, classification and optimization in Life science,
Finance, FMCG and Media domain. He completed his Mtech. In 2015 from Iiit-b. He has expertise in Bayesian
and frequentist methods of machine learning and had been working in quantum ML and quantum optimization from
past 4 years. He works with an ethos of developing customer centric and robust solutions.”

You might also like