PPT 1.1.3
PPT 1.1.3
BY : Urvashi
Chapter-1 Introduction to Big Data – Definition and Characteristics; The 5 V’s of Big Data – Volume: Data at scale,
Understandin Velocity: Real-time data processing, Variety: Structured, semi-structured, unstructured data, Veracity:
g Big Data Uncertainty and trustworthiness in data, Value: Transforming data into insights; Challenges and
and the 5 V’s Opportunities in Big Data; Big Data Use Cases in Real-World Applications
Chapter – 2 Fundamentals of Big Data Architecture: Data ingestion, storage, processing and visualization layers
Big Data Streaming Data in Big Data: Tools such as Spark, Apache Kafka and Flink
Architecture Real-World Big Data Architecture: Lambda and Kappa Architectures, Hybrid Architecture for batch and
real-time processing
Chapter – 3 Introduction to the Hadoop Ecosystem; HDFS (Hadoop Distributed File System): Architecture and
The Hadoop Functionality; MapReduce Programming Model: Workflow and Applications; YARN (Yet Another
Ecosystem Resource Negotiator): Resource Management; Tools in the Ecosystem: Pig, HBase, Flume, and Oozie;
Data Processing with Hadoop: ETL, Analytics and Reporting.
Course Outcomes
3
4. Veracity in Big Data
Definition:
Veracity in big data pertains to the uncertainty, inconsistency, and
inaccuracies inherent in data sources. It reflects the degree of
confidence or trust that users can place in the data they work with.
High-veracity data is clean, consistent, and reliable, while low-veracity
data may suffer from errors, biases, or ambiguities.
Sources of Veracity
Several factors contribute to challenges in maintaining data veracity, including:
• Data Inconsistencies: Inconsistent formats, duplicate records, or outdated
information.
• Data Noise: Inclusion of irrelevant or meaningless data that hinders analysis.
• Human Errors: Mistakes during data entry or manipulation.
• Biases: Systemic errors introduced by data collection processes or inherent
biases in algorithms.
• Unverified Sources: Use of unreliable data sources or lack of validation
mechanisms.
Importance of Veracity
Ensuring veracity is crucial as decisions based on low-quality data can
lead to:
Faulty analytics and inaccurate insights.
Reduced trust in big data systems and stakeholders.
Financial losses and reputational damage.
Ethical and compliance risks.
Methods to Address Veracity Challenges
• Data Cleaning and Preprocessing
o Removing duplicates and inconsistencies.
o Standardizing formats and correcting errors.
• Data Validation
o Using automated tools to cross-check data accuracy.
o Validating data from third-party sources before integration.
• Metadata Management
o Maintaining detailed metadata to provide context, origin, and changes to
the data.
Methods to Address Veracity Challenges
• Advanced Analytics
o Leveraging machine learning models to detect anomalies and
inconsistencies in datasets.
• Transparent Governance
o Implementing policies for data collection, handling, and quality assurance.
o Periodically auditing data for reliability.
• Crowdsourcing Veracity
o Engaging users to identify and correct errors in large datasets, especially for
open data projects.
5. Value in Big Data
Definition:
Big Data has emerged as a cornerstone of modern
decision-making and innovation across industries. By
analyzing massive datasets, organizations can uncover
insights that drive growth, improve efficiency, and
enhance customer experiences. This report outlines the
various dimensions of value derived from Big Data, its
applications, and key considerations for its effective use.
Value Creation from Big Data
The core value of Big Data lies in its potential to transform raw information into actionable insights.
This value can be categorized into several areas:
a. Operational Efficiency
Streamlining processes and reducing waste.
Predictive maintenance in manufacturing using sensor data.
c. Competitive Advantage
Identifying emerging trends ahead of competitors.
Innovating new products and services based on data insights.
Challenges in Extracting Value
While Big Data holds immense promise, several challenges must be addressed to
maximize its value:
Data Quality: Ensuring data accuracy, completeness, and consistency.
Data Privacy and Security: Protecting sensitive information against breaches.
Skill Gaps: Developing expertise in data analytics and related technologies.
Infrastructure Costs: Investing in storage, processing, and tools for Big Data.
Ethical Concerns: Preventing biases and ensuring fairness in data usage.
Technologies Driving Big Data Value
The following technologies facilitate the collection, storage, analysis, and
visualization of Big Data:
Artificial Intelligence (AI) and Machine Learning (ML): Automating data analysis
to uncover patterns and predictions.
Cloud Computing: Scalable storage and processing capabilities.
Internet of Things (IoT): Generating real-time data from connected devices.
Blockchain: Ensuring data integrity and traceability.
Data Visualization Tools: Simplifying complex datasets into actionable insights.
Reference Books
TEXT BOOKS
REFERENCE BOOKS
5. Chris Eaton, Dirk deroos et al., “Understanding Big data”, McGraw Hill, 2012.
6. Vignesh Prajapati, “Big Data Analytics with R and Hadoop”, Packet Publishing 2013.
7. JyLiebowitz, “Big Data and Business Analytics”, CRC press, 2013.
For more insight
Web sources
1. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
2. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
3. https://www.coursera.org/articles/
big-data-technologies?
utm_source=chatgpt.com
4. https://careerfoundry.com/en/ Big Data Big Big Data and
Analytics Analytics
blog/data-analytics/where-to-find- Wiley
free-datasets/?
utm_source=chatgpt.com
THANK YOU
For queries
Email: [email protected]