DAunit1 (1)
DAunit1 (1)
Data management
We have a huge amount of data getting generated at a very fast rate and are of
different types so we need a mechanism of managing Data
Data Management in Data Analytics
1. Data Collection: Gathering data from various sources like websites, sensors,
databases, etc.
2. Data Storage: Storing data in databases, data warehouses, or data lakes, ensuring
that it's easy to retrieve.
3. Data Cleaning: Removing errors, duplicates, and inconsistencies from the data to
ensure it's accurate and usable.
4. Data Security: Protecting data from unauthorized access and ensuring privacy.
5. Data Governance: Setting rules and policies for managing data throughout its
lifecycle.
Data Architecture
Data Architecture defines how data is collected, stored, and processed. It provides a
blueprint for managing data and ensuring that it flows efficiently from collection to
analysis.
Important components:
1. Data Sources: Where data originates (e.g., social media, sensors, databases).
2. Data Pipelines: The process of moving data from sources to storage and analytics
tools.
Databases: Structured storage, good for transactional data (e.g., SQL databases).
Data Lakes: Stores both structured and unstructured data for more flexible use.
4. Data Integration: Combining data from different sources into a unified view.
5. Analytics Layer: Tools and software that process and analyze the data (e.g., SQL
queries, machine learning algorithms).
In data architecture, there are three main levels of modeling: conceptual, logical, and
physical. These layers represent how data is structured and managed, from abstract
ideas to detailed implementation. Each layer builds on the previous one, adding more
specifics as you move down.
1. Conceptual Model
The conceptual model is the high-level, abstract view of the data and how it relates to
the business or organization. This model focuses on what data is required, without
worrying about how it is stored or implemented. It provides a broad overview for
stakeholders to understand the data landscape.
No technical details: The model does not specify how the data will be stored (e.g.,
database type, data structure).
Key elements:
Entities: Major objects of interest, like "Customer", "Order", "Product".
Example:
Relationships: A Customer can place multiple Orders, and an Order can contain
multiple Products.
2. Logical Model
The logical model adds more detail to the conceptual model by describing how the
data will be structured without being tied to a specific technology (e.g., database
type). It focuses on what types of data are stored and the rules governing the
relationships, but it still remains technology-agnostic.
Purpose: To define data structures in more detail (attributes, data types) and show
how they relate logically. It's mainly for data architects or analysts.
More detailed, but no physical implementation: Defines fields, data types, and
relationships, but does not specify how this will be implemented physically.
Key elements:
Entities and Attributes: Entities (e.g., Customer) are defined with their attributes (e.g.,
Customer Name, Customer ID).
Primary and Foreign Keys: Identifies how tables/entities are connected, like linking a
"Customer ID" to an "Order".
Example:
The physical model translates the logical model into an actual implementation on a
specific database system (e.g., MySQL, Oracle, etc.). This model is concerned with the
technical details of how data will be stored and accessed, including specific hardware
and software configurations.
Purpose: To map the data model onto the physical storage system, defining exactly
how data will be stored, indexed, and accessed. It's for database administrators and
engineers.
Key elements:
Tables and Columns: Actual database tables and columns that store the data.
Example:
2. Data Quality: High-quality data is essential for accurate analysis. Managing data
quality involves:
3. Data Access: Ensuring that analysts can access the data they need through proper
tools (e.g., dashboards, SQL databases).
Data can come from various sources, and understanding these sources is crucial in
data analytics. Each type of data source has different characteristics, formats, and
methods of collection. Here’s an overview of some common sources of data:
1. Sensor Data
Sensors are devices that detect and measure physical properties like temperature,
light, pressure, sound, or motion, and convert them into signals for analysis.
Characteristics:
Examples:
2. Signal Data
Signals refer to any transmitted data that carries information, often in the form of
electrical or electromagnetic waves.
Characteristics:
Frequency and Amplitude: These are the key properties of signals, often analyzed for
patterns.
Time-series data: Signal data is usually recorded over time, showing changes in
intensity or frequency.
Examples:
Sound Signals: Microphones convert sound waves into data that can be processed.
Applications:
3. GPS Data
Global Positioning System (GPS) data is used to track location by receiving signals
from satellites.
Characteristics:
Latitude and Longitude: GPS data provides the geographic coordinates of a location.
Real-time tracking: Data can be continuously updated to provide real-time
positioning.
Accuracy: The precision of GPS data can vary, but modern systems can be accurate
to within a few meters.
Examples:
Smartphone Location Tracking: Used in maps and navigation apps (e.g., Google
Maps).
Fitness Devices: Tracking routes and distances covered during a run or cycling
session.
Applications:
4. Transactional Data
Transactional data refers to data generated from business transactions like sales,
purchases, and other operations.
Characteristics:
Historical: Transactional data is often stored over time for analysis of trends or
patterns.
Examples:
Applications:
Social media data is the information generated from platforms like Facebook, Twitter,
Instagram, and LinkedIn.
Characteristics:
Examples:
Applications:
6. Web Data
Web data refers to information that can be extracted from websites, typically through
web scraping or APIs.
Characteristics:
Semi-structured or Unstructured: Web data is often in HTML or JSON format,
requiring processing to extract useful information.
Dynamic: Websites update frequently, and data may change over time.
Public or Private: Some web data is publicly available (e.g., public blogs), while others
require authentication (e.g., personal account details).
Examples:
Web traffic data: Information on user visits, page views, and clicks.
E-commerce data: Product prices, reviews, and ratings from online shopping
platforms.
Social media APIs: Data pulled from platforms like Twitter via their APIs.
Applications:
7. Machine-Generated Data
Characteristics:
Examples:
Network Data: Data about traffic patterns and usage from routers and switches.
IoT Data: Information collected from Internet of Things (IoT) devices, like smart
appliances.
Applications:
8. Survey Data
Characteristics:
Examples:
Applications:
Data Quality
Data Quality refers to the condition or level of excellence of data, determining how
well it can meet the needs of its intended use, whether for analysis, decision-making,
or operational processes. High-quality data is accurate, complete, reliable, and
relevant to its purpose. Poor data quality can lead to incorrect conclusions, inefficient
operations, or poor decision-making.
1. Accuracy
Data must correctly reflect the real-world entities and values it represents.
Example: Inaccurate data may include a wrong address or incorrect spelling of a
name.
Importance: Incorrect data can lead to errors in analysis and business decisions.
2. Completeness
Example: Missing customer phone numbers in a contact list would make follow-ups
impossible.
3. Consistency
Importance: Inconsistent data can create confusion and lead to inaccurate results.
4. Timeliness
Importance: Outdated data can lead to decisions based on conditions that no longer
exist.
5. Validity
Data must conform to the correct formats and fall within the acceptable range or
domain.
1. Definition: Noise is any unwanted or extraneous data that does not represent actual
information or patterns within the dataset.
2. Causes:
Human errors: In manual data entry, incorrect or inconsistent inputs can generate
noise.
3. Example:
In audio data, random static or background sounds recorded during an interview are
forms of noise.
4. Impact:
Distorts analysis: Noise can obscure true patterns or relationships in data, leading to
misleading results.
Outliers in data quality refer to data points that significantly differ from the rest of the
observations in a dataset. They can skew analysis and lead to misleading conclusions
if not properly identified and addressed.
1. Definition: Outliers are observations that lie outside the general distribution of the
dataset. They are typically much higher or lower than the majority of the data points.
2. Causes:
Data entry errors: Mistakes made during data collection or input can lead to outlier
values (e.g., typing "999" instead of "99").
Rare events: Legitimate but uncommon occurrences that differ from the norm (e.g., a
sudden spike in sales due to a promotional event).
Natural variability: In some cases, outliers may simply be extreme values that
naturally occur within the data distribution.
3. Example:
In a dataset of student test scores, if most students score between 60 and 80, a score
of 30 or 100 could be considered an outlier.
In real estate data, a property priced at $10 million in a neighborhood where most
properties range from $300,000 to $500,000 would be an outlier.
4. Impact:
Skews statistical analysis: Outliers can distort metrics like mean and standard
deviation, leading to inaccurate interpretations.
May indicate important information: While often seen as problematic, outliers can also
represent valuable insights or anomalies worth investigating (e.g., fraud detection).
Identifying and analyzing outliers is crucial in data quality assessment, as they can
significantly influence the results and interpretations drawn from a dataset.
Duplicate data refers to instances where identical or nearly identical records appear
multiple times within a dataset. This issue can lead to inflated metrics, confusion, and
inaccuracies in data analysis and reporting.
1. Definition: Duplicate data consists of repeated entries for the same entity or record,
leading to redundancy within a dataset.
2. Causes:
Manual entry errors: Data entered multiple times by users due to oversight or lack of
checks.
System integration: When merging data from different systems or sources without
proper deduplication checks, duplicates can arise.
Data migration issues: During data transfers between databases, identical records
may not be properly filtered out.
Varying formats: Different representations of the same record (e.g., different spellings
of a name) can cause entries to be seen as distinct when they are actually duplicates.
3. Example:
A customer database that includes two records for the same individual, such as:
4. Impact:
Confusion in data analysis: Having multiple records for the same entity can
complicate analysis and reporting, making it difficult to draw accurate conclusions.
Increased storage costs: Duplicates consume unnecessary storage space and can
lead to higher operational costs.
Customer experience issues: For businesses, duplicates can result in poor customer
experiences (e.g., receiving multiple communications or promotions).
Inconsistent data refers to data entries that do not match or align across different
records, databases, or datasets, leading to discrepancies and potential confusion
during analysis. This issue can significantly undermine the reliability of data-driven
decision-making.
1. Definition: Inconsistent data occurs when the same data point is represented
differently across various datasets or within the same dataset, resulting in conflicts or
contradictions.
2. Causes:
Varying formats: Different formats for the same type of data can lead to
inconsistencies (e.g., dates represented as MM/DD/YYYY in one dataset and
DD/MM/YYYY in another).
3. Example:
A customer database may have entries where the same customer's name appears as
"John Smith," "john smith," and "J. Smith," leading to discrepancies when analyzing
customer records.
In a sales dataset, the total sales for a month might be reported differently across two
reports due to differing calculation methods or data entry errors.
4. Impact:
Reduced trust in data: Stakeholders may become skeptical of the data's reliability,
leading to hesitation in making decisions based on analysis.
Ensuring data consistency is crucial for maintaining data quality, as it helps create a
reliable foundation for analysis and decision-making. Regular audits, standardization
processes, and data governance practices can help mitigate the issue of inconsistent
data.
Missing values in data quality refer to the absence of data for one or more fields in a
dataset. This issue can pose significant challenges for data analysis, as incomplete
data can lead to biased or inaccurate results.
2. Causes:
Data entry errors: Mistakes during manual data entry can result in blank fields.
Optional fields: Data fields that are not mandatory may remain unfilled, leading to
gaps in the dataset.
3. Example:
In a medical dataset, if patients fail to report their weight or height during a checkup,
those fields will be left empty.
4. Impact:
Biased analysis: Missing values can lead to biased results if the absence of data is
not random (e.g., if certain groups are more likely to have missing values).
Reduced statistical power: In statistical analysis, missing data can reduce the sample
size, leading to less reliable results and increased variability.
Impact on machine learning models: Algorithms may struggle with missing values,
resulting in poor model performance or requiring additional preprocessing steps to
manage the gaps.
Addressing missing values is crucial for maintaining data quality, as it ensures that
analyses are based on complete and accurate datasets. Various strategies, such as
imputation or exclusion of missing data, can be employed to manage missing values,
depending on the context and analysis requirements.
Data processing
Data processing in data analytics refers to the series of steps involved in collecting,
organizing, transforming, and analyzing data to extract meaningful insights. This
process is crucial for converting raw data into a format that can be easily understood
and utilized for decision-making.
1. Data Collection:
Methods: Surveys, web scraping, APIs, data logs, and data warehousing.
2. Data Cleaning:
Tasks: Handling missing values, removing duplicate records, correcting errors, and
ensuring data consistency.
3. Data Transformation:
Definition: Modifying data to fit a specific format or structure required for analysis.
4. Data Integration:
Methods: Merging datasets, joining tables, and using data warehouses or lakes to
centralize data.
5. Data Storage:
Definition: Organizing and storing data in a structured manner for easy access and
retrieval.
Types: Relational databases (SQL), NoSQL databases, data warehouses, and data
lakes.
6. Data Analysis:
7. Data Visualization:
Definition: Presenting data and analysis results in a visual format to make insights
more accessible and understandable.
Tools: Charts, graphs, dashboards, and interactive visualizations using tools like
Tableau, Power BI, and matplotlib (Python).
8. Data Interpretation:
Improves Data Quality: Ensures that data is accurate, consistent, and reliable for
analysis.
Facilitates Insight Extraction: Transforms raw data into meaningful insights that can
drive decision-making.
1. Batch Processing
Definition: Collecting and processing large volumes of data at once rather than
continuously. Data is processed in groups or batches at scheduled intervals.
Characteristics:
Latency: Typically high; results are available only after the batch is processed.
Example: A retail company processes sales data at the end of each day to generate
reports on sales performance.
2. Real-Time Processing
Characteristics:
Use Cases: Fraud detection, stock trading, monitoring social media activity.
Technologies: Often utilizes Apache Kafka, Apache Flink, or Apache Storm.
3. Stream Processing
Characteristics:
Use Cases: Sensor data monitoring, live sports updates, user interaction tracking.
4. Distributed Processing
Characteristics:
Fault Tolerance: If one node fails, others can take over the processing tasks.
Use Cases: Large-scale data analytics, scientific simulations, and machine learning
tasks that require significant computational resources.
Characteristics:
Accessibility: Data and applications can be accessed from anywhere with an internet
connection.
Example: A business uses cloud processing services like AWS Lambda or Google
Cloud Functions to run analytics tasks without managing physical servers.