DA(Unit-1)
DA(Unit-1)
Unit-1
Sources and Nature of Data
• Data in data analytics comes from various sources and can be
categorized based on its nature and origin.
1. Structured Data
• Definition: Data that is organized and formatted in a specific way,
often in tables with rows and columns.
• Sources: Relational databases, spreadsheets, CSV files.
• Examples: SQL databases, Excel spreadsheets
2. Unstructured Data
• Definition: Data that lacks a predefined data model or is not
organized in a structured manner.
• Sources: Text documents, images, videos, social media posts.
• Examples: Text files, PDFs, images, videos.
3. Semi-Structured Data
• Definition: Data that is not fully structured but contains some level of
organization, often in the form of tags or elements.
• Sources: XML, JSON, log files.
• Examples: JSON files, XML files, log files.
4. Time-Series Data
• Definition: Data collected over time at regular intervals.
• Sources: Sensor data, financial market data, weather data.
• Examples: Stock prices, temperature records, IoT sensor data.
5. Geospatial Data
• Definition: Data that includes information about the geographic
location of objects or events.
• Sources: GPS data, maps, satellite imagery.
• Examples: Location tracking data, maps, satellite images.
6. Big Data
• Definition: Extremely large and complex datasets that cannot be
easily managed or processed using traditional data processing tools.
• Sources: Social media, sensors, Internet of Things (IoT) devices.
• Examples: Large-scale social media data, sensor data from smart
cities, IoT-generated data.
7. Transactional Data
• Definition: Data generated as a result of transactions or interactions.
• Sources: E-commerce transactions, financial transactions.
• Examples: Purchase records, banking transactions.
8. Web and Social Media Data
• Definition: Data collected from websites and social media platforms.
• Sources: Web scraping, social media APIs.
• Examples: Tweets, Facebook posts, web pages.
9. Machine-Generated Data
• Definition: Data generated by machines or devices without human
intervention.
• Sources: Sensor data, logs, machine-generated reports.
• Examples: Sensor readings, system logs.
10. Human-Generated Data
• Definition: Data created and input by human users.
• Sources: Surveys, feedback forms, manual data entry.
• Examples: Survey responses, user reviews.
11. Publicly Available Data
• Definition: Data that is accessible to the public.
• Sources: Open data repositories, government datasets.
• Examples: Census data, public health records.
Structured Data
• Structured data in data analytics refers to information that is highly
organized and formatted in a way that is easily understandable by
both machines and humans.
• This type of data follows a specific schema or data model, typically
arranged in rows and columns within a relational database or a similar
tabular format.
• Structured data is foundational in data analytics, providing a reliable
and efficient way to store and analyze information.
• It is particularly well-suited for scenarios where data relationships and
integrity are critical, such as in business applications and traditional
relational database systems.
Key characteristics of structured
data
Format and Organization:
• Tabular Structure: Structured data is often organized in tables with
rows and columns, where each row represents a record or entry, and
each column represents a specific attribute or field.
• Fixed Schema: The data follows a predefined and fixed schema,
meaning the types and structure of data are well-defined in advance.
Data Types:
• Homogeneous Data Types: Within a column, data types are usually
consistent. For example, a column might contain only numerical
values, dates, or text.
• Well-Defined Data Formats: Each column has a specific data format,
such as integers, floating-point numbers, dates, or strings.
Examples of Structured Data:
• Relational Databases: Most commonly associated with structured
data, relational databases such as MySQL, PostgreSQL, or Microsoft
SQL Server store data in tables with defined relationships.
• Spreadsheets: Excel sheets or CSV files are examples of structured
data where information is organized into rows and columns.
• Tables in HTML: Data presented in tables on web pages follows a
structured format.
Querying and Analysis:
• SQL Queries: Structured Query Language (SQL) is commonly used to
query and manipulate structured data. SQL allows users to retrieve,
update, and analyze data stored in relational databases.
• Aggregation and Join Operations: Techniques like aggregations and
joins are frequently used to derive meaningful insights by combining
and summarizing structured data.
Scalability and Efficiency:
• Efficient Storage: Structured data is highly efficient in terms of
storage, and databases are designed to handle large volumes of
structured information.
• Indexing: Indexing is often applied to columns, making data retrieval
faster and more efficient.
Use Cases:
• Business Applications: Structured data is commonly used in business
applications, such as customer relationship management (CRM)
systems, enterprise resource planning (ERP) systems, and financial
databases.
• Reporting and Business Intelligence: Structured data is well-suited for
generating reports and conducting business intelligence analyses due
to its organized and predictable nature.
Challenges:
• Rigidity: The fixed schema can be a limitation when dealing with
evolving or unanticipated data structures.
• Limited Representation: Structured data may struggle to represent
complex relationships or unstructured information.
Semi-Structured Data
• Semi-structured data in data analytics refers to information that does
not conform to the structure of traditional relational databases, yet
exhibits some level of organization.
• Unlike structured data, which is organized into fixed tables with
predefined schemas, semi-structured data allows for more flexibility
in terms of data representation.
• Semi-structured data strikes a balance between the rigidity of
structured data and the flexibility of unstructured data, making it
suitable for scenarios where the structure of the data is not fixed but
still requires some level of organization and representation.
Key characteristics of semi-
structured data
Flexible Structure:
• No Fixed Schema: Semi-structured data does not adhere to a rigid,
predefined schema. It allows for variations in the structure of the
data, making it more adaptable to changing requirements.
• Self-Describing: Semi-structured data often includes metadata or tags
that describe the structure and meaning of the data.
Data Formats:
• Common Formats: Semi-structured data is often represented in
formats that provide some level of organization but do not enforce a
strict schema.
• Examples: JSON (JavaScript Object Notation), XML (eXtensible
Markup Language), YAML (YAML Ain't Markup Language).
Hierarchy and Nesting:
• Nested Structures: Semi-structured data can have nested or
hierarchical structures, allowing for the representation of complex
relationships between entities.
• Example: In JSON, objects can contain arrays or other objects,
creating a hierarchical structure.
Use of Tags or Labels:
• Key-Value Pairs: Semi-structured data often uses key-value pairs to
represent information, providing a way to label and organize data
elements.
• Tags and Attributes: XML uses tags and attributes to label and
organize data, allowing for a more flexible structure.
Query and Analysis:
• Query Languages: While semi-structured data can be queried using
traditional SQL in some cases, it is also common to use specialized
query languages or tools designed for the specific format, such as
XPath for XML or JSONPath for JSON.
• Schema-on-Read: Unlike structured data with a schema-on-write
approach, semi-structured data often employs a schema-on-read
approach, where the schema is applied when the data is queried.
Examples of Semi-Structured
Data:
• JSON: JSON is widely used for representing semi-structured data. It
allows for nested structures and is commonly used in web
development and data interchange.
• XML: XML provides a hierarchical structure using tags and attributes,
making it suitable for representing semi-structured data with complex
relationships.
• YAML: YAML is a human-readable data serialization format that is
often used for configuration files and data exchange, offering a more
concise syntax compared to XML or JSON.
Use Cases:
• Web Development: Semi-structured data formats like JSON are
commonly used in web development for data exchange between the
server and the client.
• Configuration Files: YAML is often used for configuration files due to
its human-readable and concise syntax.
• Data Interchange: Semi-structured data is suitable for scenarios
where the structure of the data is not fully known in advance or may
evolve over time.
Challenges:
• Interoperability: Different semi-structured data formats may require
different parsing and processing techniques, leading to
interoperability challenges.
• Complexity: While semi-structured data allows for flexibility, it can
also introduce complexity in terms of understanding and managing
the data structure.
Unstructured Data
• Unstructured data in data analytics refers to information that lacks a
predefined data model or a specific organizational structure.
• Unlike structured data, which is organized into tables with well-defined
columns and rows, unstructured data does not follow a rigid format.
• Unstructured data is often characterized by its diverse and free-form nature,
making it challenging to analyze using traditional database management and
analysis tools.
• Unstructured data is a valuable source of information, and advancements in
technologies like natural language processing, computer vision, and machine
learning have enabled organizations to derive meaningful insights from this
type of data.
• As data analytics continues to evolve, the ability to effectively analyze and
derive insights from unstructured data becomes increasingly important for
decision-making and gaining a comprehensive understanding of complex
information.
Key characteristics of
unstructured data
Lack of Formal Structure:
• No Predefined Schema: Unstructured data does not adhere to a
predefined schema or data model. It may include a wide variety of
data types, and the relationships between data elements are not
explicitly defined.
• Varied Formats: Unstructured data can be in the form of text, images,
audio, video, social media posts, emails, and more.
Diverse Content:
• Textual Content: This includes documents, articles, emails, and any
other textual information that is not organized in a tabular structure.
• Media Files: Images, audio, and video files fall under unstructured
data, as they lack a structured format for easy analysis.
• Social Media Feeds: Data from social media platforms, such as
tweets, comments, and posts, is unstructured and often contains
informal language.
Complex Relationships:
• Implicit Relationships: Relationships between data elements in
unstructured data are often implicit and may require advanced
analytics techniques to uncover.
• Context-Dependent: Understanding the context and relationships
within unstructured data may involve natural language processing
(NLP) and machine learning.
Analysis Challenges:
• Text Mining and NLP: Analyzing unstructured textual data requires
techniques such as text mining and natural language processing to
extract meaningful insights.
• Image and Video Analysis: Processing and analyzing unstructured
data in the form of images or videos involve computer vision
techniques.
• Speech Recognition: Transcribing and analyzing unstructured data in
the form of spoken words (audio) may require speech recognition
algorithms.
Examples of Unstructured Data:
• Text Documents: Word documents, PDFs, and other text-based files
without a defined structure.
• Multimedia Files: Images, videos, and audio recordings that lack a
structured format.
• Social Media Posts: Data from platforms like Twitter, Facebook, and
Instagram, which often include unstructured text and multimedia
content.
• Emails: Email messages and attachments that may contain a mix of
structured and unstructured content.
Use Cases:
• Sentiment Analysis: Analyzing social media posts or customer reviews
to determine sentiment towards a product or service.
• Image Recognition: Identifying objects, people, or scenes within
images using computer vision techniques.
• Speech-to-Text Conversion: Converting spoken words in audio files to
text for analysis.
• Document Classification: Categorizing unstructured textual
documents based on their content.
Storage Challenges:
• Large Volumes: Unstructured data can be voluminous, and storing
and managing it efficiently may require scalable storage solutions.
• Data Variety: The diverse nature of unstructured data may pose
challenges in terms of data storage and retrieval.
Data Integration:
• Integration Challenges: Integrating unstructured data with structured
data sources can be challenging due to the differences in data formats
and structures.
• Data Lakes: Unstructured data is often stored in data lakes, providing
a flexible and scalable repository for various data types.