0% found this document useful (0 votes)

2 views

Lecture 1 Introduction to Data engineering

Data engineering involves designing systems for managing large-scale data, which is increasingly important due to the rapid growth of online data, projected to reach 181 zettabytes by 2025. The document outlines the evolution of the web from static to dynamic and intelligent platforms, leading to the emergence of data engineering and data science fields. It also explains the characteristics of big data, including its volume, velocity, variety, and veracity, and distinguishes between structured, semi-structured, and unstructured data.

Uploaded by

genesiskalya

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lecture 1 Introduction to Data engineering

Uploaded by

genesiskalya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Lecture 1: Introduction to data engineering

What is data engineering: Data engineering is the practice of designing and building
systems for collecting, preparing, storing, and analysing data at scale (big data)
Why is data engineering important: there is an explosion of data online and this data need
to be harnessed for productive use ( this is called Big data)
Example: data being generated per day in 2023 is 120 Zettabytes (120 x 10 21 bytes)
Reference: https://explodingtopics.com/blog/data-generated-per-day retrieved on
31/10/2023

Here’s a l breakdown of a forecast of much will be been generated by the year 2025 each

from 2010

Year Data Generated Change Over Previous Year Change Over Previous Year (%)
2010 2 zettabytes - -
2011 5 zettabytes ↑ 3 zettabytes ↑ 150%
2012 6.5 zettabytes ↑ 1.5 zettabytes ↑ 30%
2013 9 zettabytes ↑ 2.5 zettabytes ↑ 38.46%
2014 12.5 zettabytes ↑ 3.5 zettabytes ↑ 38.89%
2015 15.5 zettabytes ↑ 3 zettabytes ↑ 24%
2016 18 zettabytes ↑ 2.5 zettabytes ↑ 16.13%
2017 26 zettabytes ↑ 8 zettabytes ↑ 44.44%
2018* 33 zettabytes ↑ 7 zettabytes ↑ 26.92%
2019* 41 zettabytes ↑ 8 zettabytes ↑ 24.24%
2020* 64.2 zettabytes ↑ 23.2 zettabytes ↑ 56.59%
2021* 79 zettabytes ↑ 14.8 zettabytes ↑ 23.05%
2022* 97 zettabytes ↑ 18 zettabytes ↑ 22.78%
2023* 120 zettabytes ↑ 23 zettabytes ↑ 23.71%
2024* 147 zettabytes ↑ 27 zettabytes ↑ 22.5%
2025* 181 zettabytes ↑ 34 zettabytes ↑ 23.13%

The table above was extracted from reference website above using data scrapping which is a
data engineering technique at the stage of data ingestion
The formula below was used to scrap the data from the website
=IMPORTHTML("https://explodingtopics.com/blog/data-generated-per-day","table",2)

1
A brief background of how data explosion on the web started happening
 1989 to 2005  Web 1.0 (static web for information only)  in this error thre was no
data input by user as
 2004 – current  web 2.0 (dynamic collaborative web  database driven web lots of
data started being generated in web 2.0 this is because users were allowed to post data
(the next concern was how to make good use of the data)
 To answer the question of how to make good use of the data then web 3.0 was born
(also called semantic web  intelligent web)
 Web 3.0 is still being developed on the way to web 4.0
 As a consequence, the field of data engineering was born and this in turn led to data
science field
 The data scientists build, machine learning models for business intelligence and to do
so they must work together with data engineers. The third expert in a data science
project is a Domain expert or the expert from the business area where the machine
learning model will be deployed
 A data science project requires the expertise knowledge in
o Domain of the problem
o Data engineer
o Data science
 The term BIG DATA was coined to refer to the large amount of data on the internet
What is big data: the main features or characteristics of big data are: The 4 V’s of big data
(VOLUME, VELOCITY, VARIETY AND VERACITY)
 Large volume data
 High velocity (data that is changing very frequently)
 Many varieties (lots of data types including text, numbers, pictures, videos etc)
 Low veracity (data that is not very clean nor accurate – it cannot be used as it is hence
need to be cleaned and transformed into a usable format data wrangling)
Hence big data is data that is characterized by Large volume, High velocity, Many
varieties and Low veracity
Challenge of storage: Big data comes with the challenge of storage. This is because the data
is not structured in table format for storage in relational DBMS.
Big data is in two categories:
 Semi structured
 Unstructured data

1. Structured data: e.g. data elements already well organized into a structure that
is analyzable
a. Tabulated is structured:
i. Entities (table)
ii. Records (rows)
iii. Fields (columns)

2
To achieve the above data structure we use data modelling techniques mainly applied in
RDBMS (relational data base management systems)
RDBMS have preconfigured database schemas for data storage

What is a DB schema: it is a blueprint that gives the structure of how data is organized in a
RDBMS
Semi structured and unstructured on the other hand cannot be confined to a schema hence
such data is stored in NoSQL databases like Mongo DB, Apache Cassandara etc
NoSQL databases are Schema less i.e. they have no prescribed schema for data storage hence
making them flexible to sore big data which is unstructured
Big data is stored in different file formats
File formats are several: JSON, XML, CSV, Apache Parquet etc
Big data comes in two main categories namely Unstructured and semi structured. The above
file formats store semi structured data
2. Semi structured data: e.g. data that is structured using Big data storage file
formats such XML, JSON etc

Assume you want to send the following unstructured message on the web. Use
XML to structure the message for web transportation

From SM Karume
To Kennedy

RE: reminder for supervision meeting

This is to remind you of the supervision meeting scheduled today at 2:00 in my town
campus office

Semi-structured message for big data storage and transportation

<NOTE>
<From> SM Karume </From>
<To> Kennedy </To>
<Heading> reminder for supervision meeting</Heading>
<Body>
This is to remind you of the supervision meeting scheduled today at 2:00 in my
town campus office
<Body>
<NOTE>

Difference between Structured, Semi-structured and Unstructured data

3
Big Data includes huge volume, high velocity, and extensible variety of data. There are 3
types: Structured data, Semi-structured data, and Unstructured data.

Categories of data (structured vs semistructured vs unstructured data)

1) What is structured data?
 Structured data is generally tabular data that is represented by columns and rows in
a database. (Predefined format)
 Databases that hold tables in this form are called relational databases.
 The mathematical term “relation” specify to a formed set of data held as a table.
 A relation in mathematics is a set of data organized in a tabular format
 In structured data, all row in a table has the same set of columns.
 Rows is a tabular structure data are referred to as records
 Columns in a tabular structure data are referred to as fields
 A table in a tabular structure is called an entity
 Each data record in a table have the same number of fields

 SQL (Structured Query Language) programming language used for structured data.

2) What is Semi structured Data

 Semi-structured data is information that doesn’t consist of Structured relational
data (cannot be stored in a relational database) but still has some form of
structure to it.
 Semi-structured data consist of documents held in files like:
 JavaScript Object Notation (JSON) format.
 Xml file formats
 It also includes key-value stores
 graph databases.

Example of JSON file

4
3) What is Unstructured Data
 Unstructured data is information that either does not organize in a pre-defined
manner or not have a pre-defined data model.
 Unstructured information is a set of text-heavy but may contain data such as
numbers, dates, and facts as well.
 Videos, audio, and binary data files might not have a specific structure. They’re
assigned to as unstructured data.
 A lot of data today is unstructured

5
1. Structured data –

Structured data is data whose elements are addressable for effective analysis. It has
been organized into a formatted repository that is typically a database. It concerns
all data which can be stored in database SQL in a table with rows and columns.
They have relational keys and can easily be mapped into pre-designed fields. Today,
those data are most processed in the development and simplest way to manage
information. Example: Relational data.

2. Semi-Structured data –

Semi-structured data is information that does not reside in a relational database but
that has some organizational properties that make it easier to analyze. With some
processes, you can store them in the relation database (it could be very hard for
some kind of semi-structured data), but Semi-structured exist to ease
space. Example: XML data.

6
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does
not have a predefined data model, thus it is not a good fit for a mainstream relational
database. So for Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by organizations in
a variety of business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.

Differences between Structured, Semi-structured and Unstructured data:

Unstructured
Properties Structured data Semi-structured data data

It is based on It is based on It is based on

Technology Relational database XML/RDF(Resource character and
table Description Framework). binary data

Matured transaction
No transaction
Transaction and various Transaction is adapted
management and
management concurrency from DBMS not matured
no concurrency
techniques

Version Versioning over Versioning over tuples or Versioned as a

management tuples,row,tables graph is possible whole

It is more flexible than It is more flexible

It is schema
structured data but less and there is
Flexibility dependent and less
flexible than unstructured absence of
flexible
data schema

It is very difficult to It’s scaling is simpler than It is more

Scalability
scale DB schema structured data scalable.

New technology, not very

Robustness Very robust —
spread

Structured query Only textual

Query Queries over anonymous
allow complex queries are
performance nodes are possible
joining possible

Sourcing Hacks ALL Volumes
No ratings yet
Sourcing Hacks ALL Volumes
78 pages
Unit 1
No ratings yet
Unit 1
26 pages
Chapter 1 Notes
No ratings yet
Chapter 1 Notes
10 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
BIG DATA & Hadoop Tutorial
No ratings yet
BIG DATA & Hadoop Tutorial
23 pages
1.1 Module-1
No ratings yet
1.1 Module-1
31 pages
UNIT-1 BDA
No ratings yet
UNIT-1 BDA
20 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
74 pages
4a - Database Systems
No ratings yet
4a - Database Systems
35 pages
Unit 3.BigData Notes
No ratings yet
Unit 3.BigData Notes
19 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
Unit 1
No ratings yet
Unit 1
17 pages
Big Data Analytics (Unit-II)
No ratings yet
Big Data Analytics (Unit-II)
17 pages
DATABASE MANAGEMENT SYSTEM notes bca AI
No ratings yet
DATABASE MANAGEMENT SYSTEM notes bca AI
24 pages
Unit - 1 PDF
No ratings yet
Unit - 1 PDF
24 pages
BDA Introduction
No ratings yet
BDA Introduction
61 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
UNIT-3 PART-1 (MIS)
No ratings yet
UNIT-3 PART-1 (MIS)
7 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Unit-1-Part1-Big Data Analytics and Tools
No ratings yet
Unit-1-Part1-Big Data Analytics and Tools
12 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
01. Data Lecture
No ratings yet
01. Data Lecture
32 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
BIG Data Analytics
No ratings yet
BIG Data Analytics
17 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Bda MST Merged
No ratings yet
Bda MST Merged
230 pages
ISO 90012015 Certified Institute
No ratings yet
ISO 90012015 Certified Institute
10 pages
Big Data With Hadoop
No ratings yet
Big Data With Hadoop
26 pages
01 Unit-BDA- Intro BDA
No ratings yet
01 Unit-BDA- Intro BDA
37 pages
Augmenting Data Warehouses With Big Data
No ratings yet
Augmenting Data Warehouses With Big Data
17 pages
DBMS Tutorial
No ratings yet
DBMS Tutorial
173 pages
Unit - 1
No ratings yet
Unit - 1
24 pages
Database
No ratings yet
Database
129 pages
Bda Module 1
No ratings yet
Bda Module 1
19 pages
Unit 2
No ratings yet
Unit 2
34 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
1 Big Data Analytics-Introduction R21 A7902 ABP
No ratings yet
1 Big Data Analytics-Introduction R21 A7902 ABP
14 pages
DBMS-UNIT-1 R16 (Ref-2)
No ratings yet
DBMS-UNIT-1 R16 (Ref-2)
12 pages
DBMS (1)
No ratings yet
DBMS (1)
74 pages
03) Introduction to Database data management approches
No ratings yet
03) Introduction to Database data management approches
21 pages
Lesson Two DMS
No ratings yet
Lesson Two DMS
11 pages
Unit 4 Dbms.docx
No ratings yet
Unit 4 Dbms.docx
19 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
Unit 01
No ratings yet
Unit 01
32 pages
1) What Is Database: Mysql Oracle
No ratings yet
1) What Is Database: Mysql Oracle
58 pages
Unit 1
No ratings yet
Unit 1
30 pages
Data Abstraction
No ratings yet
Data Abstraction
7 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
Unit-I (Big Data)
No ratings yet
Unit-I (Big Data)
30 pages
Database Management System
No ratings yet
Database Management System
88 pages
Structured, Semi Structured and Unstructured Data
No ratings yet
Structured, Semi Structured and Unstructured Data
13 pages
Dbms
No ratings yet
Dbms
13 pages
Cse Big Data 702 Notes
No ratings yet
Cse Big Data 702 Notes
91 pages
SITA1603_Big Data - UNIT 1- Material
No ratings yet
SITA1603_Big Data - UNIT 1- Material
23 pages
DMW Lab File Work
No ratings yet
DMW Lab File Work
18 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Research Report
No ratings yet
Research Report
12 pages
discuss 1
No ratings yet
discuss 1
5 pages
Lecture 2 File Types Suitable for Storing Big Data
No ratings yet
Lecture 2 File Types Suitable for Storing Big Data
12 pages
Lecture 3 Data Engineering Concepts, Processes, and Tools
No ratings yet
Lecture 3 Data Engineering Concepts, Processes, and Tools
2 pages
Veritas Netbackup Data Sheet EN
No ratings yet
Veritas Netbackup Data Sheet EN
9 pages
Upgrade Nexus
No ratings yet
Upgrade Nexus
3 pages
Bug Bounty Tools
No ratings yet
Bug Bounty Tools
1 page
Microstrategy: Training
No ratings yet
Microstrategy: Training
99 pages
A Wireless Intrusion Detection System and A New Attack Model
No ratings yet
A Wireless Intrusion Detection System and A New Attack Model
28 pages
VMware Vrealize Orchestrator Cookbook - Sample Chapter
No ratings yet
VMware Vrealize Orchestrator Cookbook - Sample Chapter
51 pages
Shell Scripts For Oracle Database and e Business Suite
No ratings yet
Shell Scripts For Oracle Database and e Business Suite
31 pages
Technical Skills
No ratings yet
Technical Skills
5 pages
Java Lab QB - CIE-2024
No ratings yet
Java Lab QB - CIE-2024
4 pages
Sahana Resume
No ratings yet
Sahana Resume
2 pages
Scope of Work document_Endpoint Central 3
No ratings yet
Scope of Work document_Endpoint Central 3
6 pages
Effectiveness of Information Technology in Small Medium Enterprise
No ratings yet
Effectiveness of Information Technology in Small Medium Enterprise
4 pages
Process Instruction (PI) Sheet PDF
No ratings yet
Process Instruction (PI) Sheet PDF
5 pages
EditShare Release Notes
No ratings yet
EditShare Release Notes
18 pages
Iap Ispi Release Notes BMW How To Diagnose.
No ratings yet
Iap Ispi Release Notes BMW How To Diagnose.
4 pages
Collibra Staffing Skills Matrix
No ratings yet
Collibra Staffing Skills Matrix
6 pages
LDB
No ratings yet
LDB
1 page
DW & DM Pyq
No ratings yet
DW & DM Pyq
3 pages
S7-1200 Programmable Controller - V4.1 Asynchronous Communication Connections
No ratings yet
S7-1200 Programmable Controller - V4.1 Asynchronous Communication Connections
3 pages
Pengembangan Sistem Firewall Pada Jaringan Komputer Berbasis Mikrotik RouterOS
No ratings yet
Pengembangan Sistem Firewall Pada Jaringan Komputer Berbasis Mikrotik RouterOS
5 pages
Vendor Info Consolidated New
No ratings yet
Vendor Info Consolidated New
41 pages
Arba Minch University Arba Minch Institute of Technology
No ratings yet
Arba Minch University Arba Minch Institute of Technology
86 pages
Basic VERITAS Volume Replicator™ 4.1 Commands: Veritas Education Quick Reference
No ratings yet
Basic VERITAS Volume Replicator™ 4.1 Commands: Veritas Education Quick Reference
2 pages
Computer Hardware With Images
97% (32)
Computer Hardware With Images
41 pages
Full Download Learning RabbitMQ 1st Edition Toshev Martin PDF DOCX
100% (5)
Full Download Learning RabbitMQ 1st Edition Toshev Martin PDF DOCX
55 pages
Enterprise Java Sem: V
100% (1)
Enterprise Java Sem: V
76 pages
Unit 5-Technological Environment
No ratings yet
Unit 5-Technological Environment
12 pages
Chapter 3
No ratings yet
Chapter 3
49 pages
Assignment 2
No ratings yet
Assignment 2
8 pages

Lecture 1 Introduction to Data engineering

Uploaded by

Lecture 1 Introduction to Data engineering

Uploaded by

Lecture 1: Introduction to data engineering

RE: reminder for supervision meeting

Semi-structured message for big data storage and transportation

Difference between Structured, Semi-structured and Unstructured data

Categories of data (structured vs semistructured vs unstructured data)

2) What is Semi structured Data

Example of JSON file

Differences between Structured, Semi-structured and Unstructured data:

It is based on It is based on It is based on

Version Versioning over Versioning over tuples or Versioned as a

It is more flexible than It is more flexible

It is very difficult to It’s scaling is simpler than It is more

New technology, not very

Structured query Only textual

You might also like