0% found this document useful (0 votes)
2 views

Lecture 1 Introduction to Data engineering

Data engineering involves designing systems for managing large-scale data, which is increasingly important due to the rapid growth of online data, projected to reach 181 zettabytes by 2025. The document outlines the evolution of the web from static to dynamic and intelligent platforms, leading to the emergence of data engineering and data science fields. It also explains the characteristics of big data, including its volume, velocity, variety, and veracity, and distinguishes between structured, semi-structured, and unstructured data.

Uploaded by

genesiskalya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 1 Introduction to Data engineering

Data engineering involves designing systems for managing large-scale data, which is increasingly important due to the rapid growth of online data, projected to reach 181 zettabytes by 2025. The document outlines the evolution of the web from static to dynamic and intelligent platforms, leading to the emergence of data engineering and data science fields. It also explains the characteristics of big data, including its volume, velocity, variety, and veracity, and distinguishes between structured, semi-structured, and unstructured data.

Uploaded by

genesiskalya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lecture 1: Introduction to data engineering

What is data engineering: Data engineering is the practice of designing and building
systems for collecting, preparing, storing, and analysing data at scale (big data)
Why is data engineering important: there is an explosion of data online and this data need
to be harnessed for productive use ( this is called Big data)
Example: data being generated per day in 2023 is 120 Zettabytes (120 x 10 21 bytes)
Reference: https://explodingtopics.com/blog/data-generated-per-day retrieved on
31/10/2023

Here’s a l breakdown of a forecast of much will be been generated by the year 2025 each

from 2010

Year Data Generated Change Over Previous Year Change Over Previous Year (%)
2010 2 zettabytes - -
2011 5 zettabytes ↑ 3 zettabytes ↑ 150%
2012 6.5 zettabytes ↑ 1.5 zettabytes ↑ 30%
2013 9 zettabytes ↑ 2.5 zettabytes ↑ 38.46%
2014 12.5 zettabytes ↑ 3.5 zettabytes ↑ 38.89%
2015 15.5 zettabytes ↑ 3 zettabytes ↑ 24%
2016 18 zettabytes ↑ 2.5 zettabytes ↑ 16.13%
2017 26 zettabytes ↑ 8 zettabytes ↑ 44.44%
2018* 33 zettabytes ↑ 7 zettabytes ↑ 26.92%
2019* 41 zettabytes ↑ 8 zettabytes ↑ 24.24%
2020* 64.2 zettabytes ↑ 23.2 zettabytes ↑ 56.59%
2021* 79 zettabytes ↑ 14.8 zettabytes ↑ 23.05%
2022* 97 zettabytes ↑ 18 zettabytes ↑ 22.78%
2023* 120 zettabytes ↑ 23 zettabytes ↑ 23.71%
2024* 147 zettabytes ↑ 27 zettabytes ↑ 22.5%
2025* 181 zettabytes ↑ 34 zettabytes ↑ 23.13%

The table above was extracted from reference website above using data scrapping which is a
data engineering technique at the stage of data ingestion
The formula below was used to scrap the data from the website
=IMPORTHTML("https://explodingtopics.com/blog/data-generated-per-day","table",2)

1
A brief background of how data explosion on the web started happening
 1989 to 2005  Web 1.0 (static web for information only)  in this error thre was no
data input by user as
 2004 – current  web 2.0 (dynamic collaborative web  database driven web lots of
data started being generated in web 2.0 this is because users were allowed to post data
(the next concern was how to make good use of the data)
 To answer the question of how to make good use of the data then web 3.0 was born
(also called semantic web  intelligent web)
 Web 3.0 is still being developed on the way to web 4.0
 As a consequence, the field of data engineering was born and this in turn led to data
science field
 The data scientists build, machine learning models for business intelligence and to do
so they must work together with data engineers. The third expert in a data science
project is a Domain expert or the expert from the business area where the machine
learning model will be deployed
 A data science project requires the expertise knowledge in
o Domain of the problem
o Data engineer
o Data science
 The term BIG DATA was coined to refer to the large amount of data on the internet
What is big data: the main features or characteristics of big data are: The 4 V’s of big data
(VOLUME, VELOCITY, VARIETY AND VERACITY)
 Large volume data
 High velocity (data that is changing very frequently)
 Many varieties (lots of data types including text, numbers, pictures, videos etc)
 Low veracity (data that is not very clean nor accurate – it cannot be used as it is hence
need to be cleaned and transformed into a usable format data wrangling)
Hence big data is data that is characterized by Large volume, High velocity, Many
varieties and Low veracity
Challenge of storage: Big data comes with the challenge of storage. This is because the data
is not structured in table format for storage in relational DBMS.
Big data is in two categories:
 Semi structured
 Unstructured data

1. Structured data: e.g. data elements already well organized into a structure that
is analyzable
a. Tabulated is structured:
i. Entities (table)
ii. Records (rows)
iii. Fields (columns)

2
To achieve the above data structure we use data modelling techniques mainly applied in
RDBMS (relational data base management systems)
RDBMS have preconfigured database schemas for data storage

What is a DB schema: it is a blueprint that gives the structure of how data is organized in a
RDBMS
Semi structured and unstructured on the other hand cannot be confined to a schema hence
such data is stored in NoSQL databases like Mongo DB, Apache Cassandara etc
NoSQL databases are Schema less i.e. they have no prescribed schema for data storage hence
making them flexible to sore big data which is unstructured
Big data is stored in different file formats
File formats are several: JSON, XML, CSV, Apache Parquet etc
Big data comes in two main categories namely Unstructured and semi structured. The above
file formats store semi structured data
2. Semi structured data: e.g. data that is structured using Big data storage file
formats such XML, JSON etc

Assume you want to send the following unstructured message on the web. Use
XML to structure the message for web transportation

From SM Karume
To Kennedy

RE: reminder for supervision meeting


This is to remind you of the supervision meeting scheduled today at 2:00 in my town
campus office

Semi-structured message for big data storage and transportation

<NOTE>
<From> SM Karume </From>
<To> Kennedy </To>
<Heading> reminder for supervision meeting</Heading>
<Body>
This is to remind you of the supervision meeting scheduled today at 2:00 in my
town campus office
<Body>
<NOTE>

Difference between Structured, Semi-structured and Unstructured data

3
Big Data includes huge volume, high velocity, and extensible variety of data. There are 3
types: Structured data, Semi-structured data, and Unstructured data.

Categories of data (structured vs semistructured vs unstructured data)


1) What is structured data?
 Structured data is generally tabular data that is represented by columns and rows in
a database. (Predefined format)
 Databases that hold tables in this form are called relational databases.
 The mathematical term “relation” specify to a formed set of data held as a table.
 A relation in mathematics is a set of data organized in a tabular format
 In structured data, all row in a table has the same set of columns.
 Rows is a tabular structure data are referred to as records
 Columns in a tabular structure data are referred to as fields
 A table in a tabular structure is called an entity
 Each data record in a table have the same number of fields

 SQL (Structured Query Language) programming language used for structured data.

2) What is Semi structured Data


 Semi-structured data is information that doesn’t consist of Structured relational
data (cannot be stored in a relational database) but still has some form of
structure to it.
 Semi-structured data consist of documents held in files like:
 JavaScript Object Notation (JSON) format.
 Xml file formats
 It also includes key-value stores
 graph databases.

Example of JSON file

4
3) What is Unstructured Data
 Unstructured data is information that either does not organize in a pre-defined
manner or not have a pre-defined data model.
 Unstructured information is a set of text-heavy but may contain data such as
numbers, dates, and facts as well.
 Videos, audio, and binary data files might not have a specific structure. They’re
assigned to as unstructured data.
 A lot of data today is unstructured

5
1. Structured data –

Structured data is data whose elements are addressable for effective analysis. It has
been organized into a formatted repository that is typically a database. It concerns
all data which can be stored in database SQL in a table with rows and columns.
They have relational keys and can easily be mapped into pre-designed fields. Today,
those data are most processed in the development and simplest way to manage
information. Example: Relational data.

2. Semi-Structured data –

Semi-structured data is information that does not reside in a relational database but
that has some organizational properties that make it easier to analyze. With some
processes, you can store them in the relation database (it could be very hard for
some kind of semi-structured data), but Semi-structured exist to ease
space. Example: XML data.

6
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does
not have a predefined data model, thus it is not a good fit for a mainstream relational
database. So for Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by organizations in
a variety of business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.

Differences between Structured, Semi-structured and Unstructured data:


Unstructured
Properties Structured data Semi-structured data data

It is based on It is based on It is based on


Technology Relational database XML/RDF(Resource character and
table Description Framework). binary data

Matured transaction
No transaction
Transaction and various Transaction is adapted
management and
management concurrency from DBMS not matured
no concurrency
techniques

Version Versioning over Versioning over tuples or Versioned as a


management tuples,row,tables graph is possible whole

It is more flexible than It is more flexible


It is schema
structured data but less and there is
Flexibility dependent and less
flexible than unstructured absence of
flexible
data schema

It is very difficult to It’s scaling is simpler than It is more


Scalability
scale DB schema structured data scalable.

New technology, not very


Robustness Very robust —
spread

Structured query Only textual


Query Queries over anonymous
allow complex queries are
performance nodes are possible
joining possible

You might also like