SlideShare a Scribd company logo
16 basic terms
So you can
talk to data engineers
Alex Pongpech
1. AIRFlow
● Apache Airflow is an open-source workflow
management platform.
● It started at Airbnb in October 2014 as a solution
to manage the company's increasing complex
workflows.
● Airflow uses directed acyclic graphs (DAGs) to
manage workflow orchestration.
2
2. Batch Processing:
● processing large amounts of data at once.
● Usage
○ The extract, transform, load (ETL) step in
populating data warehouses
○ Performing bulk operations on digital images
such as resizing, conversion, watermarking,
or otherwise editing a group of image files.
○ Converting computer files from one format to
another. For example, a batch job may
convert proprietary and legacy files to
common standard formats for end-user queries
and display.
3
3. Cold data storage–
● storing old data that is hardly
used on low-power servers.
Retrieving the data will take
longer
4
4. Cluster:
● several computers (or virtual
machines or node) grouped
together to perform a single
task.
5
5. Data Lakes:
● A data lake is a centralized
repository that allows you to
store all your structured and
unstructured data at any scale.
● You can store your data as-is,
without having to first
structure the data, and run
different types of
analytics—from dashboards and
visualizations to big data
processing, real-time
analytics, and machine learning
to guide better decisions.
● not just store data
6
6. Data Ingests:
● is the process of obtaining and
importing data for immediate
use or storage in a database.
To ingest something is to "take
something in or absorb
something."
● Data can be streamed in real
time or ingested in batches.
7
7. Data Pipeline
● is a broader term that
encompasses ETL as a
subset.
● It refers to a system
for moving data from one
system to another. The
data may or may not be
transformed, and it may
be processed in
real-time (or streaming)
instead of batches.
8
8. Data Streaming:
● processing data in small
chunks, one at a time,
rather than processing
all data at once.
● Streaming is necessary
for processing infinite
event streams.
● It’s also useful for
processing large amounts
of data, because it
prevents memory
overflows during
processing
9
9. Distributed Data Processing:
● breaking up data into
partitions so that
large amounts of data
can be processed by
many machines
simultaneously.
10
10.Extract, Transform and Load (ETL)
– a process in a database and data
warehousing meaning extracting the data
from various sources, transforming it to
fit operational needs and loading it into
the database
11
11. Extract Load Transform
● In contrast to ETL, in ELT models
the data is not transformed on
entry to the data lake, but stored
in its original raw format.
12
12. Hadoop Distributed File System (HDFS)
● Hadoop Distributed File System
(HDFS) is primary data storage
layer used by Hadoop
applications.
● It employs DataNode and NameNode
architecture.
13
13. Metadata.
● Metadata is "data that provides
information about other data".
● Many distinct types of metadata
exist, including descriptive
metadata, structural metadata,
administrative metadata, reference
metadata and statistical metadata.
14
14 Real-time processing:
● Real-time data
processing is the
execution of data in a
short time period,
providing
near-instantaneous
output.
● The processing is done
as the data is
inputted, so it needs a
continuous stream of
input data in order to
provide a continuous
output.
15
15. Scalable
Scalable hardware or software can expand to support
increasing workloads. This capability allows computer
equipment and software programs to grow over time, rather
than needing to be replaced.
16
16. Workflow Orchestration
● An orchestration workflow, which is based
on Business Process Manager Business
Process Definition, defines a logical flow
of activities or tasks from a Start event
to an End event to accomplish a specific
service.
17
references
https://thoughtbot.com/blog/a-glossary-for-data-engineering
https://datafloq.com/abc-big-data-glossary/
https://www.bmc.com/blogs/workflow-orchestration/
wiki
18
Ad

More Related Content

What's hot (19)

The Path to Migrating off MapR
The Path to Migrating off MapRThe Path to Migrating off MapR
The Path to Migrating off MapR
Alluxio, Inc.
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
Adnan Siddiqi
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon Nexus, Inc.
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Roushan Sinha
 
Data Center Operating System
Data Center Operating SystemData Center Operating System
Data Center Operating System
Keshav Yadav
 
Features of Hadoop
Features of HadoopFeatures of Hadoop
Features of Hadoop
Dr. C.V. Suresh Babu
 
HADOOP
HADOOPHADOOP
HADOOP
Harinder Kaur
 
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GRGlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
Theophanis Kontogiannis
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
 
ETL DW-RealTime
ETL DW-RealTimeETL DW-RealTime
ETL DW-RealTime
Adriano Patrick Cunha
 
Introducing gluster filesystem by aditya
Introducing gluster filesystem by adityaIntroducing gluster filesystem by aditya
Introducing gluster filesystem by aditya
Aditya Chhikara
 
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Tech Triveni
 
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StoragePros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Eric Carter
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
Alluxio, Inc.
 
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
y-asgari
 
Big data
Big dataBig data
Big data
Alisha Roy
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Alluxio, Inc.
 
The Path to Migrating off MapR
The Path to Migrating off MapRThe Path to Migrating off MapR
The Path to Migrating off MapR
Alluxio, Inc.
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon Nexus, Inc.
 
Data Center Operating System
Data Center Operating SystemData Center Operating System
Data Center Operating System
Keshav Yadav
 
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GRGlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
Theophanis Kontogiannis
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
 
Introducing gluster filesystem by aditya
Introducing gluster filesystem by adityaIntroducing gluster filesystem by aditya
Introducing gluster filesystem by aditya
Aditya Chhikara
 
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Tech Triveni
 
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StoragePros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Eric Carter
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
Alluxio, Inc.
 
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
y-asgari
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Alluxio, Inc.
 

Similar to 10 basic terms so you can talk to data engineer (20)

Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Denodo
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
An Introduction To Oracle Database
An Introduction To Oracle DatabaseAn Introduction To Oracle Database
An Introduction To Oracle Database
Meysam Javadi
 
1. Briefly describe the major components of a data warehouse archi.docx
1. Briefly describe the major components of a data warehouse archi.docx1. Briefly describe the major components of a data warehouse archi.docx
1. Briefly describe the major components of a data warehouse archi.docx
monicafrancis71118
 
Oracle 11gR2 plain servers vs Exadata - 2013
Oracle 11gR2 plain servers vs Exadata - 2013Oracle 11gR2 plain servers vs Exadata - 2013
Oracle 11gR2 plain servers vs Exadata - 2013
Connor McDonald
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Unit 6 - Compression and Serialization in Hadoop.pptx
Unit 6 - Compression and Serialization in Hadoop.pptxUnit 6 - Compression and Serialization in Hadoop.pptx
Unit 6 - Compression and Serialization in Hadoop.pptx
muhweziart
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Delivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and TableauDelivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and Tableau
Harald Erb
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
Genoveva Vargas-Solar
 
History of Oracle and Databases
History of Oracle and DatabasesHistory of Oracle and Databases
History of Oracle and Databases
Connor McDonald
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Denodo
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
An Introduction To Oracle Database
An Introduction To Oracle DatabaseAn Introduction To Oracle Database
An Introduction To Oracle Database
Meysam Javadi
 
1. Briefly describe the major components of a data warehouse archi.docx
1. Briefly describe the major components of a data warehouse archi.docx1. Briefly describe the major components of a data warehouse archi.docx
1. Briefly describe the major components of a data warehouse archi.docx
monicafrancis71118
 
Oracle 11gR2 plain servers vs Exadata - 2013
Oracle 11gR2 plain servers vs Exadata - 2013Oracle 11gR2 plain servers vs Exadata - 2013
Oracle 11gR2 plain servers vs Exadata - 2013
Connor McDonald
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Unit 6 - Compression and Serialization in Hadoop.pptx
Unit 6 - Compression and Serialization in Hadoop.pptxUnit 6 - Compression and Serialization in Hadoop.pptx
Unit 6 - Compression and Serialization in Hadoop.pptx
muhweziart
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Delivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and TableauDelivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and Tableau
Harald Erb
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
Genoveva Vargas-Solar
 
History of Oracle and Databases
History of Oracle and DatabasesHistory of Oracle and Databases
History of Oracle and Databases
Connor McDonald
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4
 
Ad

More from Worapol Alex Pongpech, PhD (9)

Blockchain based Customer Relation System
Blockchain based Customer Relation SystemBlockchain based Customer Relation System
Blockchain based Customer Relation System
Worapol Alex Pongpech, PhD
 
Fast analytics kudu to druid
Fast analytics  kudu to druidFast analytics  kudu to druid
Fast analytics kudu to druid
Worapol Alex Pongpech, PhD
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
Worapol Alex Pongpech, PhD
 
Building business intuition from data
Building business intuition from dataBuilding business intuition from data
Building business intuition from data
Worapol Alex Pongpech, PhD
 
Why are we using kubernetes
Why are we using kubernetesWhy are we using kubernetes
Why are we using kubernetes
Worapol Alex Pongpech, PhD
 
Airflow 4 manager
Airflow 4 managerAirflow 4 manager
Airflow 4 manager
Worapol Alex Pongpech, PhD
 
Fast Analytics
Fast Analytics Fast Analytics
Fast Analytics
Worapol Alex Pongpech, PhD
 
Dark data
Dark dataDark data
Dark data
Worapol Alex Pongpech, PhD
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
Worapol Alex Pongpech, PhD
 
Ad

Recently uploaded (20)

Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 

10 basic terms so you can talk to data engineer

  • 1. 16 basic terms So you can talk to data engineers Alex Pongpech
  • 2. 1. AIRFlow ● Apache Airflow is an open-source workflow management platform. ● It started at Airbnb in October 2014 as a solution to manage the company's increasing complex workflows. ● Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. 2
  • 3. 2. Batch Processing: ● processing large amounts of data at once. ● Usage ○ The extract, transform, load (ETL) step in populating data warehouses ○ Performing bulk operations on digital images such as resizing, conversion, watermarking, or otherwise editing a group of image files. ○ Converting computer files from one format to another. For example, a batch job may convert proprietary and legacy files to common standard formats for end-user queries and display. 3
  • 4. 3. Cold data storage– ● storing old data that is hardly used on low-power servers. Retrieving the data will take longer 4
  • 5. 4. Cluster: ● several computers (or virtual machines or node) grouped together to perform a single task. 5
  • 6. 5. Data Lakes: ● A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. ● You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions. ● not just store data 6
  • 7. 6. Data Ingests: ● is the process of obtaining and importing data for immediate use or storage in a database. To ingest something is to "take something in or absorb something." ● Data can be streamed in real time or ingested in batches. 7
  • 8. 7. Data Pipeline ● is a broader term that encompasses ETL as a subset. ● It refers to a system for moving data from one system to another. The data may or may not be transformed, and it may be processed in real-time (or streaming) instead of batches. 8
  • 9. 8. Data Streaming: ● processing data in small chunks, one at a time, rather than processing all data at once. ● Streaming is necessary for processing infinite event streams. ● It’s also useful for processing large amounts of data, because it prevents memory overflows during processing 9
  • 10. 9. Distributed Data Processing: ● breaking up data into partitions so that large amounts of data can be processed by many machines simultaneously. 10
  • 11. 10.Extract, Transform and Load (ETL) – a process in a database and data warehousing meaning extracting the data from various sources, transforming it to fit operational needs and loading it into the database 11
  • 12. 11. Extract Load Transform ● In contrast to ETL, in ELT models the data is not transformed on entry to the data lake, but stored in its original raw format. 12
  • 13. 12. Hadoop Distributed File System (HDFS) ● Hadoop Distributed File System (HDFS) is primary data storage layer used by Hadoop applications. ● It employs DataNode and NameNode architecture. 13
  • 14. 13. Metadata. ● Metadata is "data that provides information about other data". ● Many distinct types of metadata exist, including descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata. 14
  • 15. 14 Real-time processing: ● Real-time data processing is the execution of data in a short time period, providing near-instantaneous output. ● The processing is done as the data is inputted, so it needs a continuous stream of input data in order to provide a continuous output. 15
  • 16. 15. Scalable Scalable hardware or software can expand to support increasing workloads. This capability allows computer equipment and software programs to grow over time, rather than needing to be replaced. 16
  • 17. 16. Workflow Orchestration ● An orchestration workflow, which is based on Business Process Manager Business Process Definition, defines a logical flow of activities or tasks from a Start event to an End event to accomplish a specific service. 17