DataCentricComputing
DataCentricComputing
The Data-centric computing prioritizes data as the key driver for insights and innovation. It develops
hands-on experience with large, real-world data that explores techniques and tools to derive
meaningful insights. It gains practical experience through data exploration and predictive modeling.
Name of the Course Dr. R.J. Aarthi/ Associate Professor Contact Hrs.: 45
Coordinator:
Course Offering Department of CSE / School of Total Marks:100
Department/School: Computing
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
3 2 2 2
CO1
CO2 2 3 2 2
3 3 2 2 2
CO3
CO4 3 3 2 2 2
2
CO5 3 2 3 1 2
Definition and significance of data-centric computing - Differences between data-centric and computer-
centric approaches. Impact of data centric computing, Advantages and Challenges Use cases in industry
- AI, big data analytics, IoT. – Limitations.
Overview of data types and representations -File systems and storage management techniques:
Traditional storage architectures – Hard Disk Drive - HDD, Solid State Device - SSD, Distributed
storage systems – Hadoop Distributed File System – HDFS, limitation of HDFS, Ceph Distributed
Storage – Reliable Autonomic Distributed Object Store – RADOS in cloud.
Data Collection and Integration from Heterogeneous Sources of data, methods and challenges – Data
Integration Tools: ETL tools- Data cleaning and Preprocessing for data quality - Data normalization
and transformation techniques, Handling imbalanced data
Introduction to big data and distributed computing - Frameworks: Apache Hadoop: MapReduce
programming model - Apache Spark: In-memory distributed computing, Basics of parallel processing
and concurrency, Distributed system concepts: Data partitioning and replication, Consensus algorithms -
Paxos, Raft, Fault tolerance and scalability.
Overview of data-centric languages: Python for data manipulation - NumPy, pandas, R for statistical
computing - Query languages - SQL and their extensions for big data - HiveQL, SparkSQL - Tools and
Technologies Overview- Case studies in Agriculture, Education, Healthcare.
Assignme Alignment to
S.No SUMMARY OF COURSE CONTENT Hrs
nt COs
Introduction to Data-Centric Computing: Definition and
1 1 CO1
significance of data-centric computing
2 Explanation of compute-centric computing 1 CO1
Differences between data-centric and computer-centric
3 1 CO1
approaches.
4 Impact of data centric computing, 1 CO1
5 Advantages of data centric computing in real time data world 1 CO1
6 Challenges exist in data centric computing 1 CO1
Use cases in industry relevant to data centric and compute
7 1 CO1
centric computing
8 Use cases in industry - AI, big data analytics, IoT. 1 CO1
9 Limitations on data centric computing. 1 CO1
Data Storage Representation: Discussion on various data types CO2
10 1
and data structures to handle them
11 Overview of data types and representations 1 CO2
File systems and storage management techniques: Traditional CO2
12 storage architectures – Hard Disk Drive - HDD, Solid State 1
Device - SSD
13 Distributed storage systems and its necessities 1 CO2
14 Hadoop Distributed File System – HDFS 1 CO2
16 CO2 RB2 T1 2
Ceph Distributed Storage
17 Reliable Autonomic Distributed RB2 T1 2
CO2
Object Store – RADOS in cloud
18 How Ceph distributed storage can RB2 T2 3
CO2
be implemented using RADOS.
19 Data Preprocessing Techniques CO3 RB2 T1 2
Discussion and types
20 Data Collection and Integration CO3 RB1 T1 2
from Heterogeneous Sources of
data
21 Data collection methods and CO3 RB1 T1 2
challenges
22 CO3 RB2 T3 2
Data Integration Tools: ETL tools
23 Data cleaning and Preprocessing CO3 RB2 T1 2
for data quality
24 Data normalization and CO3 TB1 T1 2
transformation techniques
25 Handling imbalanced data – CO3 TBl T1 2
SMOTE techniques
26 Big Data Frameworks, Parallel WB1 T2 2
and Distributed Computing
CO4
Introduction to big data and
distributed computing
27 Frameworks: Apache Hadoop: WB1, RB2 T2 2
CO4
MapReduce programming model
28 Apache Spark: In-memory RB1 T1 3
CO4
distributed computing,
29 Basics of parallel processing and TB1 T1 3
CO4
concurrency
30 Distributed system concepts: Data TBl T1 3
CO4
partitioning and replication,
31 Consensus algorithms - Paxos, TB1 T1 3
CO4
Raft
32 Fault tolerance and scalability CO4 TB1 T1 3
33 Introduction to Data-Centric WB1 T1 3
Programming Paradigms CO5
Assignments:
T1 Black Board
T3 Video Lectures
TB1 Designing Data-Intensive Applications, 2nd Edition, by Martin Kleppmann, Chris Riccomini,
Publisher(s): O'Reilly Media, Inc.
Assessment Pattern:
There are 4 Continuous Learning Assessments (CLA) for the subject and for CLA 1 for
30 marks, CLA 2 for 30 Marks and CLA3 for 30 Marks and CLA 4 for 10 Marks.
CO WEIGHTAGE
Weightage
CO’s (Theory)
CO1 20%
CO2 20%
CO3 20%
CO4 20%
CO5 20%
THEORY
CLA 1 portions will be Module I and Module II half part with 30 marks.
CLA 2 portions will be Module II second half and Module III with 30 Marks
CO1 20
CO2 10 10
CO3 20
CO4 15 05
CO5 15 05
Final Examination – Weightage 50%
Evaluation Policy
EXAMS Total Mars split up WEIGHTAGE TOTAL MARS
Continuous Internal 100
Assessment Theory
(CLA 1, CLA 2, CLA 50% of Average 100 Mars
3, CLA 4)
End Semester Exam 100
theory
CO1 20
CO2 20
CO3 20
CO4 20
CO5 20
TEXTBOOKS
REFERENCE BOOKS
1. “Data Centric Artificial Intelligence: A Beginner’s Guide “ , By Parikshit N. Mahalle, Gitanjali R.
Shinde, Yashwant S. Ingle, Namrata N. Wasatkar, Springer
2. “A Data-Centric Introduction to Computing” - Kathi Fisler, Shriram Krishnamurthi, Benjamin S.
Lerner, Joe Gibbs Politz, MIT Press
1. https://www.jenkins.io/user-handbook.pdf