Big Data Syllabus
Big Data Syllabus
Prerequisite
Co requisite
Ant requisite
Course Objectives:
Course Outcomes
CO1 Describe what Data Science is and the skill sets needed to be a data scientist.
CO2 Explain in basic terms what Statistical Inference means. Identify probability distributions
CO3 Explain the significance of exploratory data analysis (EDA) in data science. Apply basic
CO4 Describe the Data Science Process and how its components interact. Use APIs and other
CO5 Identify and explain fundamental mathematical and algorithmic ingredients that constitute a
principal component analysis). Build their own recommendation system using existing
components.
1. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge
3. Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data
Mining
4. Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second
Edition.
5. Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. (Note: this is a book
currently being written by the three authors. The authors have made the first draft of their notes for the
book available online. The material is intended for a modern theoretical course in computer science.)
6. Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and
7. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques, Third Edition. ISBN
0123814790. 2011.
What is Data Science? - Big Data and Data Science hype – and getting past the hype - Why now? –
Populations and samples - Statistical modelling, probability distributions, fitting a model - Intro to
R.
Unit-2 . Exploratory Data Analysis and the Data Science Process 8 hours
Exploratory Data Analysis and the Data Science Process - Basic tools (plots, graphs and summary
statistics) of EDA - Philosophy of EDA - The Data Science Process - Case Study: RealDirect
(online real estate firm) 4. Three Basic Machine Learning Algorithms - Linear Regression - k-
Nearest Neighbors (k-NN) - k-means.
Motivating application: Filtering Spam - Why Linear Regression and k-NN are poor choices for
Filtering Spam - Naive Bayes and why it works for Filtering Spam - Data Wrangling: APIs and
other tools for scrapping the Web 6. Feature Generation and Feature Selection (Extracting Meaning
(brainstorming, role of domain expertise, and place for imagination) - Feature Selection algorithms
graphs.
Basic principles, ideas and tools for data visualization , Examples of inspiring (industry) projects -
Exercise: create your own visualization of a complex dataset Discussions on privacy, security,
The advances and the latest trends in the course as well as the latest applications of the areas
Discussion of some latest papers published in IEEE transactions and ACM transactions, Web of
Science and SCOPUS indexed journals as well as high impact factor conferences as well as
symposiums.
Discussion on some of the latest products available in the market based on the areas covered in the
Prerequisite
Co requisite
Ant requisite
COURSE OBJECTIVES:
Understanding Data Science Process and learning techniques, tools, Statistical Methodologies and
COURSE OUTCOMES:
Course Outcomes
CO2 Students should learn various techniques for big data analytics.
CO3 Students able to identify the real time problems and able to design solution using
Introduction – distributed file system – Big Data and its importance, Four V‟s in bigdata, Drivers for Big
data, Big data analytics, Big data applications. Algorithms using map reduce, Matrix-Vector
Big Data – Apache Hadoop & Hadoop EcoSystem – Moving Data in and out of Hadoop –
Hadoop Architecture, Hadoop Storage: HDFS, Common Hadoop Shell commands , Anatomy of File
Write and Read., NameNode, Secondary NameNode, and DataNode, Hadoop MapReduce paradigm,
Map and Reduce tasks, Job, Tasktrackers - Cluster Setup – SSH & Hadoop Configuration – HDFS
ecosystem components - Schedulers - Fair and Capacity, Hadoop 2.0 New Features- NameNode High
Architecture and Installation, Comparison with Traditional Database, HiveQL - Querying Data - Sorting
And Aggregating, Map Reduce Scripts, Joins & Subqueries, HBase concepts- Advanced Usage, Schema
Design, Advance Indexing - PIG, Zookeeper - how it helps in monitoring a cluster, HBase uses
Unit VI 5 hours
The advances and the latest trends in the course as well as the latest applications of the areas covered in
the course.
Discussion of some latest papers published in IEEE transactions and ACM transactions, Web of Science
and
SCOPUS indexed journals as well as high impact factor conferences as well as symposiums.
Discussion on some of the latest products available in the market based on the areas covered in the
course and
3. Chris Eaton, Dirk deroos et al. , “Understanding Big data ”, McGraw Hill, 2012.
5. Vignesh Prajapati, “Big Data Analytics with R and Haoop”, Packet Publishing 2013.
6. Tom Plunkett, Brian Macdonald et al, “Oracle Big Data Handbook”, Oracle Press, 2014.