MET CS777 Summer2-2022 Big-Data-Analytics
MET CS777 Summer2-2022 Big-Data-Analytics
Course Description
This course is an introduction to large-scale data analytics. Big Data analytics is the study of how
to extract actionable, non-trivial knowledge from a massive number of data sets. This class will
focus both on the cluster computing software tools and programming techniques used by data
scientists and the important mathematical and statistical models used in learning from
large-scale data processing. On the tool's side, we will cover the basic systems and techniques
to store large volumes of data and modern systems for cluster computing based on MapReduce
patterns such as Hadoop MapReduce, Apache Spark, and Flink.
Students will implement data mining algorithms and execute them on real cloud systems like
Amazon AWS, Google Cloud, or Microsoft Azure by using educational accounts. On the data
mining models side, this course will cover the main standard supervised and unsupervised
models and will introduce improvement techniques on the model side.
Course Prerequisites
● We expect you to have a solid background in Python programming and understand basic
statistics and machine learning. The following classes are required/recommended: MET
CS 521, MET CS 544 and MET CS 555, or MET CS 677.
● If you do not have the required/recommended courses, you need the instructor's
consent.
● This class includes topics from Cloud Computing, Parallel Processing, and Machine
Learning, making the course very compact for a six-week online course.
● To implement the assignments, students need to have excellent knowledge of the
Python programming language and some basic Linux knowledge. Assignments are very
time-consuming, and you should take this course when you have at least 20 hours per
week.
1
Boston University Metropolitan College
Learning Objectives
By completing this course, you will be able to:
● Explain the main challenges of Big Data Processing.
● Run a Big Data Processing pipeline on Google Cloud (or Amazon AWS).
● Implement Big Data code in Apache Spark (in PySpark).
● Run Supervised and Unsupervised machine learning on Large-Scale Data.
Laptop Requirement
Students should have a personal laptop. We will use laptops to write Python programs and do
the quizzes in the classroom. Also, for the Final exam, Laptops are required.
Materials
Required Book
There is no required textbook for the class. There are detailed lecture notes, and all class
material will be conveyed during the lecture.
Recommended Books
ISBN-13: 978-0262018029
Han, J., Kamber, M., Pei, J. (2009). Data mining: Concepts and
techniques (3rd ed.).
Morgan Kaufmann.
ISBN-13: 978-9380931913
2
Boston University Metropolitan College
Perrin, J. (2020). Spark in action (2nd ed.). (Covers Apache Spark 3 with
examples in Java, Python, and Scala)
O'Reilly Media Inc.
Damji, J., Wenig, B., Das, T., Lee, D. (2020). Learning spark (2nd ed.)
O'Reilly Media Inc.
Ramcharan, K., Sundar, K., Alla, S. (2020). Applied data science using
PySpark: Learn the end-to-end predictive model-building cycle
Apress
3
Boston University Metropolitan College
GitHub
This course has a GitHub repository (https://github.com/trajanov/BigDataAnalytics) with all of
the course code examples.
Course website
This course will use the Blackboard Learn site. Students are required to have a BU ID and
password to log in. If you do not have a BU ID yet, note that this takes some time so be sure to
start this process well before class starts. The BlackBoard site is https://learn.bu.edu
4
Boston University Metropolitan College
5
Boston University Metropolitan College
Furthermore, you should cite your sources. Add a comment to your code that includes the
URL(s) you consulted when constructing your solution. This turns out to be very helpful when
you’re looking at something you wrote a while ago, and you need to remind yourself what you
were thinking.
Grading Criteria
Please check the Study Guide in the syllabus for Live Classroom dates and specific due dates for
assignments and assessments.
Grading Structure and Distribution
The grade for the course is determined by the following:
Activity Percentages
Assignments
Homework assignments are focused on applying theory learned in the week’s module to a set of
data and analyzing that data in PySpark. Weekly homework assignments will focus on
implementing data processing and machine learning algorithms in Apache Spark (PySpark). You
will use Google Cloud to run your Spark code on large data sets. Free of charge usage credits for
Google Cloud will be provided through Education accounts.
Due Time: At the end of each module (Please check the Study Guide or the Syllabus for the
specific due date).
Where to submit: The "Assignments" section in the left-hand course menu.
Weekly Quizzes
Quizzes will evaluate students' understanding of concepts presented in the previous week’s
module. Students should ensure adequate preparation. Doing well on the quiz will not be
possible without first reviewing the course material in-depth, attempting to understand all
examples, and testing yourself. There are five quizzes.
6
Boston University Metropolitan College
A (Excellent) 95-100
B (Good) 83–86.99
Unacceptable 0
7
Boston University Metropolitan College
8
Boston University Metropolitan College
Instructor Biography
Prof. Dimitar Trajanov, Ph.D. is Visiting Research Professor at Boston
University and Head of the Department of Information systems and
network technologies at the Faculty of Computer Science and
Engineering - ss. Cyril and Methodius University—Skopje. From
March 2011 until September 2015, he was the founding Dean of the
Faculty of Computer Science and Engineering. In his tenure, the
Faculty has become the largest technical Faculty in Macedonia.
Dimitar Trajanov is the leader of the Regional Social Innovation Hub,
established in 2013 in cooperation with the United Nations
Development Programme. His professional experience includes working as a Data Science
Consultant for one of the largest Pharmaceutical companies, a Data Science consultant for
UNDP in North Macedonia, and a software architect in a couple of startups. Dimitar Trajanov is
the author of more than 170 journal and conference papers and seven books. He has been
involved in more than 70 research and industry projects.