Module 1
Module 1
MODULE 1
INTRODUCTION TO DATA SCIENCE
E xploring the
F ascinating World of
Data S cience
In a world awash with information, data science has emerged as a
powerful discipline that harnesses the insights hidden within vast troves of
data. From uncovering patterns in customer behavior to predicting global
trends, this field promises to transform how we understand and interact
with the world around us.
Unders tanding B ig
Data and Data
S cience
When it comes to big data and data science, there's a lot of However, it's
important for college students to know that not everything they hear is
true. This presentation aims to explain what big data and data science
really are. We will explore the challenges, opportunities, and real-world
uses of these fields, helping students make informed decisions as they
explore the world of data.
Big Data and Data Science Hype-
Why actually data science is hyped. So, what is eyebrow-raising about Big Data and data science?
Let’s count the ways:
• There’s a lack of definitions around the most basic terminology. What is “Big Data” anyway? What does “data
science” mean? What is the relationship between Big Data and data science? Is data science the science of Big
Data? Is data science only the stuff going on in companies like Google and Facebook and tech companies?
Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech, etc.) and to data
science as only taking place in tech? Just how big is big? Or is it just a relative term? These terms are so
ambiguous, they’re well-nigh meaningless.
• There’s a distinct lack of respect for the researchers in academia and industry labs who have been working on
this kind of stuff for years, and whose work is based on decades (in some cases, centuries) of work by
statisticians, computer scientists, mathematicians, engineers, and scientists of all types. From the way the
media describes it, machine learning algorithms were just invented last week and data was never “big” until
Google came along. This is simply not the case. Many of the methods and techniques we’re using—and the
challenges we’re facing now—are part of the evolution of everything that’s come before. This doesn’t mean
that there’s not new and exciting stuff going on, but we think it’s important to show some basic respect for
everything that came before.
Understanding Big Data and Data
Science
What is "Big Data"? What is "Data The Connection
Science"
The term "Big Data" is often "Data Science" is a broad term While big data and data science
used to describe large, that encompasses various are related, their relationship is not
complex datasets that are activities, including statistical well-defined. It is unclear whether
difficult to process using analysis and machine learning. It data science is solely focused on
traditional methods. However, is not always clear how it relates analyzing big data or includes
the exact "big" varies and can to other fields like statistics and other data-driven activities.
be subjective. computer science. Clarifying these definitions is
important for establishing a more
coherent and respected field.
Res pecting the Pas t
1 Collaboration 2 Ethics
Working together with experts from Following strong ethical guidelines is
different areas is important for solving real- crucial to address concerns about privacy,
world problems and making a positive bias, and responsible use of data and
impact with data-driven insights. algorithms. This builds trust and credibility.
• This data tells us a lot about how people behave and how society works.
• With new technology, we can use this data to learn new things and
come up with new ideas in many different industries.
• our lives.
At the same time, computers are getting cheaper and more
powerful, so we can process and analyze all this data on a large
• scale.
This perfect combination of lots of data and advanced technology
is creating a new way of making decisions based on data and
making data science a really important field.
Datafication: Turning Life into Data
3 analytical abilities.
Collaborative Mindset 4 bring meaningfulRole
An Emerging change.
• The world we live in is complex, random, and uncertain. At the same time, it’s one big data-generating
machine.
• As we commute to work on subways and in cars, as our blood moves through our bodies, as we’re shopping,
emailing, procrastinating at work by browsing the Internet and watching the stock market, as we’re building
things, eating things, talking to our friends and family about things, while factories are producing products,
this all at least potentially produces data.
• Imagine spending 24 hours looking out the window, and for every minute, counting and recording the number
of people who pass by. Or gathering up everyone who lives within a mile of your house and making them tell
you how many email messages they receive every day for the next year.
• Imagine heading over to your local hospital and rummaging around in the blood samples looking for patterns
in the DNA. That all sounded creepy, but it wasn’t supposed to. The point here is that the processes in our
lives are actually data-generating processes.
• We’d like ways to describe, understand, and make sense of these pro‐ cesses, in part because as scientists we
just want to understand the world better, but many times, understanding these processes is part of the solution
to problems we’re trying to solve.
• Data represents the traces of the real-world processes, and exactly which traces we gather are decided by our
data collection or sampling method. You, the data scientist, the observer, are turning the world into data, and
this is an utterly subjective, not objective, process.
• After separating the process from the data collection, we can see clearly that there are two sources of
randomness and uncertainty. Namely, the randomness and uncertainty underlying the process itself, and the
uncertainty associated with your underlying data collection methods.
• Once you have all this data, you have somehow captured the world, or certain traces of the world. But you
can’t go walking around with a huge Excel spreadsheet or database of millions of transactions and look at it
and, with a snap of a finger, understand the world and process that generated it.
• “This overall process of going from the world to the data, and then from the data back to the world, is the
field of statistical inference.”
Needed Statistical Inference
Data Collection Statistical Inference
The processes in our lives are data- The overall process of going from the real
generating. We gather traces of these world to data and then back to
processes through data collection and understanding the world is the field of
sampling methods, which are subjective statistical inference, which allows us to
and introduce uncertainty. draw conclusions about the processes that
generated the data.
1 2 3
Statistical Modeling
To make sens e of the data, we create
statistical models that represent our
understanding of the underlying
processes . These models use parameters
to capture the relationships in the data.
Populations and Samples
Population Sample Sampling
Statistical modeling-
• Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think the
underlying process might be with your model. What comes first? What influences what? What causes what?
What’s a test of that?
• But different people think in different ways. Some prefer to express these kinds of relationships in terms of
math. The mathematical ex‐ pressions will be general enough that they have to include parameters, but the
values of these parameters are not yet known.
What is a Model?
1 Repres enting R eality 2 Expres s ing R elations hips
A model is an attempt to understand Models can be expressed using
and represent the nature of reality mathematical equations or diagrams
through a particular lens, such as that capture the relationships
architectural, biological, or between variables and parameters.
mathematical.
3 Es timating Parameters
The parameters in a model are unknown values that need to be estimated using the
observed data. This process of fitting the model is a key step in statistical inference.
Probability Distributions-