Unit -1 DS
Unit -1 DS
UNIT - 1
Data science:
Data science is the domain of study that deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models.
The data used for analysis can come from many different sources and presented in various
formats.
Data science is about extraction, preparation, analysis, visualization, and maintenance of information. It
is a cross disciplinary field which uses scientific methods and processes to draw insights from data.
Data science’s lifecycle consists of five distinct stages, each with its own tasks:
Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage involves
gathering raw structured and unstructured data.
Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, and Data Architecture.
This stage covers taking the raw data and putting it in a form that can be used. Process: Data Mining,
Clustering/Classification, Data Modeling, Data Summarization. Data scientists take the prepared data
and examine its patterns, ranges, and biases to determine how useful it will be in predictive analysis.
Communicate: Data Reporting, Data Visualization, Business Intelligence, and Decision Making. In
this final step, analysts prepare the analyses in easily readable forms such as charts, graphs, and reports.
1962: American mathematician John W. Tukey first articulated the data science dream. In his now-
famous article “The Future of Data Analysis,” he foresaw the inevitable emergence of a new field
nearly two decades before the first personal computers. While Tukey was ahead of his time, he was not
alone in his early appreciation of what would come to be known as “data science.”
1977: The theories and predictions of “pre” data scientists like Tukey and Naur became more
concrete with the establishment of The International Association for Statistical Computing (IASC),
whose mission was “to link traditional statistical methodology, modern computer technology, and the
knowledge of domain experts in order to convert data into information and knowledge.”
1980s and 1990s: Data science began taking more significant strides with the emergence of the first
Knowledge Discovery in Databases (KDD) workshop and the founding of the International Federation
of Classification Societies (IFCS).
1994: Business Week published a story on the new phenomenon of “Database Marketing.” It described
the process by which businesses were collecting and leveraging enormous amounts of data to learn
more about their customers, competition, or advertising techniques.
1990s and early 2000s: We can clearly see that data science has emerged as a recognized and
specialized field. Several data science academic journals began to circulate, and data science
proponents like Jeff Wu and William S. Cleveland continued to help develop and expound upon the
necessity and potential of data science.
2000s: Technology made enormous leaps by providing nearly universal access to internet connectivity,
communication, and (of course) data collection.
2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering large
amounts of data, new technologies capable of processing them became necessary. Hadoop rose to the
challenge, and later on Spark and Cassandra made their debuts.
2014: Due to the increasing importance of data, and organizations’ interest in finding patterns and
making better business decisions, demand for data scientists began to see dramatic growth in different
parts of the world.
2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the realm of data
science.
2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in data
science.
2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more- increasing
demand for qualified professionals in Big Data
Data Analyst
Data Engineers
Database Administrator
Machine Learning Engineer
Data Scientist
Data Architect
Statistician
Business Analyst
Data and Analytics Manager
1. Data Analyst
Data analysts are responsible for a variety of tasks including visualisation, munging, and processing of
massive amounts of data. They also have to perform queries on the databases from time to time. One of
the most important skills of a data analyst is optimization.
Extracting data from primary and secondary sources using automated tools
Developing and maintaining databases
Performing data analysis and making reports with recommendations
To become a data analyst: SQL, R, SAS, and Python are some of the sought-after technologies
for data analysis.
2. Data Engineers
Data engineers build and test scalable Big Data ecosystems for the businesses so that the data scientists
can run their algorithms on the data systems that are stable and highly optimized. Data engineers also
update the existing systems with newer or upgraded versions of the current technologies to improve the
efficiency of the databases.
3. Database Administrator
The job profile of a database administrator is pretty much self-explanatory- they are responsible for the
proper functioning of all the databases of an enterprise and grant or revoke its services to the employees
of the company depending on their requirements.
Machine learning engineers are in high demand today. However, the job profile comes with its
challenges. Apart from having in-depth knowledge of some of the most powerful technologies such as
SQL, REST APIs, etc. machine learning engineers are also expected to perform A/B testing, build data
pipelines, and implement common machine learning algorithms such as classification, clustering, etc.
5. Data Scientist
Data scientists have to understand the challenges of business and offer the best solutions using data
analysis and data processing. For instance, they are expected to perform predictive analysis and run a
fine-toothed comb through an “unstructured/disorganized” data to offer actionable insights.
complementary technologies.
6. Data Architect
A data architect creates the blueprints for data management so that the databases can be easily
integrated, centralized, and protected with the best security measures. They also ensure that the data
engineers have the best tools and systems to work with.
7. Statistician
A statistician, as the name suggests, has a sound understanding of statistical theories and data
organization. Not only do they extract and offer valuable insights from the data clusters, but they
8. Business Analyst
The role of business analysts is slightly different than other data science jobs. While they do have a
good understanding of how data-oriented technologies work and how to handle large volumes of data,
they also separate the high-value data from the low-value data.
Data Science workflows tend to happen in a wide range of domains and areas of expertise such as
biology, geography, finance or business, among others. This means that Data Science projects can take
on very different challenges and focuses resulting in very different methods and data sets being used. A
Data Science project will have to go through five key stages: defining a problem, data processing,
modelling, evaluation and deployment.
Defining a problem
The first stage of any Data Science project is to identify and define a problem to be solved. Without a
clearly defined problem to solve, it can be difficult to know how to tackle to the problem.
For a Data Science project this can include what method to use, such as is classification, regression
or clustering. Also, without a clearly defined problem, it can be hard to determine what your measure of
success would be.
Without a defined measure of success, you can never know when your project is complete or is good
enough to be used in production.
A challenge with this is being able to define a problem small enough that it can be solved/tackled
individually.
Data Processing
Once you have your problem, how you are going to measure success, and an idea of the methods you
will be using, you can then go about performing the all important task of data processing. This is often
the stage that will take the longest in any Data Science project and can regularly be the most important
stage.
There are a variety of tasks that need to occur at this stage depending on what problem you are going
to tackle. The first is often finding ways to create or capture data that doesn’t exist yet.
Once you have created this data, you then need to collect it somewhere and in a format that is useful for
your model. This will depend on what method you will be using in the modelling phase but it will
involve figuring out how you will feed the data into your model.
The final part of this is to then perform any pre-processing steps to ensure that the data is clean
enough for the modelling method to work. This may involve removing outliers, or choosing to keep
them, manipulating null values, whether a null value is a measure orwhether it should be imputed to the
average, or standardising the measures.
Modelling
The next part, and often the most fun and exciting part, is the modelling phase of the Data Science
project. The format this will take will depend primarily on what the problem is and how you defined
success in the first step, and secondarily on how you processed the data.
Unfortunately, this is often the part that will take the least amount of time of any Data Science
project, especially when there are many frameworks or libraries that exist, such as sklearn, statsmodels,
tensorflow and that can be readily utilised.
You should have selected the method that you will be using to model your data in the defining a
problem stage, and this may include simple graphical exploration, regression, classification or
clustering.
Evaluation
Once you have then created and implemented your models, you then need to know how to evaluate
it. Again, this goes back to the problem formulation stage where you will have defined your measure of
success, but this is often one of the most important stages.
Depending on how you processed your data and set-up your model, you may have a holdout dataset
or testing data set that can be used to evaluate your model. On this dataset, you are aiming to see how
well your model performs in terms of both accuracy and reliability.
Deployment
Finally, once you have robustly evaluated your model and are satisfied with the results, then you can
deploy it into production. This can mean a variety of things such as whether you use the insights from
the model to make changes in your business, whether you use your model to check whether changes
that have been made were successful, or whether the model is deployed somewhere to continually
receive and evaluate live data.
Applications of data science in various fields
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to search
for something on the internet, we mostly used Search engines like Google, Yahoo, Safari, Firefox, etc.
So Data Science is used to get Searches faster.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of Driverless
Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data
Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy Streets, Narrow
Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is used
to examine past behavior with past data and their goal is to examine the future outcome.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user experience
with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions similar
to choices according to our past data and also we get recommendations according to most buy the
product, most rated, most searched, etc. This is all done with the help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our image
with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done
with the help of machine learning and Data Science..
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the user
searches on the Internet, he/she will see numerous posts everywhere.
example: Suppose I want a mobile phone, so I just Google search it and after that, I changed my mind
to buy offline. Data Science helps those companies who are paying for Advertisements for their mobile.
So everywhere on the internet in the social media, in the websites, in the apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy online.
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes easy to
predict flight delays. It also helps to decide whether to directly land into the destination or take a halt in
between like a flight can have a direct route from Delhi to the U.S.A or it can halt in between after that
reach at the destination.
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science
concepts are used with machine learning where with the help of past data the Computer will improve its
performance. There are many games like Chess, EA Sports, etc. will use Data Science concepts.
The process of creating medicine is very difficult and time-consuming and has to be done with full
disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of time,
resources, and finance or developing new Medicine or drug but with the help of Data Science, it
becomes easy because the prediction of success rate can be easily determined based on biological data
or factors. The algorithms based on data science will forecast how this will react to the human body
without lab experiments.
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science helps
these companies to find the best route for the Shipment of their Products, the best time suited for
delivery, the best mode of transport to reach the destination, etc
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility to just
type a few letters or words, and he will get the feature of auto-completing the line.
Data security is the process of protecting corporate data and preventing data loss through unauthorized
access. This includes protecting your data from attacks that can encrypt or destroy data, such as
ransomware, as well as attacks that can modify or corrupt your data. Data security also ensures data is
available to anyone in the organization who has access to it. Some industries require a high level of data
security to comply with data protection regulations. For example, organizations that process payment
card information must use and store payment card data securely, and healthcare organizations in the
USA must secure private health information (PHI) in line with the HIPAA standard.
Accidental Exposure
A large percentage of data breaches are not the result of a malicious attack but are caused by Negligent
or accidental exposure of sensitive data. It is common for an organization’s employees to share, grant
access to, lose, or mishandle valuable data, either by accident or because they are not aware of security
policies.
Social engineering attacks are a primary vector used by attackers to access sensitive data.
They involve manipulating or tricking individuals into providing private information or access to
privileged accounts.
Phishing is a common form of social engineering. It involves messages that appear to be from a trusted
source, but in fact are sent by an attacker.
Insider Threats
Insider threats are employees who inadvertently or intentionally threaten the security of an
organization’s data. There are three types of insider threats:
Non-malicious insider—these are users that can cause harm accidentally, via negligence, orbecause
they are unaware of security procedures.
Malicious insider—these are users who actively attempt to steal data or cause harm to the
organization for personal gain.
Compromised insider—these are users who are not aware that their accounts or credentials were
compromised by an external attacker. The attacker can then perform malicious activity, pretending to
be a legitimate user.
Ransomware
Ransomware is a major threat to data in companies of all sizes. Ransomware is malware that infects
corporate devices and encrypts data, making it useless without the decryption key.
Attackers display a ransom message asking for payment to release the key, but in many cases, even
paying the ransom is ineffective and the data is lost.
Many organizations are moving data to the cloud to facilitate easier sharing and collaboration.
However, when data moves to the cloud, it is more difficult to control and prevent data loss. Users
access data from personal devices and over unsecured networks SQL Injection
SQL injection (SQLi) is a common technique used by attackers to gain illicit access to databases, steal
data, and perform unwanted operations. It works by adding malicious code to a seemingly innocent
database query.
Modern IT environments store data on servers, endpoints, and cloud systems. Visibility over data
flows is an important first step in understanding what data is at risk of being stolen or misused.
To properly protect your data, you need to know the type of data, where it is, and what it is used for.
Data discovery and classification tools can help.
Data detection is the basis for knowing what data you have. Data classification allows you to create
scalable security solutions, by identifying which data is sensitive and needs to be secured.
Data Masking
Data masking lets you create a synthetic version of your organizational data, which you can use for
software testing, training, and other purposes that don’t require the real data.
The goal is to protect data while providing a functional alternative when needed.
Data Encryption
Data encryption is a method of converting data from a readable format (plaintext) to an unreadable
encoded format (ciphertext). Only after decrypting the encrypted data using the decryption key, the data
can be read or processed.
Password Hygiene
One of the simplest best practices for data security is ensuring users have unique, strong passwords.
Without central management and enforcement, many users will use easily guessable passwords or use
the same password for many different services.
Password spraying and other brute force attacks can easily compromise accounts with weak
passwords.
Organizations must put in place strong authentication methods, such as for web-based systems. It is
highly recommended to enforce multi-factor authentication when any user, whether internal or external,
requests sensitive or personal data
Data Analysis Tools
A Data Scientist is responsible for extracting, manipulating, pre-processing and generating predictions
out of data. In order to do so, he requires various statistical tools and programming languages.
1.R
R is a programming language used for data manipulation and graphics. Originating in 1995, this is a
popular tool used among data scientists and analysts. It is the open source version of the S language
widely used for research in statistics. According to data scientists, R is one of the easier languages to
learn as there are numerous packages and guides available for users.
2.Python
Python is another widely used language among data scientists, created by Dutch programmer Guido
Van Rossum. It’s a general-purpose programming language, focusing on readability and simplicity. If
you are not a programmer but are looking to learn, this is a great language to start with. It’s easier than
other general-purpose languages.
3.Keras
Keras is a deep learning library written in Python. It runs on TensorFlow allowing for fast
experimentation. eras was developed to make deep learning models easier and helping users treat their
data intelligently in an efficient manner.
It is one of those data science tools which are specifically designed for statistical operations. SAS is a
closed source proprietary software that is used by large organizations to analyze data. SAS uses base
SAS programming language which for performing statistical modeling. It is widely used by
professionals and companies working on reliable commercial software
5. Apache Spark
Apache Spark is general purpose cluster computing system. It provides high-level API in Java, Scala,
Python, and R. Spark provides an optimized engine that supports general execution graph. It also has
abundant high-level tools for structured data processing, machine learning, graph processing and
streaming. The Spark can either run alone or on an existing cluster manager.
6. MATLAB
scientific disciplines.In Data Science, MATLAB is used for simulating neural networks and fuzzy
logic.
7. Jupyter
Project Jupyter is an open-source tool based on IPython for helping developers in making open-source
software and experiences interactive computing. Jupyter supports multiple languages like Julia, Python,
and R. It is a web-application tool used for writing live code, visualizations, and presentations. Jupyter
is a widely popular tool that is designed to address the requirements of Data Science.
8. Matplotlib
Matplotlib is a plotting and visualization library developed for Python. It is the most popular tool for
generating graphs with the analyzed data. It is mainly used for plotting complex graphs
9. Scikit-learn
Scikit-learn is a library based in Python that is used for implementing Machine Learning Algorithms. It
is simple and easy to implement a tool that is widely used for analysis and data science.
TensorFlow has become a standard tool for Machine Learning. It is widely used for advanced machine
learning algorithms like Deep Learning. Developers named TensorFlow after Tensors which are
multidimensional arrays. It is an open-source and ever-evolving toolkit which is known for its
performance and high computational abilities.
Knowledge and skills for data science professionals
Data science is a multi-disciplinary field that requires a combination of knowledge and technical skills.
Professionals in this field are expected to be proficient in areas such as statistics, programming, achine
learning, and data visualization. Here’s a comprehensive breakdown of the essential knowledge and
skills for data science professionals:
2. Programming Skills
Python: The most widely used programming language in data science for its libraries, mplicity,
and flexibility. Libraries like Pandas, NumPy, and Scikit-learn are essential.
R: Another popular language for statistical analysis and data visualization, often used in
academia and research.
SQL: Strong understanding of SQL to query, manipulate, and analyze data stored in relational
databases.
Data Cleaning: Handling missing values, duplicates, inconsistent data, and errors.
Feature Engineering: Creating new features, transforming variables, encoding categorical data,
and scaling features.
Data Transformation: Manipulating and transforming data into a format suitable for analysis,
such as reshaping data or aggregating features.
Hadoop & Spark: Experience with distributed computing frameworks for processing large
datasets across multiple machines.
NoSQL Databases: Knowledge of non-relational databases such as MongoDB and Cassandra,
which are often used for unstructured or semi-structured data.
Data Visualization Tools: Proficiency in creating visualizations using tools like Matplotlib,
Seaborn, Plotly, or libraries in R (ggplot2).
Storytelling with Data: Ability to translate technical findings into clear, understandable
insights for decision-makers through storytelling.
7. Cloud Computing
Cloud Platforms: Knowledge of cloud services like AWS, Google Cloud Platform, or
Microsoft Azure for hosting data and running models at scale.
Data Pipelines & Automation: Familiarity with tools like Apache Airflow, AWS Lambda, or
Google Cloud Functions for automating workflows.
Problem-Solving: Ability to define problems, choose the right methodologies, and implement
practical solutions.
Collaboration: Ability to work in teams, communicate effectively with stakeholders, and
present data-driven insights to non-technical audiences.
Time Management: Managing multiple projects and deadlines, especially in complex or large-
scale data analysis tasks.
9.. Ethics & Privacy
Data Privacy: Understanding of data privacy laws (GDPR, CCPA) and ethical considerations
in handling sensitive information.
Fairness & Bias: Awareness of biases in algorithms and ensuring fairness and transparency in
model development and deployment.
Statistical and mathematical reasoning are foundational to data science because they guide how data is
analyzed, interpreted, and modeled to derive meaningful insights and predictions. These areas ensure
that data scientists approach problems rigorously and avoid drawing misleading conclusions. Here’s an
overview of how statistical and mathematical reasoning are used in data science:
Statistical reasoning provides the tools for making inferences from data, testing hypotheses, and
understanding the uncertainty involved in conclusions.
Key Components of Statistical Reasoning:
1. Probability Theory:
o Basic Concepts: Understanding the likelihood of an event or outcome occurring, which
is critical for making predictions, especially in uncertain environments.
o Distributions: Knowledge of common probability distributions (e.g., Normal, Poisson,
Binomial) is essential for modeling and understanding the behavior of different datasets.
o Bayesian Inference: A statistical method that updates the probability for a hypothesis as
more evidence becomes available, central to machine learning, especially in Bayesian
models.
2. Hypothesis Testing:
o Null and Alternative Hypotheses: Formulating hypotheses that can be tested with data.
o p-values: Determining the statistical significance of results by comparing the p-value
against a significance threshold (e.g., 0.05). A p-value tells you the probability of
obtaining the observed results if the null hypothesis is true.
4. Regression Analysis:
o Linear Regression: Modeling relationships between variables to make predictions.
o Multiple Regression: Extending simple linear regression to include multiple predictor
variables.
o Assumptions in Regression: Understanding assumptions like linearity,
homoscedasticity (constant variance of errors), and independence of errors.
5. Model Evaluation:
o Accuracy, Precision, Recall, F1-Score: Evaluating classification models based on
various metrics that assess the balance between correct predictions and false
positives/negatives.
Mathematics provides the underlying structures for many data science algorithms, helping optimize
models, solve problems, and perform accurate predictions.
1. Linear Algebra:
o Matrices and Vectors: Key mathematical tools for manipulating large datasets. Many
machine learning algorithms (like linear regression, PCA) rely heavily on matrix
operations for calculations.
o Eigenvalues and Eigenvectors: Used in methods like Principal Component Analysis
(PCA) to reduce the dimensionality of the data and identify key features.
o .
2. Calculus:
o Optimization: Most machine learning algorithms, such as gradient descent, require
optimization techniques to minimize or maximize a function
o Gradient Descent: An iterative optimization algorithm for finding the minimum of a
function. It's central to many machine learning algorithms, particularly deep learning.
3. Optimization Theory:
o Convex Optimization: Understanding convex functions and optimization problems is
key to many machine learning models where the objective function needs to be
minimized.
o Constrained Optimization: In scenarios where constraints (e.g., boundaries on
variables) must be satisfied during optimization.
5. Graph Theory:
o Graph-based Models: Many machine learning models use graph-based structures (e.g.,
in recommendation systems or social network analysis).
o Networks and Trees: Graph theory helps understand hierarchical data structures, such
as decision trees or network-based algorithms (e.g., PageRank).
6. Information Theory:
o Entropy: Measures the uncertainty or disorder in a dataset. Entropy is used in decision
trees to determine the best splits.
Machine learning
Machine learning is a subset of Artificial Intelligence (AI) that enables computers to learn from data
and make predictions without being explicitly programmed. If you're new to this field, this tutorial will
provide you with a comprehensive understanding of machine learning, its types, algorithms, tools, and
practical applications.
Supervised Learning: Trains models on labeled data to predict or classify new, unseen data.
Reinforcement Learning: Learns through trial and error to maximize rewards, ideal for
decision-making tasks.
1. Supervised Learning
In supervised learning, the algorithm learns from labeled training data, which means each training
example is paired with a correct output (label). The goal is to make predictions based on this learned
relationship.
Common Algorithms:
Linear Regression: Used for predicting continuous values. It models the relationship between a
dependent variable and one or more independent variables.
Logistic Regression: Used for binary classification tasks. It estimates the probability that an
input belongs to a certain class.
Decision Trees: A tree-like model that makes decisions based on feature values, where each
internal node represents a feature, and each leaf node represents an output label.
2. Unsupervised Learning
Unsupervised learning involves learning from unlabeled data. In this case, the algorithm tries to find
hidden patterns or intrinsic structures within the input data.
Common Algorithms:
K-Means Clustering: A method for partitioning data into k clusters by minimizing the variance
within each cluster.
Hierarchical Clustering: Builds a tree of clusters based on similarity.
Principal Component Analysis (PCA): A technique for reducing the dimensionality of the
data by transforming it into a set of linearly uncorrelated variables, called principal components.
Applications of Unsupervised Learning:
3. Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by
interacting with an environment. The goal is to maximize cumulative rewards over time by learning
from the consequences of its actions.
Common Algorithms:
Q-learning: A model-free algorithm that uses the Q-table to learn the optimal policy.
Deep Q-Networks (DQN): A combination of Q-learning and deep learning that uses neural
networks to approximate the Q-value function.
Monte Carlo Tree Search (MCTS): Used in decision-making tasks like game-playing (e.g.,
AlphaGo).
Policy Gradient Methods: Directly learn the policy function without needing a value function.
Game playing: Teaching agents to play games like chess, Go, or video games.
Robotics: Enabling robots to learn actions through trial and error in real-world environments.
Autonomous vehicles: Helping self-driving cars learn optimal driving strategies.