0% found this document useful (0 votes)

4 views

what is data science and cpare data science and information science

Data science is an interdisciplinary field that utilizes scientific methods and algorithms to extract insights from data, combining skills from computer science, statistics, and domain knowledge. Key components include data collection, cleaning, exploratory analysis, and machine learning. In contrast, information science focuses on the organization, retrieval, and dissemination of information, emphasizing the management of data rather than its analysis.

Uploaded by

changdevkolekar5595

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

what is data science and cpare data science and information science

Uploaded by

changdevkolekar5595

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

what is data science and compare data science and information science

Data Science is an interdisciplinary field that uses scientific methods, processes,

algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It combines expertise in computer science, statistics, and domain-
specific knowledge to analyze large sets of data, helping organizations make data-
driven decisions.

Key Components of Data Science:

1. Data Collection: Gathering data from different sources (databases, APIs, sensors,
etc.).
2. Data Cleaning and Preprocessing: Preparing raw data for analysis by handling
missing values, outliers, and inconsistencies.
3. Exploratory Data Analysis (EDA): Analyzing the data to find patterns, trends, or
anomalies.
4. Machine Learning & Modeling: Building predictive models to make forecasts or
decisions based on historical data.

Data Visualization: Presenting data insights visually to make it easier for stakeholders
to understand.

explain five vs of big data

1. Volume – Refers to the vast amount of data generated from various sources like
social media, sensors, and transactions.

2. Velocity – Describes the speed at which data is generated, processed, and analyzed
in real-time or near real-time.

3. Variety – Indicates the different types of data, including structured (databases),

semi-structured (XML, JSON), and unstructured (videos, images, text).

4. Veracity – Relates to the accuracy, reliability, and quality of data, ensuring it is

trustworthy for decision-making.

5. Value – Represents the usefulness and insights gained from big data to drive
business and operational improvements.

Explain data wrangling and its need

Data wrangling, also known as data munging, is the process of cleaning, transforming,
and organizing raw data into a structured format that is suitable for analysis. Here’s a
breakdown of what it involves and why it’s needed:
Definition and Process:

Cleaning: Removing errors, duplicates, and inconsistencies, as well as handling missing

values.

Transformation: Converting data types, normalizing data, and reshaping it to match

analytical requirements.

Integration: Combining data from different sources into a unified dataset.

Enrichment: Adding relevant information to improve the data’s value.

Need for Data Wrangling:

Quality Assurance: Raw data is often messy and unstructured, which can lead to
unreliable analysis. Data wrangling ensures accuracy and consistency.

Enhanced Decision-Making: Clean, well-structured data forms the foundation for

effective analytics, leading to more informed business decisions.

Efficiency: Preprocessing data reduces the time and effort needed for analysis,
enabling quicker insights and responses.

Integration Across Sources: It allows for combining disparate datasets, making it

possible to perform comprehensive analyses that require data from multiple origins.

explain life cycle of data science with diagram

1. Business Understanding

This is the first step in the data science life cycle.

It involves understanding the problem and defining the objective of the project.

Key goals, success metrics, and business constraints are identified.

2. Data Exploration

In this step, the available data is examined to understand its structure and quality.

It includes initial data analysis, missing value treatment, and outlier detection.

3. Data Mining

This step involves gathering, extracting, and transforming data from various sources.

Data preprocessing techniques are applied to make the data usable.

4. Data Visualization

It involves creating visual representations of data using graphs, charts, and

dashboards.

Helps in identifying patterns, trends, and insights in the dataset.

5. Predictive Modeling

Machine learning algorithms are applied to build predictive models.

The models are trained on historical data to make future predictions.

6. Data Cleaning

Raw data often contains errors, missing values, and inconsistencies.

This step involves cleaning and preprocessing the data for accurate analysis.

7. Feature Engineering

This step focuses on creating new features from existing data to improve model
performance.

Feature selection and transformation techniques are applied.

what is data discretization explain forms of data descretization

Data Discretization:

Data discretization is the process of converting continuous data (numerical data) into
discrete categories or intervals. This process is often used in data preprocessing to
make data easier to analyze or to convert it into a format suitable for machine learning
algorithms, particularly those that require categorical input. For example, continuous
data like age or income can be grouped into intervals or bins (e.g., "18-25", "26-35",
etc.), making it easier to apply certain algorithms.

Forms of Data Discretization:

1. Equal Width Discretization:

Description: In this method, the range of the continuous data is divided into
equal-width intervals (bins). Each interval or bin will have the same width, but
the number of data points in each bin may vary.
Example: If you have data from 0 to 100 and decide to discretize into 5 bins,
each bin will cover an interval of 20 (e.g., 0-20, 21-40, 41-60, 61-80, 81-100).
2. Equal Frequency Discretization:
Description: In this method, the continuous data is divided into intervals that
each contain approximately the same number of data points. The width of the
bins may vary depending on the distribution of the data.
Example: If you have 100 data points and you want to create 4 bins, each bin
will contain 25 data points. The values of the bins will be determined by sorting
the data and then dividing it into groups.
3. Clustering-Based Discretization:
Description: This approach involves using clustering algorithms (such as K-
means clustering) to group similar values into discrete intervals. The idea is to
divide the data into clusters based on natural groupings or similarities.
Example: If you use K-means with 3 clusters, the data points will be grouped
into 3 clusters, each representing a discrete interval.
4. Decision Tree-Based Discretization:
Description: This method uses decision tree algorithms to determine the
intervals for discretization. The decision tree splits the data based on certain
thresholds, effectively creating intervals based on the characteristics that best
differentiate the data points.
Example: A decision tree algorithm might split age data into "less than 30" and
"30 or more" if that split offers the best distinction in predicting the target
variable.
5. Manual Discretization (Domain-Specific Discretization):
Description: In this approach, the intervals or categories are defined manually
based on expert knowledge or domain-specific requirements. For example, you
might decide that income should be categorized as "Low," "Medium," and "High"
based on predefined income brackets.
Example: Age can be manually categorized into intervals like "0-18", "19-35",
"36-60", and "60+" based on the context of the analysis.

what is regression Explain types of regression with example for 5 marks

Regression is a statistical method used for modeling the relationship between a

dependent variable (also called the target or outcome) and one or more independent
variables (predictors or features). The goal of regression is to predict the value of the
dependent variable based on the values of the independent variables.

Types of Regression:

1. Linear Regression:
Description: Linear regression is the simplest form of regression, where the
relationship between the dependent variable and independent variables is
assumed to be linear. It tries to fit a straight line (y = mx + b) that best
represents the relationship between the variables.
Example: Predicting a person's salary based on their years of experience. The
relationship between salary and experience is assumed to be linear.
2. Multiple Linear Regression:
Description: Multiple linear regression is an extension of linear regression that
involves more than one independent variable. It models the relationship
between two or more predictors and a dependent variable.
Example: Predicting house prices based on features such as the number of
bedrooms, square footage, location, and age of the house.
3. Ridge Regression (L2 Regularization):
Description: Ridge regression is a type of linear regression that adds a penalty
term to the cost function to shrink the coefficients and reduce model
complexity. This helps to avoid overfitting, particularly in cases where the
independent variables are highly correlated.
Example: Predicting product sales while ensuring that no single feature (such as
price) has too much influence over the prediction, particularly when there is
multicollinearity between features.
4. Lasso Regression (L1 Regularization):
Description: Lasso regression, similar to ridge regression, is another form of
regularized linear regression. It adds a penalty to the absolute values of the
coefficients, which can lead to some coefficients being reduced to zero. This
results in automatic feature selection.
Example: Predicting customer churn while automatically selecting the most
important features (such as customer age or account type) and removing
irrelevant features.
5. Logistic Regression:
Description: Logistic regression is used when the dependent variable is
categorical, often for binary outcomes (0 or 1, yes or no). It models the
probability of an event occurring, using a logistic (sigmoid) function to squeeze
the predicted values between 0 and 1.
Example: Predicting whether a customer will buy a product (1) or not (0) based
on factors like age, income, and previous purchases.

what are dimensionality reduction and benefits for

Dimensionality reduction is the process of reducing the number of input variables (features)
in a dataset while preserving the essential information. This is achieved by transforming the
data into a lower-dimensional space. Dimensionality reduction techniques are particularly
useful when dealing with high-dimensional data (i.e., when the number of features is large), as
it can help improve the efficiency and effectiveness of data analysis or machine learning
algorithms.

Common techniques for dimensionality reduction include:

Principal Component Analysis (PCA):

Linear Discriminant Analysis (LDA):
t-Distributed Stochastic Neighbor Embedding (t-SNE):
Autoencoders:

Benefits of Dimensionality Reduction:

1. Improves Model Performance:

Benefit: Reducing the number of irrelevant or noisy features helps in building more
efficient models. By removing less important features, models can generalize better,
avoiding overfitting, which leads to improved performance, especially in machine
learning tasks.
2. Speeds Up Computation:
Benefit: Lower-dimensional datasets require less computational power and time.
Reducing the number of features directly impacts the computational efficiency of
algorithms, which is crucial when dealing with large datasets.
3. Simplifies Data Visualization:
Benefit: Dimensionality reduction makes it easier to visualize complex data by
projecting it into 2D or 3D space, which is helpful for understanding the relationships
between data points.
4. Reduces Storage Requirements:
Benefit: Lower-dimensional datasets take up less storage space, which can be critical
for big data applications. Storing fewer features reduces the amount of memory
needed, making it easier to handle large-scale datasets.
5. Helps in Noise Reduction:
Benefit: Dimensionality reduction can help eliminate noisy features or features that
are highly correlated, thereby improving the signal-to-noise ratio in the data. This
results in cleaner, more meaningful data that can lead to better modeling outcomes.

what does feature engineering typically includes for 5 marks

Feature engineering is the process of using domain knowledge, creativity, and statistical
techniques to transform raw data into meaningful features that can improve the performance
of machine learning models. It's a crucial step in the data preprocessing pipeline because the
quality of the features directly influences the performance of the machine learning algorithm.

Typical Steps in Feature Engineering:

1. Handling Missing Data:

Description: Missing values in the dataset can degrade the model's performance.
Feature engineering addresses missing data by filling it with appropriate values (e.g.,
mean, median, mode, or using advanced techniques like regression imputation).
2. Encoding Categorical Variables:
Description: Machine learning models generally cannot work with categorical
variables directly, so they need to be converted into numerical formats. This is done
through techniques like one-hot encoding, label encoding, or binary encoding.
3. Feature Scaling and Normalization:
Description: Features may have different scales (e.g., income in thousands and age in
years), which can impact the performance of some models, particularly distance-
based algorithms like k-nearest neighbors or gradient descent-based models. Feature
scaling and normalization standardize the range of features.
4. Creating New Features:
Description: This step involves combining existing features to create new ones that
can provide better predictive power. The creation of new features often depends on
domain knowledge or exploratory data analysis (EDA).
5. Feature Selection:
Description: Feature selection involves identifying and retaining the most important
features that contribute the most to the model’s predictive power while removing
irrelevant, redundant, or highly correlated features.

explain data wrangling methods with suitable example

Common Data Wrangling Methods:

1. Handling Missing Data:

Description: Incomplete data is common in real-world datasets. Handling missing
values properly is essential to prevent them from skewing the analysis. There are
several techniques to handle missing data:
Example: If a column "Age" has missing values, you can impute the missing data by
filling them with the median age from the dataset.
2. Data Transformation:
Description: Data transformation involves modifying the format, structure, or
values of data to meet specific needs. This may include normalizing, standardizing,
or scaling data, or converting data types.
Example: In a dataset with "Income" in different scales (thousands vs. millions),
standardizing them to the same scale (e.g., thousands) can make it easier to compare
or use in machine learning models.
3. Handling Categorical Data:
Description: Many datasets include categorical variables, such as "Gender" (Male,
Female) or "Country" (USA, Canada). Machine learning models require these
categorical variables to be converted into numeric format.
Example: If you have a "Country" feature with values "USA", "Canada", and "Mexico",
one-hot encoding will create three new binary features: "Country_USA",
"Country_Canada", "Country_Mexico", with 1 or 0 indicating the presence of each
country.
4. Data Aggregation:
Description: Data aggregation involves combining multiple rows or observations
into a single summary measure. This is helpful when working with time-series data
or grouped data.
Example: If you have sales data for different stores, you can aggregate the data by
"Store" and calculate the total sales per store or the average sales per month.
5. Handling Outliers:
Description: Outliers are extreme values that differ significantly from other
observations in the dataset. They can distort the analysis and model performance.
Methods to handle outliers include:
Example: In a dataset with "Income", where most data points are between $30,000 to
$100,000 but some entries show $1,000,000 (an outlier), you could remove the
outliers or cap them at a certain threshold like $200,000.

write short note on data integration and data transformation

Data Integration:

Data integration is the process of combining data from different sources into a unified view,
making it easier to analyze and derive insights. This is crucial when data is spread across
multiple systems, databases, or formats. The goal of data integration is to merge diverse
datasets into a cohesive dataset that allows for efficient querying and analysis.

Key Steps in Data Integration:

1. Data Extraction: Extracting data from multiple heterogeneous sources like databases,
spreadsheets, or APIs.
2. Data Transformation: Cleaning and converting data into a consistent format (can include
normalization, standardization, and more).
3. Data Loading: Loading the transformed data into a target system or database for
analysis, often referred to as the ETL (Extract, Transform, Load) process.

Example:

If a company has customer data in different departments—sales, customer service, and

marketing—data integration combines these datasets into a single, unified customer profile,
enabling comprehensive analysis.

Data Transformation:

Data transformation refers to the process of converting data from its original format into a
format that is suitable for analysis or machine learning. Transformation includes cleaning,
structuring, or enriching the data to make it consistent, relevant, and easier to analyze.

Key Techniques in Data Transformation:

1. Normalization/Standardization: Scaling numerical data to a specific range or ensuring

that it has a mean of 0 and a standard deviation of 1.
2. Data Cleaning: Removing or correcting errors, duplicates, and inconsistencies in the
dataset.
3. Aggregation: Summarizing data to a higher level (e.g., aggregating daily sales into monthly
sales).
4. Encoding Categorical Variables: Converting categorical data into numerical format (e.g.,
using one-hot encoding or label encoding).

Example:
For a machine learning model predicting house prices, transforming the "Date of Sale" feature
by converting it into "Year of Sale" or creating a new "Season" feature (e.g., Winter, Spring) can
help improve the model's predictive power.

Aspect Business Intelligence (BI) Data Science

Purpose Describes past and Predicts future outcomes

current data to aid and builds predictive
decisions. models.

Focus Focuses on historical Focuses on predictive

data and reporting. and prescriptive
analysis.

Tools & Techniques Uses tools like Tableau, Uses Python, R, machine
Power BI, SQL, and learning, and AI.
dashboards.

Data Type Primarily works with Works with both

structured data. structured and
unstructured data.

Outcome Provides descriptive Provides predictive

insights (reports, models and actionable
dashboards). insights.

Skillset Skills in data Skills in statistics,

visualization, reporting, programming, and
and business. machine learning.
Aspect Data Science Machine Learning Artificial
(ML) Intelligence (AI)

Definition Interdisciplinary Subset of AI that Broader field of

field to extract focuses on simulating
insights from learning from human-like
data. data. intelligence.

Scope Covers the full Focuses on Includes machine

process of data predictive learning and other
analysis. modeling and cognitive tasks.
learning
algorithms.

Goal Make data-driven Predict outcomes Create systems

decisions and and classify data. that simulate
insights. human
intelligence.

Techniques Data analysis, Algorithms like Search

visualization, and regression, algorithms, NLP,
machine learning. classification, robotics, and
clustering. learning
algorithms.

Use Cases Business Image Autonomous

intelligence, data- recognition, vehicles, AI
driven decisions, recommendation assistants,
insights. systems, robotics.
automation.
Aspect Data Science Information Science

Focus Data-driven insights, Information

predictions, and analysis. management, retrieval,
and usage.

Tools & Techniques Machine learning, AI, big Information retrieval

data tools, and statistical systems, databases, and
analysis. metadata management.

Goal To extract actionable To manage and optimize

insights from large the use of information in
datasets. systems.

Interdisciplinary Field Primarily uses computer Combines library

science, statistics, and science, computer
domain knowledge. science, and information
management.

Application Predictive modeling, Organizing and retrieving

decision-making, and information for ease of
trend analysis. access and use.

AZURE SQL Database
100% (1)
AZURE SQL Database
23 pages
DSBDA
No ratings yet
DSBDA
18 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
Data Discretization
No ratings yet
Data Discretization
9 pages
Data Science
No ratings yet
Data Science
11 pages
Data Science
No ratings yet
Data Science
64 pages
Data Science
No ratings yet
Data Science
46 pages
PSK Unit 1 Merged
No ratings yet
PSK Unit 1 Merged
125 pages
Data Science PDF
No ratings yet
Data Science PDF
11 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
24 pages
Data Science Full
No ratings yet
Data Science Full
32 pages
Data Science
No ratings yet
Data Science
59 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Data Science Introduction
No ratings yet
Data Science Introduction
9 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
Unit I and unit ii dev (1)
No ratings yet
Unit I and unit ii dev (1)
36 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
365 data
No ratings yet
365 data
4 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Binning
No ratings yet
Data Binning
9 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
B Ei
No ratings yet
B Ei
44 pages
Data Science
100% (2)
Data Science
33 pages
4 Popular Discretization Techniques You Need to Know in Data Science (1)
No ratings yet
4 Popular Discretization Techniques You Need to Know in Data Science (1)
17 pages
Data Mining
No ratings yet
Data Mining
77 pages
FDS notes
No ratings yet
FDS notes
5 pages
III Unit Mtech 2023
No ratings yet
III Unit Mtech 2023
121 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
datas_unit1
No ratings yet
datas_unit1
20 pages
TE Computer DSBDA
No ratings yet
TE Computer DSBDA
11 pages
Data Science Book
No ratings yet
Data Science Book
383 pages
Project Report
No ratings yet
Project Report
29 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Data-Science - Introduction
No ratings yet
Data-Science - Introduction
35 pages
Unit 3
No ratings yet
Unit 3
18 pages
Chapter 01 2
No ratings yet
Chapter 01 2
19 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Data Science
No ratings yet
Data Science
31 pages
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
24 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
CS3352-QB Fds
No ratings yet
CS3352-QB Fds
12 pages
Data Science Full
No ratings yet
Data Science Full
31 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
92347
No ratings yet
92347
62 pages
Statictics Computerscience Information Science
No ratings yet
Statictics Computerscience Information Science
3 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Reflective Essay of Principles of Data Science
No ratings yet
Reflective Essay of Principles of Data Science
16 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Cbse 12 CS SP1
No ratings yet
Cbse 12 CS SP1
11 pages
SAP Basis Monitoring Tcode
No ratings yet
SAP Basis Monitoring Tcode
13 pages
Full download SQL Server 2022 Administration Inside Out 1st Edition Randolph West pdf docx
100% (4)
Full download SQL Server 2022 Administration Inside Out 1st Edition Randolph West pdf docx
76 pages
(Ebook) Database Systems: A Practical Approach to Design, Implementation, and Management by Thomas M. Connolly, Caroline E. Begg ISBN 9781292061184, 1292061189 instant download
No ratings yet
(Ebook) Database Systems: A Practical Approach to Design, Implementation, and Management by Thomas M. Connolly, Caroline E. Begg ISBN 9781292061184, 1292061189 instant download
54 pages
12.06.2021 Transaction Management II - Recourability & Conflict Serializability
No ratings yet
12.06.2021 Transaction Management II - Recourability & Conflict Serializability
96 pages
CS8492-Database Management Systems
No ratings yet
CS8492-Database Management Systems
15 pages
Create A Database With phpMyAdmin
No ratings yet
Create A Database With phpMyAdmin
6 pages
Qlib: An AI-oriented Quantitative Investment Platform
No ratings yet
Qlib: An AI-oriented Quantitative Investment Platform
8 pages
Tnega - Gis - Job Descriptions
No ratings yet
Tnega - Gis - Job Descriptions
7 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
Tructured Uery Anguage: 1 ©stefan Stanczyk P00482 - 2005
No ratings yet
Tructured Uery Anguage: 1 ©stefan Stanczyk P00482 - 2005
71 pages
Chapter 2 Complete
No ratings yet
Chapter 2 Complete
1 page
Venn Diagram 2
No ratings yet
Venn Diagram 2
1 page
PDF Oracle Database Programming using Java and Web Services 1st Edition Kuassi Mensah download
100% (3)
PDF Oracle Database Programming using Java and Web Services 1st Edition Kuassi Mensah download
47 pages
SmartSearch WP Stardog
No ratings yet
SmartSearch WP Stardog
10 pages
Kcse Computer Studies Database QSTNS 2005 - 2021
100% (1)
Kcse Computer Studies Database QSTNS 2005 - 2021
21 pages
SQL Interview Questions N Answers-Easy
No ratings yet
SQL Interview Questions N Answers-Easy
13 pages
professional 1
No ratings yet
professional 1
18 pages
Bca Syllabus, Cmc, Sem 4 [ Cn,Vbnet,Dbms,Stat. ]
No ratings yet
Bca Syllabus, Cmc, Sem 4 [ Cn,Vbnet,Dbms,Stat. ]
4 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
43 pages
DBMS Mid Sem Question Bank
No ratings yet
DBMS Mid Sem Question Bank
42 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
19 pages
Data Guard
No ratings yet
Data Guard
11 pages
AL-502 DBMS Unit 4
No ratings yet
AL-502 DBMS Unit 4
25 pages
Part3 - Ch6 - The Relational Database Constraints
No ratings yet
Part3 - Ch6 - The Relational Database Constraints
7 pages
Development of Performance Dashboards in Healthcare Sector
No ratings yet
Development of Performance Dashboards in Healthcare Sector
10 pages
Bahria University: Assignment # 1
No ratings yet
Bahria University: Assignment # 1
9 pages
Practical ADM
No ratings yet
Practical ADM
23 pages
Hadoop vs. Spark: The New Age of Big Data
No ratings yet
Hadoop vs. Spark: The New Age of Big Data
7 pages

what is data science and cpare data science and information science

Uploaded by

what is data science and cpare data science and information science

Uploaded by

what is data science and compare data science and information science

Data Science is an interdisciplinary field that uses scientific methods, processes,

Key Components of Data Science:

explain five vs of big data

3. Variety – Indicates the different types of data, including structured (databases),

4. Veracity – Relates to the accuracy, reliability, and quality of data, ensuring it is

Explain data wrangling and its need

Cleaning: Removing errors, duplicates, and inconsistencies, as well as handling missing

Transformation: Converting data types, normalizing data, and reshaping it to match

Integration: Combining data from different sources into a unified dataset.

Enrichment: Adding relevant information to improve the data’s value.

Need for Data Wrangling:

Enhanced Decision-Making: Clean, well-structured data forms the foundation for

Integration Across Sources: It allows for combining disparate datasets, making it

explain life cycle of data science with diagram

This is the first step in the data science life cycle.

Key goals, success metrics, and business constraints are identified.

Data preprocessing techniques are applied to make the data usable.

It involves creating visual representations of data using graphs, charts, and

Helps in identifying patterns, trends, and insights in the dataset.

Machine learning algorithms are applied to build predictive models.

The models are trained on historical data to make future predictions.

Raw data often contains errors, missing values, and inconsistencies.

Feature selection and transformation techniques are applied.

what is data discretization explain forms of data descretization

Forms of Data Discretization:

1. Equal Width Discretization:

what is regression Explain types of regression with example for 5 marks

Regression is a statistical method used for modeling the relationship between a

what are dimensionality reduction and benefits for

Common techniques for dimensionality reduction include:

Principal Component Analysis (PCA):

Benefits of Dimensionality Reduction:

1. Improves Model Performance:

what does feature engineering typically includes for 5 marks

Typical Steps in Feature Engineering:

1. Handling Missing Data:

explain data wrangling methods with suitable example

Common Data Wrangling Methods:

1. Handling Missing Data:

write short note on data integration and data transformation

Key Steps in Data Integration:

If a company has customer data in different departments—sales, customer service, and

Key Techniques in Data Transformation:

1. Normalization/Standardization: Scaling numerical data to a specific range or ensuring

Aspect Business Intelligence (BI) Data Science

Purpose Describes past and Predicts future outcomes

Focus Focuses on historical Focuses on predictive

Data Type Primarily works with Works with both

Outcome Provides descriptive Provides predictive

Skillset Skills in data Skills in statistics,

Definition Interdisciplinary Subset of AI that Broader field of

Scope Covers the full Focuses on Includes machine

Goal Make data-driven Predict outcomes Create systems

Techniques Data analysis, Algorithms like Search

Use Cases Business Image Autonomous

Focus Data-driven insights, Information

Tools & Techniques Machine learning, AI, big Information retrieval

Goal To extract actionable To manage and optimize

Interdisciplinary Field Primarily uses computer Combines library

Application Predictive modeling, Organizing and retrieving

You might also like