0% found this document useful (0 votes)

26 views

Oral Aswers Dsbda

Uploaded by

ngak1214

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Oral Aswers Dsbda

Uploaded by

ngak1214

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

A1.

1. Standard methods of data collection include:

- Surveys and Questionnaires: Gathering information from individuals

through predefined questions.
- Interviews: Conducting face-to-face or remote conversations to collect data.
- Observations: Directly observing and recording behaviors or events.
- Existing Data: Using data that has already been collected for another
purpose.
- Sensors: Utilizing devices to collect data automatically, such as in IoT
applications.

2. The need for data preprocessing arises from the fact that real-world data is
often incomplete, inconsistent, or contains errors. Preprocessing helps in:
- Handling missing data.
- Removing noise from data.
- Standardizing data formats.
- Addressing inconsistencies and errors in data.
- Preparing data for specific analysis tasks.

3. Comparing the performance of data analysis using random data and

preprocessed data depends on the context and goals of the analysis. In general,
preprocessed data is expected to perform better because it has been cleaned and
prepared for analysis, which can lead to more accurate and reliable results
compared to random, raw data.

4. Methods of data reduction include:

- Sampling: Selecting a representative subset of the data for analysis.

- Aggregation: Combining multiple data points into a single representation.
- Dimensionality Reduction: Reducing the number of variables in the dataset
by extracting important features or creating new ones.
- Data Cube Aggregation: Summarizing data in multidimensional arrays for
analysis.

5. Ways to check duplicates in data include:

- Using built-in functions: Many programming languages and tools have

functions to detect and remove duplicates, such as Python's `pandas` library.
- Manual inspection: For smaller datasets, visually inspecting the data to
identify duplicates.
- Hashing: Calculating hashes of data points and comparing them to identify
duplicates.
- Database queries: Using SQL queries to find duplicates in database tables.

A3.
1. **Central measures of data dispersion** quantify the spread or variability of
data points in a dataset. They include:

- Range: The difference between the maximum and minimum values in a

dataset.
- Variance: The average of the squared differences from the Mean.
- Standard Deviation: The square root of the variance, providing a measure of
how spread out the values are around the Mean.
- Interquartile Range (IQR): The range of the middle 50% of the data,
calculated as the difference between the 75th and 25th percentiles.
- Mean Absolute Deviation (MAD): The average of the absolute differences
from the Mean.

2. Mean, Mode, and Median are different measures of central tendency:

- Mean: The arithmetic average of a set of values, calculated by summing

all values and dividing by the number of values.
- **Mode**: The value that appears most frequently in a dataset.
- **Median**: The middle value in a dataset when the values are arranged in
ascending order. If there is an even number of values, the median is the average
of the two middle values.

3. Standard Deviation (Std. Deviation): A measure of the amount of

variation or dispersion of a set of values. It is calculated as the square root of the
variance and provides a standardized way to understand how spread out the
values in a dataset are around the Mean.

4. Standard Error: The standard deviation of the sampling distribution of a

statistic, such as the Mean. It measures the accuracy of the estimate of the
population parameter and is calculated as the standard deviation of the sample
divided by the square root of the sample size.

5. Importance of Measures of Data Dispersion:

- They provide insights into the variability and spread of data points, which is
crucial for understanding the reliability and stability of the data.
- They help in comparing and interpreting data sets, especially in identifying
outliers and understanding the distribution of data.
- They are used in statistical analysis to make inferences and draw conclusions
about populations based on sample data, providing a basis for hypothesis testing
and estimation.

A4.
1. **Use of Data Regression**: Data regression is used to model the
relationship between a dependent variable and one or more independent
variables. It helps in understanding how the value of the dependent variable
changes when one or more independent variables are varied. Regression
analysis is used for prediction, forecasting, and understanding the relationships
between variables in various fields such as economics, biology, engineering,
and social sciences.

2. Special Characteristic of Linear Regression: Linear regression assumes

that there is a linear relationship between the independent variable(s) and the
dependent variable. This means that the change in the dependent variable is
proportional to the change in the independent variable(s). The model equation
for linear regression is of the form \( y = mx + b \), where \( y \) is the
dependent variable, \( x \) is the independent variable, \( m \) is the slope of the
line, and \( b \) is the y-intercept.

3. Example Application of Linear Regression: One example of the

application of linear regression is in predicting house prices based on features
such as size (in square feet), number of bedrooms, and number of bathrooms. In
this case, the dependent variable is the house price, and the independent
variables are the size, number of bedrooms, and number of bathrooms. By
fitting a linear regression model to historical data on house prices and these
features, you can predict the price of a new house based on its size, number of
bedrooms, and number of bathrooms.

4. Statistical Knowledge for Applying Linear Regression: To apply linear

regression, you need to have a basic understanding of statistics, including
concepts such as correlation, covariance, and least squares estimation. You also
need to understand how to interpret the coefficients of the regression equation
and how to assess the goodness of fit of the model.

5. RMSE (Root Mean Squared Error): RMSE is a metric used to evaluate

the performance of a regression model. It measures the average of the squared
differences between the predicted values and the actual values. RMSE is
expressed in the same units as the dependent variable and provides an indication
of how well the model is able to predict the actual values. Lower RMSE values
indicate better model performance, with 0 indicating a perfect fit.

A6.

1. Basic Principle of Naïve Bayes:

Naïve Bayes is a probabilistic machine learning algorithm based on Bayes'
theorem, which is used for classification. The basic principle is to calculate the
probability of a data point belonging to a certain class based on the features of
the data point. It assumes that the features are conditionally independent given
the class, which is a strong assumption but simplifies the calculation.
2. **Advantages and Disadvantages of Naïve Bayes**:
- **Advantages**:
- Simple and easy to implement.
- Works well with large datasets.
- Computationally efficient.
- Can handle many features.
- Often performs well in practice, especially for text classification tasks.
- **Disadvantages**:
- Assumes independence of features, which may not always hold true.
- Requires a relatively large amount of training data to estimate the
parameters accurately.
- Can be sensitive to the presence of irrelevant features.

3. Desired Characteristics of Data for Naïve Bayes:

- The features should be independent of each other given the class.
- The features should be categorical or continuous, but they are often
discretized for Naïve Bayes.
- Sufficient training data should be available to estimate the probabilities
accurately.

4. **Confusion Matrix**:
A confusion matrix is a table that is often used to describe the performance of
a classification model on a set of test data for which the true values are known.
It allows the visualization of the performance of an algorithm and helps in
understanding how well the algorithm is performing.

5. TP, FP, TN, FN:

- **True Positive (TP)**: The number of correctly predicted positive
instances (e.g., correctly identified spam emails).
- **False Positive (FP)**: The number of incorrectly predicted positive
instances (e.g., non-spam emails classified as spam).
- **True Negative (TN)**: The number of correctly predicted negative
instances (e.g., correctly identified non-spam emails).
- **False Negative (FN)**: The number of incorrectly predicted negative
instances (e.g., spam emails classified as non-spam).

A7.

1. **Text Analysis**:
Text analysis, also known as text mining or text analytics, is the process of
deriving meaningful information from natural language text. It involves
extracting patterns, trends, and insights from unstructured text data, which can
be used for various applications such as sentiment analysis, topic modeling, and
document categorization.

2. Applications of Text Analysis:

- **Sentiment Analysis**: Sentiment analysis is a text analysis technique that
involves determining the sentiment or emotion expressed in a piece of text. It is
commonly used in social media monitoring, customer feedback analysis, and
brand reputation management.
- **Information Retrieval**: Information retrieval involves finding relevant
documents or information within a large collection of text data. Search engines
use text analysis techniques to index and retrieve relevant documents in
response to user queries.

3. **Text Preprocessing**:
Text preprocessing is the process of cleaning and preparing text data for
analysis. It involves several steps, including:
- **Lowercasing**: Converting all text to lowercase to ensure consistency.
- **Tokenization**: Splitting text into individual words or tokens.
- **Removing Stopwords**: Removing common words (e.g., "the," "is,"
"and") that do not contribute much meaning.
- **Stemming or Lemmatization**: Reducing words to their base or root form
(e.g., "running" to "run").
- **Removing Special Characters**: Removing non-alphanumeric characters
like punctuation marks.

4. **POS Tagging**:
POS tagging (Part-of-Speech tagging) is the process of assigning grammatical
tags to words in a text based on their role and context. POS tags indicate
whether a word is a noun, verb, adjective, etc., and can help in understanding
the syntactic structure of a sentence.

5. **TF/IDF**:
TF/IDF (Term Frequency-Inverse Document Frequency) is a statistical
measure used to evaluate the importance of a word in a document relative to a
collection of documents (corpus). It is calculated as the product of two terms:
- **Term Frequency (TF)**: The frequency of a term (word) in a document,
normalized by the total number of terms in the document. It reflects how often a
term occurs in a document.
- **Inverse Document Frequency (IDF)**: The logarithmically scaled inverse
fraction of the documents that contain the term. It measures the rarity of a term
across the documents in the corpus.
TF/IDF is often used in information retrieval and text mining to rank the
importance of terms in a document.

A8.

k
No ratings yet
k
11 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
UNIT 1,2
No ratings yet
UNIT 1,2
17 pages
Basicof Stats
No ratings yet
Basicof Stats
7 pages
Unit Ii-Ds
No ratings yet
Unit Ii-Ds
12 pages
2 4 Module Lectures
No ratings yet
2 4 Module Lectures
10 pages
Data Science Master
No ratings yet
Data Science Master
11 pages
BUSINESS ANALYTICS
No ratings yet
BUSINESS ANALYTICS
14 pages
CC02 Group6 Report
No ratings yet
CC02 Group6 Report
36 pages
Data Science Course Agenda
No ratings yet
Data Science Course Agenda
29 pages
ML DL NLP Definitions
No ratings yet
ML DL NLP Definitions
22 pages
Ivy - Data Science and Data Visualization Certification Course
100% (1)
Ivy - Data Science and Data Visualization Certification Course
10 pages
ADS IA 1 syllabus prep (1)
No ratings yet
ADS IA 1 syllabus prep (1)
5 pages
Unit IV
No ratings yet
Unit IV
22 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
Data Science Assignment
No ratings yet
Data Science Assignment
9 pages
Dsbda Viva Ans
No ratings yet
Dsbda Viva Ans
8 pages
DATA ANALYTICS
No ratings yet
DATA ANALYTICS
6 pages
Ml Chapter 2
No ratings yet
Ml Chapter 2
9 pages
Week 3
No ratings yet
Week 3
2 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
NOTES-UNIT-2
No ratings yet
NOTES-UNIT-2
3 pages
Data Science With Advanced Tableau Certification Course
No ratings yet
Data Science With Advanced Tableau Certification Course
9 pages
Statisticsgm
No ratings yet
Statisticsgm
2 pages
CIA Understanding
No ratings yet
CIA Understanding
5 pages
2 Unit
No ratings yet
2 Unit
2 pages
Data Analytivs-Unit-2
No ratings yet
Data Analytivs-Unit-2
24 pages
DA_IMP_QNA_CLEANED
No ratings yet
DA_IMP_QNA_CLEANED
7 pages
Statistics N Probability
No ratings yet
Statistics N Probability
31 pages
big-data-imp-notes-of-big-dats (1)
No ratings yet
big-data-imp-notes-of-big-dats (1)
17 pages
Updated Syllabus ML Ver 1
No ratings yet
Updated Syllabus ML Ver 1
21 pages
Unit 1
No ratings yet
Unit 1
5 pages
Statistics Concepts
No ratings yet
Statistics Concepts
19 pages
Regression
No ratings yet
Regression
86 pages
Data Analytics Questions
No ratings yet
Data Analytics Questions
40 pages
1.descriptive Statistics and Probability Distributions:: Datascience Course Content
No ratings yet
1.descriptive Statistics and Probability Distributions:: Datascience Course Content
10 pages
statistics for data science
No ratings yet
statistics for data science
4 pages
Module 2-b Prediction Methods and Models-Data Preperation
No ratings yet
Module 2-b Prediction Methods and Models-Data Preperation
26 pages
AML MIDSEM
No ratings yet
AML MIDSEM
59 pages
Mr Kinyera
No ratings yet
Mr Kinyera
6 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
MCA 202-Big Data and Big Data Analysis
No ratings yet
MCA 202-Big Data and Big Data Analysis
189 pages
Statistics Fundamentals Course Notes
No ratings yet
Statistics Fundamentals Course Notes
3 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Analytics PrepBook AnSoc 2017 PDF
100% (1)
Analytics PrepBook AnSoc 2017 PDF
41 pages
DAV Short Notes
No ratings yet
DAV Short Notes
5 pages
Ids PDF
No ratings yet
Ids PDF
397 pages
Short Details of Business Analyst Course
No ratings yet
Short Details of Business Analyst Course
4 pages
Data Science Training in Hyderabad
No ratings yet
Data Science Training in Hyderabad
7 pages
Notes 14
No ratings yet
Notes 14
190 pages
FDS-lecture-notes-2024-01-28
No ratings yet
FDS-lecture-notes-2024-01-28
217 pages
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
100% (1)
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
44 pages
How To Develop Quantitative Analysis Model
No ratings yet
How To Develop Quantitative Analysis Model
36 pages
How To Develop Quantitative Analysis Model
No ratings yet
How To Develop Quantitative Analysis Model
36 pages
Data Science Course Brochure
No ratings yet
Data Science Course Brochure
6 pages
Notes 14
No ratings yet
Notes 14
189 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Statistics: Basic Principles and Applications
From Everand
Statistics: Basic Principles and Applications
Ramune B. Adams
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
M Tech Thesis Report Vtu
100% (3)
M Tech Thesis Report Vtu
6 pages
Professional Persona Narrative
No ratings yet
Professional Persona Narrative
2 pages
Best Practice:: Siemens' Development Value Proposition
No ratings yet
Best Practice:: Siemens' Development Value Proposition
16 pages
21sm1e0034 1
No ratings yet
21sm1e0034 1
90 pages
How To Write A Literature Review For Sociology Cape
100% (1)
How To Write A Literature Review For Sociology Cape
7 pages
Randeep
No ratings yet
Randeep
17 pages
Mine Feasibility Studies
No ratings yet
Mine Feasibility Studies
2 pages
Assignment QT Col MBA Semester 1
100% (1)
Assignment QT Col MBA Semester 1
29 pages
144 Digital Ship 2020-10&11
No ratings yet
144 Digital Ship 2020-10&11
57 pages
DV Assignment Questions
No ratings yet
DV Assignment Questions
2 pages
Science Fiction Thesis Statement
100% (2)
Science Fiction Thesis Statement
7 pages
Global Citizenship Education
No ratings yet
Global Citizenship Education
3 pages
SessionPlans - F324dcourse Profile-Applied Chemistry New
No ratings yet
SessionPlans - F324dcourse Profile-Applied Chemistry New
8 pages
Intro Sentence Fluency
No ratings yet
Intro Sentence Fluency
3 pages
INDUCTION - An Introduction With Who You Are Going To Work For
No ratings yet
INDUCTION - An Introduction With Who You Are Going To Work For
31 pages
How Do Generational Differences Impact Organizations and Teams
100% (1)
How Do Generational Differences Impact Organizations and Teams
7 pages
Syllabus BALLB WITH MAPPING
No ratings yet
Syllabus BALLB WITH MAPPING
181 pages
Zidaan Farishta Midterm Report Card MYP3
0% (1)
Zidaan Farishta Midterm Report Card MYP3
12 pages
Project Management Dissertation Topics in Nigeria
100% (1)
Project Management Dissertation Topics in Nigeria
5 pages
Anova Problem
No ratings yet
Anova Problem
3 pages
Essential
No ratings yet
Essential
45 pages
I. Objectives
No ratings yet
I. Objectives
3 pages
Parts of A Thesis Proposal
100% (3)
Parts of A Thesis Proposal
7 pages
Hwk1 Solution
No ratings yet
Hwk1 Solution
11 pages
Probability Ultimate Question Set
No ratings yet
Probability Ultimate Question Set
5 pages
Flouts of The Cooperative Principle Maxi
No ratings yet
Flouts of The Cooperative Principle Maxi
14 pages
(eBook PDF) Elementary Survey Sampling 7th Editionpdf download
100% (3)
(eBook PDF) Elementary Survey Sampling 7th Editionpdf download
47 pages
SBL MJ21 Examiner's Report
No ratings yet
SBL MJ21 Examiner's Report
14 pages
Chapter 2 Thesis Review of Related Literature Sample
100% (1)
Chapter 2 Thesis Review of Related Literature Sample
5 pages
Sop 2020
No ratings yet
Sop 2020
15 pages