0% found this document useful (0 votes)
26 views

Oral Aswers Dsbda

Uploaded by

ngak1214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Oral Aswers Dsbda

Uploaded by

ngak1214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

A1.

1. Standard methods of data collection include:

- Surveys and Questionnaires: Gathering information from individuals


through predefined questions.
- Interviews: Conducting face-to-face or remote conversations to collect data.
- Observations: Directly observing and recording behaviors or events.
- Existing Data: Using data that has already been collected for another
purpose.
- Sensors: Utilizing devices to collect data automatically, such as in IoT
applications.

2. The need for data preprocessing arises from the fact that real-world data is
often incomplete, inconsistent, or contains errors. Preprocessing helps in:
- Handling missing data.
- Removing noise from data.
- Standardizing data formats.
- Addressing inconsistencies and errors in data.
- Preparing data for specific analysis tasks.

3. Comparing the performance of data analysis using random data and


preprocessed data depends on the context and goals of the analysis. In general,
preprocessed data is expected to perform better because it has been cleaned and
prepared for analysis, which can lead to more accurate and reliable results
compared to random, raw data.

4. Methods of data reduction include:

- Sampling: Selecting a representative subset of the data for analysis.


- Aggregation: Combining multiple data points into a single representation.
- Dimensionality Reduction: Reducing the number of variables in the dataset
by extracting important features or creating new ones.
- Data Cube Aggregation: Summarizing data in multidimensional arrays for
analysis.

5. Ways to check duplicates in data include:

- Using built-in functions: Many programming languages and tools have


functions to detect and remove duplicates, such as Python's `pandas` library.
- Manual inspection: For smaller datasets, visually inspecting the data to
identify duplicates.
- Hashing: Calculating hashes of data points and comparing them to identify
duplicates.
- Database queries: Using SQL queries to find duplicates in database tables.

A3.
1. **Central measures of data dispersion** quantify the spread or variability of
data points in a dataset. They include:

- Range: The difference between the maximum and minimum values in a


dataset.
- Variance: The average of the squared differences from the Mean.
- Standard Deviation: The square root of the variance, providing a measure of
how spread out the values are around the Mean.
- Interquartile Range (IQR): The range of the middle 50% of the data,
calculated as the difference between the 75th and 25th percentiles.
- Mean Absolute Deviation (MAD): The average of the absolute differences
from the Mean.

2. **Mean, Mode, and Median** are different measures of central tendency:

- **Mean**: The arithmetic average of a set of values, calculated by summing


all values and dividing by the number of values.
- **Mode**: The value that appears most frequently in a dataset.
- **Median**: The middle value in a dataset when the values are arranged in
ascending order. If there is an even number of values, the median is the average
of the two middle values.

3. **Standard Deviation (Std. Deviation)**: A measure of the amount of


variation or dispersion of a set of values. It is calculated as the square root of the
variance and provides a standardized way to understand how spread out the
values in a dataset are around the Mean.

4. **Standard Error**: The standard deviation of the sampling distribution of a


statistic, such as the Mean. It measures the accuracy of the estimate of the
population parameter and is calculated as the standard deviation of the sample
divided by the square root of the sample size.

5. **Importance of Measures of Data Dispersion**:

- They provide insights into the variability and spread of data points, which is
crucial for understanding the reliability and stability of the data.
- They help in comparing and interpreting data sets, especially in identifying
outliers and understanding the distribution of data.
- They are used in statistical analysis to make inferences and draw conclusions
about populations based on sample data, providing a basis for hypothesis testing
and estimation.

A4.
1. **Use of Data Regression**: Data regression is used to model the
relationship between a dependent variable and one or more independent
variables. It helps in understanding how the value of the dependent variable
changes when one or more independent variables are varied. Regression
analysis is used for prediction, forecasting, and understanding the relationships
between variables in various fields such as economics, biology, engineering,
and social sciences.

2. **Special Characteristic of Linear Regression**: Linear regression assumes


that there is a linear relationship between the independent variable(s) and the
dependent variable. This means that the change in the dependent variable is
proportional to the change in the independent variable(s). The model equation
for linear regression is of the form \( y = mx + b \), where \( y \) is the
dependent variable, \( x \) is the independent variable, \( m \) is the slope of the
line, and \( b \) is the y-intercept.

3. **Example Application of Linear Regression**: One example of the


application of linear regression is in predicting house prices based on features
such as size (in square feet), number of bedrooms, and number of bathrooms. In
this case, the dependent variable is the house price, and the independent
variables are the size, number of bedrooms, and number of bathrooms. By
fitting a linear regression model to historical data on house prices and these
features, you can predict the price of a new house based on its size, number of
bedrooms, and number of bathrooms.

4. **Statistical Knowledge for Applying Linear Regression**: To apply linear


regression, you need to have a basic understanding of statistics, including
concepts such as correlation, covariance, and least squares estimation. You also
need to understand how to interpret the coefficients of the regression equation
and how to assess the goodness of fit of the model.

5. **RMSE (Root Mean Squared Error)**: RMSE is a metric used to evaluate


the performance of a regression model. It measures the average of the squared
differences between the predicted values and the actual values. RMSE is
expressed in the same units as the dependent variable and provides an indication
of how well the model is able to predict the actual values. Lower RMSE values
indicate better model performance, with 0 indicating a perfect fit.

A6.

1. **Basic Principle of Naïve Bayes**:


Naïve Bayes is a probabilistic machine learning algorithm based on Bayes'
theorem, which is used for classification. The basic principle is to calculate the
probability of a data point belonging to a certain class based on the features of
the data point. It assumes that the features are conditionally independent given
the class, which is a strong assumption but simplifies the calculation.
2. **Advantages and Disadvantages of Naïve Bayes**:
- **Advantages**:
- Simple and easy to implement.
- Works well with large datasets.
- Computationally efficient.
- Can handle many features.
- Often performs well in practice, especially for text classification tasks.
- **Disadvantages**:
- Assumes independence of features, which may not always hold true.
- Requires a relatively large amount of training data to estimate the
parameters accurately.
- Can be sensitive to the presence of irrelevant features.

3. **Desired Characteristics of Data for Naïve Bayes**:


- The features should be independent of each other given the class.
- The features should be categorical or continuous, but they are often
discretized for Naïve Bayes.
- Sufficient training data should be available to estimate the probabilities
accurately.

4. **Confusion Matrix**:
A confusion matrix is a table that is often used to describe the performance of
a classification model on a set of test data for which the true values are known.
It allows the visualization of the performance of an algorithm and helps in
understanding how well the algorithm is performing.

5. **TP, FP, TN, FN**:


- **True Positive (TP)**: The number of correctly predicted positive
instances (e.g., correctly identified spam emails).
- **False Positive (FP)**: The number of incorrectly predicted positive
instances (e.g., non-spam emails classified as spam).
- **True Negative (TN)**: The number of correctly predicted negative
instances (e.g., correctly identified non-spam emails).
- **False Negative (FN)**: The number of incorrectly predicted negative
instances (e.g., spam emails classified as non-spam).

A7.

1. **Text Analysis**:
Text analysis, also known as text mining or text analytics, is the process of
deriving meaningful information from natural language text. It involves
extracting patterns, trends, and insights from unstructured text data, which can
be used for various applications such as sentiment analysis, topic modeling, and
document categorization.

2. **Applications of Text Analysis**:


- **Sentiment Analysis**: Sentiment analysis is a text analysis technique that
involves determining the sentiment or emotion expressed in a piece of text. It is
commonly used in social media monitoring, customer feedback analysis, and
brand reputation management.
- **Information Retrieval**: Information retrieval involves finding relevant
documents or information within a large collection of text data. Search engines
use text analysis techniques to index and retrieve relevant documents in
response to user queries.

3. **Text Preprocessing**:
Text preprocessing is the process of cleaning and preparing text data for
analysis. It involves several steps, including:
- **Lowercasing**: Converting all text to lowercase to ensure consistency.
- **Tokenization**: Splitting text into individual words or tokens.
- **Removing Stopwords**: Removing common words (e.g., "the," "is,"
"and") that do not contribute much meaning.
- **Stemming or Lemmatization**: Reducing words to their base or root form
(e.g., "running" to "run").
- **Removing Special Characters**: Removing non-alphanumeric characters
like punctuation marks.

4. **POS Tagging**:
POS tagging (Part-of-Speech tagging) is the process of assigning grammatical
tags to words in a text based on their role and context. POS tags indicate
whether a word is a noun, verb, adjective, etc., and can help in understanding
the syntactic structure of a sentence.

5. **TF/IDF**:
TF/IDF (Term Frequency-Inverse Document Frequency) is a statistical
measure used to evaluate the importance of a word in a document relative to a
collection of documents (corpus). It is calculated as the product of two terms:
- **Term Frequency (TF)**: The frequency of a term (word) in a document,
normalized by the total number of terms in the document. It reflects how often a
term occurs in a document.
- **Inverse Document Frequency (IDF)**: The logarithmically scaled inverse
fraction of the documents that contain the term. It measures the rarity of a term
across the documents in the corpus.
TF/IDF is often used in information retrieval and text mining to rank the
importance of terms in a document.

A8.

You might also like