Oral Aswers Dsbda
Oral Aswers Dsbda
2. The need for data preprocessing arises from the fact that real-world data is
often incomplete, inconsistent, or contains errors. Preprocessing helps in:
- Handling missing data.
- Removing noise from data.
- Standardizing data formats.
- Addressing inconsistencies and errors in data.
- Preparing data for specific analysis tasks.
A3.
1. **Central measures of data dispersion** quantify the spread or variability of
data points in a dataset. They include:
- They provide insights into the variability and spread of data points, which is
crucial for understanding the reliability and stability of the data.
- They help in comparing and interpreting data sets, especially in identifying
outliers and understanding the distribution of data.
- They are used in statistical analysis to make inferences and draw conclusions
about populations based on sample data, providing a basis for hypothesis testing
and estimation.
A4.
1. **Use of Data Regression**: Data regression is used to model the
relationship between a dependent variable and one or more independent
variables. It helps in understanding how the value of the dependent variable
changes when one or more independent variables are varied. Regression
analysis is used for prediction, forecasting, and understanding the relationships
between variables in various fields such as economics, biology, engineering,
and social sciences.
A6.
4. **Confusion Matrix**:
A confusion matrix is a table that is often used to describe the performance of
a classification model on a set of test data for which the true values are known.
It allows the visualization of the performance of an algorithm and helps in
understanding how well the algorithm is performing.
A7.
1. **Text Analysis**:
Text analysis, also known as text mining or text analytics, is the process of
deriving meaningful information from natural language text. It involves
extracting patterns, trends, and insights from unstructured text data, which can
be used for various applications such as sentiment analysis, topic modeling, and
document categorization.
3. **Text Preprocessing**:
Text preprocessing is the process of cleaning and preparing text data for
analysis. It involves several steps, including:
- **Lowercasing**: Converting all text to lowercase to ensure consistency.
- **Tokenization**: Splitting text into individual words or tokens.
- **Removing Stopwords**: Removing common words (e.g., "the," "is,"
"and") that do not contribute much meaning.
- **Stemming or Lemmatization**: Reducing words to their base or root form
(e.g., "running" to "run").
- **Removing Special Characters**: Removing non-alphanumeric characters
like punctuation marks.
4. **POS Tagging**:
POS tagging (Part-of-Speech tagging) is the process of assigning grammatical
tags to words in a text based on their role and context. POS tags indicate
whether a word is a noun, verb, adjective, etc., and can help in understanding
the syntactic structure of a sentence.
5. **TF/IDF**:
TF/IDF (Term Frequency-Inverse Document Frequency) is a statistical
measure used to evaluate the importance of a word in a document relative to a
collection of documents (corpus). It is calculated as the product of two terms:
- **Term Frequency (TF)**: The frequency of a term (word) in a document,
normalized by the total number of terms in the document. It reflects how often a
term occurs in a document.
- **Inverse Document Frequency (IDF)**: The logarithmically scaled inverse
fraction of the documents that contain the term. It measures the rarity of a term
across the documents in the corpus.
TF/IDF is often used in information retrieval and text mining to rank the
importance of terms in a document.
A8.