0% found this document useful (0 votes)
41 views8 pages

Dsbda Viva Ans

The document discusses various data science concepts like data frames, data preprocessing, different regression techniques, evaluation metrics, Bayes theorem and natural language processing. It provides examples and explanations of these concepts across 7 assignments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views8 pages

Dsbda Viva Ans

The document discusses various data science concepts like data frames, data preprocessing, different regression techniques, evaluation metrics, Bayes theorem and natural language processing. It provides examples and explanations of these concepts across 7 assignments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

**Assignment No.

1**
1. **Explain Data Frame with Suitable Example**
- A data frame is a two-dimensional data structure, similar to a table, typically used in data
analysis. It's a key concept in libraries like pandas and R.
- Example in Python:
```python
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
})
```

2. **What is the Limitation of the Label Encoding Method?**


- Label encoding can create false ordinal relationships between categorical data, leading to
unintended results in models that expect ordinal relationships.

3. **What is the Need for Data Normalization?**


- Data normalization is needed to ensure uniformity in scale across different features, reducing
bias in models and improving performance in algorithms like K-means and neural networks.

4. **What are the Different Techniques for Handling Missing Data?**


- Common techniques include:
- Imputation with mean, median, or mode.
- Dropping rows or columns with missing values.
- Using algorithms that handle missing data natively.
- Forward/backward filling for time series data.

5. **How Can We Read the Data through the Pandas Library?**


- Common functions to read data with pandas:
- `pd.read_csv()`: Reads CSV files.
- `pd.read_excel()`: Reads Excel files.
- `pd.read_sql()`: Reads SQL query results.
- `pd.read_json()`: Reads JSON data.

6. **Different Types of Library and Their Uses**


- Common data-related libraries and their uses:
- **Pandas**: Data manipulation and analysis.
- **NumPy**: Numerical operations with arrays.
- **Matplotlib/Seaborn**: Data visualization.
- **Scikit-learn**: Machine learning algorithms.
- **TensorFlow/PyTorch**: Deep learning frameworks.

7. **What is Meant by Data Preprocessing?**


- Data preprocessing involves transforming raw data into a suitable format for analysis or
machine learning, including tasks like cleaning, normalization, and feature engineering.
8. **What is Meant by Outliers? How Can We Work on It?**
- Outliers are data points that deviate significantly from other observations. Methods to handle
them include:
- Removing outliers.
- Replacing them with a central tendency measure (like mean).
- Using robust statistical techniques to reduce their impact.
- Applying transformations to mitigate their effects.

---

**Assignment No. 2**


1. **How to Identify and Handle Null Values?**
- Use pandas functions like `isnull()`, `isna()`, or `fillna()` to identify and handle null values. You
can drop rows/columns with nulls or impute them with a central value.

2. **What is the Purpose of Data Transformation?**


- Data transformation standardizes or normalizes data, making it compatible with certain
algorithms, reducing skewness, and ensuring more reliable analysis.

3. **Explain the Methods to Detect Outliers**


- Common methods for detecting outliers:
- Statistical methods like Z-score or Interquartile Range (IQR).
- Visual methods such as box plots, scatter plots, or histograms.
- Isolation Forest or clustering algorithms.

4. **Write the Algorithm to Display the Statistics of Null Values Present in the Dataset**
```python
import pandas as pd
df = pd.read_csv('file.csv')
null_counts = df.isnull().sum()
print("Null values in each column:")
print(null_counts)
```

5. **Write an Algorithm to Replace Outlier Value with the Mean of the Variable**
```python
import numpy as np
import pandas as pd

df = pd.read_csv('file.csv')
mean_value = df['Column_Name'].mean()
std_dev = df['Column_Name'].std()

# Set a threshold to define outliers, e.g., z-score > 3


z_scores = (df['Column_Name'] - mean_value) / std_dev
outlier_threshold = 3
df.loc[np.abs(z_scores) > outlier_threshold, 'Column_Name'] = mean_value
```

---
**Assignment No. 3**
1. **What are the Measures of Central Tendency?**
- Measures of central tendency describe the center of a dataset:
- **Mean**: The average of the data.
- **Median**: The middle value when data is sorted.
- **Mode**: The most frequent value in the data.

2. **What are the Measures of Dispersion?**


- Measures of dispersion describe the spread or variability of a dataset:
- **Variance**: Average of the squared deviations from the mean.
- **Standard Deviation**: The square root of variance.
- **Range**: The difference between the maximum and minimum values.

3. **What is the Difference Between Range and Variance?**


- **Range**: A simple measure of dispersion; it is the difference between the highest and lowest
values in a dataset.
- **Variance**: A more complex measure of dispersion that considers how far each value is from
the mean.

4. **What is Meant by Hypothesis Testing?**


- Hypothesis testing is a statistical method to test assumptions or hypotheses about a population
parameter, often using significance tests and p-values.

5. **What is Type 1 and Type 2 Error?**


- **Type 1 Error**: Incorrectly rejecting a true null hypothesis (false positive).
- **Type 2 Error**: Failing to reject a false null hypothesis (false negative).

---

**Assignment No. 4**


1. **What is Regression?**
- Regression is a statistical method for estimating the relationship between a dependent variable
and one or more independent variables, often used for prediction and analysis.

2. **Difference Between Linear and Logistic Regression**


- **Linear Regression**: Models the relationship between variables assuming a linear
relationship; used for continuous outcomes.
- **Logistic Regression**: Models the relationship between variables assuming a logistic
function; used for binary or categorical outcomes.

3. **What are the Different Types of Logistic Regression?**


- Types of logistic regression include:
- **Binary Logistic Regression**: Predicts binary outcomes.
- **Multinomial Logistic Regression**: Predicts categorical outcomes with more than two
classes.
- **Ordinal Logistic Regression**: Predicts ordered categorical outcomes.
4. **What are the Different Types of Linear Regression?**
- **Simple Linear Regression**: One dependent variable and one independent variable.
- **Multiple Linear Regression**: One dependent variable with multiple independent variables.
- **Polynomial Regression**: Uses polynomial functions to model relationships.

5. **How to Compute SST, SSE, SSR, MSE, RMSE, R Square**


- SST: Total sum of squares.
- SSE: Sum of squares due to error.
- SSR: Sum of squares due to regression.
- MSE: Mean squared error.
- RMSE: Root mean squared error.
- R-Square: Coefficient of determination, indicating the proportion of variance explained by the
regression model.
```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

df = pd.read_csv('file.csv')
X = df[['feature1', 'feature2']]
y = df['target']

model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

SST = np.sum((y - np.mean(y)) ** 2)


SSE = np.sum((y - y_pred) ** 2)
SSR = np.sum((y_pred - np.mean(y)) ** 2)

MSE = mean_squared error(y, y_pred)


RMSE = np.sqrt(MSE)
R_Square = model.score(X, y)
```

---

**Assignment No. 5**


1. **How to Evaluate a Classification Model?**
- Common methods for evaluating classification models include:
- Confusion matrix.
- Accuracy, precision, recall, F1-score.
- ROC curve and area under the ROC curve (AUC).

2. **How to Evaluate a Regression Model?**


- Evaluation metrics for regression models include:
- Mean squared error (MSE).
- Root mean squared error (RMSE).
- Mean absolute error (MAE).
- R-Square.
3. **What is Accuracy, Precision, Recall, and F1-Score?**
- **Accuracy**: The proportion of correct predictions among total predictions.
- **Precision**: The ratio of true positives to all predicted positives.
- **Recall**: The ratio of true positives to all actual positives.
- **F1-Score**: A harmonic mean of precision and recall.

---

**Assignment No. 6**


1. **What is Bayes Theorem?**
- Bayes theorem calculates the probability of a hypothesis given prior conditions or evidence.
- Formula: \( P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} \).

2. **Different Key Terms of Bayes Theorem?**


- **Prior probability**: The probability of an event before new evidence.
- **Posterior probability**: The updated probability after new evidence.
- **Likelihood**: The probability of observing specific evidence given a particular

hypothesis.

3. **What is Meant by Likelihood Probability?**


- Likelihood probability is the probability of observed data given certain assumptions or a model.
It’s often used to evaluate how well a model fits the data.

---

**Assignment No. 7**


1. **What is POS Tagging?**
- POS (Part of Speech) tagging is the process of assigning parts of speech, such as noun, verb,
adjective, etc., to words in a text.

2. **What is Meant by Lemmatization?**


- Lemmatization is the process of reducing words to their base or root form while considering
grammar and context. It aims to return words to their canonical form.

3. **What is Meant by Stemming?**


- Stemming involves reducing words to their basic form by removing affixes without considering
the specific context. It’s simpler than lemmatization.

4. **What is Meant by Tokenization?**


- Tokenization involves breaking text into smaller units, such as words or sentences, for analysis
or processing.

5. **Why Remove Stop Words in Text Analysis?**


- Stop words are common words that do not contribute significant meaning to text (like "the," "is,"
"and"). Removing them reduces noise and improves the focus on meaningful words.
6. **Advantages and Disadvantages of TF-IDF**
- **Advantages**:
- Emphasizes important words in a text by reducing the impact of common terms.
- **Disadvantages**:
- Doesn’t consider word semantics.
- May overemphasize less common terms.

7. **Steps of a Text Analysis Model Using TF-IDF**


- Tokenize the text.
- Remove stop words.
- Apply lemmatization or stemming.
- Calculate TF-IDF for each token.
- Use TF-IDF features for further analysis or modeling.

---

**Assignment No. 8**


1. **Explain Scatter Plot**
- A scatter plot is a graph showing individual data points, typically used to visualize relationships
between two variables. Useful for identifying trends and outliers.

2. **What is Univariate, Bivariate, Multivariate Plot?**


- **Univariate Plot**: Analysis of one variable (e.g., histogram, box plot).
- **Bivariate Plot**: Analysis of two variables (e.g., scatter plot).
- **Multivariate Plot**: Analysis of more than two variables (e.g., 3D scatter plot, pair plot).

3. **What is CM?**
- CM typically stands for confusion matrix, used to evaluate classification models by showing
true positives, true negatives, false positives, and false negatives.

4. **What is a Heatmap? What Does 0 & 1 Mean in a Heatmap?**


- A heatmap is a visual representation of data using color gradients. It can represent different
types of data, such as correlation matrices or categorical relationships.
- In heatmaps, 0 and 1 may represent binary outcomes, with 0 indicating one condition and 1
indicating another.

5. **How to Handle Text Data?**


- Techniques for handling text data include:
- Tokenization.
- Removing stop words.
- Lemmatization or stemming.
- Feature extraction with methods like TF-IDF or bag-of-words.

---
**Assignment No. 9**
1. **What is the Use of Statistics in Data Science?**
- Statistics is used to understand and analyze data, make inferences, and validate models. It
provides foundational techniques for data science and machine learning.

2. **How to Analyze a Single Feature?**


- Different ways to analyze a single feature include:
- Descriptive statistics like mean, median, and mode.
- Visualization methods like histograms, box plots, or density plots.
- Normality tests such as Shapiro-Wilk or Kolmogorov-Smirnov.

3. **What is the Interquartile Range (IQR)?**


- The interquartile range (IQR) is the difference between the first and third quartiles, used to
measure data spread and identify outliers. Data outside 1.5 times the IQR from the quartiles is
considered outlier-prone.

4. **What is a Z-Score?**
- A Z-score represents the number of standard deviations a data point is from the mean. It’s used
to identify outliers and standardize data.

5. **Does the Outliers Affect Which Central Tendency?**


- Outliers primarily affect the mean, since it's sensitive to extreme values. The median is more
robust to outliers, representing the central value in an ordered dataset.

---

**Assignment No. 10**


1. **How to Create a Histogram?**
- Creating a histogram with Python:
```python
import matplotlib.pyplot as plt
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
plt.hist(data, bins=5)
plt.show()
```
2. **Difference Between Histogram and Bar Graph?**
- **Histogram**: Used to represent continuous data, with bars representing frequency in specific
ranges (bins).
- **Bar Graph**: Used to represent categorical data, with bars representing each category.

3. **What is a Density Plot?**


- A density plot is a graph that represents the distribution of a variable, often as a smooth curve,
used to estimate and visualize data distributions.

4. **What Do You Understand by Density Plot?**


- A density plot shows the estimated distribution of a continuous variable, providing a smooth
representation of data distribution.
---

**Assignment No. 11, 10, 12**


1. **What is HDFS (Hadoop Distributed File System)?**
- HDFS is a distributed file system designed for large-scale data storage and processing,
offering scalability and redundancy across multiple nodes.

2. **What is MapReduce?**
- MapReduce is a programming model in Hadoop for distributed data processing. It consists of
"Map" tasks for parallel processing and "Reduce" tasks for aggregating results.

3. **What is the Purpose of Pig in HDFS Architecture?**


- Apache Pig provides a high-level platform for MapReduce programming with a scripting
language called Pig Latin, simplifying complex data processing tasks.

4. **Steps to Install HDFS**


- Download and install Hadoop.
- Configure core-site.xml and hdfs-site.xml for cluster setup.
- Format the NameNode.
- Start the NameNode and DataNodes.

5. **How Does MapReduce Work?**


- In the "Map" phase, input data is processed in parallel to generate intermediate key-value
pairs.
- In the "Reduce" phase, intermediate results are aggregated to produce the final output.

6. **Steps to Install Hadoop for Distributed Environment**


- Download and install Hadoop.
- Configure Hadoop environment variables.
- Set up cluster configurations in core-site.xml, hdfs-site.xml, mapred-site.xml.
- Start the Hadoop services for distributed processing.

7. **Steps to Install Scala**


- Download and install Scala from its official website.
- Configure Scala environment variables.
- Verify the installation with a simple Scala script or REPL.

---

You might also like