0% found this document useful (0 votes)
13 views10 pages

ASM using r 2 marks answer Keys

The document outlines various statistical concepts and methods, including basic statistical functions in R, types of tests for comparing means, data mining, probability, and hypothesis testing. It also discusses advanced topics such as regression analysis, dimension reduction, and predictive modeling. Key statistical techniques like t-tests, ANOVA, correlation analysis, and logistic regression are explained, along with their applications and assumptions.

Uploaded by

anup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

ASM using r 2 marks answer Keys

The document outlines various statistical concepts and methods, including basic statistical functions in R, types of tests for comparing means, data mining, probability, and hypothesis testing. It also discusses advanced topics such as regression analysis, dimension reduction, and predictive modeling. Key statistical techniques like t-tests, ANOVA, correlation analysis, and logistic regression are explained, along with their applications and assumptions.

Uploaded by

anup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

a) Four basic statistical functions in R:

1. mean(x): Calculates the arithmetic mean of a numeric vector x.


2. sd(x): Calculates the standard deviation of a numeric vector x.
3. cor(x, y): Calculates the correlation coefficient between two numeric vectors x and y.
4. summary(x): Provides a summary of the data in x, including minimum, 1st quartile,
median, mean, 3rd quartile, and maximum values.
b) Two types of tests to compare means of two samples:
1. t-test: Used to compare the means of two samples, assuming they come from
normally distributed populations with equal variances.
2. ANOVA (Analysis of Variance): Used to compare the means of more than two groups.
c) Data Mining:
Data mining is the process of discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database1 systems. It extracts information
from a data set and transforms it into an understandable structure for further use.
d) Probability and Mutually Exclusive Events:
• Probability: The measure of the likelihood that an event will occur. It is a number
between 0 and 1, where 0 indicates impossibility and 1 indicates certainty.2
• Mutually Exclusive Events: Two events are mutually exclusive if they cannot both
occur at the same time. For example, flipping a coin cannot result in both heads and
tails simultaneously.
e) Cross Tabulation:
Cross tabulation is a statistical method used to analyze the relationship between two
categorical variables. It displays the frequency distribution of the variables3 and their joint
occurrences in a tabular format.
f) Correlation Analysis:
Correlation analysis is a statistical method used to measure the strength and direction of the
linear relationship between two variables.4 It helps to understand how changes in one
variable are associated with changes in another.
g) Predictive Modeling:
Predictive modeling is a statistical technique used to predict future outcomes based on
historical data. It involves building a model that learns patterns from the data and uses them
to make predictions on new, unseen data.
h) Factor Analysis:
Factor analysis is a statistical method used to reduce the dimensionality of a dataset by
identifying underlying latent variables or factors. It helps to identify groups of correlated
variables and understand the underlying structure of the data.
a) Null and Alternate Hypothesis:
• Null Hypothesis (H₀): A statement that assumes there is no significant difference or
relationship between the variables being studied.
• Alternate Hypothesis (H₁): A statement that contradicts the null hypothesis,
suggesting a significant difference or relationship exists.
b) ROC Curve:
A Receiver Operating Characteristic (ROC) curve is a graphical plot used to illustrate the
diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots
the true positive rate (sensitivity) against the false positive rate (1-specificity) for various
threshold settings.
c) Significance of Correlation Analysis:
Correlation analysis helps to:
• Measure the strength of the relationship: It quantifies how strongly two variables are
related.
• Determine the direction of the relationship: It indicates whether the relationship is
positive (both variables increase together) or negative (one increases while the other
decreases).
• Identify potential causal relationships: While correlation does not imply causation, it
can suggest potential causal links that can be further investigated.
d) Clustering in Data Mining:
Clustering is a technique used to group similar data points together. It identifies patterns and
structures within data, allowing for better understanding, visualization, and decision-making.
e) One-way ANOVA:
One-way ANOVA (Analysis of Variance) is a statistical technique used to compare the means
of three or more independent groups. It determines whether there is a significant difference
between the means of these groups.
f) Seasonality in Time Series Data:
Seasonality refers to patterns in data that repeat over a fixed period, such as yearly,
quarterly, monthly, weekly, or daily. It is often caused by factors like weather, holidays, or
social trends.
g) Dimension Reduction:
Dimension reduction is a technique used to reduce the number of features or variables in a
dataset while preserving as much information as possible. This can improve model
performance, reduce computational cost, and enhance interpretability.
h) Two types of tests to compare means of two samples:
1. t-test: Used to compare the means of two independent samples, assuming they
come from normally distributed populations with equal variances.
2. ANOVA (Analysis of Variance): While typically used for comparing means of more
than two groups, it can also be used to compare the means of two groups.

) Define Probability and give an example.


Probability is the mathematical measure of the likelihood that an event will occur. It is
expressed as a number between 0 and 1, where 0 indicates impossibility and 1 indicates
certainty.
Example:
• Tossing a coin: The probability of getting a head when tossing a fair coin is 1/2 or 0.5.
This means that if you toss the coin many times, you would expect to get heads
about half the time.
State assumptions of multiple regression analysis
Multiple regression analysis relies on several key assumptions to ensure the validity and
reliability of the model's results. These assumptions are:
1. Linearity: The relationship between the dependent variable and each independent
variable should be linear. This means that a change in an independent variable leads
to a proportional change in the dependent variable.
2. Independence of Errors: The errors (residuals) in the model should be independent
of each other. This means that the error in one observation does not influence the
error in another observation.
3. Homoscedasticity: The variance of the errors should be constant across all levels of
the independent variables. This means that the spread of the errors is consistent
throughout the range of the data.
4. Normality of Errors: The errors should be normally distributed. This assumption is
important for hypothesis testing and confidence interval estimation.
5. No Multicollinearity: The independent variables should not be highly correlated with
each other. Multicollinearity can make it difficult to estimate the individual effects of
the independent variables on the dependent variable.
It's important to check these assumptions before interpreting the results of a multiple
regression analysis. Violations of these assumptions can lead to biased and unreliable
estimates.
c) Autocorrelation in Time Series
Autocorrelation refers to the correlation between a time series and a lagged version of itself.
In simpler terms, it measures the relationship between a variable's current value and its past
values.
• Positive Autocorrelation: If the current value of a variable is positively correlated
with its past values, it indicates a trend. For instance, if stock prices tend to rise after
previous rises, it shows positive autocorrelation.
• Negative Autocorrelation: If the current value is negatively correlated with past
values, it suggests a cyclical pattern. For example, if economic growth is followed by a
period of recession, it exhibits negative autocorrelation.
Understanding autocorrelation is crucial in time series analysis as it helps identify patterns,
make accurate forecasts, and select appropriate modeling techniques.
d) Two Methods of Dimension Reduction
Dimensionality reduction is a technique used to reduce the number of features (variables) in
a dataset while preserving the essential information. This is important because high-
dimensional data can lead to the curse of dimensionality, where models become less
accurate and computationally expensive.
Two common methods of dimension reduction are:
1. Principal Component Analysis (PCA): PCA transforms a dataset into a new
coordinate system, where the first few principal components capture most of the
variance in the data. By selecting only the most important principal components, we
can reduce the dimensionality.
2. Feature Selection: This involves selecting a subset of the original features that are
most relevant to the target variable. Techniques like filter methods (e.g., correlation
analysis), wrapper methods (e.g., forward selection, backward elimination), and
embedded methods (e.g., L1 regularization) can be used for feature selection.
f) Models that are Both Regression and Classification
While regression and classification are distinct tasks, some models can be adapted to
perform both:
1. Logistic Regression:
o Fundamentally a classification model, it predicts the probability of a binary
outcome.
o By interpreting the output as a continuous probability, it can be used for
regression-like tasks.
2. Neural Networks:
o Highly flexible models that can be trained for both regression and
classification.
o The output layer determines the task: a single node for regression, multiple
nodes for multi-class classification.
3. Support Vector Machines (SVMs):
o Primarily a classification model, SVMs can also be used for regression by
modifying the loss function.
o Support Vector Regression (SVR) is an extension for regression tasks.
g) Null and Alternative Hypothesis
In hypothesis testing, we make claims about a population parameter.
• Null Hypothesis (H₀): This is the default assumption, often a statement of no effect
or no difference.
• Alternative Hypothesis (H₁): This is the claim we want to test, often the opposite of
the null hypothesis.
Example:
• Null Hypothesis (H₀): The mean height of a population is 170 cm.
• Alternative Hypothesis (H₁): The mean height of the population is not 170 cm.
h) Properties of the Normal Distribution
The normal distribution, often called the bell curve, is a fundamental probability distribution
with the following properties:
1. Symmetry: The distribution is symmetric about the mean.
2. Mean, Median, and Mode: The mean, median, and mode are equal.
3. Standard Deviation: The standard deviation determines the spread of the
distribution.
4. Area Under the Curve: The total area under the curve is equal to 1.
5. Empirical Rule: Approximately 68% of the data falls within one standard deviation of
the mean, 95% within two standard deviations, and 99.7% within three standard
deviations.
The 1 normal distribution is widely used in statistics and probability theory due to its
simplicity and its frequent appearance in natural phenomena.

e) Sketch classification table in logistic regression


Classification Table in Logistic Regression
A classification table, also known as a confusion matrix, is a common tool to evaluate the
performance of a classification model, including logistic regression. It summarizes the
prediction results on a test dataset, comparing the predicted class labels to the actual class
labels.
Here's a typical classification table:

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN) 1

Explanation of terms:
• True Positive (TP): Correctly predicted positive cases.
• True Negative (TN): Correctly predicted negative cases.
• False Positive (FP): Incorrectly predicted positive cases (Type I error).
• False Negative (FN): Incorrectly predicted negative cases (Type II error).
From this table, we can calculate various performance metrics:
• Accuracy: Overall correctness of the model.
o Accuracy = (TP + TN) / (TP + TN + FP + FN)
• Precision: Proportion of positive predictions that are correct.
o Precision = TP / (TP + FP)
• Recall (Sensitivity): Proportion of actual positive cases correctly identified.
o Recall = TP / (TP + FN)
• Specificity: Proportion of actual negative cases correctly identified.
o Specificity = TN / (TN + FP)
• F1-score: Harmonic mean of precision and recall.
o F1-score = 2 * (Precision * Recall) / (Precision + Recall)
By analyzing these metrics, we can assess the model's performance and make informed
decisions about its suitability for a particular application.
a) Enlist basic statistical functions in R.
Here are some fundamental statistical functions in R:
• mean(x): Calculates the mean (average) of the values in the vector x.
• median(x): Finds the median value of the vector x.
• sd(x): Computes the standard deviation of the values in x.
• var(x): Calculates the variance of the values in x.
• summary(x): Provides a summary of the data in x, including quartiles, mean, median,
min, and max.
• cor(x, y): Computes the correlation between two vectors x and y.
• table(x): Creates a frequency table for categorical data in x.
• hist(x): Plots a histogram of the data in x.
• boxplot(x): Creates a boxplot to visualize the distribution of x.
• t.test(x, y): Performs a t-test to compare means of two groups.
• anova(model): Conducts an analysis of variance (ANOVA) to compare means of
multiple groups.
b) What is the difference between parametric and non-parametric tests?
Parametric tests assume that the data comes from a specific probability distribution (like the
normal distribution) and that certain parameters (like the mean and standard deviation) are
known or can be estimated. Examples include t-tests and ANOVA.
Non-parametric tests make fewer assumptions about the data distribution. They are often
used when the data is not normally distributed or when the sample size is small. Examples
include the Wilcoxon rank-sum test and the Kruskal-Wallis test.
c) Define predictive analytics.
Predictive analytics is a field of data mining that uses statistical models, machine learning
algorithms, and other techniques to predict future outcomes based on historical data. It
helps organizations make informed decisions and anticipate future trends.
d) Explain pbinom() function in R.
The pbinom() function in R calculates the cumulative distribution function (CDF) of the
binomial distribution. It gives the probability of getting at most a certain number of
successes in a given number of trials with a specified probability of success.
e) How do you interpret the p-value in hypothesis testing?
The p-value is the probability of observing a test statistic as extreme or more extreme than
the one calculated from the sample data, assuming the null hypothesis is true.
• If the p-value is less than the significance level (usually 0.05), we reject the null
hypothesis.
• If the p-value is greater than the significance level, we fail to reject the null
hypothesis.
f) Write a function to get a list of all the packages installed in R.
Code snippet
get_installed_packages <- function() {
installed_packages <- rownames(installed.packages())
return(installed_packages)
}
g) Write a function to obtain the transpose of a matrix in R.
Code snippet
transpose_matrix <- function(x) {
t(x)
}
h) What is the purpose of regression analysis in R?
Regression analysis is a statistical method used to model the relationship between a
dependent variable and one or more independent variables. In R, it helps understand how
changes in the independent variables affect the dependent variable, and it can be used for
prediction and inference.
a) Define NULL and Alternate hypothesis.
In hypothesis testing, we make claims about a population parameter.
• Null Hypothesis (H₀): This is the default assumption, often a statement of no effect or
no difference.
• Alternative Hypothesis (H₁): This is the claim we want to test, often the opposite of
the null hypothesis.
Example:
• Null Hypothesis (H₀): The mean height of a population is 170 cm.
• Alternative Hypothesis (H₁): The mean height of the population is not 170 cm.
b) Define statistical modeling.
Statistical modeling involves using mathematical and statistical techniques to represent real-
world phenomena. It helps us understand, predict, and make decisions based on data.
Statistical models can be simple or complex, depending on the nature of the data and the
research question.
c) What is adjusted R² in regression analysis?
Adjusted R² is a modified version of the R² statistic that adjusts for the number of predictors
in a regression model. It penalizes the addition of unnecessary predictors that might not
significantly improve the model's fit. A higher adjusted R² indicates a better-fitting model,
even when more predictors are added.
d) Explain Unlist() function.
The unlist() function in R is used to convert a list into a vector. It flattens the list by
combining all its elements into a single vector. This is useful when you want to perform
operations on the individual elements of a list as if they were a single vector.
e) Explain aov() function.
The aov() function in R is used to perform analysis of variance (ANOVA), which is a statistical
technique to compare means of multiple groups. It helps determine if there are significant
differences between the means of the groups.
f) What is logistic regression?
Logistic regression is a statistical method used to model the probability of a binary outcome
(e.g., success or failure, yes or no) based on one or more predictor variables. It is widely
used in fields like healthcare, finance, and marketing.
g) Define Predictive analytics.
Predictive analytics is a field of data mining that uses statistical models, machine learning
algorithms, and other techniques to predict future outcomes based on historical data. It
helps organizations make informed decisions and anticipate future trends.
h) How many predictor variables must be used in multiple regression?
The number of predictor variables in multiple regression can vary depending on the
complexity of the model and the research question. There is no fixed rule, but generally, you
can use as many predictors as necessary to explain the variation in the dependent variable.
However, adding too many predictors can lead to overfitting, so it's important to balance the
model's complexity with its predictive power.

You might also like