ASM using r 2 marks answer Keys

The document outlines various statistical concepts and methods, including basic statistical functions in R, types of tests for comparing means, data mining, probability, and hypothesis testing. It also discusses advanced topics such as regression analysis, dimension reduction, and predictive modeling. Key statistical techniques like t-tests, ANOVA, correlation analysis, and logistic regression are explained, along with their applications and assumptions.

Uploaded by

anup

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views10 pages

ASM using r 2 marks answer Keys

Uploaded by

anup

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

a) Four basic statistical functions in R:

1. mean(x): Calculates the arithmetic mean of a numeric vector x.

2. sd(x): Calculates the standard deviation of a numeric vector x.
3. cor(x, y): Calculates the correlation coefficient between two numeric vectors x and y.
4. summary(x): Provides a summary of the data in x, including minimum, 1st quartile,
median, mean, 3rd quartile, and maximum values.
b) Two types of tests to compare means of two samples:
1. t-test: Used to compare the means of two samples, assuming they come from
normally distributed populations with equal variances.
2. ANOVA (Analysis of Variance): Used to compare the means of more than two groups.
c) Data Mining:
Data mining is the process of discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database1 systems. It extracts information
from a data set and transforms it into an understandable structure for further use.
d) Probability and Mutually Exclusive Events:
• Probability: The measure of the likelihood that an event will occur. It is a number
between 0 and 1, where 0 indicates impossibility and 1 indicates certainty.2
• Mutually Exclusive Events: Two events are mutually exclusive if they cannot both
occur at the same time. For example, flipping a coin cannot result in both heads and
tails simultaneously.
e) Cross Tabulation:
Cross tabulation is a statistical method used to analyze the relationship between two
categorical variables. It displays the frequency distribution of the variables3 and their joint
occurrences in a tabular format.
f) Correlation Analysis:
Correlation analysis is a statistical method used to measure the strength and direction of the
linear relationship between two variables.4 It helps to understand how changes in one
variable are associated with changes in another.
g) Predictive Modeling:
Predictive modeling is a statistical technique used to predict future outcomes based on
historical data. It involves building a model that learns patterns from the data and uses them
to make predictions on new, unseen data.
h) Factor Analysis:
Factor analysis is a statistical method used to reduce the dimensionality of a dataset by
identifying underlying latent variables or factors. It helps to identify groups of correlated
variables and understand the underlying structure of the data.
a) Null and Alternate Hypothesis:
• Null Hypothesis (H₀): A statement that assumes there is no significant difference or
relationship between the variables being studied.
• Alternate Hypothesis (H₁): A statement that contradicts the null hypothesis,
suggesting a significant difference or relationship exists.
b) ROC Curve:
A Receiver Operating Characteristic (ROC) curve is a graphical plot used to illustrate the
diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots
the true positive rate (sensitivity) against the false positive rate (1-specificity) for various
threshold settings.
c) Significance of Correlation Analysis:
Correlation analysis helps to:
• Measure the strength of the relationship: It quantifies how strongly two variables are
related.
• Determine the direction of the relationship: It indicates whether the relationship is
positive (both variables increase together) or negative (one increases while the other
decreases).
• Identify potential causal relationships: While correlation does not imply causation, it
can suggest potential causal links that can be further investigated.
d) Clustering in Data Mining:
Clustering is a technique used to group similar data points together. It identifies patterns and
structures within data, allowing for better understanding, visualization, and decision-making.
e) One-way ANOVA:
One-way ANOVA (Analysis of Variance) is a statistical technique used to compare the means
of three or more independent groups. It determines whether there is a significant difference
between the means of these groups.
f) Seasonality in Time Series Data:
Seasonality refers to patterns in data that repeat over a fixed period, such as yearly,
quarterly, monthly, weekly, or daily. It is often caused by factors like weather, holidays, or
social trends.
g) Dimension Reduction:
Dimension reduction is a technique used to reduce the number of features or variables in a
dataset while preserving as much information as possible. This can improve model
performance, reduce computational cost, and enhance interpretability.
h) Two types of tests to compare means of two samples:
1. t-test: Used to compare the means of two independent samples, assuming they
come from normally distributed populations with equal variances.
2. ANOVA (Analysis of Variance): While typically used for comparing means of more
than two groups, it can also be used to compare the means of two groups.

) Define Probability and give an example.

Probability is the mathematical measure of the likelihood that an event will occur. It is
expressed as a number between 0 and 1, where 0 indicates impossibility and 1 indicates
certainty.
Example:
• Tossing a coin: The probability of getting a head when tossing a fair coin is 1/2 or 0.5.
This means that if you toss the coin many times, you would expect to get heads
about half the time.
State assumptions of multiple regression analysis
Multiple regression analysis relies on several key assumptions to ensure the validity and
reliability of the model's results. These assumptions are:
1. Linearity: The relationship between the dependent variable and each independent
variable should be linear. This means that a change in an independent variable leads
to a proportional change in the dependent variable.
2. Independence of Errors: The errors (residuals) in the model should be independent
of each other. This means that the error in one observation does not influence the
error in another observation.
3. Homoscedasticity: The variance of the errors should be constant across all levels of
the independent variables. This means that the spread of the errors is consistent
throughout the range of the data.
4. Normality of Errors: The errors should be normally distributed. This assumption is
important for hypothesis testing and confidence interval estimation.
5. No Multicollinearity: The independent variables should not be highly correlated with
each other. Multicollinearity can make it difficult to estimate the individual effects of
the independent variables on the dependent variable.
It's important to check these assumptions before interpreting the results of a multiple
regression analysis. Violations of these assumptions can lead to biased and unreliable
estimates.
c) Autocorrelation in Time Series
Autocorrelation refers to the correlation between a time series and a lagged version of itself.
In simpler terms, it measures the relationship between a variable's current value and its past
values.
• Positive Autocorrelation: If the current value of a variable is positively correlated
with its past values, it indicates a trend. For instance, if stock prices tend to rise after
previous rises, it shows positive autocorrelation.
• Negative Autocorrelation: If the current value is negatively correlated with past
values, it suggests a cyclical pattern. For example, if economic growth is followed by a
period of recession, it exhibits negative autocorrelation.
Understanding autocorrelation is crucial in time series analysis as it helps identify patterns,
make accurate forecasts, and select appropriate modeling techniques.
d) Two Methods of Dimension Reduction
Dimensionality reduction is a technique used to reduce the number of features (variables) in
a dataset while preserving the essential information. This is important because high-
dimensional data can lead to the curse of dimensionality, where models become less
accurate and computationally expensive.
Two common methods of dimension reduction are:
1. Principal Component Analysis (PCA): PCA transforms a dataset into a new
coordinate system, where the first few principal components capture most of the
variance in the data. By selecting only the most important principal components, we
can reduce the dimensionality.
2. Feature Selection: This involves selecting a subset of the original features that are
most relevant to the target variable. Techniques like filter methods (e.g., correlation
analysis), wrapper methods (e.g., forward selection, backward elimination), and
embedded methods (e.g., L1 regularization) can be used for feature selection.
f) Models that are Both Regression and Classification
While regression and classification are distinct tasks, some models can be adapted to
perform both:
1. Logistic Regression:
o Fundamentally a classification model, it predicts the probability of a binary
outcome.
o By interpreting the output as a continuous probability, it can be used for
regression-like tasks.
2. Neural Networks:
o Highly flexible models that can be trained for both regression and
classification.
o The output layer determines the task: a single node for regression, multiple
nodes for multi-class classification.
3. Support Vector Machines (SVMs):
o Primarily a classification model, SVMs can also be used for regression by
modifying the loss function.
o Support Vector Regression (SVR) is an extension for regression tasks.
g) Null and Alternative Hypothesis
In hypothesis testing, we make claims about a population parameter.
• Null Hypothesis (H₀): This is the default assumption, often a statement of no effect
or no difference.
• Alternative Hypothesis (H₁): This is the claim we want to test, often the opposite of
the null hypothesis.
Example:
• Null Hypothesis (H₀): The mean height of a population is 170 cm.
• Alternative Hypothesis (H₁): The mean height of the population is not 170 cm.
h) Properties of the Normal Distribution
The normal distribution, often called the bell curve, is a fundamental probability distribution
with the following properties:
1. Symmetry: The distribution is symmetric about the mean.
2. Mean, Median, and Mode: The mean, median, and mode are equal.
3. Standard Deviation: The standard deviation determines the spread of the
distribution.
4. Area Under the Curve: The total area under the curve is equal to 1.
5. Empirical Rule: Approximately 68% of the data falls within one standard deviation of
the mean, 95% within two standard deviations, and 99.7% within three standard
deviations.
The 1 normal distribution is widely used in statistics and probability theory due to its
simplicity and its frequent appearance in natural phenomena.

e) Sketch classification table in logistic regression

Classification Table in Logistic Regression
A classification table, also known as a confusion matrix, is a common tool to evaluate the
performance of a classification model, including logistic regression. It summarizes the
prediction results on a test dataset, comparing the predicted class labels to the actual class
labels.
Here's a typical classification table:

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN) 1

Explanation of terms:
• True Positive (TP): Correctly predicted positive cases.
• True Negative (TN): Correctly predicted negative cases.
• False Positive (FP): Incorrectly predicted positive cases (Type I error).
• False Negative (FN): Incorrectly predicted negative cases (Type II error).
From this table, we can calculate various performance metrics:
• Accuracy: Overall correctness of the model.
o Accuracy = (TP + TN) / (TP + TN + FP + FN)
• Precision: Proportion of positive predictions that are correct.
o Precision = TP / (TP + FP)
• Recall (Sensitivity): Proportion of actual positive cases correctly identified.
o Recall = TP / (TP + FN)
• Specificity: Proportion of actual negative cases correctly identified.
o Specificity = TN / (TN + FP)
• F1-score: Harmonic mean of precision and recall.
o F1-score = 2 * (Precision * Recall) / (Precision + Recall)
By analyzing these metrics, we can assess the model's performance and make informed
decisions about its suitability for a particular application.
a) Enlist basic statistical functions in R.
Here are some fundamental statistical functions in R:
• mean(x): Calculates the mean (average) of the values in the vector x.
• median(x): Finds the median value of the vector x.
• sd(x): Computes the standard deviation of the values in x.
• var(x): Calculates the variance of the values in x.
• summary(x): Provides a summary of the data in x, including quartiles, mean, median,
min, and max.
• cor(x, y): Computes the correlation between two vectors x and y.
• table(x): Creates a frequency table for categorical data in x.
• hist(x): Plots a histogram of the data in x.
• boxplot(x): Creates a boxplot to visualize the distribution of x.
• t.test(x, y): Performs a t-test to compare means of two groups.
• anova(model): Conducts an analysis of variance (ANOVA) to compare means of
multiple groups.
b) What is the difference between parametric and non-parametric tests?
Parametric tests assume that the data comes from a specific probability distribution (like the
normal distribution) and that certain parameters (like the mean and standard deviation) are
known or can be estimated. Examples include t-tests and ANOVA.
Non-parametric tests make fewer assumptions about the data distribution. They are often
used when the data is not normally distributed or when the sample size is small. Examples
include the Wilcoxon rank-sum test and the Kruskal-Wallis test.
c) Define predictive analytics.
Predictive analytics is a field of data mining that uses statistical models, machine learning
algorithms, and other techniques to predict future outcomes based on historical data. It
helps organizations make informed decisions and anticipate future trends.
d) Explain pbinom() function in R.
The pbinom() function in R calculates the cumulative distribution function (CDF) of the
binomial distribution. It gives the probability of getting at most a certain number of
successes in a given number of trials with a specified probability of success.
e) How do you interpret the p-value in hypothesis testing?
The p-value is the probability of observing a test statistic as extreme or more extreme than
the one calculated from the sample data, assuming the null hypothesis is true.
• If the p-value is less than the significance level (usually 0.05), we reject the null
hypothesis.
• If the p-value is greater than the significance level, we fail to reject the null
hypothesis.
f) Write a function to get a list of all the packages installed in R.
Code snippet
get_installed_packages <- function() {
installed_packages <- rownames(installed.packages())
return(installed_packages)
}
g) Write a function to obtain the transpose of a matrix in R.
Code snippet
transpose_matrix <- function(x) {
t(x)
}
h) What is the purpose of regression analysis in R?
Regression analysis is a statistical method used to model the relationship between a
dependent variable and one or more independent variables. In R, it helps understand how
changes in the independent variables affect the dependent variable, and it can be used for
prediction and inference.
a) Define NULL and Alternate hypothesis.
In hypothesis testing, we make claims about a population parameter.
• Null Hypothesis (H₀): This is the default assumption, often a statement of no effect or
no difference.
• Alternative Hypothesis (H₁): This is the claim we want to test, often the opposite of
the null hypothesis.
Example:
• Null Hypothesis (H₀): The mean height of a population is 170 cm.
• Alternative Hypothesis (H₁): The mean height of the population is not 170 cm.
b) Define statistical modeling.
Statistical modeling involves using mathematical and statistical techniques to represent real-
world phenomena. It helps us understand, predict, and make decisions based on data.
Statistical models can be simple or complex, depending on the nature of the data and the
research question.
c) What is adjusted R² in regression analysis?
Adjusted R² is a modified version of the R² statistic that adjusts for the number of predictors
in a regression model. It penalizes the addition of unnecessary predictors that might not
significantly improve the model's fit. A higher adjusted R² indicates a better-fitting model,
even when more predictors are added.
d) Explain Unlist() function.
The unlist() function in R is used to convert a list into a vector. It flattens the list by
combining all its elements into a single vector. This is useful when you want to perform
operations on the individual elements of a list as if they were a single vector.
e) Explain aov() function.
The aov() function in R is used to perform analysis of variance (ANOVA), which is a statistical
technique to compare means of multiple groups. It helps determine if there are significant
differences between the means of the groups.
f) What is logistic regression?
Logistic regression is a statistical method used to model the probability of a binary outcome
(e.g., success or failure, yes or no) based on one or more predictor variables. It is widely
used in fields like healthcare, finance, and marketing.
g) Define Predictive analytics.
Predictive analytics is a field of data mining that uses statistical models, machine learning
algorithms, and other techniques to predict future outcomes based on historical data. It
helps organizations make informed decisions and anticipate future trends.
h) How many predictor variables must be used in multiple regression?
The number of predictor variables in multiple regression can vary depending on the
complexity of the model and the research question. There is no fixed rule, but generally, you
can use as many predictors as necessary to explain the variation in the dependent variable.
However, adding too many predictors can lead to overfitting, so it's important to balance the
model's complexity with its predictive power.

13 Statistical Analysis Methods For Data Analysts & Data Scientists - by BTD - Medium
No ratings yet
13 Statistical Analysis Methods For Data Analysts & Data Scientists - by BTD - Medium
22 pages
PREDICTIVE BUSINESS ANALYTICS sem 4
No ratings yet
PREDICTIVE BUSINESS ANALYTICS sem 4
31 pages
Problem in Regression Analysis
No ratings yet
Problem in Regression Analysis
7 pages
Statistical Methods
No ratings yet
Statistical Methods
15 pages
MA Forecasting
No ratings yet
MA Forecasting
22 pages
DAM Theory
No ratings yet
DAM Theory
18 pages
Forecasts: Forecasting Using Historical Data
No ratings yet
Forecasts: Forecasting Using Historical Data
11 pages
Income Tax
No ratings yet
Income Tax
9 pages
Unit-2
No ratings yet
Unit-2
23 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
9 pages
Quiz 2 and Homework Chapter 1
No ratings yet
Quiz 2 and Homework Chapter 1
9 pages
DBB2102 - Quantitative Techniques For Management
No ratings yet
DBB2102 - Quantitative Techniques For Management
9 pages
Unit 2
No ratings yet
Unit 2
76 pages
Unit-2 - Stat Modeling & Analysis
No ratings yet
Unit-2 - Stat Modeling & Analysis
23 pages
Tata Power
No ratings yet
Tata Power
14 pages
ECO 391 Lecture Slides - Part 2
No ratings yet
ECO 391 Lecture Slides - Part 2
26 pages
Data Analysis Guide
No ratings yet
Data Analysis Guide
4 pages
DataAnalytics(Unit 2)
No ratings yet
DataAnalytics(Unit 2)
131 pages
Day 3 Statistics Interview QnA
No ratings yet
Day 3 Statistics Interview QnA
5 pages
RIO_094551
No ratings yet
RIO_094551
8 pages
Bocalig Act5 MMW
No ratings yet
Bocalig Act5 MMW
6 pages
MR Unit-V
No ratings yet
MR Unit-V
13 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Forecasting-Topic 4
No ratings yet
Forecasting-Topic 4
35 pages
Cheat Sheet
No ratings yet
Cheat Sheet
3 pages
DS
No ratings yet
DS
5 pages
Regression
No ratings yet
Regression
86 pages
Unit-III
No ratings yet
Unit-III
13 pages
Big Data - Sources and Opportunities
No ratings yet
Big Data - Sources and Opportunities
30 pages
2022 Emba 507
No ratings yet
2022 Emba 507
15 pages
Lecture 12 (Data Analysis and Interpretation
No ratings yet
Lecture 12 (Data Analysis and Interpretation
16 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
N1D FC Quantitative Methods PDF
No ratings yet
N1D FC Quantitative Methods PDF
40 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Analytics PrepBook AnSoc 2017 PDF
100% (1)
Analytics PrepBook AnSoc 2017 PDF
41 pages
Time Series: Scholarship Statistics
No ratings yet
Time Series: Scholarship Statistics
6 pages
Unit 2 Data Analytics (1)
No ratings yet
Unit 2 Data Analytics (1)
33 pages
Abn 2102
No ratings yet
Abn 2102
12 pages
Predictive-Analytics (1)
No ratings yet
Predictive-Analytics (1)
22 pages
Basicof Stats
No ratings yet
Basicof Stats
7 pages
unit 4 notes (1)
No ratings yet
unit 4 notes (1)
9 pages
Business Stat Notes
No ratings yet
Business Stat Notes
13 pages
06_Banerjee and Banerjee_Business Analytics_Ch06
No ratings yet
06_Banerjee and Banerjee_Business Analytics_Ch06
21 pages
Datascience Interview
100% (1)
Datascience Interview
31 pages
Akinnusotu Peace
No ratings yet
Akinnusotu Peace
11 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
12 pages
Antim Prahar 2024 Business Statistics and Analysis
No ratings yet
Antim Prahar 2024 Business Statistics and Analysis
34 pages
003-Forecasting Techniques Detailed
No ratings yet
003-Forecasting Techniques Detailed
20 pages
Big Data SYBBA(CA)
No ratings yet
Big Data SYBBA(CA)
12 pages
Statisticsgm
No ratings yet
Statisticsgm
2 pages
Answer of Binman Sir's Suggested Questions
No ratings yet
Answer of Binman Sir's Suggested Questions
13 pages
Unit 5 MT 202 CBNST
No ratings yet
Unit 5 MT 202 CBNST
14 pages
Finance
No ratings yet
Finance
43 pages
Chap3-INTERVENTION ANALYSIS
No ratings yet
Chap3-INTERVENTION ANALYSIS
62 pages
Statistical Methods: 4 Unit
No ratings yet
Statistical Methods: 4 Unit
39 pages
Inferential Statistics
No ratings yet
Inferential Statistics
22 pages
Unit 4
No ratings yet
Unit 4
21 pages
Copy of Unit 5 Business Analytics
No ratings yet
Copy of Unit 5 Business Analytics
24 pages
Research+Methodology+ +Multivariate+Analysis
No ratings yet
Research+Methodology+ +Multivariate+Analysis
13 pages
Regression Analysis: A Journey from Simple to Complex
From Everand
Regression Analysis: A Journey from Simple to Complex
Pasquale De Marco
No ratings yet
Deploying Machine Learning in Resource-Constrained Devices For Human Activity Recognition
No ratings yet
Deploying Machine Learning in Resource-Constrained Devices For Human Activity Recognition
6 pages
CH 01 N
No ratings yet
CH 01 N
41 pages
Chapter3 FA
No ratings yet
Chapter3 FA
41 pages
Lecture Notes
No ratings yet
Lecture Notes
97 pages
Unit 3 - Data Science, Machine Learning
No ratings yet
Unit 3 - Data Science, Machine Learning
20 pages
Unit 2 HPC
No ratings yet
Unit 2 HPC
92 pages
Skogested Half Rule PDF
No ratings yet
Skogested Half Rule PDF
8 pages
Solve Simplification or Other Simple Results 42x2+56x-105 Tiger Algebra Solver
No ratings yet
Solve Simplification or Other Simple Results 42x2+56x-105 Tiger Algebra Solver
4 pages
Practical-8 AIM:-To Perform LU Decomposition On A Given Matrix in MATLAB
No ratings yet
Practical-8 AIM:-To Perform LU Decomposition On A Given Matrix in MATLAB
14 pages
CS1010e notes and summary
No ratings yet
CS1010e notes and summary
4 pages
Legendre Differential Equation
No ratings yet
Legendre Differential Equation
5 pages
linear equation project
100% (1)
linear equation project
6 pages
Chapter 5
No ratings yet
Chapter 5
24 pages
Proposal
0% (1)
Proposal
3 pages
Test - Exponential & Logarithmic Functions
No ratings yet
Test - Exponential & Logarithmic Functions
2 pages
The Reality Mining Dataset," CSCS 535 Networks: Theory and Application, 2007
No ratings yet
The Reality Mining Dataset," CSCS 535 Networks: Theory and Application, 2007
4 pages
Advancing Federated Learning in 6G: A Trusted Architecture With Graph-Based Analysis
No ratings yet
Advancing Federated Learning in 6G: A Trusted Architecture With Graph-Based Analysis
6 pages
S4_MTC_EOT_1_2024 2
No ratings yet
S4_MTC_EOT_1_2024 2
2 pages
3030-Article Text-5716-1-10-20210418
No ratings yet
3030-Article Text-5716-1-10-20210418
6 pages
Deadlock Prevention, Avoidance, and Detection
No ratings yet
Deadlock Prevention, Avoidance, and Detection
29 pages
Notes 20230430194201
No ratings yet
Notes 20230430194201
4 pages
On - Acyclic Domination Parameter: N. Venkataraman
No ratings yet
On - Acyclic Domination Parameter: N. Venkataraman
6 pages
Proximal Minimization With D-Functions: Gorithms
No ratings yet
Proximal Minimization With D-Functions: Gorithms
11 pages
Rahil Resume 1
No ratings yet
Rahil Resume 1
1 page
ANSYS Rigid Body Dynamics
No ratings yet
ANSYS Rigid Body Dynamics
2 pages
Download An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray ebook All Chapters PDF
100% (2)
Download An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray ebook All Chapters PDF
61 pages
Document
No ratings yet
Document
4 pages
ECE 565 High-Level Synthesis-An Introduction
No ratings yet
ECE 565 High-Level Synthesis-An Introduction
17 pages
Linear Algebra
No ratings yet
Linear Algebra
2 pages
ME Math 8 Q2 0603 PS
No ratings yet
ME Math 8 Q2 0603 PS
27 pages

ASM using r 2 marks answer Keys

Uploaded by

ASM using r 2 marks answer Keys

Uploaded by

a) Four basic statistical functions in R:

1. mean(x): Calculates the arithmetic mean of a numeric vector x.

) Define Probability and give an example.

e) Sketch classification table in logistic regression

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN) 1

You might also like