ds_imp_qs
ds_imp_qs
Unit 2
1. Explain normal curve and z-score.
2. Using standard normal curve table, find the proportion of the total area identified with the following
statements.
a. above z score of 1.8
b. between the mean and a z score of 1.65
c. between z scores of 0 and -1.96
3. Describe the types of variable with an example for each.
4. Suppose a hospital tested the age and body fat data for randomly selected adults with the following
result:
Age 23 27 39 49 50 52 54 56 57 58 60
%fat 9.5 17.8 31.4 27.2 31.2 34.6 42.5 33.4 30.2 34.1 41
Draw the boxplots for age.
5. Find the mean, median, mode, variance, standard deviation and skewness for the given data:
Marks 0 – 10 10 -20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80
No. of Students 10 40 20 0 10 40 16 14
6. The number of friends reported by facebook users is summarized in the following frequency
distribution.
Friends F
400 – 2
above
350 – 399 5
300 – 349 12
250 – 299 17
200 – 249 23
150 – 199 49
100 – 149 27
50 – 99 29
0 – 49 36
Total 20
0
a. What is the shape of this distribution?
b. Find the relative frequencies
c. Find the approximate percentile rank of the interval 300 – 349
d. Convert to a histogram
e. Why would it not be possible to convert to a stem and leaf display?
7. Perform an exploratory data analysis for the following data with different types of plots:
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University
of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for cancer.
Data attributes:-
Age of patient at the time of operation (numerical)
Patient’s year of operation (year – 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute): 1 = the patient survived 5 years or longer, 2 = the patient died
within five years.
8.
a. Classify the below list of data into their types: i. ethnic group ii. age iii. family size iv.
academic major v. IQ score vi. net worth vii. third-place finish viii. gender ix. temperature x.
education; and, write a brief notes on the type.
b. Differentiate discrete and continuous variables.
c. Explain the types of data.
d. Define median with an example.
e. Compare and contrast qualitative data and quantitative data with an example.
f. List the differences between a discrete variable and a continuous variable with an example.
9. What is a frequency distribution? Customers who have purchased a particular product rated the
usability of the product on a 10-point scale, ranging from 1 (poor) to 10 (excellent) as follows:
3 7 2 7 8
3 1 4 1 3
0
2 5 3 5 8
9 7 6 3 7
8 9 7 3 6
Construct a frequency distribution for the above data.
10. What is relative frequency distribution? The GRE Scores for a group of graduate school applicants
are distributed as follows:
GRE Score Frequenc
y
725 – 749 1
700 – 724 3
675 – 699 14
650 – 774 30
625 – 649 34
600 – 624 42
575 – 599 30
550 – 574 27
525 – 549 13
500 – 524 4
475 – 499 2
Total 200
Explain the procedure to convert a frequency distribution into a relative frequency distribution and
convert the data presented in the above table to a relative frequency distribution.
11. What is Z-score? Outline the steps to obtain a Z-score.
12. Express each of the following scores as a Z Score: First, Mary’s intelligence quotient is 135, given a
mean of 100 and standard deviation 15. Second, Mary obtained a score of 470 in the Competitive
Examination conducted in April 2022, given a mean of 500 and a standard deviation of 100.
13. What is mode? Can there be distributions with no mode or more than one mode? The owner of a
new car conducts six gas mileage tests and obtains the following results, expressed in miles per gallow:
26.3, 28.7, 27.4, 26.6, 27.4 and 26.9. Find the mode for these data.
14. What is median? Outline the steps to find the median and find the median for the following scores:
set of five scores: 2, 8, 2, 7, 6; and second set of six scores are 3, 8, 9, 3, 1, 8.
Unit 3
1. Explain scatter plot. Categorize the different types of relationships using scatter plots.
2. Describe range and variance
3. Explain the correlation coefficient
4. Explain how the least squares equation which is used to minimize the total of all squared prediction
errors with example.
5. Each of the following pairs represents the number of licensed drivers (X) and the number of cars (Y)
for seven houses in my neighborhood:
Drivers Cars
(X) (Y)
5 4
4 3
2 2
2 2
3 2
1 1
2 2
a. Construct a scatterplot to verify a lack of pronounced curvilinearity.
b. Determine the least squares equation for these data. Calculate r, SSy an SSx.
c. Determine the standard error of estimate, Sy/x, given that n = 7
6. In studies dating back over 100 years, its well established that regression toward the mean occurs
between the heights of fathers and the heights of their adult sons. Indicate whether the statements are
true or false with reason.
a. Sons of tall fathers will tend to be shorter than their fathers.
b. Sons of short fathers will tend to be taller than the mean for all sons.
c. Every son of a tall father will be shorter than their fathers.
d. Taken as a group, adult sons are shorter than their fathers
e. Fathers of tall sons will tend to be taller than their sons but shorter than the mean for all
fathers.
f. Fathers of short sons will tend to be taller than their sons but shorter than the mean of all
fathers.
7. Interpret the value of r2 in correlation based analysis.
8. Assume that an r of -.80 describes the strong negative relationship between years of heavy smoking
(X) and life expectance(Y). Assume, furthermore, that the distributions of heavy smoking and life
expectancy each have the following means and sum of squares: 5 60 35 70 x y X Y SS SS
a. Determine the least squares regression equation for predicting life expectancy from years of
heavy smoking
b. Determine the standard error of estimate, Sy/x, assuming that the correlation of -.80 was
based on n = 50 pairs of observations.
c. Supply a rough interpretation of Sy/x
d. Predict the life expectancy for John, who has smoked heavily for 8 years.
e. Predict the life expectancy for Katie, who has never smoked heavily.
9. a. Consider Helen sent 10 greeting cards to her friends and she received back 8 cards, what is
the kind of relationship it is? Brief it in?
b. What is a percentile rank? Give an example?
c. Define multiple regressions.
d. Define regression towards the mean.
e. What is the use of scatter plot?
f. Define correlation coefficient.
10. Calculate the correlation coefficient for the heights ‘in inches’ of fathers’ (x) and their son’s(y) with
the data presented below:
x 66 68 68 70 71 72 72
y 68 70 69 72 72 72 74
11. The value of x and their corresponding values of y are given below:
x 0.5 1.5 2.5 3.5 4.5 5.5 6.5
y 2.5 3.5 5.5 4.5 6.5 8.5 10.5
a. Find the least square regression line y = ax + b.
b. Estimate the value of y when x = 10.
11. Consider the following dataset with one response variable y and two predictor variables x1 and x2.
y 140 155 159 179 192 200 212 215
x1 60 62 67 70 71 71 75 78
x2 22 25 24 20 15 14 14 11
Fit a multiple linear regression model to this dataset.
Unit 4
1. Explain grouping in python with example.
2. Explain data indexing and operation on missing data with suitable code and examples.
3. Describe in detail about pivot table.
4. Imagine you have a series of data that represents the amount of precipitation each day for a year in
a given city. Load the daily rainfall statistics for the city of Tirupati in 2021 which is given in a csv file
Tirupatirainfall2021.csv . Using pandas, generate a histogram for rainy days and find out the days that
have high rainfall.
5. Consider that an e-commerce organization like amazon, have different region sales as NorthSales,
SouthSales, WestSales, EastSales.csv files. They want to combime north and west region sales; south
and east sales; to find the aggregate sales of these collaborating regions. Help them to do so using
Python code.
6.
a. List the attributes of Numpy array. Give an example for each.
b. Create a data frame with key and data pairs as Key-data pair as A-10, B-20, A-40, C-5, B-10,
C-10. Find the sum of each key and display the result as each key group.
c. What are the key properties of Pearson Correlation Coefficient?
d. Summarize some built-in Pandas aggregations.
e. State the advantages of using Numpy arrays.
f. Outline two types of Numpy’s UFuncs.
7. What is an aggregate function? Elaborate about the aggregate functions in Numpy.
8. What is broadcasting? Explain the rules of broadcasting with an example.
9. Elaborate about the mapping between Python operators and Pandas methods.
10. Why is Numpy faster than lists? List and explain the categories of basic array manipulation methods
with example.
Unit 5
1. Explain the different types of joins in python
2. Explain various features of matplotlib platform used for data visualization and illustrate its
challenges.
3. How text and image annotations are done using Python? Give an example of your own with
appropriate Python code.
4. Apprise the following: a. Histograms b. Binning c. Density; with appropriate Python code.
5.
a. What is the purpose of errorbar function in matplotlib? Give an example.
b. Showcase 3 – dimensional drawing in matplotlib with corresponding Python code.
c. Explain Partial sort
d. Give a summary about the comparison operators.
e. State the two possible options in Python notebook used to embed graphics directly in the
notebook.
f. How plt.scatter function differs from plt.plot function?
6. Explain about various visualization charts like line plots, scatter plots and histograms using
matplotlib with an example.
7. Outline any two three-dimensional plotting in matplotlib with an example.