0% found this document useful (0 votes)
10 views

IT-3006(DA)-CS_END_MAY_2023

Uploaded by

girivinayak0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

IT-3006(DA)-CS_END_MAY_2023

Uploaded by

girivinayak0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

SPRING END SEMESTER EXAMINATION-2023

6th Semester, B.Tech


DATA ANALYTICS (IT-3006)
Evaluation Scheme and Solution

1. Answer the following questions.


(a) Explain the similarity and difference between JSON and BSON with
suitable examples.
[Evaluation Scheme] Full mark for the correct answer. 0.5 mark for
similarity and 0.5 for difference. No step-wise mark to be awarded.
[Solution]
Similarity: Both represent semi-structured format.
Difference: BSON is not in a readable format wherein JSON is readable
format.
(b) What is the difference between univariate, bivariate, and multivariate
analysis?
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
should be awarded based on the partial correctness of the solution.
[Solution]
Univariate represents the type of data that consists of only one variable
and its analysis involves central tendency measures (mean, median and
mode), dispersion or spread of data (range, minimum, maximum, quartiles,
variance and standard deviation) and by using frequency distribution
tables, histograms, pie charts, frequency polygon and bar charts.
Bivariate represents the type of data that consists of two variables and its
analysis involves comparisons, relationships, causes and explanations.
Multivariate represents the type of data that consists of more than two
variables and its analysis involves regression analysis, path analysis, factor
analysis and multivariate analysis of variance (MANOVA).
(c) Consider the below dataset that contains the number of hours of studies
and the actual score received for 3 students in data analytics, and the
predicted score was calculated with linear regression. Calculate R2.
# Number of hrs Actual score Predicted score
1 2 74 72
2 3 80 83
3 4 76 79
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
Mean of actual score = 76.66 which is rounded to 77
SSR = Sum of squares regression = (72 - 77)2 + (83 – 77)2 + (79 – 77)2 =
25 + 36 + 4 = 65
SSE = Sum of squares error = (72 - 74)2 + (83 – 80)2 + (79 – 76)2 = 4 + 9 +
9 = 22

1
SST = Sum of squares total = SSR + SSE = 65 + 22 = 87
R2 = SSR / SST = 65/87 = 0.747 and such value indicate moderately fit
model.
(d) A time series model is mathematically represented as Y t = f(Tt, St, Ct, It)
where Yt is the time series value at time t. Tt, St, Ct, and It are the trend,
seasonal, cyclic and irregular component value at time t respectively.
Represents the model
(1) When the amplitude of seasonal and irregular variations does not
change as the level of trend rises or falls.
(2) When the amplitude of both the seasonal and irregular variations
increase as the level of trend rises.
[Evaluation Scheme] Full mark for the correct answer. 0.5 mark for 1 st
part and 0.5 for other part. No step-wise mark to be awarded.
[Solution]
(1) When the amplitude of seasonal and irregular variations does not
change as the level of trend rises or falls, time series follows additive
model and it is represented by Yt = Tt + St + Ct + It
(2) When the amplitude of both the seasonal and irregular variations
increase as the level of trend rises, time series follows multiplicative
model and it is represented by Yt = Tt * St * Ct * It
(e) Suppose a hierarchical clustering to be applied in segmenting the students
and following sample has been collected. Create the proximity matrix for
the below sample. The mark is out of 20 in the mid semester.
Roll No Sex Section Mark
1 Male CSE -1 10
2 Female IT – 1 17
3 Male CSSE – 1 18
4 Female CSCE - 1 20
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
Since the students dataset have 4 observations, so a 4 X 4 proximity matrix
is to be created wherein the diagonal elements is 0 as the distance of a
point with itself is always 0. Applying Euclidean distance formula, the
matrix looks as follows:
Roll No 1 2 3 4
1 0 √(10-17)2= 7 √(10-18)2= 8 √(10-20)2= 10

2 √(17-10)2= 7 0 √(17-18)2= 1 √(17-20)2= 3

3 √(18-10)2= 8 √(18-17)2= 1 0 √(18-20)2= 2

4 √(20-10)2= 10 √(20-17)2= 3 √(20-18)2= 2 0

(f) Consider the following dataset, wherein TID represents transaction ID and
G to O represents individual products. In the dataset, 1 represents a

2
transaction that includes the specific products. For instance, TID 1
includes all products and TID 3 includes M and O product. Calculate
Confidence({G, A} => {M}).
TID G A M
1 1 1 1
2 1 0 1
3 0 0 1
4 0 1 0
5 1 1 1
6 1 1 0
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
Confidence({G, A} => {M}) = Support(G, A, M)/Support(G, A) = [2/6] /
[3/6] = 0.667
(g) Consider the decagon, which has 10 sides. Three sides are marked 1, two
sides are marked 2, one side is marked 3, two sides are marked 4, and two
sides are marked 5. Draw a graph representing occurrence of each mark
verses its probability.
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
Probability(side marked as 1) = 3 / 10 = 0.3
Probability(side marked as 2) = 2/10 = 0.2
Probability(side marked as 3) = 1 / 10 = 0.1
Probability(side marked as 4) = 2/10 = 0.2
Probability(side marked as 5) = 2/10 = 0.2
The graph looks as follows:

(h) Consider the following dataset. Consider support count is represented with
SC. Calculate (SC({E}) + SC({A, B}) + SC({C, D})) / (SC({A, B, C, E})
+ SC({A, B, C, D, E}))
Transaction Itemset
T1 A, B
T2 B, D
T3 B, C
T4 A, B, D
T5 A, C
T6 B, C
T7 A, B, C, E

3
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
SC({E}) = 1, SC({A, B}) = 3, SC({C, D}) = 0, SC({A, B, C, E}) = 1, and
SC({A, B, C, D, E}) = 0
Numerator = (SC({E}) + SC({A, B}) + SC({C, D})) = 1 + 3 + 0 = 4
Denominator = SC({A, B, C, E}) + SC({A, B, C, D, E}) = 1 + 0 = 1
Therefore, (SC({E}) + SC({A, B}) + SC({C, D})) / (SC({A, B, C, E}) +
SC({A, B, C, D, E})) = 4/1 = 4
(i) A bloom filter with a size of 1000 slots is used to store the information of
100 data stream items using 4 hash functions. Calculate the false positive
probability.
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
n = size of bloom filter = 1000
m = number of expected elements to be inserted = 100
k = number of hash functions = 4

False positive probability =


(1/e)km/n = (1/2.718)4*100/1000 =0.670. So, (1-0.670)4 = 0.3294 = 0.011
(j) What is the probability that a slot is hashed in a bloom filter where n is the
size and k is the number of hash functions?
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
Probability that a slot is hashed with one hash function = 1/n, so with k
hash functions it is 1/nk

2. (a) Consider the following dataset. Draw the MapReduce process to find the number
of customers from each city followed by each state, both in the chronological
order.
ID Name City State
1 Sujay Lila Ambikapur Chhattisgarh
2 Geetha Choudhary Bhilai Chhattisgarh
3 Anandi D'Cruz Bilaspur Chhattisgarh
4 Surendra Nagarkar Cuttack Odisha
5 Balwinder Nagarkar Bangalore Karnataka
6 Nitin Nibhanupudi Mangalore Karnataka
7 Dinesh Sharma Cuttack Odisha
8 Raj Chaudhri Bilaspur Chhattisgarh

4
9 Govind Kumar Mysore Karnataka
10 Jayanta Begam Ambikapur Chhattisgarh
[Evaluation Scheme] Full mark for the correct answer. 2 marks for city
and 2 marks for state. Step-wise mark can be awarded based on the partial
correctness of the solution.
[Solution]
MapReduce process to find the number of customers from each city:

MapReduce process to find the number of customers from each state:

(b) A retail company wants to enhance their customer experience by analysing


the customer reviews for different products, so that they can inform the
corresponding vendors and manufacturers about the product defects and
shortcomings. You have been tasked to analyse the complaints filed under
each product & the total number of complaints filed based on the
geography, type of product, etc. You also have to figure out the complaints
which have no timely response. Discuss and then model your views
concerning descriptive, diagnostic and predictive analytics.
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
Descriptive analytics model – the model should use historical and current
data to seek answer for the questions “what has been happened” using data
analytics technique such as box plot. Few examples may be as follows:
(1) Find the number of complaints by geography and type of product
(2) Which geography contributed maximum number of negative
review comments?
(3) Which product type has maximum number of positive review

5
comments?
Diagnostic analytics model – the model should use historical and current
data to seek answer for the questions “why it has been happened” using
data analytics technique such as drill-through, root cause analysis using
fish bone, etc. Few examples may be as follows:
(1) Why is the number of complaints by Asian geography and food
and beverages product
(2) Why male of Asian geography provided maximum number of
negative review comments?
(3) Why the suitable features of beauty care product type are has
collected maximum number of positive review comments?
Predictive analytics model – the model should use historical and current
data to seek answer for the questions “what will happen in the” using data
analytics technique such as regression, clustering, classifications etc. Few
examples may be as follows:
(1) What would be the total number of complaints by Asian geography
and food and beverages product by the end of this quarter?
(2) What is expected number of negative review comments from Asian
geography by end of this month?
(3) What would be the sale by end of this year?

3. (a) In the population, the average IQ is 100 with a standard deviation of 15. A
team of scientists want to test a new medication to see if it has either a
positive or negative effect on intelligence or not effect at all. A sample of
30 participants who have taken the medication has a mean of 140. Using
hypothesis testing, find the answer to the question i.e., did the medication
affect intelligence? The z value (i.e., critical value) from statistical table is
found to be 1.96. The solution must mention the null (H0) and alternative
hypotheses (Ha).
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
Step 1: Set up the null and alternate hypothesis
H0: medication affects intelligence
Ha: medication does not affect intelligence.
Step 2: Determine the type of test to use
Since the sample size is 30, the z-test is used.
Step 3: Calculate the tested statistic z using the formula

6
Where x̄n is the mean of the population, µ0 is the null hypothesis (i.e., the
mean) to be tested, σ is the standard deviation, and n is the sample size.
Using the data given in the equation we would have the following:
μ0 = 100, σ = 15, n = 30, x̄n = 140
Plugging the values into the formula: ((140 – 100) / 15) * √30 = 14.606
Step 4: In the question, z value is provided i.e., 1.96 and hence no need to
look into z table.
Step 5: drawing conclusion
The tested statistic value of z calculated is more than the critical value
obtained from statistical tables (i.e., 14.606 > 1.96). Therefore the null
hypothesis is rejected. This means that the medication administered does
not affect intelligence.
(b) Find the relationships of salary between millennials (between the ages of
18 and 34), gen X (between the ages of 35 and 50) and baby boomers
(aged 51 and above) of below sample by plotting multiple boxplots in one
graph.
Gender Age Salary
Male 20 81600
Female 55 61600
Male 38 64300
Female 25 71900
Male 58 76300
Male 45 68200
Female 30 60900
Female 49 78600
Male 60 81700
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
The data points (i.e., salary) for millennials = {81600, 71900, 60900}
The data points (i.e., salary) for gen X = {64300, 68200, 78600}
The data points (i.e., salary) for baby boomers = {61600, 76300, 81700}

Plotting multiple boxplots in one graph infers visualizing millennials, gen


X and baby boomers boxplots side-by-side in the same graphic.

7
4. (a) A consumer electronics company has adopted an aggressive policy to
increase sales of a newly launched product. The company has invested in
advertisements as well as employed salesmen for increasing sales rapidly.
Below dataset presents the sales, the number of employed salesmen, and
advertisement expenditure for 4 randomly selected months. Develop a
regression model to predict the impact of advertisement and the number of
salesmen on sales.
Month No 1 2 3 4
Sales 5000 5200 5700 6300
Salesmen 25 35 15 27
Advertisement 180 250 150 240
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]

8
(b) Explain non-linear regression with a suitable example. Subsequently,
establish narrate second degree (quadratic), third degree (cubic) and n
degree polynomial mathematical model. In general, what techniques
applied to determine the right degree of the model?
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
In the case of linear and multiple linear regression, the dependent variable
is linearly dependent on the independent variable(s). But, in several
situations, the situation is no simple where the two variables might be
related in a non-linear way. This may be the case where the results from
the correlation analysis show no linear relationship but these variables
might still be closely related. If the result of the data analysis shows that
there is a non-linear (also known as curvilinear) association between the
two variables, then the need is to develop a non-linear regression model.
Imagine a dataset whose scatter plot looks as follows:

The non-linear data can be handled in 2 ways:


 Use of polynomial rather than linear regression model
 Transform the data and then use linear regression model.
The polynomial mathematical model are represented below:
Second degree: y = β0+ β1x1 + β2x2 + e
Third degree: y = β0+ β1x1 + β2x2 + β3x3 + e
n degree: y = β0+ β1x1 + β2x2 + β3x3 + … … + βnxn + e
To determine the right degree of the model, 2 approaches are followed:
Forward Selection: This method increases the degree until it is significant

9
enough to define the best possible model.
Backward Elimination: This method decreases the degree until it is
significant enough to define the best possible model.

5. (a) Consider the following dataset consisting of 6 observations that depicts


automobile battery sales. Using Simple Exponential Smoothing, calculate
the forecasted value of month 7 by calculating smooth observation (S t) for
each month and mean of the squared errors. The smoothing constant is 0.5
and S1 value is 20.
Month No Actual
1 20
2 22
3 21
4 18
5 17
6 23
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
St=α * Yt-1 + (1-α) * St-1 where α = 0.5 and S1=20

Mont Actua Forec Err Sq- S2 = 0.5 * 20 + 0.5 *20


h l(Yt) ast Err = 10 + 10 = 20
(St)
1 20 20 0 0 S3 = 0.5 * 22 + 0.5 * 20
2 22 20 2 4 = 11 +10 = 21
3 21 21 0 0
4 18 21 -3 9 S4 = 0.5 * 21 + 0.5 * 21
5 17 19.5 -2.5 6.25 = 10.5 + 10.5 = 21
6 23 17.26 5.74 32.94
7 19.14 S5 = 0.5 * 18 + 0.5 * 21
= 9 + 10.5 = 19.5
Sum of Square errors = 52.19
S6 = 0.5 * 17 + 0.5 * 19.5
Mean Square error = 52.19/6 =8.698 = 8.5 + 9.75
= 17.26
S7 = 0.5 * 23 + 0.5 * 17.26
= 11.5 + 8.63 = 19.14

(b) Consider the following dataset capturing monthly sales of actual vs.
predicted of an Indian B2C (business to customer) firm. The sales figures
are in lakh and presented in INR.

10
Month No 1 2 3 4
Actual 112 113 122 120
Predicted 113 115 121 119
As a data consultant, the B2C firm hires you for the following and you
need to justify your response.
(1) Determine the hybrid error and a hybrid error is determined by 0.3 *
MSE + 0.25 * RMSE.
(2) Determine MAPE.
[Evaluation Scheme] Full mark for the correct answer. 2 marks for hybrid
error and rest 2 marks for MAPE calculation. Step-wise mark can be
awarded based on the partial correctness of the solution.
[Solution]
The hybrid error calculation is as follows:
Month No 1 2 3 4
Actual 112 113 122 120
Predicted 113 115 121 119
Error -1 -2 1 1
Squared Error 1 4 1 1
Sum of Square Error = 1 + 4 +1 + 1 = 7
Mean Square Error (MSE) = 7/ 4 = 1.75
Root Mean Square Error (RMSE) = √MSE = √1.75 = 1.322
So, hybrid error = 0.3 * 1.75 + 0.25 * 1.322 = 0.855

The MAPE calculation is as follows:


Month No 1 2 3 4
Actual 112 113 122 120
Predicted 113 115 121 119
| Predicted – Actual | 1 2 1 1
| Predicted – Actual | / Actual 1/112 2/113 = 1/122 = 1/120 =
= 0.008 0.017 0.008 0.008

SUM(| Predicted – Actual | / Actual) = 0.008 + 0.017 + 0.008 + 0.008 = 0.041


MAPE = (100/4) * 0.041 = 1.025

6. (a) Consider the following transactional data in which minimum support is 2


and minimum confidence is 50%. Find frequent itemsets and generate
association rules for them by illustrating it with step-by-step process.
Transactions List of items
T1 I1, I2, I3
T2 I2, I3, I4
T3 I4, I5
T4 I1, I2, I4
T5 I1, I2, I3, I5
T6 I1, I2, I3, I4
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark

11
can be awarded based on the partial correctness of the solution.
[Solution]

12
13
14
(b) Consider the following dataset.
Basket Product 1 Product 2 Product 3
1 Milk Cheese
2 Milk Apple Cheese
3 Apple Banana
4 Milk Cheese
5 Apple Banana
6 Milk Cheese Banana
Calculate Support, Confidence and Lift for the followings:
(1) Apple, Milk
(2) (Apple, Milk) => Cheese
(3) Milk => Cheese
(4) (Apple, Cheese) => Milk
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]

15
The support formula written out would look something like:

The confidence formula written out would like something like:

The lift formula written out would look something like:

16
7. (a) Consider the following hypothetical dataset concerning student
characteristics whether or not each student should be hired. Use Naive
Bayes Classifier to determine whether or not someone with poor GPA and
lots of effort should be hired.
Name GPA Effort Hirable?
Sarah Poor Lots Yes
Dana Average Some No
Alex Average Some No
Annie Average Some Yes
Emily Excellent Lots Yes
Pete Excellent Lots No
John Excellent Lots No
Kathy Poor Some No
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]

17
(b) Demonstrate a step-by-step process of Agglomerative hierarchical
clustering with the following dataset. In addition, illustrate the merge with
Dendogram (keep the threshold as 5). Use Manhattan distance for the
construction of matrix.
Roll Mark
1 80
2 90
3 65
4 75
5 95
6 55
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark

18
can be awarded based on the partial correctness of the solution.
[Solution]

19
20
8. (a) Design an optimised algorithm for the updation of an element in a Bloom
filter.
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
The steps of optimised algorithm is as follows
1. Clear the bloom filter
2. Insert all the elements into the bloom filter except the element to
be updated.
3. Insert the updated value into the bloom filter.
The insert function code is as follows.
insert(e)
begin
/* Loop all hash functions k */
for j : 1 . . . k do
m ← hj(e) //apply the hash function on e
Bm ← bf[m] //retrieve val at mth pos from Bloom filter bf
if Bm == 0 then
/* Bloom filter had zero bit at index m */
Bm ← 1;
end if
end for
end

The clear function code is as follows.


clear()
begin
for i : 1 . . . n // n is the size of the bloom filter
bf[i] = 0
end for
end
(b) Consider a Bloom Filter of size 11, with integers as stream elements and
two hash functions as follows:
 H1(x) = take odd number of bits from right in the binary
representation of X. Subsequently, treat it as an integer i, and result is
i modulo 11.
 H2(x) = same, but take even numbered bits.

(1) Find the filter after the insertion of elements 25, 15 and 35.
(2) Check whether the element y=18 exists in the bloom filter or not. Is it

21
the case of False Positive or False Negative? Explain.
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
should be awarded based on the partial correctness of the solution.
[Solution]
Step 1: Initialization of bloom filter
0 0 0 0 0 0 0 0 0 0 0
Step 2:
- Insertion of 25:
(25)10 = (11001)2
Considering odd number of bits from right in (11001) 2 = 101, So
H1(101) = 101 mod 11 = 2
Considering even number of bits from right in (11001) 2 = 1, So
H2(1) = 1 mod 11 = 1
The revised bloom filter is as follows:
0 1 1 0 0 0 0 0 0 0 0
- Insertion of 15
(15)10 = (1111)2
Considering odd number of bits from right in (1111) 2 = 11, So
H1(11) = 11 mod 11 = 0
Considering even number of bits from right in (1111) 2 = 11, So
H2(11) = 11 mod 11 = 0
The revised bloom filter is as follows:
1 1 1 0 0 0 0 0 0 0 0
- Insertion of 35
(35)10 = (100011)2
Considering odd number of bits from right in (100011) 2 = 100, So
H1(100) = 100 mod 11 = 1
Considering even number of bits from right in (100011) 2 = 101, So
H2(101) = 101 mod 11 = 2
The revised bloom filter is as follows:
1 1 1 0 0 0 0 0 0 0 0
Step 3:
Membership test of 18
(18)10 = (10010)2
Considering odd number of bits from right in (10010) 2 = 1, So H1(1) = 1
mod 11 = 1

22
Considering even number of bits from right in (100011) 2 = 10, So H2(10)
= 10 mod 11 = 10
Since 10th slot of bloom filter is 0, it is concluded that 18 is definitely does
not exist in bloom filter.

23

You might also like