IT-3006(DA)-CS_END_MAY_2023
IT-3006(DA)-CS_END_MAY_2023
1
SST = Sum of squares total = SSR + SSE = 65 + 22 = 87
R2 = SSR / SST = 65/87 = 0.747 and such value indicate moderately fit
model.
(d) A time series model is mathematically represented as Y t = f(Tt, St, Ct, It)
where Yt is the time series value at time t. Tt, St, Ct, and It are the trend,
seasonal, cyclic and irregular component value at time t respectively.
Represents the model
(1) When the amplitude of seasonal and irregular variations does not
change as the level of trend rises or falls.
(2) When the amplitude of both the seasonal and irregular variations
increase as the level of trend rises.
[Evaluation Scheme] Full mark for the correct answer. 0.5 mark for 1 st
part and 0.5 for other part. No step-wise mark to be awarded.
[Solution]
(1) When the amplitude of seasonal and irregular variations does not
change as the level of trend rises or falls, time series follows additive
model and it is represented by Yt = Tt + St + Ct + It
(2) When the amplitude of both the seasonal and irregular variations
increase as the level of trend rises, time series follows multiplicative
model and it is represented by Yt = Tt * St * Ct * It
(e) Suppose a hierarchical clustering to be applied in segmenting the students
and following sample has been collected. Create the proximity matrix for
the below sample. The mark is out of 20 in the mid semester.
Roll No Sex Section Mark
1 Male CSE -1 10
2 Female IT – 1 17
3 Male CSSE – 1 18
4 Female CSCE - 1 20
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
Since the students dataset have 4 observations, so a 4 X 4 proximity matrix
is to be created wherein the diagonal elements is 0 as the distance of a
point with itself is always 0. Applying Euclidean distance formula, the
matrix looks as follows:
Roll No 1 2 3 4
1 0 √(10-17)2= 7 √(10-18)2= 8 √(10-20)2= 10
(f) Consider the following dataset, wherein TID represents transaction ID and
G to O represents individual products. In the dataset, 1 represents a
2
transaction that includes the specific products. For instance, TID 1
includes all products and TID 3 includes M and O product. Calculate
Confidence({G, A} => {M}).
TID G A M
1 1 1 1
2 1 0 1
3 0 0 1
4 0 1 0
5 1 1 1
6 1 1 0
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
Confidence({G, A} => {M}) = Support(G, A, M)/Support(G, A) = [2/6] /
[3/6] = 0.667
(g) Consider the decagon, which has 10 sides. Three sides are marked 1, two
sides are marked 2, one side is marked 3, two sides are marked 4, and two
sides are marked 5. Draw a graph representing occurrence of each mark
verses its probability.
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
Probability(side marked as 1) = 3 / 10 = 0.3
Probability(side marked as 2) = 2/10 = 0.2
Probability(side marked as 3) = 1 / 10 = 0.1
Probability(side marked as 4) = 2/10 = 0.2
Probability(side marked as 5) = 2/10 = 0.2
The graph looks as follows:
(h) Consider the following dataset. Consider support count is represented with
SC. Calculate (SC({E}) + SC({A, B}) + SC({C, D})) / (SC({A, B, C, E})
+ SC({A, B, C, D, E}))
Transaction Itemset
T1 A, B
T2 B, D
T3 B, C
T4 A, B, D
T5 A, C
T6 B, C
T7 A, B, C, E
3
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
SC({E}) = 1, SC({A, B}) = 3, SC({C, D}) = 0, SC({A, B, C, E}) = 1, and
SC({A, B, C, D, E}) = 0
Numerator = (SC({E}) + SC({A, B}) + SC({C, D})) = 1 + 3 + 0 = 4
Denominator = SC({A, B, C, E}) + SC({A, B, C, D, E}) = 1 + 0 = 1
Therefore, (SC({E}) + SC({A, B}) + SC({C, D})) / (SC({A, B, C, E}) +
SC({A, B, C, D, E})) = 4/1 = 4
(i) A bloom filter with a size of 1000 slots is used to store the information of
100 data stream items using 4 hash functions. Calculate the false positive
probability.
[Evaluation Scheme] Full mark for the correct answer. No step-wise mark
to be awarded.
[Solution]
n = size of bloom filter = 1000
m = number of expected elements to be inserted = 100
k = number of hash functions = 4
2. (a) Consider the following dataset. Draw the MapReduce process to find the number
of customers from each city followed by each state, both in the chronological
order.
ID Name City State
1 Sujay Lila Ambikapur Chhattisgarh
2 Geetha Choudhary Bhilai Chhattisgarh
3 Anandi D'Cruz Bilaspur Chhattisgarh
4 Surendra Nagarkar Cuttack Odisha
5 Balwinder Nagarkar Bangalore Karnataka
6 Nitin Nibhanupudi Mangalore Karnataka
7 Dinesh Sharma Cuttack Odisha
8 Raj Chaudhri Bilaspur Chhattisgarh
4
9 Govind Kumar Mysore Karnataka
10 Jayanta Begam Ambikapur Chhattisgarh
[Evaluation Scheme] Full mark for the correct answer. 2 marks for city
and 2 marks for state. Step-wise mark can be awarded based on the partial
correctness of the solution.
[Solution]
MapReduce process to find the number of customers from each city:
5
comments?
Diagnostic analytics model – the model should use historical and current
data to seek answer for the questions “why it has been happened” using
data analytics technique such as drill-through, root cause analysis using
fish bone, etc. Few examples may be as follows:
(1) Why is the number of complaints by Asian geography and food
and beverages product
(2) Why male of Asian geography provided maximum number of
negative review comments?
(3) Why the suitable features of beauty care product type are has
collected maximum number of positive review comments?
Predictive analytics model – the model should use historical and current
data to seek answer for the questions “what will happen in the” using data
analytics technique such as regression, clustering, classifications etc. Few
examples may be as follows:
(1) What would be the total number of complaints by Asian geography
and food and beverages product by the end of this quarter?
(2) What is expected number of negative review comments from Asian
geography by end of this month?
(3) What would be the sale by end of this year?
3. (a) In the population, the average IQ is 100 with a standard deviation of 15. A
team of scientists want to test a new medication to see if it has either a
positive or negative effect on intelligence or not effect at all. A sample of
30 participants who have taken the medication has a mean of 140. Using
hypothesis testing, find the answer to the question i.e., did the medication
affect intelligence? The z value (i.e., critical value) from statistical table is
found to be 1.96. The solution must mention the null (H0) and alternative
hypotheses (Ha).
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
Step 1: Set up the null and alternate hypothesis
H0: medication affects intelligence
Ha: medication does not affect intelligence.
Step 2: Determine the type of test to use
Since the sample size is 30, the z-test is used.
Step 3: Calculate the tested statistic z using the formula
6
Where x̄n is the mean of the population, µ0 is the null hypothesis (i.e., the
mean) to be tested, σ is the standard deviation, and n is the sample size.
Using the data given in the equation we would have the following:
μ0 = 100, σ = 15, n = 30, x̄n = 140
Plugging the values into the formula: ((140 – 100) / 15) * √30 = 14.606
Step 4: In the question, z value is provided i.e., 1.96 and hence no need to
look into z table.
Step 5: drawing conclusion
The tested statistic value of z calculated is more than the critical value
obtained from statistical tables (i.e., 14.606 > 1.96). Therefore the null
hypothesis is rejected. This means that the medication administered does
not affect intelligence.
(b) Find the relationships of salary between millennials (between the ages of
18 and 34), gen X (between the ages of 35 and 50) and baby boomers
(aged 51 and above) of below sample by plotting multiple boxplots in one
graph.
Gender Age Salary
Male 20 81600
Female 55 61600
Male 38 64300
Female 25 71900
Male 58 76300
Male 45 68200
Female 30 60900
Female 49 78600
Male 60 81700
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
The data points (i.e., salary) for millennials = {81600, 71900, 60900}
The data points (i.e., salary) for gen X = {64300, 68200, 78600}
The data points (i.e., salary) for baby boomers = {61600, 76300, 81700}
7
4. (a) A consumer electronics company has adopted an aggressive policy to
increase sales of a newly launched product. The company has invested in
advertisements as well as employed salesmen for increasing sales rapidly.
Below dataset presents the sales, the number of employed salesmen, and
advertisement expenditure for 4 randomly selected months. Develop a
regression model to predict the impact of advertisement and the number of
salesmen on sales.
Month No 1 2 3 4
Sales 5000 5200 5700 6300
Salesmen 25 35 15 27
Advertisement 180 250 150 240
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
8
(b) Explain non-linear regression with a suitable example. Subsequently,
establish narrate second degree (quadratic), third degree (cubic) and n
degree polynomial mathematical model. In general, what techniques
applied to determine the right degree of the model?
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
In the case of linear and multiple linear regression, the dependent variable
is linearly dependent on the independent variable(s). But, in several
situations, the situation is no simple where the two variables might be
related in a non-linear way. This may be the case where the results from
the correlation analysis show no linear relationship but these variables
might still be closely related. If the result of the data analysis shows that
there is a non-linear (also known as curvilinear) association between the
two variables, then the need is to develop a non-linear regression model.
Imagine a dataset whose scatter plot looks as follows:
9
enough to define the best possible model.
Backward Elimination: This method decreases the degree until it is
significant enough to define the best possible model.
(b) Consider the following dataset capturing monthly sales of actual vs.
predicted of an Indian B2C (business to customer) firm. The sales figures
are in lakh and presented in INR.
10
Month No 1 2 3 4
Actual 112 113 122 120
Predicted 113 115 121 119
As a data consultant, the B2C firm hires you for the following and you
need to justify your response.
(1) Determine the hybrid error and a hybrid error is determined by 0.3 *
MSE + 0.25 * RMSE.
(2) Determine MAPE.
[Evaluation Scheme] Full mark for the correct answer. 2 marks for hybrid
error and rest 2 marks for MAPE calculation. Step-wise mark can be
awarded based on the partial correctness of the solution.
[Solution]
The hybrid error calculation is as follows:
Month No 1 2 3 4
Actual 112 113 122 120
Predicted 113 115 121 119
Error -1 -2 1 1
Squared Error 1 4 1 1
Sum of Square Error = 1 + 4 +1 + 1 = 7
Mean Square Error (MSE) = 7/ 4 = 1.75
Root Mean Square Error (RMSE) = √MSE = √1.75 = 1.322
So, hybrid error = 0.3 * 1.75 + 0.25 * 1.322 = 0.855
11
can be awarded based on the partial correctness of the solution.
[Solution]
12
13
14
(b) Consider the following dataset.
Basket Product 1 Product 2 Product 3
1 Milk Cheese
2 Milk Apple Cheese
3 Apple Banana
4 Milk Cheese
5 Apple Banana
6 Milk Cheese Banana
Calculate Support, Confidence and Lift for the followings:
(1) Apple, Milk
(2) (Apple, Milk) => Cheese
(3) Milk => Cheese
(4) (Apple, Cheese) => Milk
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
15
The support formula written out would look something like:
16
7. (a) Consider the following hypothetical dataset concerning student
characteristics whether or not each student should be hired. Use Naive
Bayes Classifier to determine whether or not someone with poor GPA and
lots of effort should be hired.
Name GPA Effort Hirable?
Sarah Poor Lots Yes
Dana Average Some No
Alex Average Some No
Annie Average Some Yes
Emily Excellent Lots Yes
Pete Excellent Lots No
John Excellent Lots No
Kathy Poor Some No
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
17
(b) Demonstrate a step-by-step process of Agglomerative hierarchical
clustering with the following dataset. In addition, illustrate the merge with
Dendogram (keep the threshold as 5). Use Manhattan distance for the
construction of matrix.
Roll Mark
1 80
2 90
3 65
4 75
5 95
6 55
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
18
can be awarded based on the partial correctness of the solution.
[Solution]
19
20
8. (a) Design an optimised algorithm for the updation of an element in a Bloom
filter.
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
can be awarded based on the partial correctness of the solution.
[Solution]
The steps of optimised algorithm is as follows
1. Clear the bloom filter
2. Insert all the elements into the bloom filter except the element to
be updated.
3. Insert the updated value into the bloom filter.
The insert function code is as follows.
insert(e)
begin
/* Loop all hash functions k */
for j : 1 . . . k do
m ← hj(e) //apply the hash function on e
Bm ← bf[m] //retrieve val at mth pos from Bloom filter bf
if Bm == 0 then
/* Bloom filter had zero bit at index m */
Bm ← 1;
end if
end for
end
(1) Find the filter after the insertion of elements 25, 15 and 35.
(2) Check whether the element y=18 exists in the bloom filter or not. Is it
21
the case of False Positive or False Negative? Explain.
[Evaluation Scheme] Full mark for the correct answer. Step-wise mark
should be awarded based on the partial correctness of the solution.
[Solution]
Step 1: Initialization of bloom filter
0 0 0 0 0 0 0 0 0 0 0
Step 2:
- Insertion of 25:
(25)10 = (11001)2
Considering odd number of bits from right in (11001) 2 = 101, So
H1(101) = 101 mod 11 = 2
Considering even number of bits from right in (11001) 2 = 1, So
H2(1) = 1 mod 11 = 1
The revised bloom filter is as follows:
0 1 1 0 0 0 0 0 0 0 0
- Insertion of 15
(15)10 = (1111)2
Considering odd number of bits from right in (1111) 2 = 11, So
H1(11) = 11 mod 11 = 0
Considering even number of bits from right in (1111) 2 = 11, So
H2(11) = 11 mod 11 = 0
The revised bloom filter is as follows:
1 1 1 0 0 0 0 0 0 0 0
- Insertion of 35
(35)10 = (100011)2
Considering odd number of bits from right in (100011) 2 = 100, So
H1(100) = 100 mod 11 = 1
Considering even number of bits from right in (100011) 2 = 101, So
H2(101) = 101 mod 11 = 2
The revised bloom filter is as follows:
1 1 1 0 0 0 0 0 0 0 0
Step 3:
Membership test of 18
(18)10 = (10010)2
Considering odd number of bits from right in (10010) 2 = 1, So H1(1) = 1
mod 11 = 1
22
Considering even number of bits from right in (100011) 2 = 10, So H2(10)
= 10 mod 11 = 10
Since 10th slot of bloom filter is 0, it is concluded that 18 is definitely does
not exist in bloom filter.
23