0% found this document useful (0 votes)
14 views

MODEL EXAM II Answer Key - For Merge

Uploaded by

devi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

MODEL EXAM II Answer Key - For Merge

Uploaded by

devi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

MODEL EXAM II – SET 1

Sub. Name: Foundations of Data Science Branch / Year / SEM: IT / II/ III
Sub. Code :CS3352 Date :
Duration :3hours. Marks : 100

ANSWER ALL QUESTIONS

Part A (10x2=20 Marks)


1. What are the properties of correlation coefficient of R²?
The two properties are:
 The sign of R indicates the type of linear relationship, whether positive or negative
 The numerical value of R, without regard to sign indicates the strength of the linear relationship.

2. What is regression?
 Regression is the statistical method to determine the relationship between dependent variable and
a series of other variables known as independent variable.
 A regression line is a line that is used to describe the behavior of a set of data. It is used to
forecast procedures.

3. Give the least square regression equation.


 Y= bX+a
 Y represents the predicted value.
 X represents the predicted value.
 a and b represent numbers calculated from the original correlation analysis.

4. What is interpretation of R²?


 The squared correlation coefficient R² provides us with not only a key interpretation of the
correlation coefficient but also a measure of predictive accuracy that supplements the standard
error of estimate, Sy|x
 The coefficient of determination, often denoted as R-squared (R²), is a statistical measure that
represents the proportion of the variance in the dependent variable that is explained by the
independent variables in a regression model.

5. Difference between correlation and regression.


Correlation Regression
1. Relationship between variables 1. One affects the other variable
2. Variables move together 2. Cause and effect
3. X and Y can be interchanged 3. X and Y cannot be interchanged
4. Data represented by a single point 4. Data represented by a line.

6. What is numpy and Panda?


 Numpy is a general-purpose array- processing package with high-performance multidimenstional
array object, and tools. It is the fundamental package for scientific computing with Python. It
provides N-dimentinal array object supporting many sophisticated ( broadcasting) functions.
 Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the
Numpy package and its key data structure is called the Data Frame. Data Frames allow you to
store and manipulate tabular data in rows of observations and columns of variables. Pandas is
built on top of the Numpy package, meaning a lot of the structure of Numpy is used to replicated
in pandas.

7. What are the attributes of numpy?


 Ndim: number of dimensions
 Shape: size of each dimension
 Dtype: total size of the array
 Itemsize:data type of the array
 Nbytes: lists the total size of the array

8. What is Python list?


 Elements can belong to different data types.
 No need ti explicitly import a module for declaration. Cannot directly handle arithmetic oeprations.
 Preferred for shorter sequence of data items.
 The entire list can be printed without any explicit looping.
 Consume larger memory for easy addition of elements.

9. What are universal functions?


 A universal function ( or ufunc for short) is a function that operates on ndarrays in a element-by-
element fashion.
 It is a “vectorized” wrapper for a function that takes a fixed number of specific inputs and proudces
a fixed number of specific outputs.
 These functions include standard trigonometric functions, functions for arithmetic operations,
handling complex numbers, statistical functions.

10. What is fancy indexing?


 With numpy array fancy indexing, am array can be indexed with another numpy array, a python list,
or a sequence of intergers, whose values select elements in the indexed array.
 Fancy indexing is like the simple indexing in which arrays are passed as indices in place of single
scalars.
 This allows us to very quickly access and modify complicated subsets of an array’s values.
 When using fancy indexing, the shape of the result replicates the shape of the index arrays not the
shape of the array being indexed.
PART-B
11. A) i) calculate the Karl Pearson’s coefficient of correlation for the following data.
X 28 45 40 38 35 33 40 32 36 33
Y 23 34 33 34 30 26 28 31 36 35

ii) A teacher is interested in studying the relationship between the performance in Statistics and
Economics of a class of 20 students. For this the compilers the scores on these subjects in the last
semester examination. Some data of this type are presented in table. Calculate correlation coefficient for
the data.
12. B) i. Find Karl Pearson correlation coefficient for the following paired data.
Wages 10 10 10 10 10 9 9 9 96 95
0 1 2 2 0 9 7 8
Cost Of living 98 99 99 97 95 9 9 9 90 91
2 5 4

Solution:
X= wages y= cost of living
x̄= 100+101+102+102+100+99+97+98+96+95/10 = 990/10 = 99
Ȳ= 98+99+99+97+95+92+95+94+90+91/10 = 950/10 = 95
ii)
13. A) i)

ii) ) A sample of 12 fathers and their elder sons gave the following data about their heights in inches,
Calculate the coefficient of rank correlation.
Father 65 6 6 6 68 6 7 6 6 6 6 7
3 7 4 2 0 6 8 7 9 1
Sons 68 6 6 6 69 6 6 6 7 6 6 7
6 8 5 6 8 5 1 7 8 0
12. b) i)
12. b)ii)
13. a) i)
15. a) Explain about data wrangling?
 Data wrangling is the process of transforming data from its original “raw” form into a more
digestible format and organizing sets from various sources into a singular coherent whole for
further processing.
 Data wrangling covers the following process:
 Getting data from the various source into one place.
 Piecing the data together according to the determined setting
 Cleaning the data from the noise or erroneous, missing elements.

Data wrangling has 6 iterative steps:

1. Discovering: We must understand what is in our data, which will inform how we want to
analyse it. How we wrangle customer data, for example, may be informed by where they
are located, what they bought, or what promotions they received.
2. Structuring: this means organising the data, which is necessary because raw data comes
in many different shapes and sizes. A single column may turn into several rows for easier
analysis. One column may become two. Movement of data is made for easier
computation and analysis.
3. Cleaning: what happens when errors and outliers skew our data? Clean the data. What
happens when state data is entered as AP or Andhra Pradesh or Arunachal Pradesh? Null
values are changed and standard formatting implemented, estimate increasing data
quality.
4. Enriching: here we take stock in our data and strategize about how other additional data
might augment it. Questions asked during this data wrangling step might be: what new
types of data can I derive from what I already have or what other information would
better inform my decision making about this current data?
5. Validating: validation rules are repetitive programming sequences that verify data
consistency, quality, and security. Examples of validation include ensuring uniform
distribution of attributes that should be distributed normally (eg. Birth dates) or
confirming accuracy of fields through a check across data.
6. Publishing: analysts prepare the wrangled data for use downstream, whether by a
particular user or software and document any particular steps taken or logic used to
wrangle said data. Data wrangling gurus understand that implementation of insights relies
upon the ease with which it can be accessed and utilized by others.

b) briefly explain hierarchical indexing with example?


PART-C

16. a) explain about Numpy’s structured arrays?

Structured Data: NumPy’s Structured Arrays Imagine that we have several categories of data on a
number of people (say, name, age, and weight), and we’d like to store these values for use in a Python
program. It would be possible to store these in three separate arrays: In[2]: name = ['Alice', 'Bob',
'Cathy', 'Doug'] age = [25, 45, 37, 19] weight = [55.0, 85.5, 68.0, 61.5] But this is a bit clumsy. There’s
nothing here that tells us that the three arrays are related; it would be more natural if we could use a
single structure to store all of this data. NumPy can handle this through structured arrays, which are
arrays with com‐ pound data types. Recall that previously we created a simple array using an expression
like this: In[3]: x = np.zeros(4, dtype=int) We can similarly create a structured array using a compound
data type specification:
More Advanced Compound Types It is possible to define even more advanced compound types. For
example, you can create a type where each element contains an array or matrix of values. Here, we’ll
create a data type with a mat component consisting of a 3×3 floating-point matrix: In[14]: tp =
np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))]) X = np.zeros(1, dtype=tp) print(X[0]) print(X['mat'][0]) (0, [[0.0,
0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]) [[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]] Now each element in the X array
consists of an id and a 3×3 matrix.

b) . Illustrate and manipulate Pandas Data frame object with an example program.

The Pandas DataFrame Object The fundamental structure in Pandas is the DataFrame. Like the Series
object, the DataFrame can be thought of either as a generalization of a NumPy array, or as a
specialization of a Python dictionary. DataFrame as a generalized NumPy array If a Series is an analog
of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array
with both flexible row indices and flexible column names. Just as you might think of a two-dimensional
array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a
sequence of aligned Series objects. ie they share the same index. To demonstrate this, let’s first construct
a new Series listing the area of each of the five states discussed in the previous section: In[18]: area_dict
= {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995} area
= pd.Series(area_dict) area Out[18]: California 423967 Florida 170312 Illinois 149995 New York
141297 Texas 695662 dtype: int64 Now that we have this along with the population Series from before,
we can use a dictionary to construct a single two-dimensional object containing this information: In[19]:
states = pd.DataFrame({'population': population, 'area': area}) states Out[19]: area population California
423967 38332521 Florida 170312 19552860 Illinois 149995 12882135

New York 141297 19651127 Texas 695662 26448193 Like the Series object, the DataFrame has an
index attribute that gives access to the index labels: In[20]: states.index Out[20]: Index(['California',
'Florida', 'Illinois', 'New York', 'Texas'], dtype='object') Additionally, the DataFrame has a columns
attribute, which is an Index object holding the column labels: In[21]: states.columns Out[21]:
Index(['area', 'population'], dtype='object') Thus the DataFrame can be thought of as a generalization of a
two-dimensional NumPy array, where both the rows and columns have a generalized index for access
ing the data. DataFrame as specialized dictionary Similarly, we can also think of a DataFrame as a
specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column
name to a Series of column data. For example, asking for the 'area' attribute returns the Series object
containing the areas we saw earlier: In[22]: states['area'] Out[22]: California 423967 Florida 170312
Illinois 149995 New York 141297 Texas 695662 Name: area, dtype: int64 Constructing DataFrame
objects A Pandas DataFrame can be constructed in a variety of ways. From a single Series object. A
DataFrame is a collection of Series objects, and a single column DataFrame can be constructed from a
single Series: In[23]: pd.DataFrame(population, columns=['population']) Out[23]: population California
38332521 Florida 19552860

Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662
26448193 From a two-dimensional NumPy array. Given a two-dimensional array of data, we can create
a DataFrame with any specified column and index names. If omitted, an integer index will be used for
each: In[27]: pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c']) Out[27]: foo
bar a 0.865257 0.213169 b 0.442759 0.108267 c 0.047110 0.905718 From a NumPy structured array.
We covered structured arrays in “Structured Data: NumPy’s Structured Arrays” on page 92. A Pandas
DataFrame operates much like a structured array, and can be created directly from one: In[28]: A =
np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')]) A Out[28]: array([(0, 0.0), (0, 0.0), (0, 0.0)],

You might also like