MODEL EXAM II Answer Key - For Merge
MODEL EXAM II Answer Key - For Merge
Sub. Name: Foundations of Data Science Branch / Year / SEM: IT / II/ III
Sub. Code :CS3352 Date :
Duration :3hours. Marks : 100
2. What is regression?
Regression is the statistical method to determine the relationship between dependent variable and
a series of other variables known as independent variable.
A regression line is a line that is used to describe the behavior of a set of data. It is used to
forecast procedures.
ii) A teacher is interested in studying the relationship between the performance in Statistics and
Economics of a class of 20 students. For this the compilers the scores on these subjects in the last
semester examination. Some data of this type are presented in table. Calculate correlation coefficient for
the data.
12. B) i. Find Karl Pearson correlation coefficient for the following paired data.
Wages 10 10 10 10 10 9 9 9 96 95
0 1 2 2 0 9 7 8
Cost Of living 98 99 99 97 95 9 9 9 90 91
2 5 4
Solution:
X= wages y= cost of living
x̄= 100+101+102+102+100+99+97+98+96+95/10 = 990/10 = 99
Ȳ= 98+99+99+97+95+92+95+94+90+91/10 = 950/10 = 95
ii)
13. A) i)
ii) ) A sample of 12 fathers and their elder sons gave the following data about their heights in inches,
Calculate the coefficient of rank correlation.
Father 65 6 6 6 68 6 7 6 6 6 6 7
3 7 4 2 0 6 8 7 9 1
Sons 68 6 6 6 69 6 6 6 7 6 6 7
6 8 5 6 8 5 1 7 8 0
12. b) i)
12. b)ii)
13. a) i)
15. a) Explain about data wrangling?
Data wrangling is the process of transforming data from its original “raw” form into a more
digestible format and organizing sets from various sources into a singular coherent whole for
further processing.
Data wrangling covers the following process:
Getting data from the various source into one place.
Piecing the data together according to the determined setting
Cleaning the data from the noise or erroneous, missing elements.
1. Discovering: We must understand what is in our data, which will inform how we want to
analyse it. How we wrangle customer data, for example, may be informed by where they
are located, what they bought, or what promotions they received.
2. Structuring: this means organising the data, which is necessary because raw data comes
in many different shapes and sizes. A single column may turn into several rows for easier
analysis. One column may become two. Movement of data is made for easier
computation and analysis.
3. Cleaning: what happens when errors and outliers skew our data? Clean the data. What
happens when state data is entered as AP or Andhra Pradesh or Arunachal Pradesh? Null
values are changed and standard formatting implemented, estimate increasing data
quality.
4. Enriching: here we take stock in our data and strategize about how other additional data
might augment it. Questions asked during this data wrangling step might be: what new
types of data can I derive from what I already have or what other information would
better inform my decision making about this current data?
5. Validating: validation rules are repetitive programming sequences that verify data
consistency, quality, and security. Examples of validation include ensuring uniform
distribution of attributes that should be distributed normally (eg. Birth dates) or
confirming accuracy of fields through a check across data.
6. Publishing: analysts prepare the wrangled data for use downstream, whether by a
particular user or software and document any particular steps taken or logic used to
wrangle said data. Data wrangling gurus understand that implementation of insights relies
upon the ease with which it can be accessed and utilized by others.
Structured Data: NumPy’s Structured Arrays Imagine that we have several categories of data on a
number of people (say, name, age, and weight), and we’d like to store these values for use in a Python
program. It would be possible to store these in three separate arrays: In[2]: name = ['Alice', 'Bob',
'Cathy', 'Doug'] age = [25, 45, 37, 19] weight = [55.0, 85.5, 68.0, 61.5] But this is a bit clumsy. There’s
nothing here that tells us that the three arrays are related; it would be more natural if we could use a
single structure to store all of this data. NumPy can handle this through structured arrays, which are
arrays with com‐ pound data types. Recall that previously we created a simple array using an expression
like this: In[3]: x = np.zeros(4, dtype=int) We can similarly create a structured array using a compound
data type specification:
More Advanced Compound Types It is possible to define even more advanced compound types. For
example, you can create a type where each element contains an array or matrix of values. Here, we’ll
create a data type with a mat component consisting of a 3×3 floating-point matrix: In[14]: tp =
np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))]) X = np.zeros(1, dtype=tp) print(X[0]) print(X['mat'][0]) (0, [[0.0,
0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]) [[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]] Now each element in the X array
consists of an id and a 3×3 matrix.
b) . Illustrate and manipulate Pandas Data frame object with an example program.
The Pandas DataFrame Object The fundamental structure in Pandas is the DataFrame. Like the Series
object, the DataFrame can be thought of either as a generalization of a NumPy array, or as a
specialization of a Python dictionary. DataFrame as a generalized NumPy array If a Series is an analog
of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array
with both flexible row indices and flexible column names. Just as you might think of a two-dimensional
array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a
sequence of aligned Series objects. ie they share the same index. To demonstrate this, let’s first construct
a new Series listing the area of each of the five states discussed in the previous section: In[18]: area_dict
= {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995} area
= pd.Series(area_dict) area Out[18]: California 423967 Florida 170312 Illinois 149995 New York
141297 Texas 695662 dtype: int64 Now that we have this along with the population Series from before,
we can use a dictionary to construct a single two-dimensional object containing this information: In[19]:
states = pd.DataFrame({'population': population, 'area': area}) states Out[19]: area population California
423967 38332521 Florida 170312 19552860 Illinois 149995 12882135
New York 141297 19651127 Texas 695662 26448193 Like the Series object, the DataFrame has an
index attribute that gives access to the index labels: In[20]: states.index Out[20]: Index(['California',
'Florida', 'Illinois', 'New York', 'Texas'], dtype='object') Additionally, the DataFrame has a columns
attribute, which is an Index object holding the column labels: In[21]: states.columns Out[21]:
Index(['area', 'population'], dtype='object') Thus the DataFrame can be thought of as a generalization of a
two-dimensional NumPy array, where both the rows and columns have a generalized index for access
ing the data. DataFrame as specialized dictionary Similarly, we can also think of a DataFrame as a
specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column
name to a Series of column data. For example, asking for the 'area' attribute returns the Series object
containing the areas we saw earlier: In[22]: states['area'] Out[22]: California 423967 Florida 170312
Illinois 149995 New York 141297 Texas 695662 Name: area, dtype: int64 Constructing DataFrame
objects A Pandas DataFrame can be constructed in a variety of ways. From a single Series object. A
DataFrame is a collection of Series objects, and a single column DataFrame can be constructed from a
single Series: In[23]: pd.DataFrame(population, columns=['population']) Out[23]: population California
38332521 Florida 19552860
Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662
26448193 From a two-dimensional NumPy array. Given a two-dimensional array of data, we can create
a DataFrame with any specified column and index names. If omitted, an integer index will be used for
each: In[27]: pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c']) Out[27]: foo
bar a 0.865257 0.213169 b 0.442759 0.108267 c 0.047110 0.905718 From a NumPy structured array.
We covered structured arrays in “Structured Data: NumPy’s Structured Arrays” on page 92. A Pandas
DataFrame operates much like a structured array, and can be created directly from one: In[28]: A =
np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')]) A Out[28]: array([(0, 0.0), (0, 0.0), (0, 0.0)],