Data Science HW1
Data Science HW1
Farhad Kabir
7084590528
MBA 2023 Core
Assignment 1 (40 points total)
Problem 1. (10 Points)
Companies vary greatly in size. This variation can hide how well a company is performing.
Rather than looking at the raw profit numbers, analysts consider financial ratios that adjust
for the size of the company. A popular ratio is the return on assets, defined as:
Net income is another name for profits, and the total assets of a company is the value of
everything it owns that is used to produce profits. The return on assets indicates how much
profit the company generates relative to the amount that it invested to make that profit. A
company with losses rather than profits has a negative return on assets. The data set
“Company.csv” gives the total assets (in Million $), net income (in Million $), and the
number of employees reported by 167 retailers in the United States.
a) Report the following summary statistics for the variable Total Assets.
Loss : 23.35%
c) What is the shape of the distribution of the variable “return on assets”? Answer the
question without creating a histogram. Select one:
A. Right Skewed
B. Left Skewed
C. Symmetrical
Explain your answer in one short sentence.
Ans. Data skewed to the left because of the presence of majority number of
values in higher boundary as mean is less than the median.
Page 1 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
d) Create a histogram to verify your answer in c). Use 20 bins and include a graph in
your solution.
e) Suppose that the total asset value of Wal-Mart is accidentally recorded as 180.66.
Which measure (mean vs median) is less influenced by the recording error?
Explain why in one short sentence (no computation is needed).
Median would be less influenced than that of mean as there are few numbers
of samples like Walmart which is higher in value.
Page 2 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
The data set “CreditCard.csv” records the sales transactions for 400 customers. It includes the
time of transaction, the gender and region of customer, the payment method, the payment
amount, and the number of items purchased.
a) Report % of customers who pay using a Visa card.
Ans. 38%
b) To see how Gender and Payment method are related to each other, we examine the joint
distribution of Gender and Payment method. Use Pivot Table in Excel or Radiant to
create a two-way table to describe the joint distribution of Gender and Payment method.
Report % of the grand total.
Page 3 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
c) Given that a customer is a female, what is the distribution of her payment method?
d) Find the following two conditional probabilities: (i) The conditional probability that a
customer spends more than $120 given that s/he is from the west.
Ans:
65
53.72%
EXCEL FUNCTION =IF(AND(H2>120,D2="West"),"1","0")
(ii) The conditional probability that a customer is from the west given that s/he spends
more than $120.
Ans.
65
30.09%
EXCEL FUNCTION =IF(AND(D2="West",H2>120),"1","0")
Hint: use the “IF” function in Excel to create a new variable based on the value of the
variable “Total Cost”. Then use Radiant/Pivot Table to create two-way tables.
Page 4 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
Problem 3. (10 Points)
The data set “Salary.csv” contains information of 220 employees in a company regarding
their salaries, years of experiences and ranks.
a) Report the following summary statistics for the variable Salary for male and female
employees, respectively: Mean, SD, IQR.
b) Compute the correlation between Salary and YearsExper (for all employees). Hint: You
can compute the correlation in Radiant.
Correlation
Data : Salary
Method : Pearson, Spearman
Variables : Salary, Exp
Null hyp. : variables x and y are not correlated
Alt. hyp. : variables x and y are correlated
Correlation matrix:
Salary Exp 0.31
p.values:Salary
Exp 0.00
Page 5 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
Correlation
Data : Salary
Method : Kendall
Variables : Salary, Exp
Null hyp. : variables x and y are not correlated
Alt. hyp. : variables x and y are correlated
Correlation matrix:
Salary Exp 0.21
p.values:Salary
Exp 0.00
Page 6 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
Observation: Salary of female is lower than the salary of male. Gap is insignificant. One
outlier in the female category.
Page 7 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
Page 8 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
e) Create a scatter plot of Salary vs YearsExper. Include a graph in your solution and
comment on any interesting patterns that you observe.
Ans.
Page 9 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
f) Do you still think that there is a gender bias? Note: no definitive conclusion can be
reached at this stage. We need to use multiple regression models to do a rigorous study.
Just briefly comment based on the above plots and your intuitions (in 2-3 short
sentences).
Ans.
Based on the data, gender bias cannot be attributed for salary gap since the salary
proportionately higher to the people having more years of experience. It can be
called as Simpson’s paradox as we get the different outcomes after combining the
two datasets.
A medical study compares the success rates of two treatments for kidney stones.
The tables below show the numbers of treatments for treatments involving both small and
large kidney stones, where Treatment A includes all open surgical procedures and Treatment
B is percutaneous nephrolithotomy (which involves only a small puncture).
a) What are the overall success rates for treatments A and B, respectively?
Ans.
Treatment Overall Success Rate
A 78%
B 82.57%
b) For small stones, what are the success rates for treatment A and B, respectively? Which
treatment is better?
Ans.
Treatment Small stones’ Success Rate
A 93.10%
B 86.67%
In terms of success rate, A is better, In terms of frequency, B is better
Page 10 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
c) For large stones, what are the success rates for treatment A and B, respectively? Which
treatment is better?
Ans.
Treatment Large stones’ Success Rate
A 73.004%
B 68.75%
Both in terms of success rate and in terms of frequency, A is better
d) You must have found that the conclusion in a) contradicts with those in b) and c). Briefly
explain the phenomenon (in 2-3 short sentences).
Ans. Based on data, the example given is a Simpson’s paradox. Apparently, the data
of treatment B seems better but lurking variable encounters after finding the
accurate percentage of success and failure rate.
-------------------------------------------END OF ASSIGNMENT---------------------------------
Page 11 of 11