0% found this document useful (0 votes)
48 views

Data Science HW1

This document contains an assignment on data analysis and visualization. It includes the following key points: 1) The student is asked to analyze financial and transactional data from several companies to calculate metrics like return on assets and payment method distributions. 2) Charts like histograms, boxplots, and scatterplots are created to visualize relationships between variables like salary, experience, and gender. 3) Correlations are calculated and conditional probabilities are found to understand connections between variables. 4) Preliminary analysis suggests that while females may earn less on average, salary seems more strongly linked to years of experience rather than gender alone. Further regression is needed to study gender bias rigorously.

Uploaded by

Farhad Kabir
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Data Science HW1

This document contains an assignment on data analysis and visualization. It includes the following key points: 1) The student is asked to analyze financial and transactional data from several companies to calculate metrics like return on assets and payment method distributions. 2) Charts like histograms, boxplots, and scatterplots are created to visualize relationships between variables like salary, experience, and gender. 3) Correlations are calculated and conditional probabilities are found to understand connections between variables. 4) Preliminary analysis suggests that while females may earn less on average, salary seems more strongly linked to years of experience rather than gender alone. Further regression is needed to study gender bias rigorously.

Uploaded by

Farhad Kabir
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

GSBA 524 Data Science for Business Md.

Farhad Kabir
7084590528
MBA 2023 Core
Assignment 1 (40 points total)
Problem 1. (10 Points)
Companies vary greatly in size. This variation can hide how well a company is performing.
Rather than looking at the raw profit numbers, analysts consider financial ratios that adjust
for the size of the company. A popular ratio is the return on assets, defined as:

Return on Assets = Net Income/Total Assets

Net income is another name for profits, and the total assets of a company is the value of
everything it owns that is used to produce profits. The return on assets indicates how much
profit the company generates relative to the amount that it invested to make that profit. A
company with losses rather than profits has a negative return on assets. The data set
“Company.csv” gives the total assets (in Million $), net income (in Million $), and the
number of employees reported by 167 retailers in the United States.

a) Report the following summary statistics for the variable Total Assets.

MEAN STANDARD RANGE IQR


DEVIATION
5,286.940 16,119.786 180561 2691.5

b) Report the % of companies that incurred losses in the reported year.


I have helped you create a new variable “Profit or Loss” using the “IF” function in
Excel based on the value of the variable “Net Income”. You can use Radiant to
compute the frequencies based on this new variable.
This is how to the IF function in Excel works:

Loss : 23.35%

c) What is the shape of the distribution of the variable “return on assets”? Answer the
question without creating a histogram. Select one:
A. Right Skewed
B. Left Skewed
C. Symmetrical
Explain your answer in one short sentence.

Ans. Data skewed to the left because of the presence of majority number of
values in higher boundary as mean is less than the median.

Page 1 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core

d) Create a histogram to verify your answer in c). Use 20 bins and include a graph in
your solution.

e) Suppose that the total asset value of Wal-Mart is accidentally recorded as 180.66.
Which measure (mean vs median) is less influenced by the recording error?
Explain why in one short sentence (no computation is needed).

Median would be less influenced than that of mean as there are few numbers
of samples like Walmart which is higher in value.

Page 2 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core

Problem 2. (10 Points)

The data set “CreditCard.csv” records the sales transactions for 400 customers. It includes the
time of transaction, the gender and region of customer, the payment method, the payment
amount, and the number of items purchased.
a) Report % of customers who pay using a Visa card.
Ans. 38%
b) To see how Gender and Payment method are related to each other, we examine the joint
distribution of Gender and Payment method. Use Pivot Table in Excel or Radiant to
create a two-way table to describe the joint distribution of Gender and Payment method.
Report % of the grand total.

Page 3 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core

c) Given that a customer is a female, what is the distribution of her payment method?

Payment method of female Probability


Cash 17.95%
Visa 38.46%
Master 43.59%

d) Find the following two conditional probabilities: (i) The conditional probability that a
customer spends more than $120 given that s/he is from the west.

Ans:
65
53.72%
EXCEL FUNCTION =IF(AND(H2>120,D2="West"),"1","0")

(ii) The conditional probability that a customer is from the west given that s/he spends
more than $120.
Ans.
65
30.09%
EXCEL FUNCTION =IF(AND(D2="West",H2>120),"1","0")

Hint: use the “IF” function in Excel to create a new variable based on the value of the
variable “Total Cost”. Then use Radiant/Pivot Table to create two-way tables.

Page 4 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
Problem 3. (10 Points)

The data set “Salary.csv” contains information of 220 employees in a company regarding
their salaries, years of experiences and ranks.

a) Report the following summary statistics for the variable Salary for male and female
employees, respectively: Mean, SD, IQR.

Gender Mean SD IQR


Male 144.110 12.394 19
Female 140.467 12.496 15.50
Total 142.868 12.521 18.00

b) Compute the correlation between Salary and YearsExper (for all employees). Hint: You
can compute the correlation in Radiant.
Correlation
Data : Salary
Method : Pearson, Spearman
Variables : Salary, Exp
Null hyp. : variables x and y are not correlated
Alt. hyp. : variables x and y are correlated
Correlation matrix:
Salary Exp 0.31
p.values:Salary
Exp 0.00

Page 5 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core

Correlation
Data : Salary
Method : Kendall
Variables : Salary, Exp
Null hyp. : variables x and y are not correlated
Alt. hyp. : variables x and y are correlated
Correlation matrix:
Salary Exp 0.21
p.values:Salary
Exp 0.00

Page 6 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core

c) Create a Side-by-Side boxplot of Salary by Gender (comparative boxplots). Include a


graph in your solution and comment on any interesting patterns that you observe.
Ans.

Observation: Salary of female is lower than the salary of male. Gap is insignificant. One
outlier in the female category.

Page 7 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core

d) Create a Side-by-Side boxplot of YearsExper by Gender. Include a graph in your solution


and comment on any interesting patterns that you observe.
Ans.

Observation: Years of experience of female is lower than that of male. Gap is


significant.

Page 8 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
e) Create a scatter plot of Salary vs YearsExper. Include a graph in your solution and
comment on any interesting patterns that you observe.
Ans.

Observation: Salary is proportionately higher to the people having more years of


experience.

Page 9 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core
f) Do you still think that there is a gender bias? Note: no definitive conclusion can be
reached at this stage. We need to use multiple regression models to do a rigorous study.
Just briefly comment based on the above plots and your intuitions (in 2-3 short
sentences).
Ans.
Based on the data, gender bias cannot be attributed for salary gap since the salary
proportionately higher to the people having more years of experience. It can be
called as Simpson’s paradox as we get the different outcomes after combining the
two datasets.

Problem 4. (10 points)

A medical study compares the success rates of two treatments for kidney stones.

The tables below show the numbers of treatments for treatments involving both small and
large kidney stones, where Treatment A includes all open surgical procedures and Treatment
B is percutaneous nephrolithotomy (which involves only a small puncture).

a) What are the overall success rates for treatments A and B, respectively?
Ans.
Treatment Overall Success Rate
A 78%
B 82.57%

b) For small stones, what are the success rates for treatment A and B, respectively? Which
treatment is better?
Ans.
Treatment Small stones’ Success Rate
A 93.10%
B 86.67%
In terms of success rate, A is better, In terms of frequency, B is better

Page 10 of 11
GSBA 524 Data Science for Business Md. Farhad Kabir
7084590528
MBA 2023 Core

c) For large stones, what are the success rates for treatment A and B, respectively? Which
treatment is better?
Ans.
Treatment Large stones’ Success Rate
A 73.004%
B 68.75%
Both in terms of success rate and in terms of frequency, A is better

d) You must have found that the conclusion in a) contradicts with those in b) and c). Briefly
explain the phenomenon (in 2-3 short sentences).
Ans. Based on data, the example given is a Simpson’s paradox. Apparently, the data
of treatment B seems better but lurking variable encounters after finding the
accurate percentage of success and failure rate.

-------------------------------------------END OF ASSIGNMENT---------------------------------

Page 11 of 11

You might also like