Assignment 1 B2019010
Assignment 1 B2019010
ASSIGNMENT-1
m
er as
YoutubeLikes– Number of likes of the YouTube trailers
YoutubeDislike– Number of dislikes of the YouTube trailers
co
eH w
Use Python Code to answer the following questions.
o.
rs e
# -*- coding: utf-8 -*-
ou urc
"""
Created on Mon Jan 27 00:14:44 2020
o
"""
v i y re
"""===============DATASET1=================="""
import pandas as pd
import numpy as np
import seaborn as sn
ed d
%matplotlib inline
ar stu
1. How many records are present in the dataset? Print the metadata information of the dataset.
sh is
bollywood.info()
Th
bollywood.shape
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
2. How many movies got released in each genre? Which genre had highest number of releases?
Sort number of releases in each genre in descending order.
bollywood["Genre"].value_counts()
3. How Many movies in each genre got released in different release times like long weekend,
festive season ,etc. (Note: Do a cross tabulation between Genre and ReleaseTime.
m
er as
pd.crosstab(bollywood["Genre"],bollywood["ReleaseTime"])
co
eH w
o.
rs e
ou urc
o
aC s
v i y re
4. Which month of the year, maximum number movie releases are seen? (Note: Extract a new
column called month from ReleaseDate column.)
bollywood["Month"]=pd.DatetimeIndex(bollywood["Release Date"]).month
ed d
print(bollywood[["MovieName","Month"]])
ar stu
sh is
Th
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
bollywood["Month"].value_counts()
5. Which month of the year typically sees most releases of high budgeted movies, that is, movies
with budget of 25 crore or more?
bollywood[bollywood["Budget"]>=25]["Month"].value_counts
m
er as
co
eH w
o.
rs e
ou urc
6. Which are the top 10 movies with maximum return on investment (ROI)? Calculate return on
o
bollywood["ROI"]= (bollywood["BoxOfficeCollection"]-bollywood["Budget"]) /
bollywood["Budget"]
bollywood[["MovieName","ROI"]].sort_values("ROI",ascending=False)[0:10]
ed d
ar stu
sh is
Th
7. Do the movies have higher ROI if they get released on festive season or long weekend?
Calculate the average ROI for different release times.
bollywood.groupby("ReleaseTime")["ROI"].mean()
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
8. Draw a histogram and distribution plot to find out the distribution of movie budgets. Interpret
the plot to conclude if the most movies are high or low budgeted movies.
m
er as
co
eH w
o.
rs e
ou urc
9. Compare the distribution of ROIs between movies with comedy genre and drama. Which
genre typically sees higher ROIs ?
o
aC s
bollywood.groupby("Genre")["ROI"].sum().plot.bar()
v i y re
ed d
ar stu
sh is
Th
10. Is there a correlation between Box office collection and YouTube likes? Is the correlation
positive or negative?
corr_bollywood=bollywood[["BoxOfficeCollection","YoutubeViews"]].corr()
sn.heatmap(corr_bollywood,annot=True)
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
11. Which genre of movies typically sees more YouTube likes ? Draw boxplot for each genre of
movies to compare.
m
er as
sn.boxplot(x="Genre",y = "YoutubeLikes", data=bollywood)
co
eH w
o.
rs e
ou urc
o
aC s
v i y re
ed d
features=["Budget","YoutubeViews","YoutubeLikes","YoutubeDislikes"]
sn.pairplot(bollywood[features],height=2)
sh is
Th
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
m
er as
co
eH w
o.
rs e
ou urc
o
aC s
v i y re
The dataset contains retrospective sample of males in a heart-disease high-risk region of the
ar stu
Western cape, South Africa. There are roughly two controls per case of Coronary Heart
Disease (CHD). Many of the CHD-positive men have undergone blood pressure reduction
treatment and other programs to reduce their risk factors after their CHD event. In some
cases, the measurements were made after these treatments. These data are taken from a larger
sh is
dataset, described in Rousseauw et al. (1983), South African Medical Journal. It is a tab
Th
separated file (csv) and contains the following columns (source: http://www-stat.stanford.edu)
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
13. How many records are present in the dataset? Print the metadata information of the dataset.
"""================DATASET 2===================="""
SAheart = pd.read_csv("SAheart.csv")
SAheart.info()
m
er as
co
eH w
o.
rs e
14. Draw a bar plot to show the number of persons having CHD or not in comparison to they
ou urc
having family history of the disease or not.
for i in range(0,len(SAheart["chd"])):
if(SAheart["chd"][i]=="Si"):
o
SAheart["chd"][i]=1
aC s
else:
v i y re
SAheart["chd"][i]=0
sn.barplot(x="famhist",y="chd",data=SAheart)
ed d
pd.crosstab(SAheart["famhist"],SAheart["chd"]).plot.bar()
ar stu
sh is
Th
15. Does Age has any correlation with sbp ? Choose appropriate plot to show the relationship.
corr_heart=SAheart[["sbp","age"]].corr()
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
sn.heatmap(corr_heart,annot=True)
16. Compare the distribution of tobacco consumption for person having CHD and not having
CHD. Can you interpret the effect of tobacco on having coronary heart disease?
m
er as
d0 = SAheart[SAheart["chd"]==1]
co
eH w
d1 = SAheart[SAheart["chd"]==0]
sn.distplot(d1["tobacco"])
o.
sn.distplot(d2["tobacco"])
rs e
ou urc
o
aC s
v i y re
ed d
ar stu
17. How are the parameters sbp, obesity, age and ldl corelated? choose the right plot to show the
relationships.
sh is
corr_heart=SAheart[["sbp","age","obesity","ldl"]].corr()
sn.heatmap(corr_heart,annot=True)
Th
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
18. Derive new column called agegroup from age column where persons falling in different age
ranges are categorized as below:
(0-15):Young
(15-35):adults
(35-55):mid
(55-):old
SAheart["agegroup"]=pd.cut(SAheart.age,bins=[0,14,34,54,99],labels=["Young","Adult
s","Mid","Old"])
19. Find out number of chd cases in different age categories. Do a barplot and sort them in the
order of age groups.
SAheart.groupby("agegroup")["chd"].count().plot.bar()
m
er as
co
eH w
o.
rs e
ou urc
20. Draw a box plot to compare distributions of ldl for different age groups.
o
This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00
https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
Powered by TCPDF (www.tcpdf.org)