0% found this document useful (0 votes)
175 views9 pages

Assignment 1 B2019010

The document describes a Python programming assignment that involves analyzing a dataset called "bollywood.csv" containing box office and social media data for Bollywood movies from 2013-2015. Students are asked to use Python code to answer 12 questions that involve exploring the dataset, calculating metrics like return on investment, and visualizing relationships between variables through plots and charts. The assignment aims to help students practice working with datasets and answering business questions through Python programming and data analysis.

Uploaded by

bhushan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
175 views9 pages

Assignment 1 B2019010

The document describes a Python programming assignment that involves analyzing a dataset called "bollywood.csv" containing box office and social media data for Bollywood movies from 2013-2015. Students are asked to use Python code to answer 12 questions that involve exploring the dataset, calculating metrics like return on investment, and visualizing relationships between variables through plots and charts. The assignment aims to help students practice working with datasets and answering business questions through Python programming and data analysis.

Uploaded by

bhushan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Programming with Python

ASSIGNMENT-1

Use the Bollywood Dataset to Answer Questions 1 to 12.


The data file bollywood.csv contains box office collection and social media promotion
information about movies released in 2013-2015 period. Following are the columns and their
descriptions.
 SlNo
 Release Date
 MovieName – Name of the movie
 ReleaseTime – Mentions special time of release. LW (Long weekend), FS (Festive
Season), HS (Holiday Season), N (Normal)
 Genre – Genre of the film such as Romance, Thriller, Action, Comedy, etc.
 Budget– Movie creation budget
 BoxOfficeCollection– Box office collection
 YoutubeViews– Number of views of the YouTube trailers

m
er as
 YoutubeLikes– Number of likes of the YouTube trailers
 YoutubeDislike– Number of dislikes of the YouTube trailers

co
eH w
Use Python Code to answer the following questions.

o.
rs e
# -*- coding: utf-8 -*-
ou urc
"""
Created on Mon Jan 27 00:14:44 2020
o

@author: Aviral Mehta


aC s

"""
v i y re

"""===============DATASET1=================="""
import pandas as pd
import numpy as np
import seaborn as sn
ed d

%matplotlib inline
ar stu

import matplotlib.pyplot as plt

1. How many records are present in the dataset? Print the metadata information of the dataset.
sh is

bollywood.info()
Th

bollywood.shape

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
2. How many movies got released in each genre? Which genre had highest number of releases?
Sort number of releases in each genre in descending order.

bollywood["Genre"].value_counts()

3. How Many movies in each genre got released in different release times like long weekend,
festive season ,etc. (Note: Do a cross tabulation between Genre and ReleaseTime.

m
er as
pd.crosstab(bollywood["Genre"],bollywood["ReleaseTime"])

co
eH w
o.
rs e
ou urc
o
aC s
v i y re

4. Which month of the year, maximum number movie releases are seen? (Note: Extract a new
column called month from ReleaseDate column.)

bollywood["Month"]=pd.DatetimeIndex(bollywood["Release Date"]).month
ed d

print(bollywood[["MovieName","Month"]])
ar stu
sh is
Th

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
bollywood["Month"].value_counts()

5. Which month of the year typically sees most releases of high budgeted movies, that is, movies
with budget of 25 crore or more?

bollywood[bollywood["Budget"]>=25]["Month"].value_counts

m
er as
co
eH w
o.
rs e
ou urc

6. Which are the top 10 movies with maximum return on investment (ROI)? Calculate return on
o

investment (ROI) as (BoxOfficeCollection-Budget) / Budget.


aC s
v i y re

bollywood["ROI"]= (bollywood["BoxOfficeCollection"]-bollywood["Budget"]) /
bollywood["Budget"]
bollywood[["MovieName","ROI"]].sort_values("ROI",ascending=False)[0:10]
ed d
ar stu
sh is
Th

7. Do the movies have higher ROI if they get released on festive season or long weekend?
Calculate the average ROI for different release times.
bollywood.groupby("ReleaseTime")["ROI"].mean()

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
8. Draw a histogram and distribution plot to find out the distribution of movie budgets. Interpret
the plot to conclude if the most movies are high or low budgeted movies.

import matplotlib.pyplot as plt


import seaborn as sn
plt.hist(bollywood["Budget"],bins=5) #by default 10 bins hence taking 5 bins at a time
sn.distplot(bollywood["Budget"])

m
er as
co
eH w
o.
rs e
ou urc
9. Compare the distribution of ROIs between movies with comedy genre and drama. Which
genre typically sees higher ROIs ?
o
aC s

bollywood.groupby("Genre")["ROI"].sum().plot.bar()
v i y re
ed d
ar stu
sh is
Th

10. Is there a correlation between Box office collection and YouTube likes? Is the correlation
positive or negative?

corr_bollywood=bollywood[["BoxOfficeCollection","YoutubeViews"]].corr()
sn.heatmap(corr_bollywood,annot=True)

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
11. Which genre of movies typically sees more YouTube likes ? Draw boxplot for each genre of
movies to compare.

m
er as
sn.boxplot(x="Genre",y = "YoutubeLikes", data=bollywood)

co
eH w
o.
rs e
ou urc
o
aC s
v i y re
ed d

12. Which of the variables among Budget, BoxOfficeCollection, YoutubeViews, YoutubeLikes,


ar stu

YoutubeDislikes are highly correlated? Note: Draw pair plot or heatmap.

features=["Budget","YoutubeViews","YoutubeLikes","YoutubeDislikes"]
sn.pairplot(bollywood[features],height=2)
sh is
Th

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
m
er as
co
eH w
o.
rs e
ou urc
o
aC s
v i y re

Use the SAheart Dataset to Answer Questions 13 to 20.


The dataset SAheart.data is taken from the link below :
https://web.stanford.edu/~hastie/ElemStatLearn//datasets/SAheart.data
ed d

The dataset contains retrospective sample of males in a heart-disease high-risk region of the
ar stu

Western cape, South Africa. There are roughly two controls per case of Coronary Heart
Disease (CHD). Many of the CHD-positive men have undergone blood pressure reduction
treatment and other programs to reduce their risk factors after their CHD event. In some
cases, the measurements were made after these treatments. These data are taken from a larger
sh is

dataset, described in Rousseauw et al. (1983), South African Medical Journal. It is a tab
Th

separated file (csv) and contains the following columns (source: http://www-stat.stanford.edu)

Sbp – systolic blood pressure


Tobacco – Cumulative tobacco (kg)
Ldl – Low density liproprotein cholesterol
Adiposity
Famhist- Family history of heart disease (PRESENT / ABSENT)
typea – Type-A behaviour
obesity
alcohol – Current Alcohol consumption
age- Age at onset
chd- Response, coronary heart disease

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
13. How many records are present in the dataset? Print the metadata information of the dataset.

"""================DATASET 2===================="""
SAheart = pd.read_csv("SAheart.csv")
SAheart.info()

m
er as
co
eH w
o.
rs e
14. Draw a bar plot to show the number of persons having CHD or not in comparison to they
ou urc
having family history of the disease or not.

for i in range(0,len(SAheart["chd"])):
if(SAheart["chd"][i]=="Si"):
o

SAheart["chd"][i]=1
aC s

else:
v i y re

SAheart["chd"][i]=0

sn.barplot(x="famhist",y="chd",data=SAheart)
ed d

pd.crosstab(SAheart["famhist"],SAheart["chd"]).plot.bar()
ar stu
sh is
Th

15. Does Age has any correlation with sbp ? Choose appropriate plot to show the relationship.

corr_heart=SAheart[["sbp","age"]].corr()

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
sn.heatmap(corr_heart,annot=True)

16. Compare the distribution of tobacco consumption for person having CHD and not having
CHD. Can you interpret the effect of tobacco on having coronary heart disease?

m
er as
d0 = SAheart[SAheart["chd"]==1]

co
eH w
d1 = SAheart[SAheart["chd"]==0]
sn.distplot(d1["tobacco"])

o.
sn.distplot(d2["tobacco"])
rs e
ou urc
o
aC s
v i y re
ed d
ar stu

17. How are the parameters sbp, obesity, age and ldl corelated? choose the right plot to show the
relationships.
sh is

corr_heart=SAheart[["sbp","age","obesity","ldl"]].corr()
sn.heatmap(corr_heart,annot=True)
Th

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
18. Derive new column called agegroup from age column where persons falling in different age
ranges are categorized as below:
(0-15):Young
(15-35):adults
(35-55):mid
(55-):old
SAheart["agegroup"]=pd.cut(SAheart.age,bins=[0,14,34,54,99],labels=["Young","Adult
s","Mid","Old"])

19. Find out number of chd cases in different age categories. Do a barplot and sort them in the
order of age groups.
SAheart.groupby("agegroup")["chd"].count().plot.bar()

m
er as
co
eH w
o.
rs e
ou urc
20. Draw a box plot to compare distributions of ldl for different age groups.
o

sn.boxplot(x="agegroup",y = "ldl", data=SAheart)


aC s
v i y re
ed d
ar stu
sh is
Th

This study source was downloaded by 100000783676552 from CourseHero.com on 10-28-2021 06:02:17 GMT -05:00

https://www.coursehero.com/file/60784145/Assignment-1-B2019010docx/
Powered by TCPDF (www.tcpdf.org)

You might also like