PA Assignment 1 Oct2021
PA Assignment 1 Oct2021
ASSIGNMENT 1
Due on 19 December 2021 (Sunday), 23:59 hrs
Individual/Team/Both: Individual
WARNING
1. OBJECTIVE
This assignment assesses the student’s ability to apply relevant programming concepts
to develop a simple application using the numpy package of the Python programming
language.
2. QUESTIONS
The sinking of the Titanic was a tragedy in 1912. Data analytics and data science are
often applied to improve future outcomes, based on past events.
For this assignment, we will be using a modified, anonymized dataset from the Titanic
Passenger Data. This dataset is a partial sub-set of the full original data, with
missing/unusual data already cleaned or adjusted.
Column details:
• passenger_id: (anonymized, unique id numbers assigned to each passenger)
• pclass: represents passenger cabin class, (1 - first class, 2 - second class, 3 -
third class)
• survived: (0 - no, 1 - yes)
• gender: (0 - female, 1 - male)
• age: (numerical age of passenger)
• sibsp: (number of siblings and/or spouses the passenger had onboard together
with them)
• parch: (number of parents and/or children the passenger had onboard together
with them)
• fare: (assume in $US based on 1912 pricing)
Answer the following questions based on the "titanic_mod.csv" data file by writing Python
code.
1) Read the provided data file into the Jupyter Notebook using suitable file opening
functions, and perform the following tasks:
i) Print a list (python data structure) of all the column header names of the
dataset (the column names are in the first line of the data file).
ii) Print the first 5 column header names, followed by the first 5 rows of the data.
For ii), your output should clearly display the column names and the required
rows of data as per the example below.
Hint: Use Python's open() and readline() to open the provided file, and to read the
file's column headers and rows of the data, line by line.
Reminder:
Use of non-basic python such as csv or pandas libraries will result in 50% penalties.
(8 marks)
2) Using relevant numpy functions, load the data file into an array, excluding the first
row of column headers. Display the contents and properties of the array as per the
below example. You may need to use numpy’s set_printoptions() function to
achieve the desired display).
Hint: Use numpy’s genfromtxt() function to load data from a text or csv file. Read
the documentation and consider carefully what parameters to use when calling
genfromtxt().
3) Write a user-defined function that has two parameters: a column index and a
passenger age number. It will count the occurrences of the passenger age number
in the column index of the numpy array and return the total occurrences.
In the dataset, print the 3 most frequent ages of the passengers. Include the
proportion as a % out of entire passenger manifest, to 3 decimal places, for each
age.
(12 marks)
Please print out the following values amongst passengers (when appropriate, to 2
decimal places):
(12 marks)
Print out the difference between mean fare paid by males that survived, and mean
fare paid by males that did not, appropriately formatted.
(8 marks)
6) A research think-tank has tasked you with automating some of the common queries
that their members make about the Titanic dataset.
Write a simple Python program for the user to query the data based on his/her given
inputs. When a user enters an option from 0 to 3, the program will process the
option accordingly.
After the option has been processed, the program will display the main menu again
and the process is repeated until the user chooses to exit.
(10 marks)
For the Compute Correlation option, display a numbered list of all the column
header names and prompt the user to input the numbers representing the two
quantities for the computation of correlation. The computed correlation should be
rounded off to 3 decimal places, as per the sample run below.
8) In the absence of actual lifeboat data, survivor age can be used to gauge if certain
demographics were allowed on the lifeboats first.
Prompt the user to enter the passenger class number, before displaying the
corresponding rows of the 20 oldest survivors for that passenger class, in order
from oldest to youngest.
9) It was reported that while generally women were allowed onto lifeboats first,
researchers are also keen to identify female survivors with larger numbers of family
members on-board (not including themselves).
Write a simple lambda function to calculate a new numpy array column containing
each passenger's non-self family members on-board, by adding the count of sibling
and/or spouses, to the count of parents and/or children, for each passenger.
Append this column to the existing numpy 2-D array of values (you may need to
use numpy.reshape() before appending) and display the top 20 rows of female
survivors, ordered by highest to lowest by non-self family member count primarily,
and in case of a tie, by highest to lowest fare secondarily.
(15 marks)
3. SCOPE
Marks may be penalized for students submitting programs that exhibit one or more of the
below undesirable characteristics (not necessarily an exhaustive list):
• Non-descript or irrelevant choice of variable names
• Lack of accompanying documentation/comments for complex code segments
Note:
• You are expected to follow the naming conventions introduced in this module.
• You should think carefully what input is required for each option if there is any.
• You are allowed to customize your own output.
• You are required to present your solution to explain your program to your tutor
before submission. Programs need not be complete, in which case, a
discussion on how you would proceed to complete would suffice.
• Marks will be deducted if you are not able to show your understanding of the
program during the presentation.
Data File:
• The data file for your program is available from PolyMall Programming for Analytics
module Learning Materials.
Links to Documentation:
• Section 7.2 of https://docs.python.org/3/tutorial/inputoutput.html on Reading and
Writing Files.
• https://numpy.org/devdocs/user/how-to-io.html on reading text and csv files.
6. DELIVERABLES
• Submit your solution file into your MS Teams Class Notebook by 19 Dec 2021
23:59hrs.
7. ASSESSMENT
Performance Criteria for grading the assignment is as described below. Marks awarded
will be based on program code as well as student’s degree of understanding of work
done for completed program and discussion for incomplete parts as assessed during
the presentation.
A Grade
B Grade
C Grade
D Grade
== END OF DOCUMENT ==