TM351 Data Management and Analysis TMA01 Fall 2018 - Q-03
TM351 Data Management and Analysis TMA01 Fall 2018 - Q-03
Branch:
Name:
StudentID:
Group number:
Instructor:
TM351 Data management and analysis
TMA01 Fall 2018 TMA01 MG
(Cut-off date will be announced)
1. Preamble:
This section contains general rules and guidelines for completing and submitting your TMA.
Ths TMA is provided as a Jupyter notebook. You should answer all questions inside this notebook. You will
also work through any additional course notebooks and submit them separetly via a different LMS link. The
TMA requires that you demonstrate an understanding of course concepts and techniques, and an ability to
apply these to sample problems. Your tutor will be following a detailed marking scheme, but he or she will
particularly look for the following:
Using the e-library and other external sources. When asked to do so, you need to search the e-library and
the internet to identify relevant material. In particular, you are urged to use the following sources, all of which
are freely available to AOU students:
For this TMA, you will be required to submit two different items:
1. a ".rar" or ".zip" compressed directory containing the solved course notebooks you were able to
complete, along with any required data sets.
2. a second ".rar" or ".zip" compressed directory containing this notebook and any required data sets.
Please note that all notebooks you submit must be solved before you submit them. This means that you
should run all cells and show all outputs befor eyou submit the notebook.
Submit your TMA to the LMS system on (or preferably before) the cut-off date. Your tutor will mark your script
and post the grades to the LMS.
1.3 Plagiarism
All work you submit must be yours and in your own words. Your tutor has tools available to him/her to
allow the detection of plagiarism from the Internet as well as from other colleagues. Furthermore, you may be
quizzed on the work you submitted and/or asked to demonstrate that it is indeed your own work. If you copy
material that is not your own and submit it as your own you are committing plagiarism. Plagiarism is a serious
offence and if a case of plagiarism is detected, the Arab Open University will apply severe penalties and
disciplinary procedures.
If you wish to quote other materials, including the TM351 learning materials, then you must clearly
acknowledge the source according to accepted rules of citation and referencing.
Note that it is not enough to simply post a reference at the end of the document without explicitly stating
which parts of your reference are being quoted. Proper citation of external sources must be included. Also,
quoting is only used in limited fashion; to stress a certain point using the words of a well-recognized guru, for
example. Large amounts of materials copied into your TMA will not be accepted, even if properly quoted. If
you need to refer to large amount of external material, you can simply refer to the source.
You can discuss the TMA with your tutor. Your tutor will help explain unclear points in the TMA and will direct
you to useful and appropriate material in the course. However, you should not expect your tutor to supply you
with answers to TMA questions. Remember that answering the TMA is ultimately your responsibility, not your
tutor’s. In addition, working the TMA and overcoming its difficulties by yourself will help you do well in the
final examination.
Sharing knowledge and information and holding discussions with your colleagues about the course material
is called group learning and is encouraged by the Arab Open University. However, at the end, you should
complete the TMA by yourself and answer the TMA, in your own words. Collaborating in answering TMA
questions is not allowed and is not the same as group learning You are also not allowed to use the course
Question 1 (20 marks)
The National Archives manages the website (http://www.legislation.gov.uk/ (http://www.legislation.gov.uk/))
which publishes all UK legislation. For each legislation, the site provides the complete text of the legislation
as well as explanatory notes that explain the act to readers who are not legally qualified. Using the
explanatory notes for the data protection act, as well as your course notes, the e-library and/or other external
references, answer the following questions:
i. What are in your opinion the main matters provided in the act (list four)? [4 marks]
ii. Explain what is meant by each of the following data protection principles according to the new General
Data Protection Regulatory principles: [4 marks]
Data minimisation
Accuracy
Security
Accountability
iii. According to the new General Data Protection Regulatory principles, explain under what circumstances do
individuals have the right to have personal information erased. List at least 4 circumstances. [4 marks]
iv. Identify your local branch-country law that is equivalent to the UK data protection act. Provide information
for at least the following aspects of the law: [4 marks]
what is the name of the law and the name of the authority or authorities that are responsible for
enforcing it ?
at least two references in the Harvard style, including a link to the text of the law or to articles describing
it
when was the law enacted (provide a date)
A brief overview of the law
v. Critically compare the UK data protection act to its local equivalent as identified above. You should discuss
and compare at least the following aspects: [4 marks]
Partially completed notebooks will not be counted. All outputs must be shown.
Please demonstrate your active interaction with each notebook by including your own additions and/or
extensions to the code and/or your own additional comments. Use a double hash sign '##' to distinguish
your comments from those already provided in the notebook.
Your tutor may quiz you on the contents of the notebooks you provide.
Question 2 Answers:
This question refers to assessing the issues of data and its analysis as discussed in Part 1.
the Website https://catalog.data.gov/dataset/crime-data-from-2010-to-present
(https://catalog.data.gov/dataset/crime-data-from-2010-to-present) publishes crime data for the city of Los
Angeles from 2010 to the present. With respect to this data set, answer the following questions after
downloading and exploring this data set, writing all your answers in the cell below:
a. Carefully examine the site and the data it provides and give evidence-based assessments to the following
issues. Give a score of 1-10 for each issue, giving reasons for your mark:
Its origin
Its documentation
Its curation
The quality of its maintenance
b. give evidence-based assessments to the following data quality attributes. Give a score of 1-10 for each
attribute, giving reasons for your mark: [6 marks]
Accuracy
Validity
Reliability
Timeliness
Consistency
Provenance
Please note that scores provided with no evidence or justification will receive no marks.
Question 3 Answers:
This question refers to the data analysis pipeline as discussed in parts 2-5. You are required to perform the
following steps in your notebook, making sure that you fully document each step.
ii. Explore the data sets and answer the following questions about each of them: [10 marks]
Is there any missing data in the two sets? give examples. and how does each set handle the missing
data? give examples. [ 4 marks]
Are there any dirtiness of the data? give examples of each type of dirtiness that exist in the data set. Are
there any ambiguities in understanding the contents of the data set ? Give examples. [ 4 marks]
How are data encoded in the two sets? support your answer by evidence [2 mark]
iii. Combine the data in both sets from within the notebook into a single dataframe by joining them based on
the common zipcodes. Ignore zip codes that show in only one data set but not in the other. To complete this
task, you must first count the number of foreclosures of each type (single family, multi-family, ..) in each zip
code. The end result of this step would be a single data frame with the columns shown below. Display the
first 10 lines of this combined data frame. [ 5 marks]
Zip code - Total Males - Total Females - Total Households - Average Household size - total number of
single family foreclosures - total number of multi-family foreclosures
iv. Using a suitable visualisation technique, display the relationship between the average household size in a
certain zip code area and the total number of single family foreclosures and the total number of multi-family
foreclosures. What are your conclusions? [5 marks]
Question 4 Answers:
1. Download this file by clicking on the link above, and save the file to the data directory in the folder that
has this TMA file. Then, convert this set into a pandas DataFrame structure and display the first 5 lines
for viewing. [4 marks] If you get a message: Columns(x,y) have inconsistent types, then you should re-
import the file into pandas dataframe using the dtype and the low_memory options. [5 marks]
2. For all crimes committed in WHEATON Police district and that involved more than 4 victims, list the
NIBRS code, the number of victims, the police district name, Crime Name2, and the Location of the
crime. Display all results [5 marks]
3. For all crimes involving more than 3 victims, display the police district name, and the total number of
crimes committed in that district. Display the results [5 marks]
Question 5 Answers: