0% found this document useful (1 vote)
632 views9 pages

TM351 Data Management and Analysis TMA01 Fall 2018 - Q-03

This document provides instructions for completing a TMA assignment. Students must submit two zip files - one containing solved Jupyter notebooks from the course and one containing the main TMA notebook with answers. The TMA focuses on data protection laws and requires students to analyze a data set on crime in Los Angeles. It instructs students to properly cite sources, avoid plagiarism, and complete the work independently while being allowed to discuss concepts with peers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
632 views9 pages

TM351 Data Management and Analysis TMA01 Fall 2018 - Q-03

This document provides instructions for completing a TMA assignment. Students must submit two zip files - one containing solved Jupyter notebooks from the course and one containing the main TMA notebook with answers. The TMA focuses on data protection laws and requires students to analyze a data set on crime in Los Angeles. It instructs students to properly cite sources, avoid plagiarism, and complete the work independently while being allowed to discuss concepts with peers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Please complete the information below :

Branch:

Name:

StudentID:

Group number:

Instructor:
TM351 Data management and analysis
TMA01 Fall 2018 TMA01 MG
(Cut-off date will be announced)

1. Preamble:

This section contains general rules and guidelines for completing and submitting your TMA.

1.1 General guidelines

Ths TMA is provided as a Jupyter notebook. You should answer all questions inside this notebook. You will
also work through any additional course notebooks and submit them separetly via a different LMS link. The
TMA requires that you demonstrate an understanding of course concepts and techniques, and an ability to
apply these to sample problems. Your tutor will be following a detailed marking scheme, but he or she will
particularly look for the following:

That all work is your own


that you have provided references in the proper format whenever required
that you have used the course concepts, terminology and prescribed software

Using the e-library and other external sources. When asked to do so, you need to search the e-library and
the internet to identify relevant material. In particular, you are urged to use the following sources, all of which
are freely available to AOU students:

1. References provided in your course materials


2. AOU’s subscribed e-library, accessible through the LMS which includes a number of different resources
3. Google books
4. Google scholar

1.2 Submitting your TMA

For this TMA, you will be required to submit two different items:

1. a ".rar" or ".zip" compressed directory containing the solved course notebooks you were able to
complete, along with any required data sets.
2. a second ".rar" or ".zip" compressed directory containing this notebook and any required data sets.

Please note that all notebooks you submit must be solved before you submit them. This means that you
should run all cells and show all outputs befor eyou submit the notebook.

Submit your TMA to the LMS system on (or preferably before) the cut-off date. Your tutor will mark your script
and post the grades to the LMS.

1.3 Plagiarism

All work you submit must be yours and in your own words. Your tutor has tools available to him/her to
allow the detection of plagiarism from the Internet as well as from other colleagues. Furthermore, you may be
quizzed on the work you submitted and/or asked to demonstrate that it is indeed your own work. If you copy
material that is not your own and submit it as your own you are committing plagiarism. Plagiarism is a serious
offence and if a case of plagiarism is detected, the Arab Open University will apply severe penalties and
disciplinary procedures.

1.3.1 Quoting and Referencing.

If you wish to quote other materials, including the TM351 learning materials, then you must clearly
acknowledge the source according to accepted rules of citation and referencing.

Note that it is not enough to simply post a reference at the end of the document without explicitly stating
which parts of your reference are being quoted. Proper citation of external sources must be included. Also,
quoting is only used in limited fashion; to stress a certain point using the words of a well-recognized guru, for
example. Large amounts of materials copied into your TMA will not be accepted, even if properly quoted. If
you need to refer to large amount of external material, you can simply refer to the source.

1.3.1 Getting help and collaborating with colleagues.

You can discuss the TMA with your tutor. Your tutor will help explain unclear points in the TMA and will direct
you to useful and appropriate material in the course. However, you should not expect your tutor to supply you
with answers to TMA questions. Remember that answering the TMA is ultimately your responsibility, not your
tutor’s. In addition, working the TMA and overcoming its difficulties by yourself will help you do well in the
final examination.

Sharing knowledge and information and holding discussions with your colleagues about the course material
is called group learning and is encouraged by the Arab Open University. However, at the end, you should
complete the TMA by yourself and answer the TMA, in your own words. Collaborating in answering TMA
questions is not allowed and is not the same as group learning You are also not allowed to use the course
Question 1 (20 marks)
The National Archives manages the website (http://www.legislation.gov.uk/ (http://www.legislation.gov.uk/))
which publishes all UK legislation. For each legislation, the site provides the complete text of the legislation
as well as explanatory notes that explain the act to readers who are not legally qualified. Using the
explanatory notes for the data protection act, as well as your course notes, the e-library and/or other external
references, answer the following questions:

i. What are in your opinion the main matters provided in the act (list four)? [4 marks]

ii. Explain what is meant by each of the following data protection principles according to the new General
Data Protection Regulatory principles: [4 marks]

Data minimisation
Accuracy
Security
Accountability

iii. According to the new General Data Protection Regulatory principles, explain under what circumstances do
individuals have the right to have personal information erased. List at least 4 circumstances. [4 marks]

iv. Identify your local branch-country law that is equivalent to the UK data protection act. Provide information
for at least the following aspects of the law: [4 marks]

what is the name of the law and the name of the authority or authorities that are responsible for
enforcing it ?
at least two references in the Harvard style, including a link to the text of the law or to articles describing
it
when was the law enacted (provide a date)
A brief overview of the law

v. Critically compare the UK data protection act to its local equivalent as identified above. You should discuss
and compare at least the following aspects: [4 marks]

How the term "personal data" is defined


Principles of the law
Individual rights
Types of processing (General, law enforcement, intelligence services, ...)

*write your answer in the markdown cell below*


Question 1 Answers:
Question 2 (30 marks)
Complete and submit All the following Jupyter notebooks in the form of a "solved" .rar or .zip file: [30 marks]

0.1 Scribble pad


1.1 IPython Boot camp
1.2 IPython Boot camp
1.3 IPython Boot camp
1.4 IPython Boot camp
1.5 IPython Boot camp
2.2.0 Data file formats, file encodings
2.1 Pandas dataFrames
2.2.1 Data file formats -CSV
2.2.2 Data file formats - JSON
2.2.3 Data file formats - other
3.1 Cleaning data
3.2 Selecting and projecting, sorting and limiting
3.3 Combining data from multiple data sets
3.4 Handling missing data
4.1 Crosstabs and pivot tables
4.2 Descriptive statistics in pandas
4.3 Simple visualisations in pandas
4.4 Activity 4.4 Walkthrough
4.5 Split-apply-combine with SQL and pandas
4.6 Introducing regular expressions
4.7 Reshaping data with pandas
--- show a screenshot from open refine
8.1 Movies dataset
9.1 SQL DDL
9.2 SQL DML
9.3 SQL views
10.7 Outer join operations
11.1 SQL set operations
11.2 SQL subqueries

Please note that:


You will receive 1 mark for each completed notebook, including your own scribble pad notebook and a
screenshot of the OpenRefine tool, for a total of 30 marks.

Please note that:

Partially completed notebooks will not be counted. All outputs must be shown.
Please demonstrate your active interaction with each notebook by including your own additions and/or
extensions to the code and/or your own additional comments. Use a double hash sign '##' to distinguish
your comments from those already provided in the notebook.
Your tutor may quiz you on the contents of the notebooks you provide.
Question 2 Answers:

*Submit your completed notebooks and screenshots through the LMS*

<font color = red> Model answer </font>


<font color = red> Students will submit their solved notebooks through special local LMS links that the
instructors in each branch will provide. </font> Award 1 mark for each completed and "solved" notebook /
OpenRefine screenshot for a total of 30 marks. The only requirements for this question are that the
notebooks should be complete and should represent the student's own work.

Question 3 (10 Marks)


Assessing data

This question refers to assessing the issues of data and its analysis as discussed in Part 1.
the Website https://catalog.data.gov/dataset/crime-data-from-2010-to-present
(https://catalog.data.gov/dataset/crime-data-from-2010-to-present) publishes crime data for the city of Los
Angeles from 2010 to the present. With respect to this data set, answer the following questions after
downloading and exploring this data set, writing all your answers in the cell below:

a. Carefully examine the site and the data it provides and give evidence-based assessments to the following
issues. Give a score of 1-10 for each issue, giving reasons for your mark:

Trust in the data with regards to: [4 marks]

Its origin
Its documentation
Its curation
The quality of its maintenance

b. give evidence-based assessments to the following data quality attributes. Give a score of 1-10 for each
attribute, giving reasons for your mark: [6 marks]

Accuracy
Validity
Reliability
Timeliness
Consistency
Provenance

Please note that scores provided with no evidence or justification will receive no marks.
Question 3 Answers:

Question 4 (25 marks)


Data Analysis Pipeline:

This question refers to the data analysis pipeline as discussed in parts 2-5. You are required to perform the
following steps in your notebook, making sure that you fully document each step.

i. Capture the data sets from the two sites: https://catalog.data.gov/dataset/2015-registered-foreclosure-


properties (https://catalog.data.gov/dataset/2015-registered-foreclosure-properties) and
https://catalog.data.gov/dataset/2010-census-populations-by-zip-code (https://catalog.data.gov/dataset/2010-
census-populations-by-zip-code) which contain data on property foreclosures and demographic data in the
City of Los Angeles, respectively, from within the notebook. You should place each data set in a pandas
dataframe and display the first 10 lines of each dataframe. This task uses the techniques explained in Part 2.
[ 5 marks]

ii. Explore the data sets and answer the following questions about each of them: [10 marks]

Is there any missing data in the two sets? give examples. and how does each set handle the missing
data? give examples. [ 4 marks]
Are there any dirtiness of the data? give examples of each type of dirtiness that exist in the data set. Are
there any ambiguities in understanding the contents of the data set ? Give examples. [ 4 marks]
How are data encoded in the two sets? support your answer by evidence [2 mark]

iii. Combine the data in both sets from within the notebook into a single dataframe by joining them based on
the common zipcodes. Ignore zip codes that show in only one data set but not in the other. To complete this
task, you must first count the number of foreclosures of each type (single family, multi-family, ..) in each zip
code. The end result of this step would be a single data frame with the columns shown below. Display the
first 10 lines of this combined data frame. [ 5 marks]

Zip code - Total Males - Total Females - Total Households - Average Household size - total number of
single family foreclosures - total number of multi-family foreclosures

iv. Using a suitable visualisation technique, display the relationship between the average household size in a
certain zip code area and the total number of single family foreclosures and the total number of multi-family
foreclosures. What are your conclusions? [5 marks]
Question 4 Answers:

Question 5 (15 marks)


In this question, you will write Python code that will interact with a pandas dataframe using pandasl.

The following url:https://data.montgomerycountymd.gov/api/views/icn6-v9z3/rows.csv


(https://data.montgomerycountymd.gov/api/views/icn6-v9z3/rows.csv) refers to a CSV file of crime data in
Montgomery County in the State of Maryland from 2010 to present. In this question, you will perform a series
of steps to download and query this dataset using pandas SQL. Answer the following questions, in a
sequence a separate code cells:

1. Download this file by clicking on the link above, and save the file to the data directory in the folder that
has this TMA file. Then, convert this set into a pandas DataFrame structure and display the first 5 lines
for viewing. [4 marks] If you get a message: Columns(x,y) have inconsistent types, then you should re-
import the file into pandas dataframe using the dtype and the low_memory options. [5 marks]
2. For all crimes committed in WHEATON Police district and that involved more than 4 victims, list the
NIBRS code, the number of victims, the police district name, Crime Name2, and the Location of the
crime. Display all results [5 marks]
3. For all crimes involving more than 3 victims, display the police district name, and the total number of
crimes committed in that district. Display the results [5 marks]

Question 5 Answers:

You might also like