0% found this document useful (0 votes)
2 views

A2RIB_T4

This assignment focuses on data wrangling and basic statistics using R. Students are required to perform various tasks including loading datasets, cleaning data, calculating statistics, and interpreting results. Submissions must include a PDF and R script, with specific naming conventions and deadlines.

Uploaded by

e.stephenson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

A2RIB_T4

This assignment focuses on data wrangling and basic statistics using R. Students are required to perform various tasks including loading datasets, cleaning data, calculating statistics, and interpreting results. Submissions must include a PDF and R script, with specific naming conventions and deadlines.

Uploaded by

e.stephenson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment 2

Instructions (please read carefully)

General information: This assignment focuses on data wrangling and basic


univariate/bivariate statistics.

Submitting the assignment: Please use this word file as your starting point. Add your
answers in the boxes below the questions. Please also copy-paste the R code that you use if
the question asks you to do so. Once you have completed it, convert this word document to
pdf and submit the pdf as well as the R script that you used to come to the answers in Canvas
under -> Assignments -> Assignment 2.

Remember to upload the files to Canvas on Thursday before 12h (noon).

Please name the pdf document and the R script: “A2RIB_TeamName”. For example, if Team
A submitted the files, they would be named A2RIB_TA.pdf & A2RIB_TA.R.

To check:

1. Set the working directory as your main folder (under Session -> Set Working
Directory).
2. Consult the R instructional videos and the Analysis: “Data-Wrangling A Key Skill”
chapter.
3. Make sure you download the necessary data for this assignment (provided in Canvas).
4. Make sure you download the packages that we introduced in this session, namely
“skimr”, “janitor”, and “kableExtra”.

Questions

Methodology

1. From the master theses shared on Canvas (module: Master theses examples) identify
and comment on the execution of one particular research design that the student used.
Note: you can pick and choose whichever master thesis.

2. From the master theses shared on Canvas, screenshot an example of summary


statistics (e.g., whether in text or table form). Explain what the summary statistics are
telling you.
R section

1. Load the vehicle_data.csv file into object named d. Note, you will have to add a few
extra arguments to the upload function to make it work. Explain why you needed to
add the extra arguments. Copy in the code you used to answer this question.

Data cleaning

2. Remove the first 5 rows of d and rename “q3” into “Year” as well as “q88” into
“Transmission”. Also remove the CNG type cars from the dataset. How many rows
does the data have now and how many of these rows are Petrol cars? Copy in the code
you used to answer this question.

3. In the d dataset create a new variable that will be called internal. Internal should have
the value of “fast_sell” if the car has been driven more than 50.000 kilometers and is a
diesel car, otherwise it should say “slow_sell”. What is the percentage of fast_sell and
slow_sell? Copy in the code you used to answer this question.

Before you start with the next question, familiarize yourself with the function round(). Its
usage should be clear from the name, but you can find more info in the help documentation.

4. What percentage of cars are diesel cars and sold by an individual? Round the numbers
to 2 decimal points. Copy in the code you used to answer this question.

5. The last 4 variables starting with “perauth…” in your dataset d are from an
authenticity measure that was asked of the car owners. In essence, car owners were
asked how authentic does the drive feel. What is the Chronbach’s alpha for these 4
variables? What does Cronbach’s alpha measure? Explain. Copy in the code you used
to answer this question.

6. What are the average selling price points as a function of fuel type, seller type, and
transmission? Round the numbers to 2 decimals and make a nice table. Copy in the
code you used to answer this question. Screenshot how your table and copy in as well.
7. What is the correlation between seller price and kilometers driven. Provide the
interpretation of the correlation. What is the p-value of the correlation? Copy in the
code you used to answer this question.

8. What is the correlation between selling price and car year (i.e., year of car make)?
Provide an interpretation of the results.

For this next part, download the housing.csv dataset. This dataset provides information on
median house prices for California districts derived from the 1990 census. The dataset
variables are the following:

longitude: A measure of how far west a house is; a higher value is farther west
latitude: A measure of how far north a house is; a higher value is farther north
housingMedianAge: Median age of a house within a block; a lower number is a newer
building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households, a group of people residing within a home unit, for a
block
medianIncome: Median income for households within a block of houses (measured in tens
of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US
Dollars)
oceanProximity: Location of the house w.r.t ocean/sea

9. Load the housing.csv dataset into R and call it d1. Get a brief overview of the data,
describe the data, what are the specifics, what are the variable types? Use functions
like skim(), summary()… Are there any particularities in the data that we should
attend do?

10. Remove all rows that have NAs. Once you do that, calculate what is the average
median house value where the total number of people residing within a block is higher
than 1000? Copy in the code that you used.

11. Create a rough estimate of price per square meter by creating a new variable that
divides median house value with total rooms. Split this newly created variable into a
“low” category if the number is lower than 3 and a “high” category otherwise. What is
the average median house value for low? Copy in the code that you used.
12. Does this value (the one you obtained in question 11) change dependent on ocean
proximity? Copy in the code that you used.

13. What is the correlation between median income and median house value? Provide an
interpretation of the results. Copy in the code that you used.

14. What is the correlation between median income, median house value, total bedrooms,
and population? You can use the correlation() function from the correlation package
to help you create a correlation matrix: https://easystats.github.io/correlation/ Note:
there are now 4 variables you are looking at. Provide an interpretation of the results.
Copy in the code that you used and screenshot the correlation matrix you obtained.

You might also like