Census 8
Census 8
ID
1. Introduction
This project involves working with a fictitious census data of a hypothetical town intending
to address the fundamental issues of land use and resource management that affects the local
authority. It has been seen that census data acts as a powerful tool in the hands of the
governments as it reveals the qualitative and quantitative population details which are
essential to identify the areas of growth and to shape the future policies and infrastructure.
The sample data resembled a census that might have been taken in historical past: details of
household members including marital status, age, occupation and religion were recorded.
In this project, we focus on cleaning and analysing the data to inform decisions on two
primary issues: In terms of community welfare, this means the construction of a plot of land
without the presence of dwellings and investing in other areas. Concretely, the analysis will
give the answer to the question of what shall be built – high-density housing, low-density
housing, train station, religious structure, emergency medical centre, or some other
infrastructure. It will also evaluate the merits or otherwise of investing in
employment/training opportunities, care for the elderly, education, or simply, physical
enhancement of facilities.
Therefore, through analysing the age factors, employment rates, religious activities, people’s
marital statuses, household occupancy rates, university students in the town, birth and death
rates, this project will be prepared with sound advice as to further development and
investment in the town. The result will be a well-coordinated and researched plan that
captures the current and future needs of the town inhabitants in terms of need fulfilment in
equal proportions.
2. Data cleaning
In this project, data cleaning was a critical step to ensure the accuracy and reliability of the
analysis. The dataset initially contained several missing values and inconsistencies, which
needed to be addressed before conducting any meaningful analysis. The data cleaning process
involved handling missing values across various columns, converting data types where
necessary, and standardizing certain entries to maintain consistency.
The dataset had 8,291 entries and 12 columns, with missing values spread across multiple
columns. Here's an overview of the initial data quality:
The House Number, Street, First Name, Surname, Age, Gender, and Occupation
columns had 41 missing values each.
The Relationship to Head of House column had 644 missing values.
The Marital Status column had 2,142 missing values.
The Infirmity column had 8,223 missing values, indicating that most entries were
missing for this attribute.
The Religion column had 4,814 missing values.
To address these issues, the following data cleaning steps were implemented:
1. House Number: Missing values were filled with the median of the existing house
numbers, as this would provide a reasonable estimate for a numerical field.
2. Street: Missing street names were replaced with the mode (most frequent) street
name, if common street names might cover missing entries.
3. First Name and Surname: Missing names were filled with the placeholder
'Unknown', acknowledging that these are critical identifiers but were not available for
some entries.
4. Age: The Age column, which was initially of type object, was converted to a numeric
type to facilitate analysis. Any non-numeric values were coerced into NaN (missing
values), which were then filled with the median age of the dataset to approximate the
age of individuals where it was missing.
5. Relationship to Head of House: Missing values in this categorical field were filled
with the mode, as it represents the most common relationship in the dataset.
6. Marital Status: Similarly, the mode was used to fill missing values in the marital
status field, ensuring consistency in demographic data.
7. Gender: Missing gender entries were filled with the mode, aligning with the most
common gender recorded.
8. Occupation: The most common occupation was employed to complete the gaps in
occupational information, thus keeping the credibility of work-related results.
9. Infirmity: As stated earlier, there were many cells which were empty in this column;
thus, all the blank cells in this column were coded as ‘Unknown’. This maintains a list
of recognized impairments while also admitting the existence of missing
informational links.
10. Religion: The mode was used to fill missing values in the religion column, providing
a common religious affiliation in cases where the data was absent.
To ensure the data is ready for the models, some more steps in the data cleaning process were
identified to check for and work with invalid values in the Age and Gender columns I also
one more column name with Age Group range of 4. Here are the guidelines of these steps:
1. Handling Invalid Age Values:
o A check was performed to identify any Age values that were unrealistic,
specifically those less than 0 or greater than 120. Such values were replaced
with pd.NA (missing values) to indicate that the age data was not valid.
o After addressing the invalid ages, any missing values in the Age column were
filled with the median age of the dataset to maintain consistency in
demographic analysis.
2. Standardizing Gender Values:
o A gender mapping dictionary was created to standardize the various
representations of gender in the dataset. For example, 'f', 'female', and 'F' were
all mapped to 'Female', while 'm', 'male', and 'M' were mapped to 'Male'.
o The Gender column was then updated using this mapping to ensure
consistency across entries.
o Any remaining empty values in the Gender column were replaced with pd.NA,
and subsequently, all pd.NA values were filled with 'Unknown' to maintain
completeness in the data.
3. Population Demographics
The provided data on age group distribution reveals a diverse demographic landscape. The
highest percentage of the population falls within the 35-39 age group, comprising 8.91% of
the total, indicating a significant concentration in this age bracket. The percentage gradually
declines as the age increases, with the 65-69 age group representing 3.53% and the 70-74 age
group further decreasing to 2.36%. The population distribution continues to shrink with
advancing age, culminating in the 100+ age group, which accounts for just 0.21% of the total
population. Notably, the data highlights a prominent decline in population percentages for
older age groups, particularly after 65 years, reflecting the typical demographic trend of aging
populations. This distribution underscores a central tendency of the population clustered
around middle age, with a decreasing proportion in the elderly categories.
3.7 Commuters
In the town, there are a total of 533 university students and 5,370 employed, all of whom are
commuters because there are no universities located within the town. Consequently, every
university student must travel to attend their educational institution and workplaces. In
addition to university students, other professions that are likely to involve commuting include
roles in IT technical support, health professions, and specialized technology fields. For
instance, IT technical support roles and positions such as health physicists or brewing
technologists may require employees to work in specialized facilities or offices that are
situated outside the town. Similarly, multimedia programmers and other specialized
professionals might also need to commute to access the specific resources or workplaces
pertinent to their jobs.
(a) What should be built on an unoccupied plot of land that the local
government wishes to develop?
To determine the most suitable development for the unoccupied plot of land, we'll consider
the following options:
1. High-Density Housing
o Justification: This is suitable if the population is expanding and there is a
need to accommodate more residents.
o Analysis
Population Growth: The birth rate (11.82 per 1,000) is slightly higher
than the death rate (11.70 per 1,000), indicating a stable population
with a slight increase.
Household Occupancy: There are significant instances of high
household occupancy (e.g., households with 8-9 individuals).
Overcrowding is a concern, suggesting a need for more housing units
to alleviate pressure on existing homes.
2. Low-Density Housing
o Justification: This should be considered if the population is affluent and
there's a demand for larger family homes.
o Analysis
3. Train Station
o Justification: If a significant number of commuters are present, a train station
can reduce road congestion.
o Analysis
4. Religious Building
o Justification: If there is a high demand for religious services not met by
existing facilities.
o Analysis
o Analysis
4. General Infrastructure
o Justification: If the town is expanding, general services need more
investment.
o Analysis
5. Bibliography
Clark, D. (2018). Life expectancy in the UK by gender 2018 | Statista. [online] Statista.
Available at: https://www.statista.com/statistics/281671/life-expectancy-united-kingdom-uk-
by-gender/.
GOV.UK. (n.d.). Key Findings, Statistical Digest of Rural England. [online] Available at:
https://www.gov.uk/government/statistics/key-findings-statistical-digest-of-rural-england/
key-findings-statistical-digest-of-rural-england.