0% found this document useful (0 votes)
400 views

Capstone Project Final Report

This report analyzes potential locations for a specialty coffee shop in Jakarta, Indonesia. The client wants a location with less competition, close to their central Jakarta supplier, and in an area with adequate population. The report clusters Jakarta districts using K-Means based on venue data. It finds 4 clusters and maps them, then analyzes each cluster's common venues. Recommended locations fulfill the client's criteria of proximity, population, and cluster type with less competition.

Uploaded by

Hajid Naufal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
400 views

Capstone Project Final Report

This report analyzes potential locations for a specialty coffee shop in Jakarta, Indonesia. The client wants a location with less competition, close to their central Jakarta supplier, and in an area with adequate population. The report clusters Jakarta districts using K-Means based on venue data. It finds 4 clusters and maps them, then analyzes each cluster's common venues. Recommended locations fulfill the client's criteria of proximity, population, and cluster type with less competition.

Uploaded by

Hajid Naufal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CAPSTONE PROJECT

THE BATTLE OF NEIGHBORHOOD


PROJECT REPORT
Hajid Naufal Atthousi, 2020
1. Introduction

1.1. Background
The potential of the coffee industry in Indonesia is quite large. As a tropical country,
Indonesia is a suitable location for coffee cultivation. Therefore, the cultivation and
management of Indonesian specialty coffee is a strategic step that must continue to be
developed. The culture of drinking coffee has indeed spread to various countries not only in
Indonesia. There is a very high demand for coffee which makes the coffee business
opportunity more profitable. The coffee business has become one of the businesses that has
been taken into account where the beverage product is in high demand.

Jakarta is one of the cities in Indonesia which has a lot of coffee shops. From 2013 until the
end of 2018, there have been several coffee shops spreads around every corner of the capital
city of Jakarta, even from several locations of offices, schools, or campuses. this report will
be targeted to a client interested in opening a Coffee shop in Jakarta, Indonesia.

1.2. Business Problem


The client is interested to open a specialty coffee shop in Jakarta. Unfortunately, he has issue
on making a decision about the location to open the coffee shop. Though, the client is quite
optimist of his homegrown specialty coffee. His first issue is that he wanted to know which
place has lesser competition so that he can grow his business in a stable pace without
fighting over customer, whether it is battle between coffee shops or other kind of cafes or
restaurant. The second concern is that he wanted the place to not very far away from his
supplier in the central Jakarta to minimize the time in retrieving the supply from the supplier.
Finally, last but not least, he wanted the place to have an adequate population. So, where
will I recommend the best place for him to open the coffee shop?

1.3. Target interest


Personal client who wants to gain insight about the best location to build a coffee shop in
Jakarta according to his concerns.

2. Data Acquisition and pre-processing

2.1. Data choice


In order to solve the problem, I need a precise data that can tell the population of each
district. Furthermore, the data should also can tell the neighborhood within each district
since that data will be used on the last section to see the distance on each neighborhood
from the central Jakarta (supplier's place) and the population within the neighborhood.
So, I will use the following data:
• Dataset from Jakarta Open Data. I choose to use this data since it is the most up to date
within the site. This data consists of:
- The name of districts and neighborhoods
- The spread of population based on gender (Male and Female)
- The spread of population based on age (from 0 to above 75 years old with 4 years
step)
- The cities, districts and neighborhood of those population’s spread
• Latitude and Longitude from geopy.geocoders package that will be cast on each data
• Venues list that I can get from real-time foursquare API

2.2. Data acquisition, cleaning and pre-processing


On the Dataset from Jakarta Open Data, I will sort the data and group it by district, before I
sum the population to get a new column, I will drop the population whose age is in the range
0-4 since those population is rather out from the target market (in case growth hacking is
needed), another opinion as why I don't drop the age >75 is because from my personal
experience some of those people are indeed still drink coffee in Indonesia. At this point, I
will have a sorted data about Jakarta grouped by district. I also rename the column so the
client can understand what the dataframe means. After that, I will group the dataframe by
district, applying join function on the neighborhood and sum up the population for each
district. This is how the first 5 rows of the dataframe looks after I have done the steps:

The next thing I will do is to use geocoder package so I can cast geocoder.arcgis function to
retrieve all the location's latitude and longitude in a single for looping and then append it to
the list and make a new column with the list of latitude and longitude of each districts. You
can see result at the dataframe below
In order to make things easier for later analysis, I will retrieve the approximate distance from
the supplier’s location for each district by using haversine formula. The supplier’s latitude
and longitude are at (-6.171009, 106.852772). Here’s the dataframe after I use haversine
formula for each district. This dataframe is the final pre-processed dataframe that later will
be used for the next step.

In order to make sure that my dataframe can be plotted into a map, I will use folium package
to make the map from my current dataframe. the coordinate of Jakarta can be found by
using Nominatim. Here’s what the map looks like with my current dataframe. The label in
the map have all description of the dataframe for each district.

After I have confirmed that the map looks perfectly fine, the next step is to get the data of
nearby venues by using foursquare API. By passing in the query needed to make the call, I
can get the nearby venues for each district. The table below shows the total number venues
returned for some districts.
This dataset will be used to get the average frequency for each venue category within
districts in the explanatory data analysis section.

3. Methodology

3.1. Explanatory data analysis

3.1.1. One hot encoding for venues dataset


One hot encoding on jakarta venues dataframe is necessary. This one hot encoding
will be used to get average frequency for each venue categories by using mean
function. Panda get dummies function will be used to get the one hot encoding for
the dataset. This is how the data set looks after I apply one hot encoding.

3.1.2. Getting average frequency for each venue category


The dataset of one hot encoding will be averaged to get the frequency of each venue
category within district. This average frequency dataset will be used later for my
modeling phase. Here’s how the dataset looks.
3.1.3. Checking top 5 venues for general overview
To get the general overview of the frequency we can use for looping code below. This
step is used for further analysis when K-Means cluster has finished generating its
result. This will improve my understanding of why the cluster is leaned to be labeled
that way.

3.2. Modeling
After I have done all necessary analysis and data to be inserted to my model, the next thing
to do is using unsupervised machine learning algorithm which is clustering. The algorithm
that I choose is K-Means. Based on the frequency, I found that choosing K-Means is actually
preferable in this problem rather than DBScan. From what I have researched, DBscan
doesn't work well with datasets that have large difference in densities. You will notice if you
look at my notebook, specifically at the results of the code above, some districts have a very
low densities compared with other districts.

3.2.1. Finding best K by using silhouette method


Before I run my K-Means model, I search for the best K first. This can be done by
either using elbow method or silhouette method. For this problem, I choose to do
silhouette method. The graphs below show the best K for this problem after I run
the silhouette method. The result shows that the best K is 4.

3.2.2. K-Means algorithm for clustering


After I got the best K, I will pass it to the K-Means algorithm provided by scikit learn cluster.
The dataset that will be fitted in this algorithm is the dataset of average frequency of each
districts in Jakarta.

4. Results
In order to make it easier for the client to see the result, I will plot the map that can show
the cluster and its description. There will be some steps to achieve my desired map.

4.1. Constructing dataframe that shows top common venues for each district
This step is necessary to get the main idea of most common venues within each district. For
this case I will return 7 most common venues. Later, this dataframe will be used in parallel
with the previous for looping function that can return the average frequency of each district
for result analysis section. Here’s how the dataframe looks.

4.2. Appending cluster labels to the previous dataframe


The cluster labels that were generated by K-Means will be appended to the previous
dataframe. This variable within dataframe will be used to make a map and also will be used
for the cluster analysis later on. This is what the dataframe looks for the first rows.
4.3. Generate cluster map for visualization
This map will be the visualization for the cluster map for each district in the dataframe with
their own description. This visualization will help the client to easily understand the cluster
spread in Jakarta.
5. Analysis and Discussion
In this section, I will provide an analysis about the district within each cluster. After the analysis,
I will pick some suitable places and see its distance and population for comparison in order to
fulfill the client’s second and third concern.

5.1. Cluster analysis


I can access the district within each cluster in the last dataframe to see the result. Thanks to
the silhouette method the number of clusters are 4 with no empty nodes (label 0,1,2,3).
Here’s the district in cluster label 0. (please note I only call the district column and its top
venues)

This is first 5 rows list of districts in the cluster label 1. There is a total of 38 districts in this
cluster

This is the list of districts in the cluster label 2.

This is the district in the cluster label 3


You readers may ask why there are so many districts labeled in cluster label 1. The answer
to that question is within the frequency. I will provide the answer alongside the cluster that
I will pick for further analysis. In this case, I will pick cluster label 0, 2 and 3 since it seems
those clusters will have less competition if my client wants to open a specialty coffee shop.

5.2. Frequency check


I have mentioned about the frequency before to see why the clusters are leaned to be
labeled that way. The code below can check the frequency within each district, this time I
will pick cluster 0,2 and 3 so the reader will get the general idea about what I mean. I will
also provide the screenshot of the results.

You will notice that cluster 2 (CIRACAS, MATRAMAN, PASAR MINGGU, PASAR REBO) are
leaned to be clustered to one of its most recurring venue which is pizza place, while there
are some labels in cluster 1 that also has pizza place, the frequency might differ in the second
most recurring venue (or perharps the first). On the other hand, the CILINCING and TANJUNG
PRIOK district were also in different cluster, if you notice their first and second venue
frequency were quite unique from other clusters. That's why, in my opinion, they were in
their own different clusters. If I want to different description for each label, it would be:
• Label 0 : Districts with moderate level competition with Asian and Donut shop as its
main competitor
• Label 1 : Districts with moderate to high level competition with various unique
venues as its main competitor
• Label 2 : Districts with low to moderate level competition with pizza place as its main
competitor
• Label 3 : Districts with low level competition with seafood restaurant as its main
competitor.

5.3. Picking suitable place


Based from the client's first concern, I will pick the place with less competition. In this case,
I will pick the place with less frequency of cafes, restaurants and other type of veues. From
the results above, I can see from the district MATRAMAN (cluster 2) and CILINCING (cluster
3) has less competition. The other district from cluster 2 will be dropped, since from the
map I can see that district MATRAMAN was close to the supplier place compared to other
districts in the same cluster.

The next step is to compare the distance between MATRAMAN and CILINCING district first,
as it is the second concern of our client. Here’s the comparison of distance between the
two districts.

From the results above, we have a clear winner which is MATRAMAN district. After this, I
will move on to the client's third concern which is the place with adequate population. By
iterating the very first dataframe again, I can obtain the population within each
neighborhood of MATRAMAN district. This time I group it bey neighborhood and its sum of
population. Here’s how the data looks.

5.4. Plot it into graph


In order for our client to get better understanding with ease, I will use bar graph to plot the
result above so that the comparison can be interpreted visually. The graph is shown below.
Hopefully, the client now will have an insight of the most suitable location to build a
specialty coffee shop.

6. Conclusion
In this project, I have analyzed the frequency of venues within each district in Jakarta. I used the
K-Means algorithm to make clusters of those districts. This algorithm is very useful for clustering
and plotting a cluster map in order to help the client to gain better understanding of the market
competition within each district.
From the results, I will recommend the client to open a specialty coffee shop in UTAN KAYU
SELATAN neighborhood which resides in MATRAMAN district. The reasons are:

• MATRAMAN district has less competition compared with other districts.


• MATRAMAN district is closer to the supplier's place compared with other district that also
has less competition.
• UTAN KAYU SELATAN neighborhood in MATRAMAN district is the recommended place to
open the specialty coffee shop because the population within that area is the highest
compared with other neighborhoods in the MATRAMAN district.

7. Future direction
From this project, there are some improvements that can be made to gain a better model and
analysis:
• The analysis above will have different results if you use Google Maps API instead.
Personally, I think Gmaps has more comprehensive data set of Indonesia compared
with foursquare, but the price is too expensive if you just want to do a one time project
like this.
• If, somehow, you use google maps API and see the results have few differences in
densities and you want to have more accurate results (to see whether there is cluster
within clusters), DBScan might be preferred to solve it.
• Elbow method can also be used to retrieve optimum K in Kmeans, this may produce
slightly different result but it is worth to try. You can also set the number iteration in
the KMeans function, the default is 10. If you perhaps want to play with the code, you
can tweak this variable alongside the random state.

You might also like