Capstone Project Final Report
Capstone Project Final Report
1.1. Background
The potential of the coffee industry in Indonesia is quite large. As a tropical country,
Indonesia is a suitable location for coffee cultivation. Therefore, the cultivation and
management of Indonesian specialty coffee is a strategic step that must continue to be
developed. The culture of drinking coffee has indeed spread to various countries not only in
Indonesia. There is a very high demand for coffee which makes the coffee business
opportunity more profitable. The coffee business has become one of the businesses that has
been taken into account where the beverage product is in high demand.
Jakarta is one of the cities in Indonesia which has a lot of coffee shops. From 2013 until the
end of 2018, there have been several coffee shops spreads around every corner of the capital
city of Jakarta, even from several locations of offices, schools, or campuses. this report will
be targeted to a client interested in opening a Coffee shop in Jakarta, Indonesia.
The next thing I will do is to use geocoder package so I can cast geocoder.arcgis function to
retrieve all the location's latitude and longitude in a single for looping and then append it to
the list and make a new column with the list of latitude and longitude of each districts. You
can see result at the dataframe below
In order to make things easier for later analysis, I will retrieve the approximate distance from
the supplier’s location for each district by using haversine formula. The supplier’s latitude
and longitude are at (-6.171009, 106.852772). Here’s the dataframe after I use haversine
formula for each district. This dataframe is the final pre-processed dataframe that later will
be used for the next step.
In order to make sure that my dataframe can be plotted into a map, I will use folium package
to make the map from my current dataframe. the coordinate of Jakarta can be found by
using Nominatim. Here’s what the map looks like with my current dataframe. The label in
the map have all description of the dataframe for each district.
After I have confirmed that the map looks perfectly fine, the next step is to get the data of
nearby venues by using foursquare API. By passing in the query needed to make the call, I
can get the nearby venues for each district. The table below shows the total number venues
returned for some districts.
This dataset will be used to get the average frequency for each venue category within
districts in the explanatory data analysis section.
3. Methodology
3.2. Modeling
After I have done all necessary analysis and data to be inserted to my model, the next thing
to do is using unsupervised machine learning algorithm which is clustering. The algorithm
that I choose is K-Means. Based on the frequency, I found that choosing K-Means is actually
preferable in this problem rather than DBScan. From what I have researched, DBscan
doesn't work well with datasets that have large difference in densities. You will notice if you
look at my notebook, specifically at the results of the code above, some districts have a very
low densities compared with other districts.
4. Results
In order to make it easier for the client to see the result, I will plot the map that can show
the cluster and its description. There will be some steps to achieve my desired map.
4.1. Constructing dataframe that shows top common venues for each district
This step is necessary to get the main idea of most common venues within each district. For
this case I will return 7 most common venues. Later, this dataframe will be used in parallel
with the previous for looping function that can return the average frequency of each district
for result analysis section. Here’s how the dataframe looks.
This is first 5 rows list of districts in the cluster label 1. There is a total of 38 districts in this
cluster
You will notice that cluster 2 (CIRACAS, MATRAMAN, PASAR MINGGU, PASAR REBO) are
leaned to be clustered to one of its most recurring venue which is pizza place, while there
are some labels in cluster 1 that also has pizza place, the frequency might differ in the second
most recurring venue (or perharps the first). On the other hand, the CILINCING and TANJUNG
PRIOK district were also in different cluster, if you notice their first and second venue
frequency were quite unique from other clusters. That's why, in my opinion, they were in
their own different clusters. If I want to different description for each label, it would be:
• Label 0 : Districts with moderate level competition with Asian and Donut shop as its
main competitor
• Label 1 : Districts with moderate to high level competition with various unique
venues as its main competitor
• Label 2 : Districts with low to moderate level competition with pizza place as its main
competitor
• Label 3 : Districts with low level competition with seafood restaurant as its main
competitor.
The next step is to compare the distance between MATRAMAN and CILINCING district first,
as it is the second concern of our client. Here’s the comparison of distance between the
two districts.
From the results above, we have a clear winner which is MATRAMAN district. After this, I
will move on to the client's third concern which is the place with adequate population. By
iterating the very first dataframe again, I can obtain the population within each
neighborhood of MATRAMAN district. This time I group it bey neighborhood and its sum of
population. Here’s how the data looks.
6. Conclusion
In this project, I have analyzed the frequency of venues within each district in Jakarta. I used the
K-Means algorithm to make clusters of those districts. This algorithm is very useful for clustering
and plotting a cluster map in order to help the client to gain better understanding of the market
competition within each district.
From the results, I will recommend the client to open a specialty coffee shop in UTAN KAYU
SELATAN neighborhood which resides in MATRAMAN district. The reasons are:
7. Future direction
From this project, there are some improvements that can be made to gain a better model and
analysis:
• The analysis above will have different results if you use Google Maps API instead.
Personally, I think Gmaps has more comprehensive data set of Indonesia compared
with foursquare, but the price is too expensive if you just want to do a one time project
like this.
• If, somehow, you use google maps API and see the results have few differences in
densities and you want to have more accurate results (to see whether there is cluster
within clusters), DBScan might be preferred to solve it.
• Elbow method can also be used to retrieve optimum K in Kmeans, this may produce
slightly different result but it is worth to try. You can also set the number iteration in
the KMeans function, the default is 10. If you perhaps want to play with the code, you
can tweak this variable alongside the random state.