Kaggle Survey Story
Kaggle Survey Story
The Kaggle team set out to conduct an industry-wide survey that presents a truly
comprehensive view of the state of data science and machine learning. The survey
was live for one week in October, and after cleaning the data we finished with 23,859
responses, a 49% increase over last year!
There's a lot to explore here. The results include raw numbers about who is working
with data, what’s happening with machine learning in different industries, and the
best ways for new data scientists to break into the field. We've published the data in
as raw a format as possible without compromising anonymization, which makes it an
unusual example of a survey dataset.
Tell a data story about a subset of the data science community represented in this
survey, through a combination of both narrative text and data exploration. A “story”
could be defined any number of ways, and that’s deliberate. The challenge is to
deeply explore (through data) the impact, priorities, or concerns of a specific group of
data science and machine learning practitioners. That group can be defined in the
macro (for example: anyone who does most of their coding in Python) or the micro
(for example: female data science students studying machine learning in masters
programs). This is an opportunity to be creative and tell the story of a community you
identify with or are passionate about!
Survey Methodology
● This survey received 23,859 usable respondents from 147 countries and
territories. If a country or territory received less than 50 respondents, we
grouped them into a group named “Other” for anonymity.
● We excluded respondents who were flagged by our survey system as “Spam”.
● Most of our respondents were found primarily through Kaggle channels, like
our email list, discussion forums and social media channels.
● The survey was live from October 22nd to October 29th. We allowed
respondents to complete the survey at any time during that window. The
median response time for those who participated in the survey was 15-20
minutes.
● Not every question was shown to every respondent. You can learn more
about the different segments we used in the schema.csv file.
● To protect the respondents’ identity, the answers to multiple choice questions
have been separated into a separate data file from the open-ended
responses. We do not provide a key to match up the multiple choice and free
form responses. Further, the free form responses have been randomized
column-wise such that the responses that appear on the same row did not
necessarily come from the same survey-taker.
Package 2.
1. Create a heatmap-style matrix plot, where rows and columns represent age
groups and countries (top 5 in number of respondents), respectively and
elements are colored by the percentage of the respondents.
2. Create a population pyramid about online platforms where the respondents
begun or completed data science courses groupped by whether the
respondent is younger than 30 years or not.
3. Create a map visualization about how many respondents are located in the
different countries. You can use plotly for this task.