0% found this document useful (0 votes)
26 views2 pages

Kaggle Survey Story

Uploaded by

Minh Học
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views2 pages

Kaggle Survey Story

Uploaded by

Minh Học
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Description

The Kaggle team set out to conduct an industry-wide survey that presents a truly
comprehensive view of the state of data science and machine learning. The survey
was live for one week in October, and after cleaning the data we finished with 23,859
responses, a 49% increase over last year!
There's a lot to explore here. The results include raw numbers about who is working
with data, what’s happening with machine learning in different industries, and the
best ways for new data scientists to break into the field. We've published the data in
as raw a format as possible without compromising anonymization, which makes it an
unusual example of a survey dataset.
Tell a data story about a subset of the data science community represented in this
survey, through a combination of both narrative text and data exploration. A “story”
could be defined any number of ways, and that’s deliberate. The challenge is to
deeply explore (through data) the impact, priorities, or concerns of a specific group of
data science and machine learning practitioners. That group can be defined in the
macro (for example: anyone who does most of their coding in Python) or the micro
(for example: female data science students studying machine learning in masters
programs). This is an opportunity to be creative and tell the story of a community you
identify with or are passionate about!

Survey Methodology
● This survey received 23,859 usable respondents from 147 countries and
territories. If a country or territory received less than 50 respondents, we
grouped them into a group named “Other” for anonymity.
● We excluded respondents who were flagged by our survey system as “Spam”.
● Most of our respondents were found primarily through Kaggle channels, like
our email list, discussion forums and social media channels.
● The survey was live from October 22nd to October 29th. We allowed
respondents to complete the survey at any time during that window. The
median response time for those who participated in the survey was 15-20
minutes.
● Not every question was shown to every respondent. You can learn more
about the different segments we used in the schema.csv file.
● To protect the respondents’ identity, the answers to multiple choice questions
have been separated into a separate data file from the open-ended
responses. We do not provide a key to match up the multiple choice and free
form responses. Further, the free form responses have been randomized
column-wise such that the responses that appear on the same row did not
necessarily come from the same survey-taker.

For the questions, see this notebook:


https://www.kaggle.com/paultimothymooney/2018-kaggle-machine-learning-data-
science-survey
Package 1.
1. Create a table, where
a. rows: age groups
b. columns: top 5 countries from where we have the most respondents
c. values: percentage of the respondents from that age group and from
that country
2. Create a Venn-diagram showing the number of respondents using the
following programming languages on a regular basis: Python, R, SQL, Java
3. Create a Sankey diagram, which has 3 levels: education level, gender and job
profession (in this order). The bands are simply created from the frequency
values. You can use plotly for this task.

Package 2.
1. Create a heatmap-style matrix plot, where rows and columns represent age
groups and countries (top 5 in number of respondents), respectively and
elements are colored by the percentage of the respondents.
2. Create a population pyramid about online platforms where the respondents
begun or completed data science courses groupped by whether the
respondent is younger than 30 years or not.
3. Create a map visualization about how many respondents are located in the
different countries. You can use plotly for this task.

You might also like