24 Ultimate Data Science Projects To Boost Your Knowledge and Skills
24 Ultimate Data Science Projects To Boost Your Knowledge and Skills
This article was originally published on October 26, 2016 and updated with new
projects on 30th May, 2018.
Introduction
Data science projects offer you a promising way to kick-start your career in this field.
Not only do you get to learn data science by applying it, you also get projects to
showcase on your CV! Nowadays, recruiters evaluate a candidate’s potential by
his/her work and don’t put a lot of emphasis on certifications. It wouldn’t matter if you
just tell them how much you know if you have nothing to show them! That’s where
most people struggle and miss out.
You might have worked on several problems before, but if you can’t make it
presentable & easy-to-explain, how on earth would someone know what you are
capable of? That’s where these projects will help you. Think of the time you’ll spend
on these projects like your training sessions. The more time you spend practicing,
the better you’ll become!
We’ve made sure to provide you with a taste of a variety of problems from different
domains. We believe everyone must learn to smartly work with huge amounts of
data, hence large datasets are included. Also, we’ve made sure all the datasets are
open and free to access.
Useful Information
To help you decide where to begin, we’ve divided this list into 3 levels, namely:
1. Beginner Level: This level comprises of data sets which are fairly easy to
work with, and don’t require complex data science techniques. You can solve
them using basic regression or classification algorithms. Also, these data sets
have enough open tutorials to get you going. In this list, we have also
provided tutorials to help you get started. You can also check out AV’s
‘Introduction to Data Science‘ course along with this!
2. Intermediate Level: This level comprises of data sets which are more
challenging in nature. It consists of mid & large data sets which require some
serious pattern recognition skills. Also, feature engineering will make a
difference here. There is no limit on the use of ML techniques; everything
under the sun can be put to use.
3. Advanced Level: This level is best suited for people who understand
advanced topics like neural networks, deep learning, recommender systems
etc. High dimensional datasets are also featured here. Also, this is the time
to get creative. See the creativity best data scientists bring into their work and
codes.
Table of Contents
1. Beginner Level
o Iris Data
o Loan Prediction Data
o Bigmart Sales Data
o Boston Housing Data
o Time Series Analysis Data
o Wine Quality Data
o Turkiye Student Evaluation Data
o Heights and Weights Data
2. Intermediate Level
o Black Friday Data
o Human Activity Recognition Data
o Siam Competition Data
o Trip History Data
o Million Song Data
o Census Income Data
o Movie Lens Data
o Twitter Classification Data
3. Advanced Level
o Identify your Digits
o Urban Sound Classification
o Vox Celebrity Data
o ImageNet Data
o Chicago Crime Data
o Age Detection of Indian Actors Data
o Recommendation Engine Data
o VisualQA Data
Beginner Level
1. Iris Data Set
This is probably the most versatile, easy and resourceful dataset in pattern
recognition literature. Nothing could be simpler than the Iris dataset to learn
classification techniques. If you are totally new to data science, this is your start line.
The data has only 150 rows & 4 columns.
Problem: Use classification and clustering techniques to deal with the data.
If you’re new to the world of data science, Analytics Vidhya has curated a
comprehensive course – ‘Introduction to Data Science’, aimed for beginners! We will
cover the basics of Python, before moving to Statistics and finally going through
various Modelling techniques.
Intermediate Level
1. Black Friday Dataset
This dataset comprises of sales transactions captured at a retail store. It’s a classic
dataset to explore and expand your feature engineering skills and day to day
understanding from multiple shopping experiences. This is a regression problem.
The dataset has 550,069 rows and 12 columns.
Problem: Identify the tweets which are hate tweets and which are not.
Problem: Predict the time taken to solve a problem given the current status of the
user.
8. VisualQA Dataset
VisualQA is a dataset containing open-ended questions about images. These
questions require an understanding of computer vision and language. There is an
automatic evaluation metric for this problem. The dataset has 265,016 images, 3
questions per image and 10 ground truth answers per question.
End Notes
Out of the 24 datasets listed above, you should start by finding the one that matches
your skillset. Say, if you are a beginner in machine learning, avoid taking up
advanced level data sets from the get go. Don’t bite more than you can chew and
don’t feel overwhelmed with how much you still have to do. Instead, focus on making
step-wise progress.
Once you complete 2 – 3 projects, showcase them on your resume and your GitHub
profile (very important!). Lots of recruiters these days hire candidates by checking
their GitHub profiles. Your motive shouldn’t be to do all the projects, but to pick out
selected ones based on the problem to be solved, domain and the dataset size. If
you want to look at complete project solution, take a look at this article.
Did you find this article useful? Have you already built any projects on these
datasets? Do share your experience, learnings and suggestions in the comments
section below.
Website: https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-data-science-projects-
to-boost-your-knowledge-and-skills/?utm_source=linkedin.com&utm_medium=social