Cracking the Data Science Interview: Unlock insider tips from industry experts to master the data science field
()
Related to Cracking the Data Science Interview
Related ebooks
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees Rating: 4 out of 5 stars4/5Machine Learning for Beginners: A Comprehensive Guide to Mastering Algorithms, Data Science, and Artificial Intelligence Rating: 0 out of 5 stars0 ratingsData Science Mastery: From Beginner to Expert in Big Data Analytics Rating: 0 out of 5 stars0 ratingsData Scientist Roadmap Rating: 5 out of 5 stars5/5Data Science Unveiled: A Practical Guide to Key Techniques Rating: 0 out of 5 stars0 ratingsData Science For Dummies Rating: 5 out of 5 stars5/5Mastering Data Science: A Comprehensive Guide to Techniques and Applications Rating: 0 out of 5 stars0 ratingsPython for Data Science For Dummies Rating: 0 out of 5 stars0 ratingsData Science Essentials: Machine Learning and Natural Language Processing Rating: 0 out of 5 stars0 ratingsBig Data and Data Science: Analytics for the Future Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning Rating: 0 out of 5 stars0 ratingsData-Centric Machine Learning with Python: The ultimate guide to engineering and deploying high-quality models based on good data Rating: 0 out of 5 stars0 ratings"Big Data Science" Basic Concepts and Applications Rating: 0 out of 5 stars0 ratingsBuild a Career in Data Science Rating: 5 out of 5 stars5/5Data Science for Decision Makers: Enhance your leadership skills with data science and AI expertise Rating: 0 out of 5 stars0 ratingsThe Data Scientist's Mind: Creativity Curiosity and Critical Thinking Rating: 0 out of 5 stars0 ratingsMACHINE LEARNING FOR NOVICES: Navigating the Complex World of Data Science and Artificial Intelligence (2023 Guide) Rating: 0 out of 5 stars0 ratingsDeveloping Analytic Talent: Becoming a Data Scientist Rating: 3 out of 5 stars3/5Crack the Data Analyst Interview: Real-Time Questions & Expert Answers Rating: 0 out of 5 stars0 ratings“Careers in Information Technology: Data Scientist”: GoodMan, #1 Rating: 0 out of 5 stars0 ratingsMarkov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python Rating: 2 out of 5 stars2/5Practical Data Cleaning: Bite-Size Stats, #5 Rating: 0 out of 5 stars0 ratingsThe Data Science Handbook Rating: 0 out of 5 stars0 ratingsMastering Data Science: From Basics to Expert Proficiency Rating: 0 out of 5 stars0 ratingsData Science, AI, and Blockchain: Integrated Approaches Rating: 0 out of 5 stars0 ratings15 Math Concepts Every Data Scientist Should Know: Understand and learn how to apply the math behind data science algorithms Rating: 0 out of 5 stars0 ratingsData Science with R: Beginner to Expert Rating: 0 out of 5 stars0 ratings
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Computer Science I Essentials Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®) Rating: 4 out of 5 stars4/5Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning Rating: 5 out of 5 stars5/5Fundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5Learning the Chess Openings Rating: 5 out of 5 stars5/5UX/UI Design Playbook Rating: 4 out of 5 stars4/5Microsoft Azure For Dummies Rating: 0 out of 5 stars0 ratingsData Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsThe Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Technical Writing For Dummies Rating: 0 out of 5 stars0 ratingsThinking in Algorithms: Strategic Thinking Skills, #2 Rating: 4 out of 5 stars4/5The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling Rating: 0 out of 5 stars0 ratings
Reviews for Cracking the Data Science Interview
0 ratings0 reviews
Book preview
Cracking the Data Science Interview - Leondra R. Gonzalez
Cracking the Data Science Interview
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Niranjan Naikwadi
Publishing Product Manager: Nitin Nainani
Senior Editor: Hayden Edwards
Technical Editor: Simran Haresh Udasi
Copy Editor: Safis Editing
Project Coordinator: Aishwarya Mohan
Proofreader: Safis Editing
Indexer: Rekha Nair
Production Designer: Prashant Ghare
Marketing Coordinators: Vinishka Kalra
First published: March 2024
Production reference: 1160224
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB
ISBN 978-1-80512-050-6
www.packtpub.com
Foreword
The data science landscape is ever-evolving and has been that way since its conception. Though it is a rewarding field with many opportunities, navigating it can be a challenge, especially when you’re just getting started.
During my career, I have found that various companies can interpret data science differently depending on their business needs or understanding of data science. When I first began my data science journey in 2015, I was employed as a health data analyst with a start-up. It was there that I was exposed to data science, as my role was not purely data analytics or data science, but a mixture somewhere in between. I wanted to continue learning and advancing, but I did not know where to focus my energy to gain the information needed to thrive in this field. So, I curated a list of lessons I needed to learn in order to be competent enough to enter and advance in the field. I learned Python, data science with Python, R programming, linear algebra, and calculus, and as time went on, it became more and more daunting, the list of lessons becoming even longer than what was required for a graduate degree. Unfortunately, even after all of my hard work, during interviews, I found there were still concepts that I was unaware of. This has been the issue that I, as well as others, have noted with this field – there is so much information, but it can be unclear where to begin and what information is necessary to know.
On top of this, the data science interview is universally dreaded and challenging for various reasons that I have already alluded to. For instance, candidates are usually unsure of what that particular company considers data science. Plus, take-home assignments can take hours to complete – and once that time has been invested in completing the assignment, the company may choose to not offer feedback or, even worse, disappear completely when they’ve decided they aren’t interested. After experiencing this devastating outcome more than once, I became highly selective in what companies I chose to do a take-home assignment for. Many companies had a habit of immediately asking candidates to complete a take-home assignment before an interview, which I have learned rarely works in the candidate’s favor.
This book will address and outline the concepts that are necessary to begin or progress in a data science role. Because this field is ever-evolving, our understanding of concepts will continue as well, however this book can be used as a reference for those that are experienced in the field, or for those that are in data science adjacent roles and want to keep their knowledge current. This book will include imperative information so that candidates can be successful during a data science interview, as well as removing some of the guesswork in what companies are expecting.
It is widely accepted that data science candidates have an online portfolio to showcase their talent and application of knowledge – for this reason, there is information on how to build a portfolio and create a resume that will get you noticed. Salary and benefits negotiation is also outlined to streamline the process for you – a process many of us had to learn completely uninformed in the past, is now disseminated for the benefit of others.
We are certain that you will find this book helpful in your data science journey. Cheers!
Angela Baltes, PhD
Data Scientist, UnitedHealth Group
Contributors
About the authors
Leondra R. Gonzalez is a senior data and applied scientist at Microsoft with a decade of experience in data science, analytics, and corporate strategy. In addition to her work as a data scientist, Leondra has led teams in the entertainment, media, and advertising space to produce advanced e-commerce models for top brands, including NBC Peacock, First Aid Beauty, Procter & Gamble, HBO Max, Toyota, Whirlpool, and Tubi.
Academically, Leondra graduated from Carnegie Mellon University’s Heinz College of Information Systems Management with a master’s in entertainment industry management, with a focus on business analytics; Quantic School of Business and Technology with an MBA, including a specialization in statistics; and Otterbein University with a bachelor’s in music and business. Leondra is currently pursuing a PhD in information technology with a specialization in artificial intelligence at the University of the Cumberlands, and she has researched deep learning architectures as a PhD computer science apprentice at Google.
To my loving husband, Chris, my parents, my sister, and my unborn son who kicked my bump every day while writing this book.
Aaren Stubberfield is a senior data scientist for Microsoft’s digital advertising business and the author of three popular courses on DataCamp. He graduated with an MS in predictive analytics and has over 10 years of experience in various data science and analytical roles, focused on finding insights for business-related questions.
With his experience, he has led numerous teams of data scientists and has been instrumental in the successful completion of many projects. Aaren’s technical skills include the use of AI, like LLMs, Python, and various other tools necessary for the execution of data science projects.
I want to thank the people who have been close to me and supported me, especially my wife, Pam, and my family.
About the reviewer
Vishal Kumar, a seasoned data scientist, has over seven years of experience with a premium credit card company, where he has made indelible contributions to the realms of AI and ML. He has a master’s degree in statistics from Delhi University.
Throughout his career, he has garnered a plethora of accolades, stemming from his adeptness in constructing cutting-edge decision science tools that have steered various organizations’ success. His commitment to continuous learning is evidenced by his embrace of new technologies, such as generative AI, to stay at the forefront of the ever-evolving data science landscape.
Beyond his professional pursuits, his creativity extends into his personal life, as he likes to paint and play ukulele.
Table of Contents
Preface
Part 1: Breaking into the Data Science Field
1
Exploring Today’s Modern Data Science Landscape
What is data science?
Exploring the data science process
Data collection
Data exploration
Data modeling
Model evaluation
Model deployment and monitoring
Dissecting the flavors of data science
Data engineer
Dashboarding and visual specialist
ML specialist
Domain expert
Reviewing career paths in data science
The traditionalist
Domain expert
Off-the-beaten path-er
Tackling the experience bottleneck
Academic experience
Work experience
Understanding expected skills and competencies
Hard (technical) skills
Soft (communication) skills
Exploring the evolution of data science
New models
New environments
New computing
New applications
Summary
References
2
Finding a Job in Data Science
Searching for your first data science job
Preparing for the road ahead
Finding job boards
Beginning to build a standout portfolio
Applying for jobs
Constructing the Golden Resume
The perfect resume myth
Understanding automated resume screening
Crafting an effective resume
Formatting and organization
Using the correct terminology
Prepping for landing the interview
Moore’s Law
Research, research, research
Branding
References
Part 2: Manipulating and Managing Data
3
Programming with Python
Using variables, data types, and data structures
Answers
Indexing in Python
Using string operations
Initializing a string
String indexing
Answers
Answers
Using Python control statements, loops, and list comprehensions
Conditional statements such as if, elif, and else
Loop statements such as for and while
List comprehension
Using user-defined functions
Breaking down the user-defined function syntax
Doing stuff
with user-defined functions
Getting familiar with lambda functions
Creating good functions
Answers
Handling files in Python
Opening files with pandas
Answers
Wrangling data with pandas
Handling missing data
Selecting data
Sorting data
Merging data
Aggregation with groupby()
Summary
References
4
Visualizing Data and Data Storytelling
Understanding data visualization
Bar charts
Line charts
Scatter plots
Histograms
Density plots
Quantile-quantile plots (Q-Q plots)
Box plots
Pie charts
Surveying tools of the trade
Power BI
Tableau
Shiny
ggplot2 (R)
Matplotlib (Python)
Seaborn (Python)
Developing dashboards, reports, and KPIs
Developing charts and graphs
Bar chart – Matplotlib
Bar chart – Seaborn
Scatter plot – Matplotlib
Scatter plot – Seaborn
Histogram plot – Matplotlib
Histogram plot – Seaborn
Applying scenario-based storytelling
Summary
5
Querying Databases with SQL
Introducing relational databases
Mastering SQL basics
The SELECT statement
The WHERE clause
The ORDER BY clause
Aggregating data with GROUP BY and HAVING
The GROUP BY statement
The HAVING clause
Creating fields with CASE WHEN
Analyzing subqueries and CTEs
Subqueries in the SELECT clause
Subqueries in the FROM clause
Subqueries in the WHERE clause
Subqueries in the HAVING clause
Distinguishing common table expressions (CTEs) from subqueries
Merging tables with joins
Inner joins
Left and right join
Full outer join
Multi-table joins
Calculating window functions
OVER, ORDER BY, PARTITION, and SET
LAG and LEAD
ROW_NUMBER
RANK and DENSE_RANK
Using date functions
Approaching complex queries
Process and answer
Summary
6
Scripting with Shell and Bash Commands in Linux
Introducing operating systems
Navigating system directories
Introducing basic command-line prompts
Understanding directory types
Filing and directory manipulation
Scripting with Bash
Introducing control statements
Creating functions
Processing data and pipelines
Using pipes
Using cron
Summary
7
Using Git for Version Control
Introducing repositories (repos)
Creating a repo
Cloning an existing remote repository
Creating a local repository from scratch
Linking local and remote repositories
Detailing the Git workflow for data scientists
Using Git tags for data science
Understanding Git tags
Using tagging as a data scientist
Understanding common operations
Summary
Part 3: Exploring Artificial Intelligence
8
Mining Data with Probability and Statistics
Describing data with descriptive statistics
Measuring central tendency
Measuring variability
Introducing populations and samples
Defining populations and samples
Representing samples
Reducing the sampling error
Understanding the Central Limit Thereom (CLT)
The CLT
Demonstrating the assumption of normality
Shaping data with sampling distributions
Probability distributions
Uniform distribution
Normal and student’s t-distributions
The binomial distribution
The Poisson distribution
Exponential distribution
Geometric distribution
The Weibull distribution
Testing hypotheses
Understanding one-sample t-tests
Understanding two-sample t-tests
Understanding paired sample t-tests
Understanding ANOVA and MANOVA
Chi-squared test
A/B tests
Understanding Type I and Type II errors
Type I error (false positive)
Type II error (false negative)
Striking a balance
Summary
References
9
Understanding Feature Engineering and Preparing Data for Modeling
Understanding feature engineering
Avoiding data leakage
Handling missing data
Scaling data
Applying data transformations
Introducing data transformations
Logarithm transformations
Power transformations
Box-Cox transformations
Exponential transformations
Engineering categorical data and other features
One-hot encoding
Label encoding
Target encoding
Calculated fields
Performing feature selection
Types of feature selection
Recursive feature elimination
L1 regularization
Tree-based feature selection
The variance inflation factor
Working with imbalanced data
Understanding imbalanced data
Treating imbalanced data
Reducing the dimensionality
Principal component analysis
Singular value decomposition
t-SNE
Autoencoders
Summary
10
Mastering Machine Learning Concepts
Introducing the machine learning workflow
Problem statement
Model selection
Model tuning
Model predictions
Getting started with supervised machine learning
Regression versus classification
Linear regression – regression
Logistic regression
k-nearest neighbors (k-NN)
Random forest
Extreme Gradient Boosting (XGBoost)
Getting started with unsupervised machine learning
K-means
Density-based spatial clustering of applications with noise (DBSCAN)
Other clustering algorithms
Evaluating clusters
Summarizing other notable machine learning models
Understanding the bias-variance trade-off
Tuning with hyperparameters
Grid search
Random search
Bayesian optimization
Summary
11
Building Networks with Deep Learning
Introducing neural networks and deep learning
Weighing in on weights and biases
Introduction to weights
Introduction to biases
Activating neurons with activation functions
Common activation functions
Choosing the right activation function
Unraveling backpropagation
Gradient descent
What is backpropagation?
Loss functions
Gradient descent steps
The vanishing gradient problem
Using optimizers
Optimization algorithms
Network tuning
Understanding embeddings
Word embeddings
Training embeddings
Listing common network architectures
Common networks
Tools and packages
Introducing GenAI and LLMs
Unveiling language models
Transformers and self-attention
Transfer Learning
GPT in action
Summary
12
Implementing Machine Learning Solutions with MLOps
Introducing MLOps
A model pipeline overview
Understanding data ingestion
Learning the basics of data storage
Reviewing model development
Packaging for model deployment
Identifying requirements
Virtual environments
Tools and approaches for environment management
Deploying a model with containers
Using Docker
Validating and monitoring the model
Validating the model deployment
Model monitoring
Thinking about governance
Using Azure ML for MLOps
Summary
Part 4: Getting the Job
13
Mastering the Interview Rounds
Mastering early interactions with the recruiter
Mastering the different interview stages
The hiring manager stage
The technical interview
Coding questions, step by step
The panel stage
Summary
References
14
Negotiating Compensation
Understanding the compensation landscape
Negotiating the offer
Negotiation considerations
Responding to the offer
Maximum negotiable compensation and situational value
Summary
Final words
Index
Other Books You May Enjoy
Preface
In today’s dynamic technological landscape, the demand for skilled professionals in artificial intelligence (AI) and data science roles has surged, and the data science job market is increasingly saturated by various levels of data science and AI employees. This book is a comprehensive guide, crafted to equip both aspiring and seasoned individuals with the essential tools and knowledge required to navigate the intricacies of data science interviews. Whether you’re stepping into the AI realm for the first time or aiming to elevate your expertise, this book offers a holistic approach to mastering the fundamental and cutting-edge facets of the field.
The chapters within this book span a wide spectrum of critical subjects, from programming with Python and SQL to statistical analysis, pre-modeling and data cleaning concepts, machine learning (ML), deep learning, Large Language Models (LLMs), and generative AI. We aim to provide a comprehensive review and update on the foundational concepts while also delving into the latest advancements. In an era marked by the disruptive potential of language models and generative AI, it’s imperative to continually hone your skills. This book serves as a compass, guiding you through the intricacies of these transformative technologies, ensuring you’re poised to tackle the challenges and harness the opportunities they present.
Moreover, beyond technical prowess, we delve into the art of interviewing for AI roles, offering guidance on how to ace interviews and negotiate compensation effectively. Additionally, crafting a standout résumé tailored for data science roles is a crucial step, and our guide offers insights into writing compelling résumés that capture attention in a competitive job market. As AI reshapes industries and innovation accelerates, now is the ideal time to embark on or advance in your data science journey. We invite you to dive into this comprehensive resource and embark on your path to mastering the dynamic world of data science and AI.
Who this book is for
If you are a seasoned or young professional who needs to brush up on your technical skills, or you are looking to break into the exciting world of the data science industry, then this book is for you.
What this book covers
In Chapter 1, Exploring the Modern Data Science Landscape, we begin our journey with a brief but valuable overview of the contemporary landscape of data science and AI.
In Chapter 2, Finding a Job in Data Science, we will introduce data science roles and their various categories.
In Chapter 3, Programming with Python, you will familiarize yourself with the most common and useful tasks and operations in the Python language.
In Chapter 4, Visualizing Data and Storytelling, you will learn techniques for telling engaging data stories.
In Chapter 5, Querying Databases with SQL, you will dive into the world of databases, understanding their design and how to query them to acquire data.
In Chapter 6, Scripting with Bash and Shell Commands in Linux, you will boost your operating system skills with the power of bash and shell commands, enabling you to interface with multiple technologies either locally or in the cloud.
In Chapter 7, Using Git for Version Control, we explore the most useful commands in Git for project collaboration and reproducibility.
In Chapter 8, Mining Data with Probability and Statistics, you will understand some of the most relevant topics in probability and statistics that serve as the foundation for many ML models and assumptions.
In Chapter 9, Understanding Feature Engineering and Preparing Data for Modeling, you will use your understanding of descriptive statistics to create clean, machine-legible
datasets.
In Chapter 10, Mastering Machine Learning Concepts, you will learn about the most used ML algorithms, their assumptions, how they work, and how to best evaluate their performance.
In Chapter 11, Building Networks with Deep Learning, we take a step further into building and evaluating neural networks in various applications while also touching base on the latest advancements in AI.
In Chapter 12, Implementing Machine Learning Solutions with MLOps, we will review the data science process, tools, and strategies to effectively design and implement an end-to-end ML solution.
In Chapter 13, Mastering the Interview Rounds, you will learn the best techniques to successfully bypass technical and non-technical factors at every stage of the interview process.
In Chapter 14, Negotiating Compensation, you will learn to optimize your earning potential.
To get the most out of this book
To get the most out of this book, you should have a basic knowledge of Python, SQL, and statistics. However, you will also benefit from this book if you have familiarity with other analytical languages, such as R. By brushing up on critical data science concepts such as SQL, Git, statistics, and deep learning, you’ll be well-equipped to crack through the interview process.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The split() method can be used to split s into individual words: words = s.split().
A block of code is set as follows:
x = 5
print(type(x)) #
Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: The increased computing power and the development of advanced algorithms, especially in machine learning (ML) and deep learning (DL), have made it possible to efficiently process and analyze massive amounts of data.
Tips or important notes
Appear like this.
Special Note
The prevalence of accessible AI technology has exploded over the past few months, particularly over the course of writing this book. We encourage our readers to utilize AI during their educational journey, leveraging tools such as Chat GPT to test your newly acquired skills. Long gone are the days where you browse StackOverFlow for hours for your specific inquiry. Now, the power of asking for help is right at your fingertips.
Even we, the authors of this book, leveraged generative AI to aid in minor editorial tasks and creating code examples. However, rest assured that humans wrote the content and laid out what is covered in the book! In this new era, we just wanted to make our readers aware of how we used the tool.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Share Your Thoughts
Once you’ve read Cracking the Data Science Interview, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/978-1-80512-050-6
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1: Breaking into the Data Science Field
In the first part of this book, you will learn about the data science profession as it exists in the modern day, and how this relates to your endeavors in the field. This will serve as an introduction to various career paths and help to set expectations in terms of the skills and competencies required to be successful.
This part includes the following chapters:
Chapter 1, Exploring Today’s Modern Data Science Landscape
Chapter 2, Finding a Job in Data Science
1
Exploring Today’s Modern Data Science Landscape
If you’ve picked up this book, chances are that you’ve already heard of data science. It’s arguably one of the fastest-growing, most discussed professions within the tech and STEM space, all while maintaining its relative edge and mystique. That is, many people have heard of data scientists, but very few know what they do, how a data scientist produces value, or how to break into the field from scratch.
In this chapter, we will verify the definition of data science with a practical description. Then, we will discuss what most data science jobs entail, while spending some time describing the distinction between different flavors of data science. We’ll then dive into the various paths into data science and what makes it so challenging to land your first job. We’ll finish the chapter with an overview of the non-negotiable competencies expected of data scientists.
By the end of this chapter, you will have a firm understanding of the modern data scientist, the various paths to getting the job, and what to expect in your journey to becoming one.
With this gentle introduction, you’ll have a better understanding of the job of a data scientist, which path to becoming a data scientist best fits your journey, the barriers to expect in your journey, and which skills you should master.
In this chapter, we will cover the following topics:
What is data science?
Exploring the data science process
Dissecting the flavors of data science
Reviewing career paths in data science
Tacking the experience bottleneck
Understanding expected skills and competencies
Exploring the evolution of data science
What is data science?
To begin, let’s offer a definition of data science. According to Wikipedia, data science "is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms, and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data"[1]. It encompasses various techniques, procedures, and tools to process, analyze, and visualize data, enabling businesses and organizations to make data-driven decisions and predictions. The primary goal of data science is to identify patterns, relationships, and trends within data to support decision-making and create actionable insights.
You are not alone in your interest in data science – it was called by the Harvard Business Review one of the sexiest jobs in the 21st century [2], and stories of data scientists earning enormous salaries in the six-figure range are not uncommon. Data scientists are often looked at as oracles within an organization, answering complex business questions such as, If we increase our offering to this group of customers, can we increase our revenues?
or What are the common causes of customer churn?
Within organizations, the demand for the skills of data scientists has continued to grow. The U.S. Bureau of Labor Statistics estimated that in 2022, the number of jobs for data scientists will increase by roughly 36% over the next 10 years [3]. This growth in the demand for data scientists is being fuelled by several factors, which are shown here:
Figure 1.1: Reasons for the increased demand for data scientistsFigure 1.1: Reasons for the increased demand for data scientists
The first is the proliferation of data. The exponential growth of data generated by digital devices, social media, and various other sources has made it essential for organizations to harness this data for decision-making and innovation. This data growth is expected to continue in the future, with the International Data Corporation (IDC) expecting that by 2025, we will generate 175 zettabytes of data annually [4]. That is a staggering amount of data!
Organizations want to take advantage of this explosion in data availability to generate insights for decision-making. As the world becomes more interconnected and complex, the need for evidence-based decision-making has grown, leading to an increased demand for skilled data scientists who can transform data into actionable insights. Organizations and businesses increasingly rely on data-driven insights to gain a competitive edge in the market, optimize operations, and improve customer experiences.
Finally, transforming data into insights couldn't be accomplished without advancements in computational power and the advancement of tools and platforms. The increased computing power and the development of advanced algorithms, especially in machine learning (ML) and deep learning (DL), have made it possible to efficiently process and analyze massive amounts of data. In addition, the development of open source tools, libraries, and platforms has made data science more accessible to a broader audience, fostering the growth of the profession.
Hence, data science is still an evolving field that is only expected to grow in parallel with computational and technological advancements (such as generative AI). Furthermore, as companies continue to embrace the digital age with an increased interest in maximizing their utility of data and capitalizing on its underlying insights for a competitive advantage, the demand for data scientists will also expand.
However, although data science is often regarded and described as a monolithic function, you’ll soon learn that it’s a multi-faceted discipline that often varies by team, department, or even company. Naturally, the data scientist job profile is also an ever-evolving description, but we will cover all our bases for the most common tasks.
Exploring the data science process
Performing data science work is often an iterative process, where the data scientist needs to return to earlier steps if they run into challenges. There are many ways to categorize the data science process, but it often includes:
Data collection
Data exploration
Data modeling
Model evaluation
Model deployment and monitoring
Let’s briefly touch on each step and discuss what’s expected of the data scientist during them.
Data collection
Data collection and preprocessing involves gathering data from various sources (such as databases, APIs, and web scraping), then cleaning and transforming the data to prepare it for analysis. This step involves dealing with missing, inconsistent, or noisy data and converting it into a structured format. Depending on the organization, a team of data engineers support this step of the data science process; however, it is common for the data scientist to manage this process as well. This requires them to have intimate knowledge of the data sources and the ability to write Structured Query Language (SQL) queries, code that can query databases, or custom tools such as web scrapers to gather the needed data.
Data exploration
Data exploration involves conducting exploratory data analysis (EDA) to better understand the data, detect anomalies, and identify relationships between variables. The key to this step is to look for correlations and understand the distribution of the data. This involves using descriptive statistics and visualization techniques to summarize the data and gain insights; therefore, the data scientist should be able to use summary statistics, program descriptive visualizations, or utilize reporting tools such as Power BI or Tableau to create robust charts.
Data modeling
Using what was learned in the data exploration step, data modeling is the step when the data scientist builds their predictive or descriptive models using ML and statistical techniques that identify patterns and relationships in the data. Here, the data scientist selects the appropriate algorithms, trains the models on historical data, and validates their performance.
Model evaluation
Model evaluation and optimization involves assessing the performance of models using metrics such as accuracy, RMSE, precision, recall, AUC, or F1 scores. Based on these evaluations, data scientists may refine the models or try alternative algorithms to improve their performance. Understanding the underlying reasons behind a model’s predictions is crucial for building trust in its results and ensuring that it aligns with the domain knowledge. Therefore, the data scientist must be sure the model solves the organizational/business goal. Here, the data scientist needs to be able to communicate their findings to possible technical and non-technical individuals.
Model deployment and monitoring
Model deployment and monitoring involves implementing the models in real-world applications, monitoring their performance, and maintaining them to ensure their continued accuracy and relevance. For example, the data scientist might work with a data engineering team or use tools such as containers to implement the model. Once deployed, the data scientist may also need to develop dashboards to monitor the model’s performance over time and flag stakeholders if it goes outside the expected performance range.
As you can see, data science is a profession that incorporates many data-related tasks – particularly those that involve the acquisition, prepping, and delivery of data in one format or another. While data modeling makes up most of the glitz and glamour associated with the job, it is really everything else that takes up roughly 80% of the gig. This does not include non-data-related tasks, such as interfacing with stakeholders, gathering requirements, debugging software, checking emails, and research. However, those tasks are not necessarily unique to data scientists.
Now that you understand the common tasks associated with the job, let’s explore the different types or flavors of data science.
Dissecting the flavors of data science
Now that we have defined some of the critical aspects of the role of a data scientist, it is clear that the role often covers many different skills. Data scientists are frequently asked to perform a variety of data-related tasks, including designing database tables to collect data, programming ML algorithms, understanding statistics, and creating stunning visuals to help explain interesting findings to others, but it is difficult for any single person to master all of these skill areas.
Therefore, we often see data scientists who are particularly skilled in one or two areas and have basic competencies in the others. Their talents could be considered T-shaped, where they are proficient across many areas such as the horizontal line of a T, while they have deep knowledge and expertise in a few areas such as the vertical portion of the letter:
Figure 1.2: Example of the ‘T of Competencies’Figure 1.2: Example of the ‘T of Competencies’
While this example shows an example of someone who is adequate in data engineering and visualization principles but exceptional in ML, you can expect to see every possible combination of skills among data scientists. These competencies are often aligned with a person’s unique experiences or interests. Perhaps they were a statistics major and took a liking to ML, or perhaps they’re a former business intelligence (BI) engineer with considerable experience in data extraction, transformation, and loading (ETL), allowing them to grasp data engineering concepts much faster.
Whatever the reason, it’s natural for someone to grasp some concepts better than others. This is important to remember as you navigate this book. While you are not expected to specialize in every facet of data science, you are expected to master the fundamentals. However, you will almost certainly discover your T of Competencies – a trinity of top skill sets that will solidify your identity in the data science space.
While there are countless combinations of skill proficiencies, let’s review some of the most common that you will encounter:
The data engineer
The dashboarding and visual specialist
The ML specialist
The domain expert
Let’s take a look at these now.
Data engineer
As we discussed earlier, data engineering is a crucial aspect of the data science process that involves data collection, storage, processing, and management. It focuses on designing, developing, and maintaining scalable data infrastructure, ensuring the availability of high-quality data for analysis and modeling. Data engineers are most known for their oversight of the ETL process of data pipelines. On some data scientist teams, especially within smaller organizations, the data engineering responsibilities sit within the data science team. Therefore, the data scientist specializing in this area can help support team projects with data collection and storage, understanding the needs of the ML process, such as structuring the data so that it can be fed efficiently to a DL algorithm.
Data engineers have a wealth of tools to choose from. It is not expected for any single data engineer to know all of these technologies, especially at the same level of competencies. In fact, the more senior the engineer, the more competent they are in their tools of choice. Furthermore, this is not a comprehensive list. However, you can expect to see the following on