0% found this document useful (0 votes)

40 views

How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC

The document describes a step-by-step example of using natural language processing (NLP) in Python to analyze job postings for data scientists and identify in-demand skills. It loads and cleans data from over 2,600 job postings, forms lists of keywords for tools, skills, and education levels, and streamlines job descriptions using NLP techniques like tokenization. The goal is to summarize the popular tools, skills, and minimum education requirements mentioned by employers.

Uploaded by

Juanito Alimaña

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC

Uploaded by

Juanito Alimaña

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

www.justintodata.

com /use-nlp-in-python-practical-step-by-step-example/

How to use NLP in Python: a Practical Step-by-Step Example To find

out the In-Demand Skills for Data Scientists with NLTK
⋮ 28/2/2020

1/12
In this article, we present a step-by-step NLP application on job postings.

It is the technical explanation of the previous article, in which we summarized the in-demand skills for data scientists.
We provided the top tools, skills, and minimum education required most often by employers.

If you want to see a practical example using Natural Language Toolkit (NLTK) package with Python code, this post is
for you.

Let’s dive right in.

The table of contents is below for your convenience.

Preparation: Scraping the Data

We obtained the job postings for “data scientists” for 8 different cities. We’ve downloaded the data into separate files
for each of the cities.

The 8 cities included in this analysis are Boston, Chicago, Los Angeles, Montreal, New York, San Francisco, Toronto,
and Vancouver. The variables are job_title, company, location, and job_description.

We are not going into details for this process within this article.

We are ready for the real analysis! We’ll summarize the popular tools, skills, and minimum education required by the
employers from this data.

Step #1: Loading and Cleaning the Data

First, we load and combine the data files of the 8 cities into Python.

from collections import Counter

import nltk
import string
from nltk.tokenize import word_tokenize
import math
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
# load in the data.
df_list = []
cities = ['boston', 'chicago', 'la', 'montreal', 'ny', 'sf', 'toronto', 'vancouver']
for city in cities:
df_tmp = pd.read_pickle('data_scientist_{}.pkl'.format(city))
df_tmp['city'] = city
df_list.append(df_tmp)
df = pd.concat(df_list).reset_index(drop=True)
# make the city names nicer.
msk = df['city'] == 'la'
df.loc[msk, 'city'] = 'los angeles'
msk = df['city'] == 'ny'
df.loc[msk, 'city'] = 'new york'
msk = df['city'] == 'sf'
df.loc[msk, 'city'] = 'san francisco'
view raw load_jobs_data.py hosted with ❤ by GitHub

We remove duplicate rows/job postings with the same job_title, job_description,and city features.

# If it's the same job description in the same city, for the same job title, we consider it duplicate.
print(df.shape)
df = df.drop_duplicates(subset=['job_description', 'city', 'job_title'])
print(df.shape)

2/12
view raw load_jobs_data_dedup.py hosted with ❤ by GitHub

Now we have a dataset of 5 features and 2,681 rows.

Step #2: Forming the Lists of Keywords

Before searching in the job descriptions, we need lists of keywords that represent the tools/skills/degrees.

For this analysis, we use a simple approach to forming the lists. The lists are based on our judgment and the content
of the job postings. You may use more advanced approaches if the task is more complicated than this.

For the list of keywords of tools, we initially come up with a list based on our knowledge of data science. We know
that the popular tools for data scientists include Python, R, Hadoop, Spark, and more. We have a decent knowledge
of the field. So this initial list is good to have covered many tools mentioned in the job postings.

Then we look at random job postings and add tools that are not on the list yet. Often these new keywords remind us
to add other related tools as well.

After this process, we have a keyword list that covers most of the tools mentioned in the job postings.

Next, we separate the keywords into a single-word list and a multi-word list. We need to match these two lists of
keywords to the job description in different ways.

With simple string matches, the multi-word keyword is often unique and easy to identify in the job description.

The single-word keyword, such as “c” is referring to C programming language in our article. But “c” is also a common
letter that is used in many words including “can”, “clustering”. We need to process them further (through tokenization)
to match only when there is a single letter “c” in the job descriptions.

Below are our lists of keywords for tools coded in Python.

# got these keywords by looking at some examples and using existing knowledge.
tool_keywords1 = ['python', 'pytorch', 'sql', 'mxnet', 'mlflow', 'einstein', 'theano', 'pyspark', 'solr', 'mahout',
'cassandra', 'aws', 'powerpoint', 'spark', 'pig', 'sas', 'java', 'nosql', 'docker', 'salesforce', 'scala', 'r',
'c', 'c++', 'net', 'tableau', 'pandas', 'scikitlearn', 'sklearn', 'matlab', 'scala', 'keras', 'tensorflow', 'clojure',
'caffe', 'scipy', 'numpy', 'matplotlib', 'vba', 'spss', 'linux', 'azure', 'cloud', 'gcp', 'mongodb', 'mysql', 'oracle',
'redshift', 'snowflake', 'kafka', 'javascript', 'qlik', 'jupyter', 'perl', 'bigquery', 'unix', 'react',
'scikit', 'powerbi', 's3', 'ec2', 'lambda', 'ssrs', 'kubernetes', 'hana', 'spacy', 'tf', 'django', 'sagemaker',
'seaborn', 'mllib', 'github', 'git', 'elasticsearch', 'splunk', 'airflow', 'looker', 'rapidminer', 'birt', 'pentaho',
'jquery', 'nodejs', 'd3', 'plotly', 'bokeh', 'xgboost', 'rstudio', 'shiny', 'dash', 'h20', 'h2o', 'hadoop', 'mapreduce',
'hive', 'cognos', 'angular', 'nltk', 'flask', 'node', 'firebase', 'bigtable', 'rust', 'php', 'cntk', 'lightgbm',
'kubeflow', 'rpython', 'unixlinux', 'postgressql', 'postgresql', 'postgres', 'hbase', 'dask', 'ruby', 'julia', 'tensor',
# added r packages doesn't seem to impact the result
'dplyr','ggplot2','esquisse','bioconductor','shiny','lubridate','knitr','mlr','quanteda','dt','rcrawler','caret','rmarkdown',
'leaflet','janitor','ggvis','plotly','rcharts','rbokeh','broom','stringr','magrittr','slidify','rvest',
'rmysql','rsqlite','prophet','glmnet','text2vec','snowballc','quantmod','rstan','swirl','datasciencer']
# another set of keywords that are longer than one word.
tool_keywords2 = set(['amazon web services', 'google cloud', 'sql server'])
view raw word_setting1.py hosted with ❤ by GitHub

We get lists of keywords for skills by following a similar process as tools.

# hard skills/knowledge required.

skill_keywords1 = set(['statistics', 'cleansing', 'chatbot', 'cleaning', 'blockchain', 'causality', 'correlation',
'bandit', 'anomaly', 'kpi',
'dashboard', 'geospatial', 'ocr', 'econometrics', 'pca', 'gis', 'svm', 'svd', 'tuning', 'hyperparameter', 'hypothesis',
'salesforcecom', 'segmentation', 'biostatistics', 'unsupervised', 'supervised', 'exploratory',

3/12
'recommender', 'recommendations', 'research', 'sequencing', 'probability', 'reinforcement', 'graph',
'bioinformatics',
'chi', 'knn', 'outlier', 'etl', 'normalization', 'classification', 'optimizing', 'prediction', 'forecasting',
'clustering', 'cluster', 'optimization', 'visualization', 'nlp', 'c#',
'regression', 'logistic', 'nn', 'cnn', 'glm',
'rnn', 'lstm', 'gbm', 'boosting', 'recurrent', 'convolutional', 'bayesian',
'bayes'])
# another set of keywords that are longer than one word.
skill_keywords2 = set(['random forest', 'natural language processing', 'machine learning', 'decision tree',
'deep learning', 'experimental design',
'time series', 'nearest neighbors', 'neural network', 'support vector machine', 'computer vision', 'machine
vision', 'dimensionality reduction',
'text analytics', 'power bi', 'a/b testing', 'ab testing', 'chat bot', 'data mining'])
view raw word_setting2.py hosted with ❤ by GitHub

For education level, we use a different procedure.

Because we are looking for the minimum required education level, we need a numeric value to rank the education
degree. For example, we use 1 to represent “bachelor” or “undergraduate”, 2 to represent “master” or “graduate”, and
so on.

In this way, we have a ranking of degrees by numbers from 1 to 4. The higher the number, the higher the education
level.

degree_dict = {'bs': 1, 'bachelor': 1, 'undergraduate': 1,

'master': 2, 'graduate': 2, 'mba': 2.5,
'phd': 3, 'ph.d': 3, 'ba': 1, 'ma': 2,
'postdoctoral': 4, 'postdoc': 4, 'doctorate': 3}
degree_dict2 = {'advanced degree': 2, 'ms or': 2, 'ms degree': 2, '4 year degree': 1, 'bs/': 1, 'ba/': 1,
'4-year degree': 1, 'b.s.': 1, 'm.s.': 2, 'm.s': 2, 'b.s': 1, 'phd/': 3, 'ph.d.': 3, 'ms/': 2,
'm.s/': 2, 'm.s./': 2, 'msc/': 2, 'master/': 2, 'master\'s/': 2, 'bachelor\s/': 1}
degree_keywords2 = set(degree_dict2.keys())
view raw word_setting3.py hosted with ❤ by GitHub

Step #3: Streamlining the Job Descriptions using NLP Techniques

In this step, we streamline the job description text. We make the text easier to understand by computer programs;
and hence more efficient to match the text with the lists of keywords.

The job_description feature in our dataset looks like this.

df['job_description'].iloc[12]
view raw job_description_print.py hosted with ❤ by GitHub

Tokenizing the Job Descriptions

Tokenization is a process of parsing the text string into different sections (tokens). It is necessary since the computer
programs understand the tokenized text better.

We must explicitly split the job description text string into different tokens (words) with delimiters such as space (“ ”).
We use the word_tokenize function to handle this task.

word_tokenize(df['job_description'].iloc[12])
view raw word_tokenize.py hosted with ❤ by GitHub

4/12
After this process, the job description text string is partitioned into tokens (words) as below. The computer can read
and process these tokens easier.

For instance, the single-word keyword “c” can only match with tokens (words) “c”, rather than with other words “can”
or “clustering”.

Please read on for the Python code. We combine tokenization with the next few procedures together.

Parts of Speech (POS) Tagging the Job Descriptions

The job descriptions are often long. We want to keep the words that are informative for our analysis while filtering out
others. We use POS tagging to achieve this.

The POS tagging is an NLP method of labeling whether a word is a noun, adjective, verb, etc. Wikipedia explains it
well:

POS tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part
of speech, based on both its definition and its context — i.e., its relationship with adjacent and related
words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age
children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Thanks to the NLTK, we can use this tagger with Python.

Applying this technique on the lists of keywords, we can find tags related to our analysis.

Below, we POS tag the list of keywords for tools as a demonstration.

from nltk import pos_tag

from nltk.stem import PorterStemmer
pos_tag(tool_keywords1)
view raw pos_tag_example.py hosted with ❤ by GitHub

Different combinations of letters represent the tags. For instance, NN stands for nouns and singular words such as
“python”, JJ stands for adjective words such as “big”. The full list of representations is here.

5/12
As we can see, the tagger is not perfect. For example, “sql” is tagged as “JJ” — adjective. But it is still good enough
to help us filtering for useful words.

We use this list of tags of all the keywords as a filter for the job descriptions. We keep only the words from the job
descriptions that have these same tags of keywords. For example, we would keep the words from job descriptions
with tags “NN” and “JJ”. By doing this, we filter out the words from the job descriptions such as “the”, “then” that are
not informative for our analysis.

At this stage, we have streamlined job descriptions that are tokenized and shortened.

Stay patient! We only need to process them a little more.

Step #4: Final Processing of the Keywords and the Job Descriptions
In this step, we process both the lists of keywords and the job descriptions further.

Stemming the Words

Word stemming is the process of reducing inflected (or sometimes derived) words to their word stem,
base, or root form — generally a written word form.

The stemming process allows computer programs to identify the words of the same stem despite their different look.
In this way, we can match words as long as they have the same stem. For instance, the words “models”, “modeling”
both have the same stem of “model”.

We stem both the lists of keywordsand the streamlined job descriptions.

Lowercasing the Words

Lastly, we standardize all the words by lowercasing them. We only lowercase the job descriptions since the lists of
keywords are built in lowercase.

As mentioned in the previous sections, the Python code used in the previous procedures is below.

from nltk import pos_tag

6/12
from nltk.stem import PorterStemmer
ps = PorterStemmer()
# process the job description.
def prepare_job_desc(desc):
# tokenize description.
tokens = word_tokenize(desc)
# Parts of speech (POS) tag tokens.
token_tag = pos_tag(tokens)
# Only include some of the POS tags.
include_tags = ['VBN', 'VBD', 'JJ', 'JJS', 'JJR', 'CD', 'NN', 'NNS', 'NNP', 'NNPS']
filtered_tokens = [tok for tok, tag in token_tag if tag in include_tags]
# stem words.
stemmed_tokens = [ps.stem(tok).lower() for tok in filtered_tokens]
return set(stemmed_tokens)
df['job_description_word_set'] = df['job_description'].map(prepare_job_desc)
# process the keywords
tool_keywords1_set = set([ps.stem(tok) for tok in tool_keywords1]) # stem the keywords (since the job
description is also stemmed.)
tool_keywords1_dict = {ps.stem(tok):tok for tok in tool_keywords1} # use this dictionary to revert the
stemmed words back to the original.
skill_keywords1_set = set([ps.stem(tok) for tok in skill_keywords1])
skill_keywords1_dict = {ps.stem(tok):tok for tok in skill_keywords1}
degree_keywords1_set = set([ps.stem(tok) for tok in degree_dict.keys()])
degree_keywords1_dict = {ps.stem(tok):tok for tok in degree_dict.keys()}
view raw data_preprocessing.py hosted with ❤ by GitHub

Now only the words (tokens) in the job descriptions that are related to our analysis remain. An example of a final job
description is below.

df['job_description_word_set'].iloc[10]
view raw final_job_description.py hosted with ❤ by GitHub

7/12
Finally, we are ready for keyword matching!

Step #5: Matching the Keywords and the Job Descriptions

To see if a job description mentions specific keywords, we match the lists of keywords and the final streamlined job
descriptions.

Tools/Skills

As you may recall, we built two types of keyword lists — the single-word list and the multi-word list. For the single-
word keywords, we match each keyword with the job description by the set intersection function. For the multi-word
keywords, we check whether they are sub-strings of the job descriptions.

Education

For the education level, we use the same method as tools/skills to match keywords. Yet, we only keep track of the
minimum level.

For example, when the keywords “bachelor” and “master” both exist in a job description, the bachelor’s degree is the
minimum education required for this job.

The Python code with more details is below.

tool_list = []
skill_list = []
degree_list = []
msk = df['city'] != '' # just in case you want to filter the data.
num_postings = len(df[msk].index)
for i in range(num_postings):

8/12
job_desc = df[msk].iloc[i]['job_description'].lower()
job_desc_set = df[msk].iloc[i]['job_description_word_set']
# check if the keywords are in the job description. Look for exact match by token.
tool_words = tool_keywords1_set.intersection(job_desc_set)
skill_words = skill_keywords1_set.intersection(job_desc_set)
degree_words = degree_keywords1_set.intersection(job_desc_set)
# check if longer keywords (more than one word) are in the job description. Match by substring.
j=0
for tool_keyword2 in tool_keywords2:
# tool keywords.
if tool_keyword2 in job_desc:
tool_list.append(tool_keyword2)
j += 1
k=0
for skill_keyword2 in skill_keywords2:
# skill keywords.
if skill_keyword2 in job_desc:
skill_list.append(skill_keyword2)
k += 1
# search for the minimum education.
min_education_level = 999
for degree_word in degree_words:
level = degree_dict[degree_keywords1_dict[degree_word]]
min_education_level = min(min_education_level, level)
for degree_keyword2 in degree_keywords2:
# longer keywords. Match by substring.
if degree_keyword2 in job_desc:
level = degree_dict2[degree_keyword2]
min_education_level = min(min_education_level, level)
# label the job descriptions without any tool keywords.
if len(tool_words) == 0 and j == 0:
tool_list.append('nothing specified')
# label the job descriptions without any skill keywords.
if len(skill_words) == 0 and k == 0:
skill_list.append('nothing specified')
# If none of the keywords were found, but the word degree is present, then assume it's a bachelors level.
if min_education_level > 500:
if 'degree' in job_desc:
min_education_level = 1
tool_list += list(tool_words)
skill_list += list(skill_words)
degree_list.append(min_education_level)
view raw word_setting4.py hosted with ❤ by GitHub

Step #6: Visualizing the Results

We summarize the results with bar charts.

For each particular keyword of tools/skills/education levels, we count the number of job descriptions that match them.
We calculate their percentage among all the job descriptions as well.

For the lists of tools and skills, we are only presenting the top 50 most popular ones. For the education level, we
summarize them according to the minimum level required.

The detailed Python code is below.

Top Tools In-Demand

# create the list of tools.

df_tool = pd.DataFrame(data={'cnt': tool_list})
df_tool = df_tool.replace(tool_keywords1_dict)

9/12
# group some of the categories together.
msk = df_tool['cnt'] == 'h20'
df_tool.loc[msk, 'cnt'] = 'h2o'
msk = df_tool['cnt'] == 'aws'
df_tool.loc[msk, 'cnt'] = 'amazon web services'
msk = df_tool['cnt'] == 'gcp'
df_tool.loc[msk, 'cnt'] = 'google cloud'
msk = df_tool['cnt'] == 'github'
df_tool.loc[msk, 'cnt'] = 'git'
msk = df_tool['cnt'] == 'postgressql'
df_tool.loc[msk, 'cnt'] = 'postgres'
msk = df_tool['cnt'] == 'tensor'
df_tool.loc[msk, 'cnt'] = 'tensorflow'
df_tool_top50 = df_tool['cnt'].value_counts().reset_index().rename(columns={'index': 'tool'}).iloc[:50]
view raw visualization_tools_list.py hosted with ❤ by GitHub
# visualize the tools.
layout = dict(
title='Tools For Data Scientists',
yaxis=dict(
title='% of job postings',
tickformat=',.0%',
)
)
fig = go.Figure(layout=layout)
fig.add_trace(go.Bar(
x=df_tool_top50['tool'],
y=df_tool_top50['cnt']/num_postings
))
iplot(fig)
view raw visualization_tools.py hosted with ❤ by GitHub

Top 50 Tools for Data Scientists

Top Skills In-Demand

# create the list of skills/knowledge.

df_skills = pd.DataFrame(data={'cnt': skill_list})
df_skills = df_skills.replace(skill_keywords1_dict)
# group some of the categories together.
msk = df_skills['cnt'] == 'nlp'
df_skills.loc[msk, 'cnt'] = 'natural language processing'
msk = df_skills['cnt'] == 'convolutional'
df_skills.loc[msk, 'cnt'] = 'convolutional neural network'
msk = df_skills['cnt'] == 'cnn'
df_skills.loc[msk, 'cnt'] = 'convolutional neural network'
msk = df_skills['cnt'] == 'recurrent'
df_skills.loc[msk, 'cnt'] = 'recurrent neural network'
msk = df_skills['cnt'] == 'rnn'
df_skills.loc[msk, 'cnt'] = 'recurrent neural network'
msk = df_skills['cnt'] == 'knn'
df_skills.loc[msk, 'cnt'] = 'nearest neighbors'

10/12
msk = df_skills['cnt'] == 'svm'
df_skills.loc[msk, 'cnt'] = 'support vector machine'
msk = df_skills['cnt'] == 'machine vision'
df_skills.loc[msk, 'cnt'] = 'computer vision'
msk = df_skills['cnt'] == 'ab testing'
df_skills.loc[msk, 'cnt'] = 'a/b testing'
df_skills_top50 = df_skills['cnt'].value_counts().reset_index().rename(columns={'index': 'skill'}).iloc[:50]
view raw visualization_skills_list.py hosted with ❤ by GitHub
# visualize the skills.
layout = dict(
title='Skills For Data Scientists',
yaxis=dict(
title='% of job postings',
tickformat=',.0%',
)
)
fig = go.Figure(layout=layout)
fig.add_trace(go.Bar(
x=df_skills_top50['skill'],
y=df_skills_top50['cnt']/num_postings
))
iplot(fig)
view raw visualization_skills.py hosted with ❤ by GitHub

Top 50 Skills for Data Scientists

Minimum Education Required

# create the list of degree.

df_degrees = pd.DataFrame(data={'cnt': degree_list})
df_degrees['degree_type'] = ''
msk = df_degrees['cnt'] == 1
df_degrees.loc[msk, 'degree_type'] = 'bachelors'
msk = df_degrees['cnt'] == 2
df_degrees.loc[msk, 'degree_type'] = 'masters'
msk = df_degrees['cnt'] == 3
df_degrees.loc[msk, 'degree_type'] = 'phd'
msk = df_degrees['cnt'] == 4
df_degrees.loc[msk, 'degree_type'] = 'postdoc'
msk = df_degrees['cnt'] == 2.5
df_degrees.loc[msk, 'degree_type'] = 'mba'
msk = df_degrees['cnt'] > 500
df_degrees.loc[msk, 'degree_type'] = 'not specified'
df_degree_cnt = df_degrees['degree_type'].value_counts().reset_index().rename(columns={'index':
'degree'}).iloc[:50]
view raw visualization_degree_list.py hosted with ❤ by GitHub
# visualize the degrees.
layout = dict(
title='Minimum Education For Data Scientists',
yaxis=dict(
title='% of job postings',
tickformat=',.0%',

11/12
)
)
fig = go.Figure(layout=layout)
fig.add_trace(go.Bar(
x=df_degree_cnt['degree'],
y=df_degree_cnt['degree_type']/num_postings
))
iplot(fig)
view raw visualization_degree.py hosted with ❤ by GitHub

Minimum Education Level for Data Scientists

We did it!

We hope you found this article helpful. Leave a comment to let us know your thoughts.

Again, if you want to see the detailed results, read What are the In-Demand Skills for Data Scientists.

12/12

Pagsulat NG Lathalain PDF
100% (10)
Pagsulat NG Lathalain PDF
30 pages
Developing Microsoft Media Foundation Applications
No ratings yet
Developing Microsoft Media Foundation Applications
385 pages
"The Impossible" - Worksheet
100% (2)
"The Impossible" - Worksheet
1 page
I Analyzed 2k Data Scientist and Data Engineer Jobs and This Is What I Found - by Khuyen Tran - Towards AI
No ratings yet
I Analyzed 2k Data Scientist and Data Engineer Jobs and This Is What I Found - by Khuyen Tran - Towards AI
17 pages
Nazi and The Barber
No ratings yet
Nazi and The Barber
14 pages
CV Nagaraj 3 4 2023.pdf 1680525267971
No ratings yet
CV Nagaraj 3 4 2023.pdf 1680525267971
3 pages
Data Science ML Full Stack Roadmap
No ratings yet
Data Science ML Full Stack Roadmap
35 pages
Data Science & ML Using Python
No ratings yet
Data Science & ML Using Python
5 pages
Andres Limon Alcocer English
No ratings yet
Andres Limon Alcocer English
1 page
Datasciendeusingpython 6 Weeks
No ratings yet
Datasciendeusingpython 6 Weeks
7 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
ayush ai project file final[1][1]
No ratings yet
ayush ai project file final[1][1]
15 pages
Full Stack Roadmap
No ratings yet
Full Stack Roadmap
25 pages
Sample Template - Advance Data Science Students
No ratings yet
Sample Template - Advance Data Science Students
3 pages
AI Using Python
No ratings yet
AI Using Python
9 pages
Technology Skills
No ratings yet
Technology Skills
6 pages
Machine Learning and Data Science Master
No ratings yet
Machine Learning and Data Science Master
19 pages
Course Curriculum: Internship On Python
No ratings yet
Course Curriculum: Internship On Python
29 pages
ALL Course Curriculum
No ratings yet
ALL Course Curriculum
39 pages
Bhavnesh Baghel's Resume
No ratings yet
Bhavnesh Baghel's Resume
2 pages
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
No ratings yet
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
8 pages
Action PlanJournaling
No ratings yet
Action PlanJournaling
7 pages
Let Us Create Super Ai by Chat GPT and Muwanguz David
No ratings yet
Let Us Create Super Ai by Chat GPT and Muwanguz David
133 pages
Python Data Science Projects
No ratings yet
Python Data Science Projects
14 pages
Datascienceusing Python Training
No ratings yet
Datascienceusing Python Training
11 pages
Python Itinerary
No ratings yet
Python Itinerary
4 pages
M Resume
No ratings yet
M Resume
2 pages
Tue+Sep+20+23 56 35+GMT+05 00+2022
No ratings yet
Tue+Sep+20+23 56 35+GMT+05 00+2022
1 page
Profile 1
No ratings yet
Profile 1
7 pages
Practical 1to10
No ratings yet
Practical 1to10
32 pages
Data Science Lab-KTU
No ratings yet
Data Science Lab-KTU
5 pages
Summer Training Report - Ishan Patwal
No ratings yet
Summer Training Report - Ishan Patwal
21 pages
IT Lab PPT Pratham Chouhan CSE174
No ratings yet
IT Lab PPT Pratham Chouhan CSE174
40 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
Dnyaneshwar Ds
No ratings yet
Dnyaneshwar Ds
2 pages
Data-Science-and-Machine-Learning
No ratings yet
Data-Science-and-Machine-Learning
30 pages
Python NLP
No ratings yet
Python NLP
15 pages
Generative AI Tghjraining in Hyderabad
No ratings yet
Generative AI Tghjraining in Hyderabad
22 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
1732869290817
No ratings yet
1732869290817
25 pages
Shi008 Booklet 80hours Machine Learning Using Python 082022
No ratings yet
Shi008 Booklet 80hours Machine Learning Using Python 082022
9 pages
Log Book
No ratings yet
Log Book
32 pages
ML Lab Manual
No ratings yet
ML Lab Manual
90 pages
Data Science Toc Srinivas
No ratings yet
Data Science Toc Srinivas
4 pages
Data Science Full Stack Roadmap
No ratings yet
Data Science Full Stack Roadmap
25 pages
Resume - Shrugal Tayal - IIITD
No ratings yet
Resume - Shrugal Tayal - IIITD
3 pages
A Z Cheatsheet Python DA
No ratings yet
A Z Cheatsheet Python DA
7 pages
Data Science Student Schedule
No ratings yet
Data Science Student Schedule
7 pages
Certified Professional Diploma in Data Science-1
No ratings yet
Certified Professional Diploma in Data Science-1
43 pages
Artificial-Intelligence-Syllabus-shshbs
No ratings yet
Artificial-Intelligence-Syllabus-shshbs
8 pages
syllabus
No ratings yet
syllabus
7 pages
Stat and Machine Learning Python PDF
No ratings yet
Stat and Machine Learning Python PDF
300 pages
Poly
100% (1)
Poly
108 pages
Data Science & AIML Coursework
No ratings yet
Data Science & AIML Coursework
10 pages
Data Science Machine Learning 17054
No ratings yet
Data Science Machine Learning 17054
27 pages
20191120122749-Data Science Certification Training
No ratings yet
20191120122749-Data Science Certification Training
4 pages
Aryan_Resume_6thJuly
No ratings yet
Aryan_Resume_6thJuly
2 pages
Artificial Intelligence 3171105 Lab Manual
No ratings yet
Artificial Intelligence 3171105 Lab Manual
38 pages
Data Science with Generative AI outline
No ratings yet
Data Science with Generative AI outline
12 pages
Alexander Daniel Roman.docx
No ratings yet
Alexander Daniel Roman.docx
3 pages
01-Nhan Dang Mau
No ratings yet
01-Nhan Dang Mau
39 pages
OnkarPramodKurle (3 0)
No ratings yet
OnkarPramodKurle (3 0)
7 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Bai Tap Word Form Lop 6
100% (1)
Bai Tap Word Form Lop 6
2 pages
Aesthetics and Style in The Songs of George Enescu - P. Boire (2003)
No ratings yet
Aesthetics and Style in The Songs of George Enescu - P. Boire (2003)
5 pages
Essay Writing
100% (1)
Essay Writing
18 pages
Individual Assignment - Resume and Cover Letter
No ratings yet
Individual Assignment - Resume and Cover Letter
6 pages
Presentation Basics OPTIX RTN (605, 610, 620)
100% (8)
Presentation Basics OPTIX RTN (605, 610, 620)
74 pages
5th Grade Spelling Words Week 01
100% (1)
5th Grade Spelling Words Week 01
10 pages
cbse_4
No ratings yet
cbse_4
17 pages
Week 1 - Speaking Task Assignment - How Much Do You Remember
No ratings yet
Week 1 - Speaking Task Assignment - How Much Do You Remember
5 pages
Guia Ingles
No ratings yet
Guia Ingles
13 pages
Localization and Contextualization
100% (1)
Localization and Contextualization
16 pages
GENIVI - HARMAN Shared State Rendering - TechBrief - 20180414
No ratings yet
GENIVI - HARMAN Shared State Rendering - TechBrief - 20180414
2 pages
Presentation Style
No ratings yet
Presentation Style
3 pages
JN0-664-Demo
No ratings yet
JN0-664-Demo
9 pages
ghasiram kotwal
No ratings yet
ghasiram kotwal
49 pages
MyMaor Weekly Tzav
No ratings yet
MyMaor Weekly Tzav
2 pages
Siemens PLC With TIA Portal V12 Prof SP1 en
No ratings yet
Siemens PLC With TIA Portal V12 Prof SP1 en
52 pages
PFREE - Emerging - Topic 3 and 4 - Week 3
No ratings yet
PFREE - Emerging - Topic 3 and 4 - Week 3
27 pages
@iptvturkiyebymetin1453 Lider-Server TK
No ratings yet
@iptvturkiyebymetin1453 Lider-Server TK
28 pages
Tutorial: Create An Excel Dashboard: Download The Example Dashboard
No ratings yet
Tutorial: Create An Excel Dashboard: Download The Example Dashboard
12 pages
Using Adverbs Appropriately
No ratings yet
Using Adverbs Appropriately
16 pages
Class XII Poet and Pancakes Q n A
No ratings yet
Class XII Poet and Pancakes Q n A
7 pages
Lecture 7 State Space Canonical Forms
No ratings yet
Lecture 7 State Space Canonical Forms
50 pages
Language Profile - Gondi Language
100% (1)
Language Profile - Gondi Language
7 pages
Paper 3 - Interlocutor Booklet
No ratings yet
Paper 3 - Interlocutor Booklet
4 pages
HTML Project
43% (7)
HTML Project
43 pages
QC Script
No ratings yet
QC Script
2 pages

How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC

Uploaded by

How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC

Uploaded by

www.justintodata.

How to use NLP in Python: a Practical Step-by-Step Example To find

Let’s dive right in.

The table of contents is below for your convenience.

Preparation: Scraping the Data

Step #1: Loading and Cleaning the Data

from collections import Counter

Now we have a dataset of 5 features and 2,681 rows.

Related article: Data Cleaning Techniques in Python: the Ultimate Guide

Step #2: Forming the Lists of Keywords

Below are our lists of keywords for tools coded in Python.

We get lists of keywords for skills by following a similar process as tools.

# hard skills/knowledge required.

For education level, we use a different procedure.

degree_dict = {'bs': 1, 'bachelor': 1, 'undergraduate': 1,

Step #3: Streamlining the Job Descriptions using NLP Techniques

The job_description feature in our dataset looks like this.

Tokenizing the Job Descriptions

Parts of Speech (POS) Tagging the Job Descriptions

Thanks to the NLTK, we can use this tagger with Python.

Below, we POS tag the list of keywords for tools as a demonstration.

from nltk import pos_tag

Stay patient! We only need to process them a little more.

Stemming the Words

We stem both the lists of keywordsand the streamlined job descriptions.

Lowercasing the Words

from nltk import pos_tag

Step #5: Matching the Keywords and the Job Descriptions

The Python code with more details is below.

Step #6: Visualizing the Results

The detailed Python code is below.

Top Tools In-Demand

# create the list of tools.

Top 50 Tools for Data Scientists

Top Skills In-Demand

# create the list of skills/knowledge.

Top 50 Skills for Data Scientists

Minimum Education Required

# create the list of degree.

Minimum Education Level for Data Scientists

You might also like