Using Word Embedding to Build a Job Search Engine

Introduction

The COVID-19 pandemic has caused the unemployment rate to rise more than it did during the Great Recession in 2008. Especially now as more people are trying to find jobs, it’s important to have an efficient platform where users can search for jobs specific to their unique skillset, company interests, and experience level. According to a 2017 Jobvite Recruiting Funnel Benchmark Report, career sites and job boards like LinkedIn and Indeed contribute to 46% of hiring.

Encoder Structure of BERT
Encoder Structure of BERT
Encoder Structure of BERT
BERT’s word masking approach

Data

We will use a dataset of job postings scraped from LinkedIn. I used phantombuster.com and a python script to scrape job postings and then combined them into one dataset. The python script uses the Beautiful Soup and Selenium packages.

import pandas as pddf_jobs = pd.read_csv("df_jobs.csv")for index in tqdm.tqdm(df_jobs.index):
df_jobs.loc[index, 'jobDescription'] = df_jobs['jobDescription'][index] + "\n" + "This job is located in " + df_jobs['jobLocation'][index] + ".\n" + "The title of this job is " + df_jobs['jobTitle'][index] + " at " + df_jobs['companyName'][index] + ".\n" + "The job function is: " + df_jobs['jobFunctions'][index] + ".\n" + "This job is in the following industries: " + df_jobs['jobIndustries'][index] + "."

Methods

Before we build our search engine, let’s build a BM25 baseline using the rank-bm25 package. We can use three sample queries to test our baseline and search engine against:

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# Read in the a list of stopwords
stopwords_file = open("stoplist.txt","r")
stopwords_list = stopwords_file.read().splitlines()
df_jobs['jobDescription'] = df_jobs['jobDescription'].astype(str)
jobs = df_jobs['jobDescription'].tolist()
jobs_flat = [[v] for v in jobs]
# Tokenize ascii characters
tokenized_df = []
for job in jobs_flat:
tokenized_df.append(nltk.RegexpTokenizer(r"\w+").tokenize(str(job)))
# Convert everything to lowercase
tokenized_lower = []
for sent in tokenized_df:
curr_sent = []
for word in sent:
curr_sent.append(word.lower())
tokenized_lower.append(curr_sent)
# Lemmatize
lemmatizer = WordNetLemmatizer()
tokenized_lemm = []
for sent in tokenized_lower:
curr_sent = []
for word, tag in pos_tag(sent):
postag = tag[0].lower()
if postag in ['a', 'r', 'n', 'v']:
postag = postag
else:
postag = None
if not postag:
lemma = word
else:
lemma = lemmatizer.lemmatize(word, postag)
curr_sent.append(lemma)
tokenized_lemm.append(curr_sent)

# Remove stopwords
tok_no_stop = []
for job in tokenized_lemm:
curr_job = []
for word in job:
if (not word in stopwords_list):
curr_job.append(word)
tok_no_stop.append(curr_job)
# Build inverted index
from rank_bm25 import BM25Okapi
bm25 = BM25Okapi(tok_no_stop)
# Run queries same processing as above (before we created the inverted index):
# Tokenize ascii characters
tokenized_query = []
for job in query_list:
tokenized_query.append(nltk.RegexpTokenizer(r"\w+").tokenize(str(job)))
# Convert everything to lowercase
q_lower = []
for sent in tokenized_query:
curr_sent = []
for word in sent:
curr_sent.append(word.lower())
q_lower.append(curr_sent)
# Lemmatize
lemmatizer = WordNetLemmatizer()
q_tokenized_lemm = []
for sent in q_lower:
curr_sent = []
for word, tag in pos_tag(sent):
postag = tag[0].lower()
if postag in ['a', 'r', 'n', 'v']:
postag = postag
else:
postag = None
if not postag:
lemma = word
else:
lemma = lemmatizer.lemmatize(word, postag)
curr_sent.append(lemma)
q_tokenized_lemm.append(curr_sent)

# Remove stopwords
q_tok_no_stop = []
for job in q_tokenized_lemm:
curr_job = []
for word in job:
if (not word in stopwords_list):
curr_job.append(word)
q_tok_no_stop.append(curr_job)

# Get query results
results_list = []
for query in q_tok_no_stop:
result = bm25.get_top_n(query, untokenized , n=15)
results_curr = []
for j in result:
results_curr.append([j])
results_list.append(results_curr)


# Get indices of returned jobs
indices_results = []
for results_15 in results_list:
curr_result_indices = []
for single_result in results_15:
s_result = single_result[0]
curr_result_indices.append(untokenized.index(s_result))
indices_results.append(curr_result_indices)
from flair.data import Sentencejobs = df_jobs['jobDescription'].tolist()queries =  ["Medical assistant looking for work near New York City at a large hospital, 10 years of experience in patient care, including injections, CPR, and EKG testing", 
"Web developer with experience in html, javascript, python, and Git, looking for work at startup in San Francisco, have portfolio website",
"New college grad with internship experience looking for software engineer position at large tech company"]
joined_list = jobs + queriessentences_list = []
for item in joined_list:
sentences_list.append(Sentence(item))
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
from flair.embeddings import TransformerWordEmbeddings
from flair.embeddings import TransformerDocumentEmbeddings
# Initialize the word embeddings
crawl_embedding = WordEmbeddings('crawl')
# Initialize the document embeddings
document_embeddings = DocumentPoolEmbeddings([crawl_embedding])
# Get the document embeddings
document_embeddings.embed(joined)
import torch
import numpy as np
# Create a matrix to store our cosine similarity values.
# The columns will represent our queries and the rows will be
# job postings.
cos = torch.nn.CosineSimilarity(dim = 0, eps=1e-6)
cosine_matrix = np.empty((0, len(queries)), int)

for job in range(len(jobs)):
curr_row = []
for query in range(len(job_sentences), len(joined)):
curr_similarity = cos(joined[job].embedding, joined[query].embedding)
curr_row.append(curr_similarity)
curr_row_final = np.array(curr_row)
cosine_matrix = np.vstack((cosine_matrix, curr_row_final))
# Find top 15 rows with maximum cosine values in each column
index_of_top_jobs_per_query = []
for column in cosine_matrix.T:
curr_top_jobs = []
curr_top_jobs = column.argsort()[-15:][::-1]
index_of_top_jobs_per_query.append([curr_top_jobs])
list_of_fastext_top_15 = list(np.array(index_of_top_jobs_per_query).flatten())
# Collect all the relevant jobs in a new data frame
df_jobs_relevant = df_jobs_relevant.iloc[list_of_fastext_top_15,:]
df_jobs_relevant['jobDescription'] = df_jobs['jobDescription'].astype(str)
jobs_new = df_jobs_relevant['jobDescription'].tolist()
joined_list2 = jobs_new + queries# Load the pretrained model
model_distilbert = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Get the word embeddings
embeddings_distilbert = model_distilbert.encode(joined_list2, show_progress_bar=True, convert_to_tensor=True)
# Calculate cosine similarity of the BERT embeddings
cosine_matrix2 = np.empty((0, len(queries)), int)
for job in range(len(jobs_new)):
curr_row = []
for query in range(len(jobs_new), len(joined_list2)):
curr_similarity = cos(distilbert_embeddings[job], distilbert_embeddings[query])
curr_row.append(curr_similarity)
curr_row_final = np.array(curr_row)
cosine_matrix2 = np.vstack((cosine_matrix2, curr_row_final))

index_of_top_jobs_sentence_trans = []
for column in cosine_matrix2.T:
curr_top_jobs = []
curr_top_jobs = column.argsort()[-15:][::-1]
index_of_top_jobs_sentence_trans.append([curr_top_jobs])

Results

Let’s view some snippets of the top few results I got for each of our three sample queries.

DCG formula
Formula for Precision
Results Comparing Different Search Engine Approaches

What’s Next

Using Sentence Transformers, we can fine-tune our model for BERT embeddings using a dataset with labels for sentence-query pairs, and a cosine similarity loss function. Since we are measuring document-query distance using cosine similarity, these labels must be between 0 and 1, where 1 indicates a perfect match. If we have these ground truth labels for our data, we can also evaluate our search engine using the package’s built in functions to calculate Mean Reciprocal Rank (MRR), Recall@k, and Normalized Discounted Cumulative Gain (NDCG). Here is a tutorial on how to do this:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store