Using Word Embedding to Build a Job Search Engine

10 min readDec 16, 2020

Introduction

The COVID-19 pandemic has caused the unemployment rate to rise more than it did during the Great Recession in 2008. Especially now as more people are trying to find jobs, it’s important to have an efficient platform where users can search for jobs specific to their unique skillset, company interests, and experience level. According to a 2017 Jobvite Recruiting Funnel Benchmark Report, career sites and job boards like LinkedIn and Indeed contribute to 46% of hiring.

Most job search engines today allow users to search through job postings using a few keywords, and narrow down their search with various filters, such as for location and experience level. However, these engines often limit the scope of the search, especially since filters are usually limited to pre-set options. What if we had a search engine that allowed users to enter as much information as they wanted, or even a summary of their resume, and still produce a meaningful ranking of job postings? In this post, we will create such a job search engine using word embeddings.

Our engine will be composed of two layers of semantic search using FastText embeddings and BERT document embeddings: the first layer uses FastText embeddings to retrieve the top job postings, and the second uses BERT embeddings for the retrieved postings to rank them in order of relevance.

Word Embeddings

The BERT model is one of the most powerful word embedding models available today. It can capture deep levels of complexity among words, such as polysemy (different meanings of a word) and anaphora (abbreviated references of a word that has been introduced earlier in a text). This means that embeddings can be different for the same word used in a different context. For example, in the phrases “the boy is reading a book” and “the family wanted to book a trip”, the word “book” has a different meaning in each case.

BERT uses feed forward neural networks along with self-attention layers. It is a bidirectional model that can take in an entire text at once and learn the context of words from both the left and right directions at the same time.

Specifically, 15% of the words are masked and the model tries to predict these masked words based on the position and context of the known words. It then builds vector representations of each word, where even the same word in a different context has a different vector representation. The self-attention layers help the model keep track of the surrounding words that are most important in predicting the masked word.

Since we have job postings, some with multiple paragraphs, we will need to use the Sentence BERT model, which uses siamese BERT networks and a pooling operation on the individual word embeddings in a text to create document embeddings.

The BERT model is computationally expensive, so it is not feasible to rely on it entirely for a search engine, which is why we will use fastText embeddings to first retrieve the most relevant job postings. FastText embeddings are created using a skip-gram neural network where surrounding words for a given word are predicted.

Data

We will use a dataset of job postings scraped from LinkedIn. I used phantombuster.com and a python script to scrape job postings and then combined them into one dataset. The python script uses the Beautiful Soup and Selenium packages.

Here are a few sample rows from our dataset:

We have lots of information, and we want to make sure all of it is included in the job description text that we will use to train our models, so we can add information such as location and industry type to the job description column using the following for loop:

import pandas as pddf_jobs = pd.read_csv("df_jobs.csv")for index in tqdm.tqdm(df_jobs.index):
    df_jobs.loc[index, 'jobDescription'] = df_jobs['jobDescription'][index] + "\n" + "This job is located in " + df_jobs['jobLocation'][index] + ".\n" + "The title of this job is " + df_jobs['jobTitle'][index] + " at " + df_jobs['companyName'][index] + ".\n" + "The job function is: " + df_jobs['jobFunctions'][index] + ".\n" + "This job is in the following industries: " + df_jobs['jobIndustries'][index] + "."

Let’s take a look at a sample job description:

Methods

Before we build our search engine, let’s build a BM25 baseline using the rank-bm25 package. We can use three sample queries to test our baseline and search engine against:

To build the BM25 baseline, we will need to do some preprocessing: only keep the ascii characters, convert all characters to lowercase, remove stopwords, and perform lemmatization.

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')# Read in the a list of stopwords
stopwords_file = open("stoplist.txt","r")
stopwords_list = stopwords_file.read().splitlines()df_jobs['jobDescription'] = df_jobs['jobDescription'].astype(str)
jobs = df_jobs['jobDescription'].tolist()
jobs_flat = [[v] for v in jobs]# Tokenize ascii characters
tokenized_df = []
for job in jobs_flat:        
tokenized_df.append(nltk.RegexpTokenizer(r"\w+").tokenize(str(job)))# Convert everything to lowercase
tokenized_lower = []
for sent in tokenized_df:
    curr_sent = []
    for word in sent:
        curr_sent.append(word.lower())
    tokenized_lower.append(curr_sent)# Lemmatize
lemmatizer = WordNetLemmatizer() 
tokenized_lemm = []
for sent in tokenized_lower:
    curr_sent = []
    for word, tag in pos_tag(sent):
        postag = tag[0].lower()
        if postag in ['a', 'r', 'n', 'v']:
            postag = postag
        else:
            postag = None
        if not postag:
            lemma = word
        else:
            lemma = lemmatizer.lemmatize(word, postag)
        curr_sent.append(lemma)
    tokenized_lemm.append(curr_sent)
    
# Remove stopwords
tok_no_stop = []
for job in tokenized_lemm:
    curr_job = []
    for word in job: 
        if (not word in stopwords_list):  
            curr_job.append(word)
    tok_no_stop.append(curr_job)# Build inverted index
from rank_bm25 import BM25Okapi
bm25 = BM25Okapi(tok_no_stop)# Run queries same processing as above (before we created the inverted index):
# Tokenize ascii characters
tokenized_query = []
for job in query_list:
tokenized_query.append(nltk.RegexpTokenizer(r"\w+").tokenize(str(job)))# Convert everything to lowercase
q_lower = []
for sent in tokenized_query:
    curr_sent = []
    for word in sent:
        curr_sent.append(word.lower())
    q_lower.append(curr_sent)# Lemmatize
lemmatizer = WordNetLemmatizer() 
q_tokenized_lemm = []
for sent in q_lower:
    curr_sent = []
    for word, tag in pos_tag(sent):
        postag = tag[0].lower()
        if postag in ['a', 'r', 'n', 'v']:
            postag = postag
        else:
            postag = None
        if not postag:
            lemma = word
        else:
            lemma = lemmatizer.lemmatize(word, postag)
        curr_sent.append(lemma)
    q_tokenized_lemm.append(curr_sent)
    
# Remove stopwords
q_tok_no_stop = []
for job in q_tokenized_lemm:
    curr_job = []
    for word in job: 
        if (not word in stopwords_list):  
            curr_job.append(word)
    q_tok_no_stop.append(curr_job)        
    
# Get query results
results_list = []
for query in q_tok_no_stop:
    result = bm25.get_top_n(query, untokenized , n=15)
    results_curr = []
    for j in result:
        results_curr.append([j])
    results_list.append(results_curr)
    
    
# Get indices of returned jobs
indices_results = []
for results_15 in results_list:
    curr_result_indices = []
    for single_result in results_15:
        s_result = single_result[0]
        curr_result_indices.append(untokenized.index(s_result))
    indices_results.append(curr_result_indices)

We can later annotate these results to evaluate whether our proposed search engine produced better results than this baseline. For example, we can annotate each result per query on a scale of 0 (completely irrelevant job) to 4 (perfect job match).

Now let’s build our proposed search engine. We do not need to do much preprocessing of our data because the python packages we will use, Flair and Sentence Transformers, will take care of that step for us. First we can load our dataset and create a list combining our job descriptions and queries. We will have to run each job and query into Flair’s Sentence() function to create the right format for inputing into the Fastext model.

from flair.data import Sentencejobs = df_jobs['jobDescription'].tolist()queries =  ["Medical assistant looking for work near New York City at a large hospital, 10 years of experience in patient care, including injections, CPR, and EKG testing", 
              "Web developer with experience in html, javascript, python, and Git, looking for work at startup in San Francisco, have portfolio website", 
              "New college grad with internship experience looking for software engineer position at large tech company"]joined_list = jobs + queriessentences_list = []
for item in joined_list:
     sentences_list.append(Sentence(item))

Next, we can import a pre-trained fastText embedding model which was trained on data from CommonCrawl and Wikipedia. We will use Flair’s DocumentPoolEmbedding’s function, which will take the average over all of the fastText word embeddings in a job posting and give us a document embedding for each posting.

from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
from flair.embeddings import TransformerWordEmbeddings
from flair.embeddings import TransformerDocumentEmbeddings# Initialize the word embeddings
crawl_embedding = WordEmbeddings('crawl')# Initialize the document embeddings
document_embeddings = DocumentPoolEmbeddings([crawl_embedding])# Get the document embeddings
document_embeddings.embed(joined)

We can now compute the cosine similarity between all the job posting embeddings and query embeddings using Pytorch, and retrieve the indices of the top 15 postings for each of our three sample queries.

import torch
import numpy as np# Create a matrix to store our cosine similarity values.
# The columns will represent our queries and the rows will be 
#  job postings.cos = torch.nn.CosineSimilarity(dim = 0, eps=1e-6)
cosine_matrix = np.empty((0, len(queries)), int)

for job in range(len(jobs)):
    curr_row = []
    for query in range(len(job_sentences), len(joined)):
        curr_similarity = cos(joined[job].embedding, joined[query].embedding)
        curr_row.append(curr_similarity)
    curr_row_final = np.array(curr_row)
    cosine_matrix = np.vstack((cosine_matrix, curr_row_final))# Find top 15 rows with maximum cosine values in each column
index_of_top_jobs_per_query = []
for column in cosine_matrix.T:
    curr_top_jobs = []
    curr_top_jobs = column.argsort()[-15:][::-1]
    index_of_top_jobs_per_query.append([curr_top_jobs])list_of_fastext_top_15 = list(np.array(index_of_top_jobs_per_query).flatten())

Next, we can move on to ranking these top job postings using BERT sentence/document embeddings. We will use the Sentence Transformers package and load a pre-trained distilBERT model, which is a computationally faster variation of BERT. After we get the embeddings, we can again calculate the cosine similarity between the queries and jobs to get the ranking of job postings.

# Collect all the relevant jobs in a new data frame
df_jobs_relevant = df_jobs_relevant.iloc[list_of_fastext_top_15,:] 
df_jobs_relevant['jobDescription'] = df_jobs['jobDescription'].astype(str)
jobs_new = df_jobs_relevant['jobDescription'].tolist()joined_list2 = jobs_new + queries# Load the pretrained model
model_distilbert = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')# Get the word embeddings
embeddings_distilbert = model_distilbert.encode(joined_list2, show_progress_bar=True, convert_to_tensor=True)# Calculate cosine similarity of the BERT embeddings
cosine_matrix2 = np.empty((0, len(queries)), int)for job in range(len(jobs_new)):
    curr_row = []
    for query in range(len(jobs_new), len(joined_list2)):
        curr_similarity = cos(distilbert_embeddings[job], distilbert_embeddings[query])
        curr_row.append(curr_similarity)
    curr_row_final = np.array(curr_row)
    cosine_matrix2 = np.vstack((cosine_matrix2, curr_row_final))
    
index_of_top_jobs_sentence_trans = []
for column in cosine_matrix2.T:
    curr_top_jobs = []
    curr_top_jobs = column.argsort()[-15:][::-1]
    index_of_top_jobs_sentence_trans.append([curr_top_jobs])

Results

Let’s view some snippets of the top few results I got for each of our three sample queries.

I also have gone through and annotated part of the dataset including the results for our search engine, the BM25 baseline, and the original ranking that the fastText embeddings produced using a scale of 0–4. Let’s evaluate our search engine against the two baselines using two common metrics, the Normalized Discounted Cumulative Gain@10 (NDCG@10) and precision@10.

NDCG@10 evaluates the quality of rankings by looking at the relevance labels (‘rel’ in the formula below) for the top 10 results and first computing the Discounted Cumulative Gain@10 (DCG@10):

Then we order all results (or as many as we have labels for) in descending order according to their labels, so that the beset results come first, and used the same formula to calculate what is called the Ideal Discounted Cumulative Gain@10 (IDCG@10). Dividing the DCG@10 by IDCG@10 gives you the NDCG@10.

Precision@10 represents the proportion of the top 10 results that are relevant. In our case, I considered jobs labelled with a 3 or 4 as relevant.

Results Comparing Different Search Engine Approaches

As expected, we can see that both the approach using only fastText embeddings and our two-layered search engine always performed better than BM25 (when considering both NDCG@10 and precision@10). Looking at precision@10, our two-layered approach beat the fastText rankings, so we can see that BERT embeddings provided valuable contextual information in addition to what was captured by the fastText embeddings.

According to the NDCG@10, our search engine performed better for the second query, but the fastText embeddings performed better on the first and third queries. This is an interesting result given that using BERT embeddings usually outperforms using static word embeddings like fastText, which cannot distinguish between different contexts of a word. It is important to note that our dataset was relatively small, and the NLI and STSB dataset that the BERT model was pre-trained on contained a different type of data (from image captions, news headlines, and user forums). Using a larger jobs dataset and fine-tuning our BERT model with lots of labelled training data would improve its performance.

What’s Next

Using Sentence Transformers, we can fine-tune our model for BERT embeddings using a dataset with labels for sentence-query pairs, and a cosine similarity loss function. Since we are measuring document-query distance using cosine similarity, these labels must be between 0 and 1, where 1 indicates a perfect match. If we have these ground truth labels for our data, we can also evaluate our search engine using the package’s built in functions to calculate Mean Reciprocal Rank (MRR), Recall@k, and Normalized Discounted Cumulative Gain (NDCG). Here is a tutorial on how to do this:

Training Overview - Sentence-Transformers documentation

Each task is unique, and having sentence / text embeddings tuned for that specific task greatly improves the…

www.sbert.net

We can also experiment with different types of pre-trained models that are available to use with the Sentence Transformers library. These include different model types, such as RoBERTa and XLNet, with different numbers of hidden layers and parameters, and trained on different datasets. The newly released Longformer model has shown strong performance with longer sequences of text such as our job postings. If you have access to a computer with large memory or GPU, this would be a great model to test out.