What is Semantic Search?

Posted May 21st, 2015

Semantic Search: The new trend in CV/resume search technology

The promise of this technology is very alluring. You no longer need to be an expert in crafting complex and arcane Boolean queries, now you can just write down roughly what you mean and get the computer to work out what you really want to see returned. Not only does this save a considerable amount of time and effort, but it also solves the rather inconvenient prospect of each recruiter having to learn a complex Boolean query syntax and a complex set of search strategies and tricks that used to come only from hard experience.

The promise that semantic search technologies provide is easier, faster and more accurate searches for "free." Sounds too good to be true? In some cases, the promises are more hot air from tech companies' marketing departments. But there are a group of technologies which, when correctly applied, can give significant improvements to the search experience, both in ease of generating queries and in terms of ease of learning/training.

This article, by Daxtra CTO Steve Finch, seeks to separate the wheat from the chaff when it comes to the cluster of technologies which together form "Semantic Search."

A Brief History of Candidate Search

Ever since computers have been applied to processing data for hundreds of thousands of candidates, the problem of finding particular candidates on the basis of their skills, knowledge and experience has been of primary concern.

Coded Search

Early attempts to address this problem involved coding candidates with their skills and experience. Most first-generation recruitment management systems supported allowing recruiters to code their candidates from a list of skills which is stored in the system.  Although done well, this is a very powerful search tool. The main problem with this is that manual coding of candidates is very rarely done well.

The reasons for this are not hard to find:

  1. Lack of time: Recruiters rarely have time to code candidates at all if they are working at maximum efficiency on winning placements.
  2. Lack of expertise and consistency: Even where they do code candidates, recruiters tend to do so incompletely and inconsistently. It is natural to be focussed on the particular skills you are placing the candidate for, for example, and to ignore (or not even see) skills that are not relevant to the placement at hand.  Also finding the right skill code in a list of thousands is difficult. Inexperience at this skilled data-librarian type task leads to many miscodings and duplicate or inappropriate codes being entered into skills libraries.
  3. Lack of maintenance of the code lists: Skills lists need to be maintained and expanded as the skills of candidates change.  If this is not done regularly then either new skills relevant to the business have no codes (and therefore cannot be coded or searched for), or without active management multiple codes get added for the same (or related) skills.  Most often, without active maintenance, a combination of the two occurs with some highly relevant skills being omitted, and many duplicate entries polluting the list making it harder to find relevant candidates.
  4. Lack of reward:  Recruiters are very rarely rewarded for coding candidates, so being human, tend to spend the minimum amount of effort they can get away with on this task.

The end result of all this is a large set of very sketchily coded candidates that results in an inefficient search process with many relevant candidates being missed. It's also hard or impossible to find candidates with any skill that is not in the skills library.

This is not to say that when it's done well coded search is not a very useful tool for any CRM system. However, in our experience at Daxtra in dealing with hundreds of agencies, coding is very rarely done well.

Full Text Search

Coded search has the drawback that the codes have to be created and associated with candidates by recruiters. No other technology has such drawbacks. Full Text Search systems allow users to construct complex queries, essentially searching words in resumes that are actually in the text the candidates themselves have written. This has the advantage that since computers can automatically index resume text without any action from recruiters, it's available for searching whenever a candidate's resume is loaded onto the CRM.

Using Full Text Search, recruiters can craft queries, which in theory, pull back candidates which they want to see simply by finding candidates who write certain words in their CV or resumes.  However, although this is a very powerful tool, it is notoriously difficult to learn the complex Boolean syntax that is required to craft accurate queries. Even when this has been learned, queries which are both accurate and comprehensive, require a great deal of time and experience to get right. Indeed crafting Boolean queries is more of an art than a science. These queries have a tendency to return huge numbers of candidates — most of which are irrelevant to the task at hand and need to be laboriously filtered out — or return very few, missing many relevant candidates.

Some of the problems of full text search are:

  1. Synonyms: Multiple ways of saying the same thing: To write a good query you have to know all the ways candidates express their skills and not just the way you, or the requirement text, might express them. Would a candidate write ".NET" or "dotnet?" Would they write "VMware" or "VM-ware?"  "Software Developer" or "Programmer?" It takes time and experience to learn all the different ways that skills or job titles can be written.
  2. Homonyms: One word, multiple meanings:  Sometime a single word can have multiple meaning depending on the context of occurrence. For example the word "Java" can be an island in Indonesia, a type of coffee or a programming language. The word "Sail" has a general meaning relating to boats, but not when in the context "Full Sail University." "MD" can have a multitude of meanings — Managing Director, a medical qualification or a US state — depending on its context. All this means that when crafting queries, recruiters might find many unexpectedly irrelevant results returned because words are being used with unanticipated meanings.
  3. Difficult concepts to express: Some concepts are very hard to express in a full text search system. For example, having financial industry experience might need to be expanded to a query requiring the candidate to have worked in one of hundreds or thousands of financial institutions. If they do not, as is often the case, choose to write something like "I have financial industry experience."
  4. Impossible concepts to express: Even worse is trying to express a concept like "at least two years of experience in X."  Such concepts are impossible to accurately encode in Boolean queries unless this information is explicitly encoded in the resume, which is very rare, to say the least!

These difficulties are not insurmountable. A skilled and experienced recruiter can craft queries which return large numbers of relevant results. A seasoned recruiter will know what he or she can't encode in Boolean queries and will therefore have to be ascertained manually. However, reaching such a level of skill takes a lot of time and experience, and recruiters often do not have the time or inclination to learn to excel in this task. So something easier is required:

Ranked Retrieval

A variant of full text search is Natural Language Search. This is similar to Full Text Search, but rather than requiring the learning of the complex Boolean query syntax, a computer analyses a plain text query a user writes into a query box and attempts to construct a query automatically.

Since this is very difficult to do well, the computer will usually err on the side of returning too many results, but will attempt to rank these results in order of relevance to the query. It does this by paying attention to the context in which terms appear in documents, the number of times they appear and how rare a term is — rare terms being more informative than common ones — all other things being equal.

Ranked Retrieval systems have been around for many years. They form the basis of the well-known web search engines Bing and Google, as well as being embedded in all modern recruitment CRM systems.

Although Ranked Retrieval systems do not require users to learn the complex Boolean syntax of Full Text Search systems, they do sacrifice some of the power that Boolean queries provide. Consequently, relevant candidates can be missed either by not being returned at all, or by being so far down the list of returned candidates that they are not looked at.

Semantic Search

The shortcomings of existing search technologies leave us grasping for a better way of searching. And into this fertile field has come a set of technologies that can be broadly and informally described as "Semantic Search Technologies." These technologies claim to address some of the shortcomings detailed above of Full Text Search and Coded Search. The goal being to make it easier to write more powerful queries which return all and only the relevant candidates in your database that you want to see.

"Semantic" means "of or relating to meaning," according to the dictionary. Defining "meaning" is a very large subject area, being the subject of study of philosophers from Socrates to the present day. Predictably, we are a ways off where semantic search technologies solve the problem of identifying what we really mean in general. There are, however, many distinct technologies that can help us get a little closer to finding what we want more accurately, faster and with less learning required.

  1. Term Correction.  When I mistype a word in a query in Google, it sometimes seems that Google knows what I want to see more than I do!  Google can do this because it knows, from its logs and indexes, what people most frequently search for. If it finds search terms which contain terms that are rare, or not in their dictionaries, then Google can apply algorithms which quite accurately guess the intended word(s) from the mistyped words. Similar algorithms are used in spelling correction systems in word processors, and these systems genuinely improve the search experience.  The converse of this is that when documents are indexed, they can also be analyzed for spelling and tokenization errors. Then appropriate corrections made to them as they are indexed.
  2. Term Expansion.  Also called query expansion. This is where synonyms and closely related terms from a thesaurus are added to terms which are explicitly mentioned in the query so that candidates who same the same thing in a slightly different way to the original query are also returned.  For example "Chief Financial Officer," "CFO" and "Financial Director," may all be closely related terms.  Ordinarily creating a query for any one of these would not return the others. BUT in a semantic search, the variants would be added, usually automatically, to the query, so that all variants are returned. Although sometimes this works well, we have found that often it causes a decrease in accuracy of the returned results. One reason for this is the ambiguity of language (see Term Disambiguation below,) where the same tokens can mean completely different things.  For example "MD" might usually mean "Managing Director" in the UK, but in the US it is far more likely to mean either a medical doctor or the state of Maryland. If you have "Director" as a synonym for "Adobe Director," don't be surprised if your queries for web developers start returning a lot of managers. There are ways to optimize term expansion, but this usually relies on a deep linguistic and statistical analysis of your particular data, to see, for example, what terms in your data set with which "MD" usually correlates.
  3. Term Clustering/Correlation techniques.  These techniques include applying complex statistical vector algorithms to automatically identify correlated terms.  These techniques go by names like "Latent Semantic Analysis," "Bayesian Analysis," "Singular Value Decomposition" and many similar names.  Although the details of the algorithms vary greatly, the idea is always to identify which terms tend to occur together, and to use this information to build statistical queries which hopefully behave like a better version of term expansion. The problem with this approach is that although it sometimes works quite well, it can often infer spurious correlations which decrease the accuracy of the results. It is often impossible to answer why it returned a certain candidate and not another, because of the complexity of these algorithms.
  4. Term Disambiguation.  The word "Director" can mean senior manager, or it could be the name of a software package or a specialist in making movies. The acronym "CNA" can refer to a nurse or an IT specialist, among others. BA can be a degree, an airline code, and company, a Business Analyst, and so on. Matching one sense of a word when we mean another decreases the accuracy of our return set, and conversely working out which sense we actually mean increases the overall accuracy of our return set. The fact that many terms are inherently ambiguous is one of the main confounding problems when applying Term Expansion techniques.
  5. Term Inference and Selection.  Another technology which has been deployed for some time in some search systems is term inference and selection systems. This is where terms related to the query are shown or suggested to the user, either as they are constructing their query, or after the query has been run and the results analyzed. In Google, for example, as you type your query, Google attempts to guess what you are searching for and suggests multiple related terms, or entire queries, that you may wish to search for. This is more of an application of Term Expansion, and Term Clustering techniques above. But it can allow users to select or deselect additional terms during query construction, thereby speeding it up. Some technologies allow users to navigate through returned results by clustering terms after the query has been run, rather than during the query construction process.
  6. Contextual Validation. Normally, when we put in a query a skill or job title, we intend for the candidates being returned to have the said skill or job title (or related skills or titles). Often, however, this is not the case if we rely, as most search systems do, on simply matching the terms in the query to terms in the text of the CV. For example, a search for "Finance Director" will most likely return many more people who "reported to the Finance Director," than actual Finance Directors. The same logic applies to excluding from consideration the job titles of referees, words in candidate addresses, and so on. Analyzing resumes using a parser to accurately identify which terms the candidate is using to refer to their skills, and which terms are merely referring to incidental context, can be expected to increase the accuracy of the return set.
  7. Linguistic Analysis (Resume Parsing). Extending the idea of Contextual Searching, if we have an accurate parser available to us then we can make extensive use of the information that it identifies from each resume. For example, some parsers will identify the key skills of each candidate and attribute to each a score, or an estimate of the extent of experience. This information is an excellent means for being able to identify which candidates not only have a particular skill but have recent and/or extensive experience of using the skill. The highly detailed semantic information extracted by modern parsers supports much more powerful search algorithms than general text search alone.
  8. Ranking. Perhaps one of the most powerful methods for exploiting semantic information relates to the order in which results are presented to the user. In general, whether a candidate is relevant to a query is rarely an all-or-nothing decision, but rather some candidates are more suitable for a position than others. All of the above technologies can be combines in sophisticated ranking algorithms which aim to present the most relevant candidates at the top of the list of returned candidates, while returning as many potentially relevant candidates as possible for the truly dedicated recruiter to filter manually.

The main thing to take away from this analysis is that there are a number of different types of technology available that can improve the user's search experience. These types of technology are not perfect, and incorrectly applied, some approaches (especially term expansion) can harm search usability and experience as often as it helps. But if these technologies are correctly combined into a single search system, you can get search tools that ease the process of constructing queries and navigating through results sets to find the best matches for your query. None of this, however, is a search panacea. Although these technologies are an aid for search, under the hood they still construct complex Boolean queries, and the underlying technology of search systems, being the inverted index, has not changed in principle for 40 years.

— By Steve Finch, CTO, DaXtra Technologies

 

Tags: DaXtra Blog, semantic search