Query expansion is a technique in which search engines augment the query with words related to the words of the original query. For example, a query like “good blogs” might be changed to “(good OR excellent OR interesting) (blogs OR blogging OR weblogs)”. This is similar to stemming, but one step further, in that even words that are not stems of the query words might be added to the query for better relevance. In technical terms, query expansion improves recall, and if it involves spelling correction as well, it may also improve precision. This retrieval technique is sometimes also called query reformulation or query refinement.
A recent patent awarded to Google (originally filed in January of 2009, awarded in September 2011) goes even further, by changing the ranking algorithm to use related queries as part of the retrieval process. This patent is awarded to Simon Tong, Mark Pearson, and the founder of Google, Sergey Brin.
Here is what the patent starts with:
Systems and methods that improve search rankings for a search query by using data associated with queries related to the search query are described. In one aspect, a search query is received, a related query related to the search query is determined, an article (such as a web page) associated with the search query is determined, and a ranking score for the article based at least in part on data associated with the related query is determined. Several algorithms and types of data associated with related queries useful in carrying out such systems and methods are described.
What this means is that while doing search, a composite ranking score based on several queries is generated and the results are ranked based on this composite score. This is information retrieval at its best!
The patent mentions the use of a related query database. This database contains all queries that are related to the query that the user entered. There could be billions of queries in this database. How are related queries found?
…for the query “infinity auto,” relationships to queries “infiniti,” “luxury car,” “quality luxury car,” and “Japanese quality luxury car” may be defined if a user inputs these queries immediately following the initial query “infinity auto.”
So if a user searching for “infinity auto” is not satisfied by the results or wants to broaden the search, the user may re-enter a reformulated query in the same browser session to something like “luxury car”. If this reformulation happens several times through several user sessions, the related query database will contain “luxury car” as a related query to “infinity auto”.
The patent mentions the following ways of finding related queries:
Examples of related queries include having been input as consecutive search queries by users previously (whether once or multiple times), queries input by a user within a defined time range (e.g., 30 minutes), a misspelling relationship, a numerical relationship, a mathematical relationship, a translation relationship, a synonym, antonym, or acronym relationship, or other human-conceived or human-designated association, and any computer- or algorithm-determined relationship…
Consecutive search queries is what I mentioned about the “infinity auto” query.
Misspellings, Synonyms, Antonyms, and Acronym relationships are quite obvious. A translation relationship is guessed at best to be queries that are translations of each other. Note that this need not include different languages. See this paper by Google in which they use concepts of machine translation for query expansion.
(Note that in typical query expansion, there is no related query database. Instead, there are dictionaries which could be word to word, or phrase to phrase)
Once the related query database is complete, it is time to use it for retrieving search results for a query! They mention several ways of doing this.
In some embodiments, determining an article associated with the search query may comprise determining that the article is associated with both the search query and the related query….
The related query signal function, is a set of instructions processed by the related query processor, determines a weighted value for each document in the initial search result depending upon the number of times other users have previously clicked or otherwise selected the particular document as a part of the initial search result, and upon the number of times other users have previously clicked or otherwise selected the particular document as part of search results for other queries related to the search query….
….a sixth example of a related query function illustrates the use of a ranking score previously generated by the search engine for all queries, “Score (q, d);” and a ranking score previously generated by the search engine for a particular document “d” in all related queries, “Score (q’.sub.m,d);”
So, they count clicks for a result for the original query, and also for a few related queries. The final score of the result is a weighted combination of these clicks. As they say later on, they can also do a full (cached) retrieval for the original query AND the related queries and then do a weighted combination of the scores that the algorithm outputs. (The weight could be the strength of relationship between a query and a related query. This weight could simply be calculated by the semantic distance between the two queries, or more sophisticated models hased on click behavior).
Let me sum this up in a few sentences – First a set of related queries is created called the related query database. Then, these related queries are used in the usual search process. This is done by retrieving a document for the original query and noting the score the ranking algorithm gave the document. Then, the same thing is done for a few related queries and the ranking scores for the document is noted for all those related queries. Then, the final score of the document is a weighted combination of the scores outputted by the ranking algorithm for the original query and the related queries. The weights used in the weighted combination could be strengths of relationship between queries and related queries. The documents sorted by the final scores gives us the final ranking of the documents.
To understand the impact this has on SEO, it is enough to understand the fact that related queries are in fact semantically related words! By retrieving based not on one query but on several semantically related queries, the search becomes more topic oriented rather than matching only on a few words in the original query. This is just a kind of semantic search! It would be interesting to compare this method with techniques like Latent Semantic Indexing.
Art J. Adams