A new kind of Semantic Search from Google?

by ringostarr on October 25, 2011

in Patents and Papers, Search Engines, Search Relevance

 

Query expansion is a technique in which search engines augment the query with words related to the words of the original query. For example, a query like “good blogs” might be changed to “(good OR excellent OR interesting) (blogs OR blogging OR weblogs)”. This is similar to stemming, but one step further, in that even words that are not stems of the query words might be added to the query for better relevance. In technical terms, query expansion improves recall, and if it involves spelling correction as well, it may also improve precision. This retrieval technique is sometimes also called query reformulation or query refinement.

A recent patent awarded to Google (originally filed in January of 2009, awarded in September 2011) goes even further, by changing the ranking algorithm to use related queries as part of the retrieval process. This patent is awarded to Simon Tong, Mark Pearson, and the founder of Google, Sergey Brin.

Here is what the patent starts with:

Systems and methods that improve search rankings for a search query by using data associated with queries related to the search query are described. In one aspect, a search query is received, a related query related to the search query is determined, an article (such as a web page) associated with the search query is determined, and a ranking score for the article based at least in part on data associated with the related query is determined. Several algorithms and types of data associated with related queries useful in carrying out such systems and methods are described.

What this means is that while doing search, a composite ranking score based on several queries is generated and the results are ranked based on this composite score. This is information retrieval at its best!

The patent mentions the use of a related query database. This database contains all queries that are related to the query that the user entered. There could be billions of queries in this database. How are related queries found?

…for the query “infinity auto,” relationships to queries “infiniti,” “luxury car,” “quality luxury car,” and “Japanese quality luxury car” may be defined if a user inputs these queries immediately following the initial query “infinity auto.”

So if a user searching for “infinity auto” is not satisfied by the results or wants to broaden the search, the user may re-enter a reformulated query in the same browser session to something like “luxury car”. If this reformulation happens several times through several user sessions, the related query database will contain “luxury car” as a related query to “infinity auto”.

The patent mentions the following ways of finding related queries:

Examples of related queries include having been input as consecutive search queries by users previously (whether once or multiple times), queries input by a user within a defined time range (e.g., 30 minutes), a misspelling relationship, a numerical relationship, a mathematical relationship, a translation relationship, a synonym, antonym, or acronym relationship, or other human-conceived or human-designated association, and any computer- or algorithm-determined relationship…

Consecutive search queries is what I mentioned about the “infinity auto” query.

Misspellings, Synonyms, Antonyms, and Acronym relationships are quite obvious. A translation relationship is guessed at best to be queries that are translations of each other. Note that this need not include different languages. See this paper by Google in which they use concepts of machine translation for query expansion.

(Note that in typical query expansion, there is no related query database. Instead, there are dictionaries which could be word to word, or phrase to phrase)

Once the related query database is complete, it is time to use it for retrieving search results for a query! They mention several ways of doing this.

In some embodiments, determining an article associated with the search query may comprise determining that the article is associated with both the search query and the related query….

The related query signal function, is a set of instructions processed by the related query processor, determines a weighted value for each document in the initial search result depending upon the number of times other users have previously clicked or otherwise selected the particular document as a part of the initial search result, and upon the number of times other users have previously clicked or otherwise selected the particular document as part of search results for other queries related to the search query….

….a sixth example of a related query function illustrates the use of a ranking score previously generated by the search engine for all queries, “Score (q, d);” and a ranking score previously generated by the search engine for a particular document “d” in all related queries, “Score (q’.sub.m,d);”

So, they count clicks for a result for the original query, and also for a few related queries. The final score of the result is a weighted combination of these clicks. As they say later on, they can also do a full (cached) retrieval for the original query AND the related queries and then do a weighted combination of the scores that the algorithm outputs. (The weight could be the strength of relationship between a query and a related query. This weight could simply be calculated by the semantic distance between the two queries, or more sophisticated models hased on click behavior).

Let me sum this up in a few sentences – First a set of related queries is created called the related query database. Then, these related queries are used in the usual search process. This is done by retrieving a document for the original query and noting the score the ranking algorithm gave the document. Then, the same thing is done for a few related queries and the ranking scores for the document is noted for all those related queries. Then, the final score of the document is a weighted combination of the scores outputted by the ranking algorithm for the original query and the related queries. The weights used in the weighted combination could be strengths of relationship between queries and related queries. The documents sorted by the final scores gives us the final ranking of the documents.

To understand the impact this has on SEO, it is enough to understand the fact that related queries are in fact semantically related words! By retrieving based not on one query but on several semantically related queries, the search becomes more topic oriented rather than matching only on a few words in the original query. This is just a kind of semantic search! It would be interesting to compare this method with techniques like Latent Semantic Indexing.

Art J. Adams

 

2 comments on “A new kind of Semantic Search from Google?

  1. Nice analysis, and I think that people should definitely be aware of how their queries might be expanded upon by Google.

    The patent and paper from Google are pretty interesting.

    I’d recommend that you use a different way of linking to patents though. Either of the following formats will work with a patent, and since they use the patent number, that particular patent will always show up when someone clicks on the link:

    http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&f=G&l=50&d=PALL&S1=08024326&OS=PN/08024326&RS=PN/08024326

    or

    http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=08024326

    Thanks.

    Bill

    • ringostarrNo Gravatar on said:

      Thanks, Bill for the patent linking suggestion! I find USPTO’s links to be very convoluted!

      I have read a few articles on your blog and I must say, you write really well. We share the same interest in the technical aspects of search engines.

Leave a Reply

Your email address will not be published.

41,040 Spam Comments Blocked so far by Spam Free Wordpress

HTML tags are not allowed.

 

Previous post:

Next post: