Remember that last time we discussed about how we can approach a Semantic search query and rank higher in the SERPs for it. Today, I’m writing a follow-up on that discussion, specifically referring to using web search query logs to write semantic web copy and rank in semantic search. And that happens because today is Google Wednesday, the focus-category for every Wednesday of the week.
Here is what you can learn in today’s article. Remember to Subscribe to my blog using the Subscription Form located in the Right >>>, if you wish to receive more interesting articles in your Inbox.
- What are web search query logs
- 5 Things you absolutely need to know about them before anything else
- What research on web search query logs tells us
- How to make the best out of what’s available out there
- How this helps you in writing semantic web copy
- How semantic web copy helps you rank in semantic search
- And more cool stuff...
Research Related to Web Search Query Logs and Findings
Let’s start with the geeky stuff (*prepare your glasses and enable geek mode” ON”*). There have been some research studies aiming at how query logs can be used and what type of information we can generate from query logs data sheets. But before that…
5 Things You Need to Know About Web Search Query Logs Before Anything Else
I suggest you read them carefully, otherwise you won’t know what hit you. For real!
- Web Search Query Logs consist of large data sets on user search for specific keywords and phrases on search engines. This means that these query logs are data sheets with string variables containing user search information at the time the user performs a search on Google, for e.g. the session ID, time-stamp, the keywords introduced by the user in the search box, clicks and URLs clicked and more. This is just related to the user behavior when using a particular search engine and not what happens after a user accesses an URL. So we have before-click data and after-click data query logs for that matter.
- Web Search Query Logs from search engines are often not public. Meaning, you might access some query logs that are available on the Internet, but no real-time query logs will be at your service, for free from search engines. Although, on special occasions, some search engines allow public access to tiny query log samples. Like Yahoo! – you can check it out here. But don’t get too excited, though.
- Web Search Query Logs are different from the query logs on your website. Basically, the query logs of your website are data sets related to what the user has done while visiting your website. You can find some of the query logs attributes in your Google Analytics board, in your Visitor Stats boards and in your SQL datasets (though I wouldn’t recommend going GEEK mode on your own without professional advice).
- Web Search Query Logs cannot be used directly by the general public, but it is possible indirectly. Here’s how: when search engines analyze (well, the programs/people behind do that) the query logs datasets, they can get an idea about what what the “searcher’s”/user’s search behavior is all about and what type of “search terms” suggestions they should display to help the user in his search. Now, we cannot directly access the query logs, but we can make use the search terms suggestions based on these queries. So if we type in “web search we…”, it will generate a couple of search terms suggestions. These are based on web search query logs that have previously provided information on the most web search entries performed by searchers. See the image below for details (*Picture time, yey*)
- Web Search Query Logs can sometimes be available for targeted public by attending specific conferences or workshops. To give you a specific example, at a Microsoft conference workshop, back in 2008, some workshop participants were granted access to what they called a “web search query logs excerpt” from MSN. You can find the entire article with brief explanations in this link. But in case you don’t have the time to read it all, I will summarize it for you.
Topics discussed in the workshop: web mining, how to rank, mining semantic relationships, analyzing and correcting biased data, clustering and grouping log data by factors such as topic, task, geo location, time, generative models, tasks improved with the click data and information retrieval.
The Shared Dataset: an RFP 2006 dataset was shared with some of the workshop participants during a workshop held in 2008, consisting of 15million queries sampled over the extent of 1month, mostly English queries from the US website. Each query included specific attributes such as session ID, time-stamp, query string, number of results on SERPs and results page number. Also, data per query for each clicked result included the URL, the associated query, the position in the SERPs and the time-stamp. A confidentiality agreement was signed before the workshop, restricting participants to ever redistribute the data and the publication of detailed excerpts of the data.
Okay, so now we’re back on…
Research and Findings: The Cool Stuff
There have been many authors and researchers interested in how search engines work and the behavior of the searcher – the user performing a search. As I mentioned above, data on web search query logs isn’t necessarily available to the public eye, although there’s been some “data leaks” available from the past. But imagine, no search engine would like to “sell off” its most precious treasure: user search data. This data actually leads to the ranking factors, so it’s only natural to not have free access. Otherwise, chaos would bestow on us all. Imagine companies learning about the specific 200+ factors Google uses to determine the ranks for the 1st page. And related to the 1st page rankings, there is also a study I will share with you explaining why EVERYONE is so obsessed with ranking in the top #10 positions in the SERPs. So here it goes…
A Clustering Approach on Extracting User Interests Based On Web Search Query Logs
This particular study explains what are the user interests after analyzing specific query logs. So the guys who worked on this paper, which you can freely access it here (link: http://liris.cnrs.fr/Documents/Liris-4685.pdf) are from German and French universities: Lyes Limam, David Coquil, Harald Kosch and Lionel Brunie.
The study’s aim (objective) was to enhance search query log analysis while taking into consideration the query terms’ semantic properties. The paper also discusses how to extract a global semantic representation of specific web search query logs (*goldmine, I tell you, goldmine!) and shows how to use the data generated to semantically extract user interests (Bingo!). You’ll also find here the taxonomies of query terms based on generalization or specialization semantic relations.
I’ve mentioned about Taxonomies of search queries in my previous article:
Plus, there’s also a part about functions to measure semantic distance between terms (yes, I’m crying, it’s blissful). The authors also explain how they define a query terms clustering algorithm, later applied to the log representation to extract user interests.
The cool part is that the dataset consists of large real-life logs of a popular search engine. Guess what search engine we’re talking about? Anyway, if you want to perform an analysis on a large query log dataset to determine user interests, this is a great method to do it. You could also use this for the dataset you have on the visitors of your own website. Next research worth mentioning here is…
An Extensive Analysis of a Large Dataset from Specific Search Engine Web Search Query Logs
This one offers more in-depth information than the first study I mentioned above. The paper was done by Craig Silverstein and Monika Henzinger from Google Inc., Hannes Marais from Compaq Systems Research and Michael Moricz from Doublebill.Com, Inc., some years ago. But the age of this paper is sort of irrelevant, as it points out why an upgrade to Web 3.0 and Semantic Web search was needed, why Google is doing the algorithm changes in the present and so much more.
The dataset was collected based on an AltaVista search engine query log of approximately 1billion entries performed by searchers/users over a period of 6weeks. What does this mean? Well, it’s almost 285million user sessions, each an attempt to fill a single information need. Individual queries, query duplication and query sessions were all analyzed, along with correlations between log entries in order to see the interaction of terms inside a search query.
What did they discover? Well, the future of search queries, we could say: results showed that there’s a significant difference between web users and how they were perceived up until that point. And here comes the interesting part, explaining the obsession of companies ranking in the first 10 results (1st page results) for short keywords. User data analyzed showed that web users typed in short queries and looked at the first 10 results only. The correlation analysis proved that constituents of phrases correlated the highest among other items. And this indicated that search engines should consider terms as part of phrases even in the situation in which the user did not ask for them. The entire study can be found here (link: http://infoscience.epfl.ch/record/99356/files/SilversteinHMM99).
Leading us to the present of Semantic Search, if you allow me to add this personal note.
Quick Advice on Semantic Web Copy
Okay, so we are here. Finally. Well, some of the tips on semantic search have been shared above. But that’s not all. In order to write semantic web copy, you have to understand the needs of the user performing the search. And also have a deep understanding of his search behavior. Now, some of the data and research available out there kind of covers that and offers us an idea of what people want and how they act.
A thing you must keep in mind is that the user performing a search query does not want any bullshit (*pardon the expression*) stuffed in his eyes. And that is what happened at some point with search engines: the SERPs did not reflect the user needs, a lot of spam and irrelevant results were displayed and that got the searcher very angry. Hence the changes in Google ranking algorithms. Other search engines just kept on going with the same algorithm, but Google wanted more quality, relevancy and consistency, which are imperative for the health of query logs and unbiased data.
So in order to write semantic web copy, you need to keep in mind all the things mentioned above and even more, for that matter: the emotion in query logs. Ask yourself how emotional are the users’ needs and generate content based on empathy. Check out this presentation on SlideShare explaining how emotional are the needs of the user. At some point, I will get back on this topic and further explore the Emotion in Query Logs. For now, c’est fini – French for “That’s it, folks!”.
How Emotional Are Users’ Needs? Emotion in Query Logs from Marina Santini