The emergence of semantic search technologies holds a great deal of promise…
One of the biggest benefits anticipated by the wide adoption of a semantic based approach to content discovery is the ability to ask a basic question to a “search engine” and get back a specific answer.
Not a list of sites, but an actual specific answer.
While that ‘search experience’ is very appealing on a conceptual level, it starts becoming somewhat muddled when it comes to delivering practical implementations. There are three aspects to this new world of search that present challenges and will require greater thought and discussion.
First – Many questions don’t have simple answers:
Factual information can have a deeper context that is difficult to express in a simple question/answer framework. Consider the question “What is the population in New York City?” You may end up with several different answers – and all of them could potentially be correct.
One site may quote numbers reported directly from the most recent census (e.g. – the ‘official’ numbers). Others may be more recent estimations of the same, and potentially more ‘accurate’. Others may include or exclude unofficial demographic segments – like the homeless or illegal immigrants – or estimate them using different formulae. They all contain a dimension of ‘truth’, and but you’d need to understand the context each came from to appreciate it.
But none of that subtly is easily express via a simple specific answer…
Second – It’s not clear what the correct answer is:
The fact that a source provides an answer to a specific question doesn’t mean that it is the best answer or even a correct answer (What? There’s inaccurate information on the web?!?) That means that these new “search engines” will need to choose – from potentially many different sources and many different answers – a ‘correct’ answer to return. Current methods for site ranking don’t translate well into ranking factual accuracy. They were designed to measure site relevance based on popularity, not accuracy – and there is at best a weak correlation between the two.
Another approach that could be suggested as a solution here is the application of a weighted model based on a ‘wisdom of crowds’ philosophy. This model would hold that the correct answer is likely to be the most repeated answer. While that may have some rational basis behind it in a more random selection of sources, it may not apply to analysis of content on the web. For a ‘crowd’ based model to work well, the individual sources should not be influenced by one another. They need to remain discrete contributing entities to the final answer, or you end up with “group think”. Unfortunately, the web is a giant echo chamber, and answers on one site -right or wrong – can propagate to hundreds or thousands of other sites. This will give that particular ‘answer’ a disproportionate influence in the aggregated determination of a response.
As I said before, ‘popular’ isn’t necessarily ‘accurate’…
Regardless of the general approach taken to discriminate between multiple potential answers, it will also need to be able to deal well with ‘disjunctive’ information sets. Disjunctive information is information that breaks from the past in some way. Any process that biases its selection exclusively using historical factors will ignore the dynamic nature of some content. The most current answers will – by definition – have the fewest historical references and links. Answers to questions like “What are the known side effects of…?” may have highly relevant updates that will be important to include in a response, but would be deemphasized using a purely backwards-look heuristic. Addressing this would be critical in domains with highly dynamic content flow.
Third – It breaks the current commercial foundations of the web:
The current commercial framework on the web is largely built on either a subscription model or an advertising/sponsorship model. Subscription models place content behind a firewall and require payment to access it. Advertising models are based on generating a meaningful and sustainable level of traffic to a specific site. Neither of these approaches fit comfortably in a search engine based ‘question/answer’ model.
Subscription based content isn’t broadly available for mining by search engines. People tend to view the sites they subscribe to as special sources, and will visit them uniquely to access specific types of information. The value search engines bring to subscription sites today is a link back to a login or sign up page based on fairly broad metatags. That wouldn’t work in a question/answer model.
Advertising based approaches depend on driving traffic to a site to generate revenue. Content is created to address a specific audience. They can discover it via search tools and visit the site – generating traffic. Sites can even buy specific search terms to improve their visibility in certain searches and hopefully see an up tick in visits.
Unfortunately, the ‘Question/Answer’ model takes the opposite approach. It attempts to deliver an answer directly to a user without requiring someone to actually visit the site it came from. In fact, if an answer is selected based on a statistical methodology, there may not even be a specific site responsible for providing the answer – it actually may come from ‘everyone’ in the tracked cloud.
Finding a way to share revenue in this model could be difficult. It could end up looking like the (thankfully) failed ‘piracy tax’ that was proposed on DAT tapes and blank CD’s. It would have added a fee to the sale of these recordable media that would then be distributed to specific artists using a vague allocation methodology. The lesson here is that any solution that diffuses the relationship between performance and compensation is inherently inequitable and ultimately unworkable.
Using web content in this way – essentially harvesting and repackaging information from millions or billions of web sites – raises significant copyright/IP issues as well. And these issues, like the web itself, exist on a global scale. Finding a solution will require moving beyond the parochial and politically deficient requirements of individual jurisdictions, and embracing a simpler global framework that is easy enough for everyone to use, but specific enough to address the genuine concerns of content creators. This could end up being a catalyst for the broad adoption of an enhanced version of the current Creative Commons framework – something long overdue in the online world.
While the issues discussed here are not insignificant, there is enormous value in finding broadly acceptable models for working through them. These are foundational components of our next move forward on the web, and there is a great deal we will learn in the process. Determining how to address the ‘answer selection’ challenge will push the boundaries of social search and discovery methodologies, as well as accelerate progress in top down semantic analysis. Establishing a commercial and legal framework for dealing with content sharing at this granular level will create a surge of creativity and innovation in cooperative computing and social interaction that would easily dwarf the accomplishments of social pioneers like Facebook.
The innovation horizon on the web just keeps getting broader and broader…
This post is an expanded consideration of a subject I talked to in a comment on a previous post.