Avoiding "A.I. Redux"…

Share

Fred Wilson has a great post on his blog this morning about the semantic web (Making The Web Smarter). Beyond the mention of my company InfoNgen, it also provided an interesting perspective on the how the web is evolving in practice. This is a subject I’m passionate about, so I couldn’t pass up the opportunity to throw in my two cents.

With InfoNgen, I spend a great deal of time thinking about potentially new and innovative ways to analyze and classify content – including a broad range of web based content. Without a doubt, the research going on around the semantic web is some of the most interesting in this field. While there has been some really exciting progress in applying this research to many constrained information domains, creating this self-describing, intelligent network of information on an “internet wide scale” is still an incredibly daunting task.

And as Fred points out, it isn’t one we are making a lot of progress in.

I am struck by the similarities between the efforts happening here, and the work that took place from the 70′s to the early 90′s in the field of artificial intelligence. In computer circles, A.I. was the cutting edge discipline of it’s day. Until the arrival of the Internet, it was a magnet for creative engineers and scientific talent. People saw it as the next great revolution in technology. Encouraged by successes like chess playing computers that could beat grand masters and medical expert systems that demonstrated real value in clinical situations, expectations were high that we would soon see computers that would be able to interact with us conversationally – personal assistants that could carry out spoken directions and provide us with relevant advice and information. This video – done by Apple in 1987 – is a great example of what people were hoping computers would soon be able to do for them:

It’s more than 15 years later, and we’re still a very long way off from the promise shown in this video.

Today’s efforts to create the foundation of the semantic web are in some ways like a reemergence of artificial intelligence – but now repackaged for a web centric world. Many of the concepts and technical disciplines that were sitting behind A.I. – Bayesian inference, natural language processing, weighted decision trees, classifiers, and knowledge bases just to name a few – are now in some form or fashion powering various commercial and open efforts to realize the semantic web. And while they do share a common set of technologies, that doesn’t mean they need to share a common fate.

But to be successful, things will need to start coming together in a different way.

This time around, these technologies will need to leverage the core social fabric inherent in the web architecture. Analysis needs to be pushed out to the edge and become an integral and interactive part of the content creation process. This would not only be able to suggest tags or other meta level markups, but also offer potential summaries for quick display, highlight ambiguous terms or content blocks for refinement, and suggest unique topical terms that could be included in the content to improve discoverability. The human generated editorial insights that exist in trusted content sets need to be leveraged to mine for relationships in other content sets that exist more broadly. (Fair use/copyright law will need to be updated and clarified keep up with innovations in this area.) Most importantly, the creation of public databases, taxonomies, and ontologies need to become a priority for open source efforts, potentially leveraging a DBpedia style model of publication and quality control. Freely available datasets will be the fuel that powers many of these efforts going forward. Overall, any successful approach here needs to blend the things people do well with technologies that can amplify and extend it, producing something neither could accomplish well on its own.

With all of that said, I’m not naive. I don’t believe we will ever have a truly global, harmoniously classified semantic web. There are simply too many perspectives to rationalize in a way everyone can agree on, and too many people looking to game the process for their own gain. The Utopian model discussed academically is really an idealized goal that isn’t achievable on a practical level. But I strongly believe that it will be possible to offer to the broad web community the same improved web experience currently provided by vertically focused solution providers like InfoNgen. Meaningful progress at this level will require more than the isolated technological breakthroughs of any single company or organization. Though it can be anchored around the same core semantic concepts, getting the scale and scope needed to succeed here will require some kind of cooperative framework to share and enhance the currently disconnected efforts and innovations that are taking place today. Without having some mutually beneficial relationship exist between the various commercial and open sourced initiatives, it is likely that the global semantic web will end up hitting the same kind of wall that the original efforts in A.I. did.

While a technical discussion of the various solutions in this space may be interesting, the end goal of the semantic web is to make it easier for for individuals and organizations to discover and apply information that is relevant to them. This means that access to content needs to become more flexible, and conform to the variety ways people may think about it and want to consume it. This is in sharp contract to the traditionally rigid way publishers have wanted to package and present it in the past.

None of this will be easy, but getting publishers to embrace this kind of change may be the biggest challenge of all.

BING: Microsoft's New "Decision Engine"…

Share

Microsoft’s new search engine “BING” certainly looks interesting.

BING is a lightweight semantic search service integrating the technology Microsoft got when they acquired PowerSet with their current LIVE search platform. It seems to be trying to address the key frustration people have with traditional web search tools – namely the lack of structure in the results that are returned. Outside of common topics, it can sometimes take a fair amount of digging through pages of headlines for people today to find what they are looking for.

Microsoft sees this disaffection with the search status-quo as the approach they can use to go after Google. Their intention is pretty clear from this video introducing BING:

Technically, getting BING to work as promised will be a huge challenge for Microsoft. People search for all kinds of things. After you get beyond the more scripted result sets seen in these demonstration searches, how well will BING really perform? Can Microsoft’s approach really scale up to cover a meaningful percentage of the web and cover a broad enough set of subject domains to attract a large following. While I really like BING at an aspirational level, I can’t ignore the many “product visions” from Microsoft that far over-sold what ultimately got delivered in their final products.

Remember the promise of “Longhorn” aka Vista?:

But even assuming BING can live up to it’s billing on a technical level, it will probably have another issue to deal with: the limits of what Microsoft (or any search vendor) can do with the content they crawl. Unlike the more tradition approach to web search, BING seems to mine various sites for more detailed information, pulling it together into more thematic views. The richness of these views could potentially obviate the need for people to click back to the source sites to still get the information they want – something that would certainly be frowned upon by those content creators counting on receiving click-thru traffic. The high level of content extraction required here is a new area in web search that has yet to establish any accepted “terms of engagement” between all of the involved parties.

With all of this said, Microsoft may finally be on the road to having a viable answer to Google’s dominance. BING seems to be a big step up from their current LIVE search, and is probably better aligned with how people would like to experience web search than Google presently is. They will need to aggressively market it, which is something Microsoft appears more than capable of doing. And at only about 8% market share in web search today, moving the needle a meaningful amount probably wont be that difficult for them to accomplish. The key to getting advertisers to follow will be building up and sustaining some momentum around whatever market share gains they make. That’s what will make BING successful in the long term.

But all of this assumes that BING delivers on the promise – that the results BING returns are highly relevant to the searches being done and easy for a user to navigate.

And at this point, that’s still a really big assumption…

The Semantic Web: Starting Small…

Share

Jump starting the Semantic Web is no small task…

There are many people looking for Google (or the ‘next Google’) to begin applying semantic principles to the creation of a new, incredibly rich index of the web. While Google certainly has the technical wherewithal and available cash to make a credible move in this direction, I think that realizing the true power of the Semantic Web will require a somewhat different approach than that taken in the past. It needs to be a lot more distributed in every dimension.

It needs to become social…

Today, most people use a folder paradigm to organize and structure their own information. If they can’t remember where they put a file, they can search for it by some general things – date, size, file name, file type, or included text. The entire process of local discovery is primitive, inefficient, silo-ed, and incredibly frustrating for anyone lacking a rigid approach to organization.

If the semantic web is really going to take off on a large scale, it needs to happen first on a small scale. Semantics need to become an everyday part of the way individuals deal with information at a personal level.

Every piece of content a person touches has embedded semantic detail – contacts, companies, products, locations, dates, times, and topics. These are the incredibly valuable reference points people would like to use to find things, and even more importantly, to connect things together. To make that happen, semantic analysis and indexation needs to start happening on a personal level. It needs to become a fundamental component of how people interact with their own emails, office documents and even the web pages they visit.

The Semantic Web needs to begin on peoples own computers…

As opposed to starting off by having one massive organization crawling the entire web, a better approach would be having millions of copies of a small distributed component crawling individuals’ own file systems, providing them with a rich semantic experience of their personal information first.

This semantic component would be able to download multiple relevant ontologies from the web and use them to deeply index (in a private and secure way) all of the information found on users’ personal systems. These ontologies would also be used to index any web sites a user visits, allowing them to create deep contextual links between information they use on the web and the information on their desktops.

And it can also leverage the social dimension of the web…

As users share files, they could also share all of this enhanced meta-data they’ve generated. Beyond that, but they could even share the ontologies they used to create that metadata, letting recipients opt-in to including in their own ontology set. As these connections between individuals are made, sharing can also start to take place around the web indicies each has generated from their own web crawling. Ranking/relevance scoring can also be introduced, helping to statistically tune the published ontologies

And maintaince of the ontologies could even be Open-Sourced…

This would allow the ontologies being used to be fine tuned (and even extended) on a social level – by the people consuming the information. Without a doubt, people with a vested interest in a information domain are better able to capture the subtle details of related ontologies than those looking at it from a more operational perspective. This could bring a level of scale and attention to the maintenance of ontologies that simply wouldn’t be possible using a more traditional centralized model.

This approach has the power to profoundly change the way individuals manage and share the information they have, and discover new information that may be relevant to them. And it can happen without waiting for one of the ‘Big Boys’ to manifest a fully realized ‘Semantic Web’ on a global scale.

And that could be the jump start the Semantic Web needs to take off…

As for the ‘Googles’ of the world, they would still have a critical role to play. The approach I’ve described here is highly distributed, and effectively creates millions of localized ‘semantic islands’. The large search providers could become the semantic backbone that ties these islands together. They could become the clearing house for the certification and distribution of the ontologies that power this approach, and collectively manage their development. They could also become the rolled-up index of all of the individual web crawls done at a local level – the global semantic index that powers broad searching needs. They could even provide semantic indexation for commercial sites, and integrate it with the searches done at a local level. There will be plenty for them to do.

But the one thing they shouldn’t do is try to own it all…

While it may not look like it today, the would is going to move away from traditional search as the model for finding information. Instead, discovery will take place by using the context of something I’m looking at or working on as a magnet for additional related information. If you see a name in a PDF file, you could use it to pull a phone number from your own Outlook, a bio from their Facebook page, and headlines about them or their company from all over the web.

In a single window. With a single click. Without doing a single search.

But to get there, we’ll need to start small.

With the Semantic Web, we should keep thinking global, but start acting local…

NOTE: This post is a minor update to an article that I published on this blog about a year and a half ago. I decided to revisit it after my partner at InfoNgen, Isaak Karaev, sent me a link to an article describing the EU-funded Nepomuk – a European effort to implement pretty much what I had outlined here.

This is definitely an exciting development!

Cutter Associates' Technology Alliance Conference…

Share

For those not familiar with Cutter Associates, they are a premium provider of objective analysis and consulting services in the financial marketplace. I had the chance to deliver the keynote talk today at their Technology Alliance Conference in Boston.

This conference explores a broad range of issues related to the operational infrastructure financial firms need to support. I decided to focus my talk on some of the significant trends that I believe will shape the way that firms will discover information in the future. Of the seven big trends that I covered, there are three key ones that I’d like to share with you here:

    Discovery Will Become Personal – It will become increasingly important for individuals to be able to discover information using personal taxonomies that reflect their unique perspective on the key topics that they need to follow. These personal taxonomies will complement the shared global taxonomies that are provided broadly, and create a more effective and efficient way for people to discover and organize the information that is really relevant to them.
    Text Search Will Become Secondary – Though it’s central to the way the web is mined today, text search will fade in importance as a tool for information discovery. It is simply too imprecise and delivers way too much noise in the results it returns. I believe that it will be replaced by tools that provide more thematic based discovery. These tools will be based on weighted, non-Boolean matching, rules based qualifications, and statistical analysis. These approaches will make information on complex concepts much easier to find in the future.
    Discovery Will Become Pervasive – While most discussions around content discovery focus on the web, effective discovery actually needs to embrace ALL of the content sets you have available to you. This includes the content on your own desktop and email, as well as in corporate file repositories and data stores that you may have access to. Having a contextually rich framework that encompasses all of these sources will allow a new discovery model to evolve that transcends the silo limited approach most people need to deal with today.

At the heart of each of the trends I discussed in the keynote is the creation of more detailed and more personalized context that can power new approaches to information filtering. The core technologies required to create this contextual backdrop are actually all available now. They can be leveraged effectively in many of the most challenging information discovery domains firms are struggling with today.

The future of content discovery is at lot closer than most people realize…

The Future Of Competitive Intelligence…

Share

My company InfoNgen was invited to present at SLA 2008…

In a prior post, I wrote about some of the key points I made during a presentation to competitive intelligence professionals at this year’s Special Librarians Association Conference in Seattle.

As a followup, I’ve produced a video of that presentation that is now available over on our companion video blog The DIGITALedge.TV.

At just over 14 minutes long, it’s not the ideal ‘web video’ length. However, the presentation does cover a lot of ground, and provides a clear overview of the semantic based capabilities InfoNgen is currently bringing to the marketplace. Analyzing trends across information sets – covered in this video – is an especially interesting component of what we do, and it will continue to grow in importance over time.

This feels like a marketplace on the verge of taking off…

Back From SLA 2008…

Share

I was fortunate to be able to speak at this year’s SLA conference…

The SLA – Special Librarians Association – is a professional organization representing the interests of information specialists around the world. These professionals, known as “Special Librarians”, are:

…information resource experts who collect, analyze, evaluate, package, and disseminate information to facilitate accurate decision-making in corporate, academic, and government settings.

I had the chance to share my views on what the future of competitive intelligence may look like, and the technology and tools that would be needed to support it. Competitive Intelligence (CI) is an fast growing component of corporate and academic research. When done well, it can offer an organization significant strategic advantages. When done poorly, an organization can find itself off balance in the marketplace, wasting resources, focus and time by being totally reactive.

My talk centered on the critical attributes of a modern CI workflow. I summarized them into four key trends:

  • Information lives in many places – the web; professional services; email; corporate file servers; your own desktop. You need to use a single tool to discover content across all of them. There is no way to effectively manage discrete content silos if you are forced to used a set of disconnected tools, profiles, filters, and taxonomies.
  • Information discovery needs to center around concepts and context. Once that is established, filtering by more traditional means becomes viable. The concepts and context I refer to here need to be personal. You need to be able to classify and discover content from your own perspective, and organize it into a structure that makes sense for you and your business. “One size fits all” taxonomies don’t cut it anymore – you simply end up seeing the world the same way as everyone else.
  • Discovery isn’t just about raw information – it’s also about trends. You need to be able to see changes to information over time, and explore the relationships various pieces of information have with each other. Finding trends provides focus and unique insight, and is a key component to maximizing the value of the information assets at your disposal. This is especially true when looking at trends around custom themes or topics that reflect your own interests and perspective.
  • Establishing a culture of collaboration and information sharing is critical for any modern organization. It needs to exist both internally and externally (with clients and partners). It needs to be more than just a slogan or ideal. It requires an investment in both tools and training. And it also demands a decentralized approach – people need to be able to “self organize” around the work they do and the materials they share to do it. Some of the most timely and insightful information an organization has sits in peoples heads. Giving them a better way than email to share it, discuss it, and preserve it as a discoverable information asset will have a considerable payback.
  • While this talk was targeted at Competitive Intelligence librarians, the core points I made really apply to any organization that depends on information flow to conduct their business. And these days, that probably covers most of them.

    It was clear from the feedback I got after the talk that this conceptual approach resonated.

    The need is there for a new set of tools…

    I hope to be putting a video of this presentation up on the site a little later. (Now available at The DIGITALedge.TV) Semantic analysis and automated understanding are at the heart of what we do at InfoNgen, and are areas I am passionate about.

    They represent the future of the information search and discovery world…

    Just The Facts? Maybe Not…

    Share

    The emergence of semantic search technologies holds a great deal of promise…

    One of the biggest benefits anticipated by the wide adoption of a semantic based approach to content discovery is the ability to ask a basic question to a “search engine” and get back a specific answer.

    Not a list of sites, but an actual specific answer.

    While that ‘search experience’ is very appealing on a conceptual level, it starts becoming somewhat muddled when it comes to delivering practical implementations. There are three aspects to this new world of search that present challenges and will require greater thought and discussion.

    First – Many questions don’t have simple answers:

    Factual information can have a deeper context that is difficult to express in a simple question/answer framework. Consider the question “What is the population in New York City?” You may end up with several different answers – and all of them could potentially be correct.

    How?…

    One site may quote numbers reported directly from the most recent census (e.g. – the ‘official’ numbers). Others may be more recent estimations of the same, and potentially more ‘accurate’. Others may include or exclude unofficial demographic segments – like the homeless or illegal immigrants – or estimate them using different formulae. They all contain a dimension of ‘truth’, and but you’d need to understand the context each came from to appreciate it.

    But none of that subtly is easily express via a simple specific answer…

    Second – It’s not clear what the correct answer is:

    The fact that a source provides an answer to a specific question doesn’t mean that it is the best answer or even a correct answer (What? There’s inaccurate information on the web?!?) That means that these new “search engines” will need to choose – from potentially many different sources and many different answers – a ‘correct’ answer to return. Current methods for site ranking don’t translate well into ranking factual accuracy. They were designed to measure site relevance based on popularity, not accuracy – and there is at best a weak correlation between the two.

    Another approach that could be suggested as a solution here is the application of a weighted model based on a ‘wisdom of crowds’ philosophy. This model would hold that the correct answer is likely to be the most repeated answer. While that may have some rational basis behind it in a more random selection of sources, it may not apply to analysis of content on the web. For a ‘crowd’ based model to work well, the individual sources should not be influenced by one another. They need to remain discrete contributing entities to the final answer, or you end up with “group think”. Unfortunately, the web is a giant echo chamber, and answers on one site -right or wrong – can propagate to hundreds or thousands of other sites. This will give that particular ‘answer’ a disproportionate influence in the aggregated determination of a response.

    As I said before, ‘popular’ isn’t necessarily ‘accurate’…

    Regardless of the general approach taken to discriminate between multiple potential answers, it will also need to be able to deal well with ‘disjunctive’ information sets. Disjunctive information is information that breaks from the past in some way. Any process that biases its selection exclusively using historical factors will ignore the dynamic nature of some content. The most current answers will – by definition – have the fewest historical references and links. Answers to questions like “What are the known side effects of…?” may have highly relevant updates that will be important to include in a response, but would be deemphasized using a purely backwards-look heuristic. Addressing this would be critical in domains with highly dynamic content flow.

    Third – It breaks the current commercial foundations of the web:

    The current commercial framework on the web is largely built on either a subscription model or an advertising/sponsorship model. Subscription models place content behind a firewall and require payment to access it. Advertising models are based on generating a meaningful and sustainable level of traffic to a specific site. Neither of these approaches fit comfortably in a search engine based ‘question/answer’ model.

    Subscription based content isn’t broadly available for mining by search engines. People tend to view the sites they subscribe to as special sources, and will visit them uniquely to access specific types of information. The value search engines bring to subscription sites today is a link back to a login or sign up page based on fairly broad metatags. That wouldn’t work in a question/answer model.

    Advertising based approaches depend on driving traffic to a site to generate revenue. Content is created to address a specific audience. They can discover it via search tools and visit the site – generating traffic. Sites can even buy specific search terms to improve their visibility in certain searches and hopefully see an up tick in visits.

    Unfortunately, the ‘Question/Answer’ model takes the opposite approach. It attempts to deliver an answer directly to a user without requiring someone to actually visit the site it came from. In fact, if an answer is selected based on a statistical methodology, there may not even be a specific site responsible for providing the answer – it actually may come from ‘everyone’ in the tracked cloud.

    Finding a way to share revenue in this model could be difficult. It could end up looking like the (thankfully) failed ‘piracy tax’ that was proposed on DAT tapes and blank CD’s. It would have added a fee to the sale of these recordable media that would then be distributed to specific artists using a vague allocation methodology. The lesson here is that any solution that diffuses the relationship between performance and compensation is inherently inequitable and ultimately unworkable.

    Using web content in this way – essentially harvesting and repackaging information from millions or billions of web sites – raises significant copyright/IP issues as well. And these issues, like the web itself, exist on a global scale. Finding a solution will require moving beyond the parochial and politically deficient requirements of individual jurisdictions, and embracing a simpler global framework that is easy enough for everyone to use, but specific enough to address the genuine concerns of content creators. This could end up being a catalyst for the broad adoption of an enhanced version of the current Creative Commons framework – something long overdue in the online world.

    —–

    While the issues discussed here are not insignificant, there is enormous value in finding broadly acceptable models for working through them. These are foundational components of our next move forward on the web, and there is a great deal we will learn in the process. Determining how to address the ‘answer selection’ challenge will push the boundaries of social search and discovery methodologies, as well as accelerate progress in top down semantic analysis. Establishing a commercial and legal framework for dealing with content sharing at this granular level will create a surge of creativity and innovation in cooperative computing and social interaction that would easily dwarf the accomplishments of social pioneers like Facebook.

    The innovation horizon on the web just keeps getting broader and broader…

    This post is an expanded consideration of a subject I talked to in a comment on a previous post.

    An Update On re:SEARCH…

    Share

    It’s been a while since I introduced my upcoming web show re:SEARCH…

    Though preproduction on re:SEARCH has taken a bit longer than originally planned, things are now moving along. Set construction should be finishing up in a couple of weeks (yes – even virtual sets have construction requirements), and we should start shooting episodes later this month.

    I plan to cover a lot of ground in the series, weaving fundamental concepts with more advanced topics. It will combine underlying technical concepts with practical applications, and will highlight various sites and tools that can help in the process of search and discovery.

    blog-research-update.jpg

    If there are specific search related topics you would like to see addressed in the show, just shoot me an email or leave a comment below. I want to make re:SEARCH as interesting and relevant to everyone as possible, so I welcome your suggestions.

    I hope you’ll join me when it launches…