Tuesday, October 14, 2008

Giving chance a chance, or the usefulness of serendipity

A post on the scholarly kitchen, entitled ‘Citation Controversy’, particularly a reference to the principle of least effort, sparked the train of thought leading to this post.

Scientific articles have references, which represent the connection of the article to other articles, and thus other knowledge. Articles in Wikipedia often have references, too. Although it is not rare that one sees the message “This article or section is missing citations”. The ‘Principle of least effort’ article in Wikipedia carries this message (on the date of posting this). Ironically demonstrating the principle, I think. Authors are often quite parsimonious when it comes to adding references to articles. And when references have been added to an article, there isn’t often a thorough check on whether they include all or enough of the appropriate ones. The omission of obvious references may be picked up by reviewers, but the omission of less obvious ones is easily missed. One of the sad things about omitting references is that it may reduce serendipity.

I have a suggestion for ‘Wikipedians’ who wish to add appropriate references and links to Wikipedia articles. In particular to Wikipedia articles in the areas of health and life science, and so encourage serendipitous discovery. I advise them to go to what I informally call 'wikimore', an enhancement layer where they will find that the text of Wikipedia articles is enriched with highlighted concepts. By clicking on a number of those highlighted concepts and adding them to a search query, you can search the appropriate articles to refer to in, say in Google Scholar, or in Wikipedia itself, and when found, add those references to the Wikipedia article, as a good Wikipedian would.

For instance, by clicking on the concepts ‘information seeking behavior’, ‘design’ and ‘library’, and subsequently searching in Google Scholar, I find this article:

Comparing faculty information seeking in teaching and research: Implications for the design of digital libraries, by Christine L. Borgman et al., in the Journal of the American Society for Information Science and Technology, Vol. 56, No. 6. (2005), pp. 636-657. DOI: 10.1002/asi.20154.

An interesting sentence from that article: “…faculty are more likely to encounter useful teaching resources while seeking research resources than vice versa.” In my view this demonstrates the drawback of a least effort approach (I like to call it the ‘laziness principle’), which by its very nature militates against serendipity. And yet serendipity is one of the most important routes to real breakthroughs in knowledge and understanding. A quote from an article by M.K. Stoskopf: "it should be recognized that serendipitous discoveries are of significant value in the advancement of science and often present the foundation for important intellectual leaps of understanding".

I’m not sure if the article I found (one among many others) would be a good reference to add to the Wikipedia article on the ‘principle of least effort’, but I do hope you can see that with wikimore you can, starting from a Wikipedia article, embark even better on a journey of serendipitous discovery than you already can without the enhancement layer that wikimore provides, since with wikimore, i.e. the concept web enhancement as applied to Wikipedia, every concept that is recognized in the text is a link to further information in itself, a ‘reference’, if you wish.

And while you’re at it, you might want to take a look at the ‘knowlet’ of ‘information seeking behavior’, and explore the concepts with which information seeking behavior is connected in the life and medical science area.

Happy exploring!

Jan Velterop

Open Access Day

Though I haven’t posted for a while on The Parachute, today, on Open Access Day, I feel I should.

Unfettered access to scientific research results is in my view one of the ‘infrastructural’ provisions that enables science to function optimally. So why isn’t open access universal and what can be done to make it so?

After all, open access is easy. Just as I am posting this entry on a blog – open and freely available to any reader, anywhere, any time – I can post a scientific article. It is increasingly unlikely that there are many scientific researchers in the world who don’t have the possibility to publish their articles on a blog or in an open repository. And I use the word ‘publishing’ advisedly. The notion that publishing is something that happens in journals is rather outdated since the emergence of the Web. (Isn’t it interesting, by the way, that our word ‘text’ is derived from the Latin ‘textus’ which means ‘web’?)

Actually, I have to correct myself here. Journals do publish, but they are not needed for the act of publishing by itself. Publishing can easily be done by the authors. The significance of journals lies not so much the scientific content of their articles, but in the metadata of those articles. And by metadata I mean not so much the information about volume, issue, page number, et cetera – though that is useful for unambiguous citation – but in particular the information indicating that, and when, the article has been peer-reviewed (and often enough improved) in the course of a given journal’s editorial process. The role of a journal is to formalize an article, to affix the ‘label’ of the journal to it, indicating not only that it has been peer-reviewed, but also slotting it into what might be called a ‘pecking order’ of scientific publications. One only has to consider the weight attributed to a journal’s Impact Factor to get a sense of how important that pecking order is, or is at least perceived to be.

One of the reasons we do not have universal open access yet is that we keep on confusing the two: publishing (i.e. making public) on the one hand, and formalizing (i.e. affixing a scientific ‘credibility’ label) on the other.

Journal publishers, although still called ‘publishers’, are, in the Web era, mainly in the business of organizing the latter: affixing the label. That is no sinecure, as anyone who has done it will confirm. And as long as it is deemed necessary in the scientific ego-system – in order to get recognition, tenure, funding – it needs to be done. But it should not be confused with making research results openly and freely available.

Journal publishers have been in this business for decades, maybe even centuries. In the print world, publishing and formalizing were completely interwoven, possibly without anyone realizing it. The publishers were paid for their efforts by both readers and authors, though in different ways. Readers paid for access to the information via subscriptions, and authors for affixing the journal label to their articles by transferring their copyright exclusively to the publisher. That exclusively transferred copyright was worth a lot, because it enabled publishers to sell access to their journals, since anyone who didn't hold the copyright (which after copyright transfer included the authors) was prevented from disseminating articles, at least on any significant scale.

But we live in the Web world now, no longer in the exclusively print world. The value to publishers of copyright has decreased significantly since authors either started to ignore it – no-doubt encouraged by the opportunities the Web offers for wide dissemination – or were forced to limit the exclusivity of their copyright transfer, for instance because of mandates to make their articles openly available within a given period of time (within a year, for instance, in the case of the NIH mandate).

Given that open access is a great good to science and society as a whole (I treat this as an axioma), what to do?

Two options for researchers, not mutually exclusive:
  1. Publish research articles freely and openly on the Web, on blogs, in repositories, et cetera, especially in those that allow public comments, and let laying the articles open to such public comments take the place of peer-review. This option may realistically be available only to tenured, established scientists and the very young ones with an independent and iconoclastic frame of mind.
  2. Publish in the ‘traditional’ journal system, but choose journals that accept payment for organizing the peer-review and formalization process, and then make the article in question freely available with full open access immediately upon acceptance, and back this up by depositing a copy of the article in an open repository. This option may realistically be available only to funded scientists, but those who are not able to source funding for it can always resort to option 1.
A few remarks to conclude: There are indications – so far anecdotal – that ‘informal’ publications are gradually being taken more seriously by the science community and that helps the popularity of the first option. There are also indications that even the new and relevant scientific literature is becoming so overwhelming in size in some disciplines that proper manageable ways to get an overview of the state of knowledge, which progresses daily, need to be found. The analogy, if you wish, of a dependable weather report as opposed to just knowing the general climate supplemented by looking out of the window.

And lastly, isn't it fitting that this week, at the Frankfurt Book Fair, the worldwide publishers' jamboree, the inclusion of open access publishing into the mainstream of science publishing is being presented? I'm referring of course to the take-over of BioMed Central by decidedly mainstream publisher Springer.

Jan Velterop

Monday, June 09, 2008

Open Access and WikiProfessional

One of the first WikiProfessional instances is WikiProteins. An article in Genome Biology describes it in great detail. The lead author of that article, Barend Mons, reacts to the post by Euan Adie on Nature’s Nascent blog (“WikiProteins is a croc”, later changed to “WikiProteins – a more critical look”). Because it is important to understand the open access nature of the WikiProfessional project, I am reproducing Barend's reaction to the blog entry in its entirety here.

Jan Velterop
Although the rather sour blog by Euan is quite an exception in the overall positive reactions we receive on the beta site of WikiProteins, I feel that a matter-of-fact reaction from the lead author of the article in Genome Biology that announced it is warranted. It goes hereby.

First of all on Authorship: Jimmy [Wales] was instrumental in making the initial contacts between me and Gerard Meijssen who was then working on WiktionaryZ, now Omegawiki. He also gave invaluable advice on several aspects of the system and he therefore deserves as much of an authorship acknowledgement as the average senior author/professor who ‘conceived of the study’. See also Gerard Meijssens’ Blog about that.

On the interface etc., we all know this is beta and we struggled for a long time to make it as ‘good’ as it is. Obviously a flat file is easier than managing a relational database and therefore the interface can never be ‘really easy’. I agree with Peter Jan [one of the commentators on the Nascent blog entry] that constructive criticism would have been more useful.

Criticism on the commercial nature (as it were) of a company on a blog made available by another commercial company – one that makes money on others’ scientific contributions for as long as we have been studying nature – is a bit peculiar as well. With the involvement of Amos Bairoch, Michael Ashburner , Mark Musen, Abel Packer, Roberto Pacheco, Matt Cockerill and many others in this process, not to mention Jan Velterop’s reputation, it seems to me that the OA nature of the projects is sufficiently safeguarded. With my personal background in malaria, working for 15 years with colleagues in developing countries, I also built a public track record in pushing free access to information for developing countries.

The content in WikiProfessional applications is completely freely available under the Creative Commons Attribution license (we are working on making author credits more clearly visible). The Knowlets are indeed proprietary as we create added value and apply algorithms that by themselves now have taken several million dollars to develop. It has proven exceedingly difficult to get sufficient public funding for this project, which has been carefully internationally discussed and prepared for several years. Bill Melton and Al Berkeley are to be highly commended for taking the risk to fund the vision.

Also the Knowlet space is in Open Access for non-commercial use. I sincerely hope that seasoned investors like Bill and Al would be more imaginative than trying to monetize this site – and the others still to come – by ads only.

On potential fear of competition: let me tell everyone up-front that the authors on the paper have every intention to connect all information on important concepts via WikiProfessional, not trying to put it behind any barrier or to compete with anyone. Some may see us as a competitor to IHOP or Wikipedia pages on biomedical concepts for instance, which is not true, as you will soon see.

We are planning to add locally maintained databases on genes such as www.dmd.nl to the appropriate concept page in WikiProteins much more prominently placed than today (now an indirect link via SwissProt data), but also locally-maintained databases on single gene mutations such as the growing number of Leiden Open Variation Databases (LOVD’s). We have a project starting to map all concepts in WikiProfessional, including all biomedical concept pages, to the corresponding pages in Wikipedia and other emerging wiki’s. People who find the WikiProfessional interface too difficult will be soon able to contribute to their own wiki of choice and their contributions will be seen in WikiProfessional anyway.

We collectively ‘own’ the basic data and anyone is free to ‘add value’ to these and make that ‘added value’ freely available to all or just for public not-for-profit use. Knewco is just one of the companies that derives value from the data and has decided to make the added value available to the scientific community for free.
I cannot wait until Nature will be Open Access as well, at least as far as the scientific articles are concerned. Then it will be easier to make full use of Nature content for the benefit of the scientific community.

One more point on equity and access: the collaboration with our Brazilian colleagues, with whom I co-developed and signed the Salvador Declaration on Open Access, referred to in the supplementary data of the Genome Biology paper, will soon result in crossing the language barrier to Spanish and Portuguese. The record for my beloved ‘malaria’ in Omegawiki will show you our ambition on in how many languages we would like to support the indexing on-the-fly. For Free.

I hope these further explanations take away at least the worst of Euans fears. I see in today’s version of the blog that he did not only change the original title of the contribution, but I also saw a more balanced reaction to Peter-Jan Roes.

However, Euan, if you still feel that some of your comments were justified and not yet properly addressed, please substantiate your claims and in the process it is highly appreciated if you give some constructive criticism. You would really help the community – and us – by doing that. Let’s keep discussing this project to make it better.

Friday, May 30, 2008

The meanings of 'free'

I've received questions about Knewco's WikiProfessional. How free it is; and if it is free as in 'free beer' or free as in 'free speech'.

Life's never simple: it's a combination of both.

WikiProfessional's million minds approach does rely on user input. That's nothing new in science – in fact, the whole scientific knowledge edifice relies on user input. The user-generated content in WikiProfessonal is indeed free as in 'free speech'. The relationship-concept matrix (the knowlet-database, dynamic, relational, and constantly recalculated, reacting to any infusion of new knowledge) is also free to users, but free as in 'free beer'. It took considerable effort to develop and build it – and to maintain it – so it actually is (will be) paid for, by advertising and sponsorships we hope. The users 'pay' as in 'paying' a visit, and 'paying' attention, which we can then use to attract appropriate advertisers. (For some reason we haven't quite figured out yet how to survive on plain air, and we need to generate income to sustain our activities.)

It is important to distinguish the knowlet part and the wiki part in the WikiProfessional database. Knewco (the Knowledge Navigation and Expert Wiki Company) owns the first one and the knowlet is patented. In due time, there will be feeds available from the knowlet database to whoever wants to (or pays for, this might typically be a premium service).

The wiki part of the database on the other hand contains publicly as well as privately available authority and community contributions. We don't 'have' those; we just use those, as anyone else can do, at least with regard to the public ones (one has to approach the 'owners', authorities – NLM, Swissprot/Uniprot, etc. – for these authoritative databases). With respect to the community annotations and contributions, those are freely available under a CC-BY licence (Creative Commons Attribution Licence), and eventually we may have this available in a suitable form for downloading. There may be a potentially fruitful collaboration with Open Progress with regard to standardizing the download/exchange format.

Meanwhile, go to WikiProfessional, use the system, give us feedback, register and contribute, and work with us on spreading scientific knowledge via collaborative intelligence.

Jan Velterop

Wednesday, May 28, 2008

A rose by any other name

"Doctors often exude an air of omniscience, but in truth they are surprisingly ignorant."
Thus began an article in this week’s Economist. Harsh language, but many a doctor, or other professional, including scientists, will recognize himself or herself in these words. The article in The Economist isn’t specifically about that, but the sense of information overload is surely a major contributory factor to this 'surprising ignorance'. After all, a lot of the information one gets to digest is ambiguous, redundant, fragmented, inconsistent, to name a few problems. As Herbert Simon, an American political scientist once observed: “What information consumes is rather obvious: it consumes attention. Hence a wealth of information creates a poverty of attention.” The problem of the information glut in a nutshell.

Today saw the launch of an attempt to combat this abundance, redundancy, fragmentation and inconsistency: WikiProfessional.

The idea is that the combined efforts of a ‘million minds’ would be able, in a collaborative intelligence exercise, to refine a system that 'distills' the essence of established knowledge as well as points to new knowledge that has a high likelihood of being established soon. What it all entails is explained in an open access article in Genome Biology.

The concept (so to speak) is so far optimized for the life sciences and medicine, but there is no reason why it shouldn’t work in other areas as well. And in languages other than English. It is based on concepts, and those are of course valid in any language. It’s just the words or descriptions used for them are different. As Shakespeare already noted in Romeo and Juliet: "What's in a name? That which we call a rose by any other name would smell as sweet."

Just imagine what that means. One of the beauties of the concept approach (as opposed to the keyword approach) is that search terms in one language could, for instance, yield search results in another. Think of Chinese researchers searching with Chinese terms for English literature (they can read English, but may find it more difficult to come up with search terms in English, in the same way that I find it sometimes easier to search with Dutch terms), yet getting served up with English search results. Things like that. Wonderful.

(I have to declare an interest: I’m running Knewco, the company behind WikiProfessional).

Jan Velterop

Sunday, May 25, 2008

Wiki temperatures

In the Chronicle of Higher Education Jeffrey Young reports about a 'frozen' Wikipedia being more academically useful for students than the current version, which can be – and is – edited all the time, sometimes resulting in a lot of heat. There is something tremendously attractive in having unfettered editing possibilities, but also in having stable, authoritative articles in such an extremely useful web resource as the Wikipedia. In an academic environment, one would ideally have both. WikiProfessional, which is specifically conceived for the academic and professional environment, actually gives both. On the one hand it presents stable, vetted and authoritative knowledge, yet on the other hand it gives the utterly useful and necessary option for knowledge to be supplemented and annotated in real time by anyone wishing to do so. Both the authoritative version, and community annotations and additions, are presented side-by-side. Only when annotations and additions are deemed acceptable by the professional or academic community in question – peer-reviewed in one way or another – are they elevated to the level of 'received knowledge'.

For open access WikiProfessional presents a nice additional opportunity: 'annotations' can be links to particularly appropriate and relevant articles. And if such links were made to freely available versions of the articles in question, this would give WikiProfessional some of the functionality of a federated repository, not just enhancing an article's exposure and findability, but at the same time putting it in the right context in the Concept Web. This, in turn, may well further increase the chances of such an article to be cited.

Jan Velterop

Thursday, May 15, 2008

Dealing with abundance – getting more out of the science literature than you thought possible

Open access is adding to the abundance of scientific information available to us. It is to be expected that this abundance will be growing fast, with the growth of open access. This is good, because only comprehensive and unfettered access to the science literature will make it possible for us to be truly abreast of the scientific progress that's being made.

On the other hand, however, it will present us with even more challenges than we already face in terms of being able to deal with all that information. In certain disciplines reading all the relevant papers to our research topic means digesting thousands of papers per year – enough to fill our entire working time. Without assistance from the processing capabilities and speed of computers, we cannot hope to keep up with emerging trends in our chosen fields.

Few scientists can properly cope with mushrooming information and were they to read all the articles relevant to them, they would find that they almost always contain a very large amount of information already known to them. That redundant information is usually provided for the sole purpose of context and readability. The amount of actual new information is often surprisingly small and could have been conveyed in one or two sentences if the context were clear. Yet the essence of the scientific discourse is captured in those few sentences. The surrounding text of articles is, if you wish, the packaging in which the essence is transported, and analogous to the mass of fluffy stuff that's surrounding breakable item that's being shipped: emballage.

At Knewco, the company that I now work for, we aim to provide an environment for concentrating this scientific discourse – 'distilling' it from the abundance of sources, if you wish – and make it more productive by making it computer-processable. Very few scientists can read and digest all the articles and database entries that they would need to read and digest in order to synthesize the essence of the knowledge they need. So what we do is to enable and foster collaborative intelligence between machine processing power and human brainpower. Knewco 'distills' information to the essence of knowledge content from millions of documents, enriching it in the process with linked concepts and context.

This is not the same as making it possible to locate the one right document out of the abundance available. It is identifying 'atoms' of knowledge about a given concept from the literature and combining these atoms into 'molecules' of knowledge (we call those "knowlets" – a knowlet connects facts). Just as a graph can give you in one glance the essence of an enormous array of numbers in one glance, the knowlet gives you the essence of an enormous amount of scientific literature. It's like reading out of a picture instead of text. And as "a picture is worth more than a thousand words", a knowlet could be said to be worth more than the text of a thousand articles. Knowledge redesigned, as it were.

Perhaps more importantly, since a knowlet is a computer artifact, it can be used to identify related information, predict trends and intersections in data (see it as a kind of topology of knowledge), be used in combination with other knowlets of more complex concepts, and be updated in real time to keep information current up to the minute.

For technology of this kind to be optimally effective for scientific knowledge discovery, access to the literature is not sufficient by itself. It goes without saying that the source documents must be computer-readable to be optimally usable. Publishers as well as repositories may wish to take this to heart if they are serious about helping to speed up the pace of scientific progress.

Jan Velterop

Friday, March 14, 2008

Onwards from open access

As many of my readers will already know, I have recently decided to leave my position of Director of Open Access at Springer for that of CEO of Knewco Inc. Several reactions that I have since received indicate to me that my move is not necessarily understood by everyone, and I’ve even seen speculations that my leaving open access might mean that it is not going anywhere at Springer.

Let me say the following to that. First of all, OA has developed some very solid roots within Springer and I am most confident that OA is being further developed with alacrity by my successors at Springer.

Secondly, I don’t feel that I am leaving open access. Open access is not some club that one is a member of or not; it is a 'thought form' that one adheres to. And open access is only one of the ways in which the speed, efficiency and quality of scientific discovery can be enhanced.

Looking back on my career, I feel that my motives haven’t changed much. When I was working on IDEAL/APPEAL* (at Academic Press) in 1994-95 and later, I did this on the premise that there must be better ways to disseminate the research papers published in journals than just via relatively small numbers of subscriptions. The IDEAL concept (derided at first, but then imitated by just about all publishers, and often nicknamed BigDeal) was brought about by the realisation that if access to electronic journal articles could be pooled by larger numbers of institutions, then for the same publisher’s income – the same cost therefore to the academic community – the articles would be accessible to vastly more researchers. If ever the cliché
win-win was appropriate, it was here.

Open access logically follows on from that. The challenge was – still is – to find appropriate economic models to sustain professional scientific publishing with open access. The recently agreed arrangements between Springer and the Max Planck Gesellschaft, the UKB (all the Dutch universities plus the Royal Library), and Göttingen University, may point to a way forward. All articles from these institutions in Springer journals are published with open access under these arrangements.

If the underlying motive is, however, to get the most out of the scientific knowledge that has been gathered, which it is in my case, then moving on from open access to the semantic web – the concept web, if you wish – feels, at least to me, an entirely logical step. Not all knowledge after all is captured in journal articles. There is much more besides those, in databases, for instance, and in less formal web conversations. (A case can even be made that journal publishing ‘destroys’ data, for instance by reducing them to simple pixels in graphs, taking away the underlying richness of the data). Also, the connections between knowledge fragments are not always easily made purely by reading journal articles, in may areas a problem exacerbated by the sheer numbers of articles published. And all relevant. We are in a situation of overwhelming – and growing – abundance of scientific information, and methods that deal with that abundance are clearly needed. This is what Knewco people are working on, and I am very excited to join them.

Jan Velterop

*IDEAL: International Desktop Electronic Access Library – APPEAL: Academic Press Print and Electronic Access Licence