This more than evolutionary service started as a pilot project in 1997 on autonomous citation indexing. Finally, it received some well deserved NSF funding which makes it feasible to enhance and run this superb service for at least four more.



Much too often, interesting pilot projects fade out after the initial grant money runs out.. Luckily, earlier this summer the NSF awarded a $1.2 million grant to the Penn State University  School of Information Sciences and Technology (IST) and University of Kansas to enhance and improve the original project. It is a much deserved fund  in light of the direct utility and the inspirational value of CiteSeer. It is a minor fund compared to the millions what the ill-conceived and very poorly implemented PubScience project received  a few years ago in form of a congressional appropriation, and produced hardly anything novel, and left absolutely nothing usable after its few years of existence.

Beyond the huge multidisciplinary commercial citation indexes, Web of Science and Scopus which I reviewed earlier, there are a few other literary information retrieval projects which offer novel, powerful open access services related to scientific literature in specific disciplines or group of disciplines, such as arXiv.org operated by Cornell University which covers astronomy, physics, mathematics, computer science and quantitative biology; the Astrophysics Data System (ADS) sponsored by NASA and run by the  Harvard-Smithsonian Center for Astrophysics, the Research Papers in Economics (RePEc) archive in its various flavors, operated by various not for profit organizations. Faculty at Southampton University has had invaluable contributions in developing the essential software for self-archiving, collection analysis, and citation analysis, not to mention the relentless and effective evangelizing to promote the self-archiving movement.

Then there is the multidisciplinary Google Scholar service, but behind its pretty (inter)face there is such a grave brain damage that much of it hits are misses. No wonder, as the Google Scholar software is unable to carry out even the most elementary Boolean OR operation. For the term deception  reports 86,000 hits, for deceptive the hit count is 47,700, and for the search deception OR deceptive it finds 68,600 hits. The situation is worse when it comes to reporting the citedness scores. Before you include Google Scholar in your Thanksgiving blessing (Tim, is it blessing?)  do your own testing to find out what is behind the appealing façade or check out my recent illustrated story book for some examples I did for an interview with an editor of The Scientist about counting on Google Scholar’s hit counts and citation counts. 



CiteSeer  (which was likely to be  the model for Google Scholar) started out in 1997 with this good name, then switched to ResearchIndex, then switched back to the original. It  currently offers its services, including  sophisticated citation searching  options, based on nearly 1 million documents.

The documents were collected and processed from the open access Web. They are the self-archived papers, their preprint and/or reprint versions. CiteSeer stands out by offering the full text of (almost) all the documents. The size of the database in and by itself is impressive, and the instant access to the source documents makes it immensely useful. This instant access concept certainly limited the scope of the database, but it is already huge, and grew at an impressive rate in the past 8 years.

Beyond the instant access, there was another filter applied in collecting the computer science related papers. It also reduced the scope of the collection, but certainly increased its quality. Only papers in PDF and PostScript formats have been collected. In computer science where these two formats are far the most common this is not as restrictive as it may sound. The inclusion of papers in  HTML and Word formats could have increased the size of the collection, but it would have lowered its quality by picking up from the open Web far less relevant papers posted by  undergraduate students in introductory distance education computer science courses offered by one of the purely online universities. 

The vast majority of the source documents in CiteSeer are conference papers, considered by many computer scientists to be the most precious type of information sources, primarily by virtue of currency and accounts of novel, experimental techniques which are less favored by editors of scholarly journals.  

The content of this database can be the best described by following a simple search on the topic of citation indexing. The space between the query word implies exact phrase searching and finds 57 articles. The items on the result list are  sorted by decreasing citedness order. Clicking on the title of the paper, a much enhanced bibliographic record pops up. Beyond the traditional content of author, title, source name, and other publication data, it offers many (a little too many) additional links to the full-text of the document from a variety of locations, and different file formats. It also offers on this page  informative excerpts from a variety of lists about the cited, citing and otherwise related papers and their citedness indicator before making the complete lists available.

This is an awesomely  information-rich but  very dense page and would much need some illustrated help information  to enlighten and guide the novice users. Some labels and snippets are self-explanatory, but others are enigmatic. I can’t even start to explain all the features in this space, but the article itself [cs3] is only a click away (especially if you choose the PDF version) and provides good, detailed background for those who don’t want to click on the links and explore them own their own unprepared.

The references cited by this article appear in their citedness order, not in the order as presented in the original. I find this very useful (and very rare) as this ranking immediately provides a hint which of them may be the most relevant for the topic. It would be very helpful if the citing articles were also listed in their decreasing citedness order.

You must approach this, of course with a grain of salt as citedness frequency also depends on the age of the documents. An older document has a longer time and higher chance to acquire citations than a recent one. Then again, these talented researchers (the authors of the source article) could easily use (maybe behind the scenes) a relative citedness score for the citing and cited documents as I described here.

The citedness scores may not be as high as in Web of Science and Scopus because Citeseer analyzes “only” the nearly 1 million papers it has collected, whereas the two commercial citation indexing databases have citation enhanced records for about 37 million and 27.5 million source documents, respectively. (For Wos the number refers to the 1945-2005 edition).  Then again, I did not find any phantom citing papers in Citeseer, or grossly deflated hit counts with often misidentified cited documents for topical searches as I did (and you would) in Google Scholar which leads me to the software issues. 

But before the transition, let me emphasize one quintessential advantage of Citeseer: you get access also to the source documents (with some exceptions) with no fuss, no muss even if your library  doesn’t have a link resolver, because CiteSeer has a copy of the source document. This is partially true  for Google Scholar to a far lesser extent.



Citeseer has an ultra high-brow software, way beyond what end-users would see directly. Actually, what the end-users see may not be as tender an interface as you see in most web-wide search engines, and it has no help information (which is a sin). This may make it look user unfriendly, or if you don’t need the sensitivity talk, more “tough love” style.

What it lacks in user-friendliness, it makes up in smartness, especially in selecting high quality sources, and in normalizing/standardizing the terribly inconsistent, incomplete, and inaccurate citations prevalent in every scholarly fields.

You can see the latter directly, if you click on the Check button  as you look up the page of the citing references. It displays the many variant, and incomplete citation formats which Citeseer correctly identifies as the source document in case. If you want to see the full list from here just click. I have been using CiteSeer for a long time, but I have never seen a mis-identified source document. In preparing for the test for this Gale review, I spotted one citing reference which was not collocated with the other ones which cited the same document.  I missed to capture it then and could not reconstruct the search.

CiteSeer has perfected –within reasonable limits- the process of recognizing and consolidating matching records for incomplete  and/or partially  erroneous citations. It can also locate the references  in the full text (not merely in the footnotes) for many of the documents, in about 60-65% of the cases in my test. You can see also this feature which again is a highly sophisticated one, if you click on the Context button of the record of the cited document. The reference to the cited documents appears in boldface with parts of the paragraph of the preceding and following text. If the document is referred to more than once in the citing document it is repeated in the Context (also known as Details) format. There are many other gems in CiteSeer which are worth polishing, therefore the latest news about the NSF funding is especially encouraging.

The original project was developed at the NEC Research Institute which deserves credit for it. At that time all the three researchers worked for NEC. Since that time Steve Lawrence left for Google, (I could not trace Bollacker), and  Lee Giles went to the School of IST at Pittsburgh State University. I believe it is the youngest of the library and information science and technology programs in the country. But young does not mean immature. Actually, at least three of the most mature of the researchers specializing in the analysis of Web-wide search engine now works for IST. Beyond Giles, Amanda Spink and Jim Jansen are member of the IST School who published (along with the outstanding, long-time Rutgers professor  Tefko Saracevic of Rutgers) the most insightful, fact laden  articles about users’ search strategy and tactics based on several, exceptionally large projects in terms of user population.

ITS is one of the recipients of the grant. In light of past performance that group is a guarantee that the fund for the project known   as Next Generation CiteSeer will be well used.

Of course I regret even  more that only a relatively small amount was awarded for this project. It showed a working example of the revolutionary new method of autonomous citation indexing which is done without human indexing, does not require the enormously  expensive journal subscription and processing investments, and can be ported to other disciplines. It needs powerful brains and time to do the demanding system analysis, programming, implementation and monitoring tasks. I am sure that the substantial research and development will create a superb  tool for the next generation of researchers beyond computer science, and complement the commercial indexing services which have much stronger journal coverage for much longer period of time and far wider scope.

As for Google Scholar, I hope it is not becoming the jack of all trades, master of none, and will be as good in handling finely structured data served on a silver platter to it by many publishers, and in understanding the essence and nuances in citation indexing as the generic Google software has been in handling the gigantic, unstructured hodge-podge of the generic WWW. This may need the 20% free time of Steve Lawrence, one of the developers of CiteSeer, and now an employee of Google, Inc.                  

