|
Publisher:
Penn State School of Information Science & Technology URL:
http://citeseer.ist.psu.edu/ Cost:
free Tested: October 20-25, 2005 |
|
This more than evolutionary service started as a pilot project in 1997 on autonomous citation indexing. Finally, it received some well deserved NSF funding which makes it feasible to enhance and run this superb service for at least four more. THE
CONTEXT Much too often, interesting pilot projects fade out after the initial grant money runs out.. Luckily, earlier this summer the NSF awarded a $1.2 million grant to the Penn State University School of Information Sciences and Technology (IST) and University of Kansas to enhance and improve the original project. It is a much deserved fund in light of the direct utility and the inspirational value of CiteSeer. It is a minor fund compared to the millions what the ill-conceived and very poorly implemented PubScience project received a few years ago in form of a congressional appropriation, and produced hardly anything novel, and left absolutely nothing usable after its few years of existence. Beyond
the huge multidisciplinary commercial citation indexes, Web of Science
and Scopus which I reviewed earlier, there are a few other literary
information retrieval projects which offer novel, powerful open access
services related to scientific literature in specific disciplines or group
of disciplines, such as arXiv.org
operated by Cornell University which covers
astronomy, physics, mathematics, computer science and quantitative
biology; the Astrophysics
Data System (ADS) sponsored by NASA and run by the
Harvard-Smithsonian Center for Astrophysics, the Research Papers in
Economics (RePEc) archive
in its various flavors, operated by various not for profit organizations.
Faculty at Southampton University has had invaluable contributions in
developing the essential software for self-archiving, collection analysis,
and citation analysis, not to mention the relentless and effective
evangelizing to promote the self-archiving movement. Then
there is the multidisciplinary Google Scholar service, but behind its
pretty (inter)face there is such a grave brain damage that much of it hits
are misses. No wonder, as the Google Scholar software is unable to carry
out even the most elementary Boolean OR operation. For the term deception
reports 86,000 hits,
for deceptive the hit count
is 47,700, and for the search deception OR deceptive it finds
68,600 hits. The situation is worse
when it comes to reporting the citedness scores. Before you include Google
Scholar in your Thanksgiving blessing (Tim, is it blessing?)
do your own testing to find out what is behind the appealing façade
or check out my recent illustrated story
book for some examples I did for an interview with an editor of The
Scientist about counting on Google Scholar’s hit counts and citation
counts. THE
CONTENT CiteSeer (which was likely to be the model for Google Scholar) started out in 1997 with this good name, then switched to ResearchIndex, then switched back to the original. It currently offers its services, including sophisticated citation searching options, based on nearly 1 million documents. The documents were collected and processed from the open access Web. They are the self-archived papers, their preprint and/or reprint versions. CiteSeer stands out by offering the full text of (almost) all the documents. The size of the database in and by itself is impressive, and the instant access to the source documents makes it immensely useful. This instant access concept certainly limited the scope of the database, but it is already huge, and grew at an impressive rate in the past 8 years. Beyond the instant access, there was another filter applied in collecting the computer science related papers. It also reduced the scope of the collection, but certainly increased its quality. Only papers in PDF and PostScript formats have been collected. In computer science where these two formats are far the most common this is not as restrictive as it may sound. The inclusion of papers in HTML and Word formats could have increased the size of the collection, but it would have lowered its quality by picking up from the open Web far less relevant papers posted by undergraduate students in introductory distance education computer science courses offered by one of the purely online universities. The vast majority of the source documents in CiteSeer are conference papers, considered by many computer scientists to be the most precious type of information sources, primarily by virtue of currency and accounts of novel, experimental techniques which are less favored by editors of scholarly journals. The content of this database can be the best described by following a simple search on the topic of citation indexing. The space between the query word implies exact phrase searching and finds 57 articles. The items on the result list are sorted by decreasing citedness order. Clicking on the title of the paper, a much enhanced bibliographic record pops up. Beyond the traditional content of author, title, source name, and other publication data, it offers many (a little too many) additional links to the full-text of the document from a variety of locations, and different file formats. It also offers on this page informative excerpts from a variety of lists about the cited, citing and otherwise related papers and their citedness indicator before making the complete lists available. This is an awesomely information-rich but very dense page and would much need some illustrated help information to enlighten and guide the novice users. Some labels and snippets are self-explanatory, but others are enigmatic. I can’t even start to explain all the features in this space, but the article itself [cs3] is only a click away (especially if you choose the PDF version) and provides good, detailed background for those who don’t want to click on the links and explore them own their own unprepared. The references cited by this article appear in their citedness order, not in the order as presented in the original. I find this very useful (and very rare) as this ranking immediately provides a hint which of them may be the most relevant for the topic. It would be very helpful if the citing articles were also listed in their decreasing citedness order. You must approach this, of course with a grain of salt as citedness frequency also depends on the age of the documents. An older document has a longer time and higher chance to acquire citations than a recent one. Then again, these talented researchers (the authors of the source article) could easily use (maybe behind the scenes) a relative citedness score for the citing and cited documents as I described here. The citedness scores may not be as high as in Web of Science and Scopus because Citeseer analyzes “only” the nearly 1 million papers it has collected, whereas the two commercial citation indexing databases have citation enhanced records for about 37 million and 27.5 million source documents, respectively. (For Wos the number refers to the 1945-2005 edition). Then again, I did not find any phantom citing papers in Citeseer, or grossly deflated hit counts with often misidentified cited documents for topical searches as I did (and you would) in Google Scholar which leads me to the software issues. But before the transition, let me emphasize one quintessential advantage of Citeseer: you get access also to the source documents (with some exceptions) with no fuss, no muss even if your library doesn’t have a link resolver, because CiteSeer has a copy of the source document. This is partially true for Google Scholar to a far lesser extent. THE
SOFTWARE Citeseer has an ultra high-brow software, way beyond what end-users would see directly. Actually, what the end-users see may not be as tender an interface as you see in most web-wide search engines, and it has no help information (which is a sin). This may make it look user unfriendly, or if you don’t need the sensitivity talk, more “tough love” style. What it lacks in user-friendliness, it makes up in smartness, especially in selecting high quality sources, and in normalizing/standardizing the terribly inconsistent, incomplete, and inaccurate citations prevalent in every scholarly fields. You
can see the latter directly, if you click on the Check button
as you look up the page of the citing references. It displays the
many variant, and incomplete citation
formats which Citeseer correctly identifies as the source document in
case. If you want to see the full list from here just click.
I have been using CiteSeer for a long time, but I have never seen a mis-identified
source document. In preparing for the test for this Gale review, I spotted
one citing reference which was not collocated with the other ones which
cited the same document. I
missed to capture it then and could not reconstruct the search. CiteSeer has perfected –within reasonable limits- the process of recognizing and consolidating matching records for incomplete and/or partially erroneous citations. It can also locate the references in the full text (not merely in the footnotes) for many of the documents, in about 60-65% of the cases in my test. You can see also this feature which again is a highly sophisticated one, if you click on the Context button of the record of the cited document. The reference to the cited documents appears in boldface with parts of the paragraph of the preceding and following text. If the document is referred to more than once in the citing document it is repeated in the Context (also known as Details) format. There are many other gems in CiteSeer which are worth polishing, therefore the latest news about the NSF funding is especially encouraging. The original project was developed at the NEC Research Institute which deserves credit for it. At that time all the three researchers worked for NEC. Since that time Steve Lawrence left for Google, (I could not trace Bollacker), and Lee Giles went to the School of IST at Pittsburgh State University. I believe it is the youngest of the library and information science and technology programs in the country. But young does not mean immature. Actually, at least three of the most mature of the researchers specializing in the analysis of Web-wide search engine now works for IST. Beyond Giles, Amanda Spink and Jim Jansen are member of the IST School who published (along with the outstanding, long-time Rutgers professor Tefko Saracevic of Rutgers) the most insightful, fact laden articles about users’ search strategy and tactics based on several, exceptionally large projects in terms of user population. ITS is one of the recipients of the grant. In light of past performance that group is a guarantee that the fund for the project known as Next Generation CiteSeer will be well used. Of
course I regret even more
that only a relatively small amount was awarded for this project. It
showed a working example of the revolutionary new method of autonomous
citation indexing which is done without human indexing, does not require
the enormously expensive
journal subscription and processing investments, and can be ported to
other disciplines. It needs powerful brains and time to do the demanding
system analysis, programming, implementation and monitoring tasks. I am
sure that the substantial research and development will create a superb
tool for the next generation of researchers beyond computer
science, and complement the commercial indexing services which have much
stronger journal coverage for much longer period of time and far wider
scope. As
for Google Scholar, I hope it is not becoming the jack of all trades,
master of none, and will be as good in handling finely structured data
served on a silver platter to it by many publishers, and in understanding
the essence and nuances in citation indexing as the generic Google
software has been in handling the gigantic, unstructured hodge-podge of
the generic WWW. This may need the 20% free time of Steve Lawrence, one of
the developers of CiteSeer, and now an employee of Google, Inc.
|
back to "Peter's Digital Reference Shelf" GaleNet