Peter's Digital Reference Shelf

Title: Scopus revisited

Publisher: Elsevier

URL: http://www.scopus.com

Cost: to be negotiated

Tested: continuously

The good and enhanced software features, especially, the swift and generous output functions  keep the rivalry with Web of Science/Web of Knowledge systems alive, but unresolved problems of source coverage undermine the most prominent  claim of Scopus about its broad source base

THE CONTEXT

We know that competition favors the customers, and we can see the benefit of neck to neck competition between Elsevierís Scopus and the Web of Science/Web of Knowledge systems of Thomsonís Institute for Scientific Information. Beyond these two information industry heavy weights there are others who have hopped on the bandwaqon of citation-based and citation enhanced searching systems. They mostly focus on the academic literature of a specific discipline, such as the CiteSeer system (computer and information science and technology), SMEAL Search (business literature), EconPapers (economics), PubMed Central (health literature). All these are full text databases (which include the cited references in the source, which in turn provide the backbone of the web of citing/cited references within their mostly open access collections).

The American Psychological Association  enhanced PsycINFO, its 2,000,000 records indexing/abstracting database as of May, 2006 by a total of 200 million cited references in 510,000 records.. From 2001 onward almost all of the records are enhanced by cited references where applicable. Before that year the enhancement of records is selective and very selective. It is odd that for the articles published in 1986 only 36 were enhanced, while for 1987 and 1985 the number of enhanced records are 957 and 742, respectively, still a small number in light of the total yearly intake of those years of about 50,000 records. 

It is another question that few of the aggregators bring the best out of PsycINFO and a few of the other citation enhanced databases they host. CSA, WoK, and Ebsco show promising  examples in this category

Of the digital facilitators, HighWire Press went the furthest by creating a sophisticated, still visually pleasing and informative citation system based on the cited references in the 3.4 million full-text articles in journals hosted by HighWire Press. This multidisciplinary digital archive offers open access to more than 1.2 million articles courtesy of the publishers and HighWire Press.

The Japan Science and Technology agency (JST) recently enhanced and upgraded its full-text digital archive of publications by Japanese researchers. The multidisciplinary archive is split into two databases,  J-STAGE and journal@archive (with a strong science focus). It has  a quarter million scholarly papers and offers  open access to more than 200,000  articles, mostly in English. Nearly 30,000 of them have citedness data scores from and link to 43,000 papers within the archive. I got these statistics from Mitsutoshi Wada. These two databases  are now prime resources for learning about research projects conducted in Japan.  

Google Scholar is also a  multidisciplinary database. It does not provide any tangible information about its source coverage, time span, number of records, number of cited references, and dispenses grossly inflated hit counts and phantom citation counts.  It canít reliably identify matching citing and cited reference pair.  Sure, the service is free, and it can direct the uses to some open access full text materials courtesy of the publishers, and the authors who self-archived their papers. . More about its pros and cons as of mid-2006 in an upcoming review

Suffice it to say that Google Scholar  can be fine for casual searching and for tracing related materials, as long as you donít get too overjoyed, donít loose your mind and donít declare Google Scholar as equivalent to either WoS or Scopus, and donít take its counts seriously as did Pauly and Stergiou in their scarily  incompetent research report . In turn their ďfindingsĒ will likely to be quoted by those who are equally incapable of understanding the serious deficiencies of Google Scholar, but keep blogging and chatting about it without substance as entertainment industry reporters.

MSN has launched in April the Live Academic database, but has not enhanced it by citedness scores which is good because the database has serious problems with more elementary issues as I discussed in my review (Tim, can you link to it?).

THE CONTENT

The Scopus web-site consists of the Scopus database itself, which is well integrated with two other components: the database of more than 13 million patent records, and a nearly 220 million record database of other Web pages. I discuss here only the Scopus component itself, but the other two components are also noteworthy because of the aggregation and integration in the Scopus service. Much of the content of these ytwo databases are available also independently and freely through Scirus, which keeps getting better and better.

Database size and record types

Elsevier claims that Scopus now have 27.7 million abstract records, about 1.7 million records more than when I tested it in the Summer of 2004. My test confirmed this number, producing 27,758,413 hits. However, not all of these have abstracts. Again, my test confirmed close to 20 million records with abstracts.  The 27.7 million records refer to the total number of indexing records.

Elsevier does not reveal how many records have been enhanced by cited references, but my test indicates that more than 9,476,000  records have these important value added data elements. It is a realistic number considering that not all scholarly articles cite other works, and most papers in trade publications do not cite other works.

It is made clear by Elsevier that only records for primary documents published since 1996 have been enhanced with cited references This represents about 12.7  million records within Scopus.. Actually, about 7,000 records for pre-1996 papers also have cited references. As for the total number of cited references in the database Elsevier claims to have 245 million references which seems to be a reasonable number.  The indexing/abstracting records go back to 1965 with a smattering of records for pre-1965 publications. It must be borne in mind, however, that the most recent 10 years provide almost half of the records in the Scopus database.

Document types and topical focus

The core of Scopus itself is still represented by journal papers (including research papers, review papers, short notes, letters to the editors). There are records 38,00 conference papers (Scopus still refers to them erroneously as conference reviews), a very useful source type in some disciplines, like computer science. There are records for 14, 500 other research reports, and for 20,000 books. The composition by document type did not change significantly.

As for the topical coverage  medicine, engineering and biochemistry are the dominant. This is understandable as much of the source records come from Elsevierís largest indexing/abstracting databases EMBASE, Compendex, and BIOBASE. Records of physics and astronomy literature got a larger share in the past few years segment of Scopus, along with papers covering physics and astronomy, materials science, chemistry and agriculture. Social science represent less than 3% of the records (not counting records for literature of psychology, business and economics), and its share also grew in the past 2 years. Arts and humanities  remained at a low 0.3% share level.

The source base

The most current blurb of Scopus claims that it  includes more than 15,000 peer reviewed journals, including 500 open access journals, 700 conference proceedings, 600 trade publications, and 125 book series. This is a rather ill-composed blurb which undermines the claim of coverage. Conference papers often do not go through the peer review process. Trade publications donít use peer review. They are edited by the journal editor which is often better than the peer review. The same is true for books. There is nothing wrong with the inclusion of trade publications not only academic publications (after all, they are the journals which professionals  who donít have to go for tenure, can openly read, use and enjoy). ISI also includes journals meant for the practicing professionals (if I may avoid using the pejorative, almost dissing connotation of the term trade journals as uttered by some budding and veteran academics.

As for the coverage of 15,000 serials publications, it requires much clarification and qualification, that number cannot be taken at face value.  This is a very impressive number. WoS lists 8,700 journals, but you should not jump to conclusions, even though for the past 10 years the broader source coverage of Scopus undoubtedly is reflected in the typically higher citation counts for articles published after 1996 than in WoS, but this is partially offset by the typically much lower citation counts for the pre-1996 publications in Scopus.

In many disciplines learning about articles published more than 10 years ago are essential for the understanding of the research and developments in a particular area. 

The width  of coverage alone is not a decisive factor. The period of the coverage, its retrospectivity, and its consistency must be looked at carefully for a reasonable and educated evaluation. . There are significant differences between the policy of ISI and Elsevier in this regard. Once ISI picks up a journal it keeps covering it completely and consistently. It does  drop journals which  fail to meet the standards set for inclusion, but it does not do it on a whim.  Scopus coverage proved to be capricious to put it mildly, in my test searches.

Results  have shown significant gaps of coverage in Scopus. Sometimes entire volumes may be missing which is obvious from simply looking up the source index. A look at the list of coverage of the Annual Review of Information Science and Technology (ARIST), for an example that is easy to relate to for most readers of this column, shows an appalling picture. It is the highest impact factor periodical publication in the library and information science category. Still, the entire volume seems to be missing for 2005 and 2002. Other numbers are also suspicious. Usually there are about 12-14 chapters in every volume of ARIST.

The high number (24) for 2003 is realistic because volume 37 and 38 were both published in 2003. Unfortunately, the volume and publication year was royally messed up. You can find different numbering/publication year notation at the societyís page, and at the site of Information Today, Inc. which has been publishing this excellent series for a long time.

The number of chapters for 1996-2001 are low, missing records for a couple of chapters. But the real disappointment comes when you look at the pre-1995 coverage. All you have are records for 10 chapters in 1987, and one for 1981 and 1982 each.  This is below abysmal for such a high clout periodical.

Earlier the source list was not immediately updated so I did also a direct search for the periodical title in he bibliographic database. The result list  just reconfirmed the above impression. If you try hard, and use title abbreviation, Scopus will cough up one more record for a chapter in the 1974 volume.

The Journal of the American Society for Information Science & Technology shows a disturbing pattern, too. It seems to be missing the entire volume for 2000, and has puny number of records for the pre-2000 years. But wait, the title of the journal was just JASIS until 2001 when it changed to JASIST, by finally adding the word Technology. So why are there then records under JASIST before 2000? There is no rhyme, no reason, but luckily, the volume for 2000 is not missing, it is under JASIS which oddly does not come up in the source directory listing, but it is displayed when searching for it in the search mode.

The coverage of the Bulletin of ASIS shows a very disheartening picture again. Add to this the sorry coverage of the Proceedings of ASIS, with two papers from a single volume and you realize why I am very skeptical when reading PR statements about the source base of a database. This is nothing new, I illustrated the absurdity of coverage claims before in another database, and in many others in my book about the Content Evaluation of Textual Databases.  A top tier database cannot afford such kind of source coverage without weakening its credibility.

THE SOFTWARE

Elsevier did not sit on its laurels which it deserved for the excellent, compact, easy to scan presentation of the results, and for sorting millions of records in a few seconds Ė an unprecedented feature. It is faster than ever, it can sort more than 27 million records in less then 10 seconds in case you want to find out which are the top 1,000 or 10,000 most cited records in the entire database.

The cumbersome advanced mode search was somewhat improved by showing the still cryptic search field tags on the search page, eliminating the annoying step of activating the help file and scrolling down to the field-tag section.

The most important  new software  enhancements relate directly to citation pattern analysis. One of the features is called citation tracking. You can mark any item on the result list, and get the citations mapped out year by year. It is limited to a 10 year span which is somewhat limiting, and is probably motivated by the fact that Scopus adds cited references to records going back 10 years, than anything else.

You can do such a mapping by using the RANK command in DIALOG, but this is much easier and more pleasing visually. An annoying ďfeatureĒ of Citation Tracker is that even if you stay within the 10-year limit, the software believes that you exceeded the time span limit, and gives you an error message..  

Author searching  has improved big way. After typing in the name of the author a list pops up which shows variants of the same name. Although some of them seems to be the same, clicking on the icon to look up the details will show missing affiliation data in one but not the other .author record. The most common format is given an author identifier and that is considered to be  the authentic master record by Scopus for the name of the author, but the searchers  may add to the search the other variants after an educated and informed decision. Clicking on the details icon will show the information collected by Scopus from  its records, and offers the person the chance to comment on the data.  The process may need some additional clarification but it is a very attentive feature  by the  designers, and is a welcome  addition to the repertoire of the  ďcaringĒ searcher (pardon for the smarmy term).  Showing the co-authors of the author as an option reflect the same intelligence in design.  It is icing on the cake that the citation tracker feature can be launched , but it is far more than just icing that by the click of a button the self-citations can be eliminated from the results.

I donít see any reason to use as the default the current year and the previous 2 years for the citation tracking as Scopus is incredibly fast with mega-sets of results. (It is obviously an oversight that the end of the default range is > 2006. Only Google Scholar can produce phantom citations from 2007 as of mid-May, 2006, and I am not talking about citations from manuscripts of books and articles which are known to get published in 2007).

Even the current year is rarely informative for citedness as citations start coming in for most mortals 2-3 years after the publication Ė if they come at all, but this is a minor issue. There are other novel features on the way for visualizing the search results, but not even the Scopus reps could get the Java applets  working in the exhibit hall when I contacted them just before submitting the manuscript.

Deploying the sort know-how, sorting the often lengthy cited reference lists in the detailed record of the articles by the number of times they were cited would appeal to many users. The instantly produced, excellent result  analysis panel showing the distribution of hits by  journals, authors, publication years, document type  and major disciplinary categories would be even better if the entire journal list could be sorted alphabetically.

Again, there is no limit on the size of the result set. You can get the profile of the result of a search topic showing the most productive journals, authors and years of publication. Searching by single word journal names (Science, Blood, etc.) without picking up every journal which has the word science or blood in its name (as it is possible in WoS), would be very welcome.

The export features are useful, but I canít understand why Scopus settled for a tailor-made version of RefWorks (an attractive bibliography manager program) which does not save the citation counts with the articles  Ė the most valuable extra feature of the Scopus records. Luckily, the results of the citation tracking service can be exported in CSV format which in turn can be read directly by Excel into a spreadsheet..

There are some software features where Scopus could learn from WoS. Browsing the index of cited references, and the cited journal names, would make the searches more efficient and would compensate for the bad habits of us authors in being sloppy when citing works which will haunt us as citation-based and citation-enhanced systems will appear on a larger scale.

No matter how smart and swift the software is, it cannot compensate for certain types of database content deficiencies such as those caused by omissions of articles, issues, entire volumes  from core journals which should have complete and consistent coverage. If we users can so easily get an X-ray of the entire database in Scopus, then the developers certainly can get a more comprehensive, more detalied and visual report about gaps, and roller-coaster coverage of journals. They could help by giving a map for fixing this problem, and to give teeth to the  proud claim of the exceptionally broad journal base, too.