|
Peter's Digital Reference Shelf |
|
Title:
Scopus revisited Publisher:
Elsevier Cost:
to be negotiated Tested:
continuously The
good and enhanced software features, especially, the swift and generous
output functions keep the
rivalry with Web of Science/Web of Knowledge systems alive, but unresolved
problems of source coverage undermine the most prominent
claim of Scopus about its broad source base THE
CONTEXT We
know that competition favors the customers, and we can see the benefit of
neck to neck competition between Elsevier’s Scopus and the Web of
Science/Web of Knowledge systems of Thomson’s Institute for Scientific
Information. Beyond these two information industry heavy weights there are
others who have hopped on the bandwaqon of citation-based and citation
enhanced searching systems. They mostly focus on the academic literature
of a specific discipline, such as the CiteSeer system (computer and
information science and technology), SMEAL Search (business literature),
EconPapers (economics), PubMed Central (health literature). All these are
full text databases (which include the cited references in the source,
which in turn provide the backbone of the web of citing/cited references
within their mostly open access collections). The
American Psychological Association enhanced
PsycINFO, its 2,000,000 records indexing/abstracting database as of May,
2006 by a total of 200 million cited references in 510,000 records.. From
2001 onward almost all of the records are enhanced by cited references
where applicable. Before that year the enhancement of records is selective
and very selective. It is odd that for the articles published in 1986 only
36 were enhanced, while for 1987 and 1985 the number of enhanced records
are 957 and 742, respectively, still a small number in light of the total
yearly intake of those years of about 50,000 records.
It
is another question that few of the aggregators bring the best out of
PsycINFO and a few of the other citation enhanced databases they host.
CSA, WoK, and Ebsco show promising examples
in this category Of
the digital facilitators, HighWire Press went the furthest by creating a
sophisticated, still visually pleasing and informative citation system
based on the cited references in the 3.4 million full-text articles in
journals hosted by HighWire Press. This multidisciplinary digital archive
offers open access to more than 1.2 million articles courtesy of the
publishers and HighWire Press. The
Japan Science and Technology agency (JST) recently enhanced and upgraded
its full-text digital archive of publications by Japanese researchers. The
multidisciplinary archive is split into two databases,
J-STAGE and journal@archive
(with a strong science focus). It has
a quarter million scholarly papers and offers
open access to more than 200,000
articles, mostly in English. Nearly 30,000 of them have citedness
data scores from and link to 43,000 papers within the archive. I got these
statistics from Mitsutoshi Wada. These two databases are now prime resources for learning about research projects
conducted in Japan. Google
Scholar is also a multidisciplinary
database. It does not provide any tangible information about its source
coverage, time span, number of records, number of cited references, and
dispenses grossly inflated hit counts and phantom citation counts.
It can’t reliably identify matching citing and cited reference
pair. Sure, the service is
free, and it can direct the uses to some open access full text materials
courtesy of the publishers, and the authors who self-archived their
papers. . More about its pros and cons as of mid-2006 in an upcoming
review Suffice
it to say that Google Scholar can
be fine for casual searching and for tracing related materials, as long as
you don’t get too overjoyed, don’t loose your mind and don’t declare
Google Scholar as equivalent to either WoS or Scopus, and don’t take its
counts seriously as did Pauly and Stergiou in their scarily
incompetent research report .
In turn their “findings” will likely to be quoted by those
who are equally incapable of understanding the serious deficiencies of
Google Scholar, but keep blogging and chatting about it without substance
as entertainment industry reporters. MSN
has launched in April the Live Academic database, but has not enhanced it
by citedness scores which is good because the database has serious
problems with more elementary issues as I discussed in my review (Tim, can
you link to it?). THE
CONTENT The
Scopus web-site consists of the Scopus database itself, which is well
integrated with two other components: the database of more than 13 million
patent records, and a nearly 220 million record database of other Web
pages. I discuss here only the Scopus component itself, but the
other two components are also noteworthy because of the aggregation and
integration in the Scopus service. Much of the content of these ytwo
databases are available also independently and freely through Scirus,
which keeps getting better and better. Database
size and record types Elsevier
claims that Scopus now have 27.7 million abstract records, about 1.7
million records more than when I tested it in the Summer of 2004. My test
confirmed this number, producing 27,758,413 hits. However, not all of
these have abstracts. Again, my test confirmed close to 20 million records
with abstracts. The 27.7
million records refer to the total number of indexing records. Elsevier
does not reveal how many records have been enhanced by cited references,
but my test indicates that more than 9,476,000
records have these important value added data elements. It is a
realistic number considering that not all scholarly articles cite other
works, and most papers in trade publications do not cite other works. It
is made clear by Elsevier that only records for primary documents
published since 1996 have been enhanced with cited references This
represents about 12.7 million records within Scopus.. Actually, about 7,000 records
for pre-1996 papers also have cited references. As for the total number of
cited references in the database Elsevier claims to have 245 million
references which seems to be a reasonable number.
The indexing/abstracting records go back to 1965 with a smattering
of records for pre-1965 publications. It must be borne in mind, however,
that the most recent 10 years provide almost half of the records in the
Scopus database. Document
types and topical focus The
core of Scopus itself is still represented by journal papers (including
research papers, review papers, short notes, letters to the editors).
There are records 38,00 conference papers (Scopus still refers to them
erroneously as conference reviews), a very useful source type in some
disciplines, like computer science. There are records for 14, 500 other
research reports, and for 20,000 books. The composition by document type
did not change significantly. As
for the topical coverage medicine,
engineering and biochemistry are the dominant. This is understandable as
much of the source records come from Elsevier’s largest
indexing/abstracting databases EMBASE, Compendex, and BIOBASE. Records of
physics and astronomy literature got a larger share in the past few years
segment of Scopus, along with papers covering physics and astronomy,
materials science, chemistry and agriculture. Social science represent
less than 3% of the records (not counting records for literature of
psychology, business and economics), and its share also grew in the past 2
years. Arts and humanities remained
at a low 0.3% share level. The
source base The
most current blurb of Scopus claims that it
includes more than 15,000 peer reviewed journals, including 500
open access journals, 700 conference proceedings, 600 trade publications,
and 125 book series. This is a rather ill-composed blurb which undermines
the claim of coverage. Conference papers often do not go through the peer
review process. Trade publications don’t use peer review. They are
edited by the journal editor which is often better than the peer review.
The same is true for books. There is nothing wrong with the inclusion of
trade publications not only academic publications (after all, they are the
journals which professionals who
don’t have to go for tenure, can openly read, use and enjoy). ISI also
includes journals meant for the practicing professionals (if I may avoid
using the pejorative, almost dissing connotation of the term trade
journals as uttered by some budding and veteran academics. As
for the coverage of 15,000 serials publications, it requires much
clarification and qualification, that number cannot be taken at face
value. This is a very
impressive number. WoS lists 8,700 journals, but you should not jump to
conclusions, even though for the past 10 years the broader source coverage
of Scopus undoubtedly is reflected in the typically higher citation counts
for articles published after 1996 than in WoS, but this is partially
offset by the typically much lower citation counts for the pre-1996
publications in Scopus. In
many disciplines learning about articles published more than 10 years ago
are essential for the understanding of the research and developments in a
particular area. The
width of coverage alone is
not a decisive factor. The period of the coverage, its retrospectivity,
and its consistency must be looked at carefully for a reasonable and
educated evaluation. . There are significant differences between the
policy of ISI and Elsevier in this regard. Once ISI picks up a journal it
keeps covering it completely and consistently. It does
drop journals which fail
to meet the standards set for inclusion, but it does not do it on a whim.
Scopus coverage proved to be capricious to put it mildly, in my
test searches. Results
have shown significant gaps of coverage in Scopus. Sometimes entire
volumes may be missing which is obvious from simply looking up the source
index. A look at the list of coverage
of the Annual Review of
Information Science and Technology (ARIST), for an example that is easy to
relate to for most readers of this column, shows an appalling picture. It
is the highest impact factor periodical publication in the library and
information science category. Still, the entire volume seems to be missing
for 2005 and 2002. Other numbers are also suspicious. Usually there are
about 12-14 chapters in every volume of ARIST. The
high number (24) for 2003 is realistic because volume 37 and 38 were both
published in 2003. Unfortunately, the volume and publication year was
royally messed up. You can find different numbering/publication year
notation at the society’s page, and at
the site of Information Today,
Inc. which has been publishing this excellent series for a long time. The
number of chapters for 1996-2001 are low, missing records for a couple of
chapters. But the real disappointment comes when you look at the pre-1995
coverage. All you have are records for 10 chapters in 1987, and
one for 1981 and 1982 each. This
is below abysmal for such a high clout periodical. Earlier
the source list was not immediately updated so I did also a direct search
for the periodical title in he bibliographic database. The result list
just reconfirmed the above impression. If you try hard, and use
title abbreviation, Scopus will cough up one more record for a chapter in
the 1974 volume. The
Journal of the American Society for Information Science &
Technology shows a disturbing pattern, too. It seems to be
missing the entire volume for 2000, and has puny number of records for the
pre-2000 years. But wait, the title of the journal was just JASIS until
2001 when it changed to JASIST, by finally adding the word Technology. So
why are there then records under JASIST before 2000? There is no rhyme, no
reason, but luckily, the volume for 2000 is not missing, it is under JASIS
which oddly does not come up in the source directory listing,
but it is displayed when searching for it in the search mode. The
coverage of the Bulletin of ASIS shows a very disheartening picture
again. Add to this the sorry coverage of the Proceedings of
ASIS, with two papers from a single volume and you realize why
I am very skeptical when reading PR statements about the source base of a
database. This is nothing new, I illustrated the absurdity of coverage
claims before in another database, and in many others in my book about the Content Evaluation of Textual
Databases. A top tier
database cannot afford such kind of source coverage without weakening its
credibility. THE
SOFTWARE Elsevier
did not sit on its laurels which it deserved for the excellent, compact,
easy to scan presentation of the results, and for sorting millions of
records in a few seconds – an unprecedented feature. It is faster than
ever, it can sort more than 27 million records
in less then 10
seconds in case you want to find out which are the top 1,000 or 10,000
most cited records in the entire database. The
cumbersome advanced mode search was somewhat improved by showing the still
cryptic search field tags on the search page, eliminating the
annoying step of activating the help file and scrolling down to the
field-tag section. The
most important new software enhancements relate directly to citation pattern analysis.
One of the features is called citation tracking. You can mark any item on
the result list, and get the citations mapped out year by year. It is limited to a 10 year span which is somewhat limiting,
and is probably motivated by the fact that Scopus adds cited references to
records going back 10 years, than anything else. You
can do such a mapping by using the RANK command in DIALOG, but this is
much easier and more pleasing visually. An annoying “feature” of
Citation Tracker is that even if you stay within the 10-year limit, the
software believes that you exceeded the time span limit, and gives you an
error message..
Author
searching has improved big
way. After typing in the name of the author a list pops up which shows
variants of the same name. Although some of them seems to be the same,
clicking on the icon to look up the details will show missing affiliation
data in one but not the other .author record. The most common format
is given an author identifier and that is considered to be
the authentic master record by Scopus for the name of the author,
but the searchers may add to
the search the other variants after an educated and informed decision.
Clicking on the details icon will show the information
collected
by Scopus from its records, and offers the person the chance to comment on
the data. The process may
need some additional clarification but it is a very attentive feature
by the designers, and
is a welcome addition to the
repertoire of the “caring”
searcher (pardon for the smarmy term).
Showing the co-authors of the author as an option reflect the same
intelligence in design. It is icing on the cake that the citation tracker feature can
be launched , but it is far more than just icing that by the click of a
button the self-citations can be eliminated from the results. I
don’t see any reason to use as the default the current year and the
previous 2 years for the citation tracking as Scopus is incredibly fast
with mega-sets of results. (It is obviously an oversight that the end of
the default range is > 2006. Only Google Scholar can produce
phantom citations from 2007 as of mid-May, 2006, and I am not talking
about citations from manuscripts of books and articles which are known to
get published in 2007). Even
the current year is rarely informative for citedness as citations start
coming in for most mortals 2-3 years after the publication – if they
come at all, but this is a minor issue. There are other novel features on
the way for visualizing the search results, but not even the Scopus reps
could get the Java applets working
in the exhibit hall when I contacted them just before submitting the
manuscript. Deploying
the sort know-how, sorting the often lengthy cited reference lists in the
detailed record of the articles by the number of times they were cited
would appeal to many users. The instantly produced, excellent result
analysis panel showing the distribution of hits by
journals, authors, publication years, document type
and major disciplinary categories would be even better if the
entire journal list could be sorted alphabetically. Again,
there is no limit on the size of the result set. You can get the profile
of the result of a search topic showing the most productive journals,
authors and years of publication. Searching by single word journal names (Science,
Blood, etc.) without picking up every journal which has the word
science or blood in its name (as it is possible in WoS), would be very
welcome. The
export features are useful, but I can’t understand why Scopus settled
for a tailor-made version of RefWorks (an attractive bibliography manager
program) which does not save the citation counts with the articles
– the most valuable extra feature of the Scopus records. Luckily, the
results of the citation tracking service can be exported in CSV format
which in turn can be read directly by Excel into a spreadsheet.. There
are some software features where Scopus could learn from WoS. Browsing the
index of cited references, and the cited journal names, would make the
searches more efficient and would compensate for the bad habits of us
authors in being sloppy when citing works which will haunt us as
citation-based and citation-enhanced systems will appear on a larger
scale. No matter how smart and swift the software is, it cannot compensate for certain types of database content deficiencies such as those caused by omissions of articles, issues, entire volumes from core journals which should have complete and consistent coverage. If we users can so easily get an X-ray of the entire database in Scopus, then the developers certainly can get a more comprehensive, more detalied and visual report about gaps, and roller-coaster coverage of journals. They could help by giving a map for fixing this problem, and to give teeth to the proud claim of the exceptionally broad journal base, too.
|