In this presentation the results are given of an investigation we did this year into the coverage of our scientific journals by Primo.
Joke Hendriks from the Science Faculty helped me with checking journal titles in Primo, and Bert Gazendam from the Central Library helped with the global approach and especially the statistics.
The reason for the investigation was that we wanted to get more insight in the content coverage of the bibliographical databases following a survey on Primo usage in 2013. From the survey we learned that both our customers and our information specialists were not sure about what is exactly covered. Often they cannot find the materials they are looking for.
We needed answers to questions like:
- What content is indexed by Primo? By the way it is called CataloguePlus at the UvA.
- Are there alternatives to the existing mix of information resources which might give a better coverage?
- We also regularly get requests or offers for new databases. This year we evaluated SCOPUS again and there is continuing demand from the Science Department to acquire it. But does it add something next to Primo? It would be nice if we would be able to match new databases with what is already covered.
- Moreover there was a feeling that we had too much databases. At that time we offered something like 760 different databases in a Metalib application. Users couldn’t see the Wood for the Trees. It seemed necessary to take a critical look at the databases we offer.
So our goal was to improve the coverage of our collection, either by Primo or any combination of alternative bibliographical databases.
Additional pressure was put on the Primo Working group by the discovery that the usage of several databases declined sharply after the introduction of Primo in September 2012. Red arrows indicate start of Primo. In these graphs the decline in use for three EBSCO databases is given for abstract downloads, searches and full text downloads. This decline is exaggerated by the phasing out of Metalib, but it indicates a real change in the usage of for example ASP. This was formerly was one of the showpieces of Metalib, but now turned into a sleeping beauty.
I work as an information specialist in the Science Library in a new building in the South East of Amsterdam. Since I do not work at the Library Systems department I have no access to the Primo configuration. As information specialist I have some general idea of how it works, but in general we consider it as a kind of black box.
Now, our library happens to look out on the Equinix Plant in Amsterdam. I learned that the Ex Libris Cloud Center for Europe is housed at Equinix in Amsterdam. So it is possible we are looking out on very large Black or Silver Box, where Primo is residing in somewhere.
The UvA is one of the largest general research Universities in Europe with 30.000 students and 5000 employees.
The University Library collection budget for 2014 is several million euros.
About 75 % of the collection budget is spent on journals. This is including costs of Digital Library portals from publishers.
For the bibliographical databases we pay around € 500.000. This is excl. Primo
We think the primary goal of a Library Information System should be to bring our customers to the journals in our collection an the easy way. The question is if Primo meets to this demand?
Information specialists often teach their students not to use Primo/CataloguePlus for a real bibliographic search. Instead it is better to search several specialized databases. Often Primo/CataloguePlus is not even mentioned during information literacy courses.
But is this really a good advice?
Our results indicate that this might be more balanced.
We did a TURF analysis on our data. TURF is short for Total Unduplicated Reach and Frequency. It is used in media or market analysis to calculate the optimal ranking of questions in a survey based on the proportion of unique responses. This can also be applied to titles in bibliographical databases.
Suppose you have a collection of 6 bibliographical databases with the titles they index. Top row. They are imported into the SPSS extension for TURF analysis, which is build into the newest versions.
The analysis starts and shortly thereafter it shows the results.
SPSS calculates for each database the number of unique titles it adds to the total collection. This is done by iteration, first it selects the database with the largest number of unique titles. This is put in rank 1, then it calculates the next database for the remaining collection, and so on… The proportions each database adds are added up and in this way you get the total number of indexed titles. This is called the “Reach” and is equal to the coverage. These are the blue blocks above the line.
The procedure also calculates the portion each databases adds to the total collection of titles. This is called the “Frequency”. These are the green blocks beneath the line.
So, when we apply this to a real data for 37 bibliographical databases in our collection we get this graph. Along the x-axis the group size is given, against the y-axis the percentage of reach and frequency. Blue line is reach or coverage, green line is the frequency or the total number of titles added. The coverage reaches a maximum of about 64 % and the frequency or number of titles added, eventually reaches 100 %.
So if you search all these databases you will never get a greater coverage than 64 %. Additionally you have to collect results from all these different databases, DE duplicate them and save them in a single list. This is a lot of work.
Instead, when you begin your search with Primo, you will start with a coverage above 80 % and in 3 to 4 steps you will reach coverage of about 92 %. The next two databases after Primo are both in EBSCOhost and can be searched simultaneously. So searching in Primo saves you lot of work and give you a better coverage.
How did we get to this estimate of Primo coverage? Primo can be seen as a large collection of databases, which can be devided into three groups: bibliographical databases, collection portals of large publishers or agregators and collections of smaller publishers. Which collections actually are indexed can be derived from the Primo Central Collection List.
We used exports from Scholarly Stats and SFX to determine the titles that are part of those collections. Scholarly Stats is a service which delivers usage data for around 35 platforms representing about 95 % of our journal collection. To obtain an accurate estimate we checked the titles with publisher title lists and in CataloguePlus.
After that, the results were matched on ISSN with the title list from Ulrich’s XML Data Service.
Ulrich’s XML Data Service operates as a SOAP service, which in our case is handled by a PHP-script on a little Windows 2012 server and parsed to a MySQL database, see the photograph.
The latest export we used contained more than to 386.000 journal titles.
A nice feature of Ulrichs is that it gives for each journal the indexing and abstracting services by which is indexed. We found 11 indexing and abstracting services in Ulrich’s matching with the PCI list. Unfortunately Ulrich’s gives no information about the covered period.
Also for each title information is given about title changes and former ISSN’s.
It is important to notice that Ulrich’s only gives the indexing data for bibliographical databases, not for platforms of publishers, like ACM, ScienceDirect or Wiley. So information about the titles in Primo from publishers came from Scholalry Stats and SFX, information about titles being indexed by bibliographical databases came from Ulrichs.
To estimate the coverage of the collection, we had to estimate the number of titles. Unfortunately there is no complete list of journals the UvA subscribes to. Based on Scholarly Stats and SFX data we came to an estimated number of around 24.000 to 25.000 journal titles that are accessible and we are probably paying for. Of these titles 90 % matched with our Ulrich’s data, these are the 22.750 titles in the center. The missing 10 % are journals with an odd ISSN with or no ISSN. This could be for example a Journal Archive which no ISSN, but the journals in it have.
Additional there are also around 10.000 Open Access titles.
In the picture you see the different ways these journals are covered by Primo and the bibliographical databases:
1) Around 29.000 titles are part of a journal collection from a publisher or collection platform,
2) around 25.000 titles are indexed by a bibliographical index,
3) about 48.000 titles are indexed by our bibliographical databases.
There is a gigantic overlap. So we think are journals are quite good covered.
In this graph you can see that al large proportion of the titles are indexed by the bibliographical databases in Primo. This is the left column.
The middle column shows the number of titles in one of the larger collection platforms.
To the right the number of titles on a title list of one of the smaller publishers covered by Primo.
Shaded areas show titles which are indexed by more than one source:
- Yellow-red, publisher and collection platform
- Green-Red titles indexed by all three sources.
The majority of titles is covered by one of the 11 bibliographical databases, and they overlap with the titles from the collection platforms and publishers. So, by relatively few bibliographical databases a large number of journals is covered, and you need only a few collections from publishers to achieve a large title coverage.
In our implementation a number of resources has not been activated. This was because they were affecting relevance ranking in a negative or unclear manner. We also learned that some databases only index selected articles which match their subject area, not complete title lists. This is for example done by PubMed and MLA/IB.
So, we corrected for this and we estimated the following coverage for our subscribed journals:
88 % if all resources would be activated, 91 % if all OA journals are included.
77 % when the databases which are turned off are excluded, 83 % incl. OA.
75 % if also databases which are activated by subscription are excluded, 82 % incl. OA.
Coverage is determined as the proportion of journals being indexed by Primo by the resources mentioned in the PCI list.
We also wondered to which extent these databases share titles? To determine this a Principal Component Analysis was done. In this table the column headers give the discovered components or factors calculated from the total variation. In the left column the name of the database is given. (Very small.)
Databases scoring high on a component are highlighted. The first component has probably something to do with Medicine and Healthcare, the second with Exact Science, the third with Library and Information Science and so on. As can be seen there are no database with high scores on multiple components. So every database is more or less specialized for its own target group. We concluded that there is not much correlation between the databases and the overlap is small.
This graph was made because we wanted to know if adding SCOPUS would improve the coverage of our collection. At the bottom line the main suppliers of journal titles are given and at the y-axis the number of titles in their collections. The red segments in each bar indicate the number of titles indexed by Primo as well as SCOPUS. The blue portion the titles which are indexed by Primo alone. The purple part at the top indicates the number of titles which are not indexed by Primo or Scopus. In between there is a tiny little green portion which represents the number of titles SCOPUS would add.
So this was one indication that adding SCOPUS to the collection would not add much to our coverage.
The result of a TURF Analysis of SCOPUS with other bibliographical databases, excluding Primo. The coverage reached would be around 69 %.
The result of a TURF Analysis with a basic Primo configuration, SCOPUS and the other bibliographical databases. SCOPUS shows up at the 5th place, adding not much to the coverage eventually reached, 92 %. So despite the clear interface, the further acquisition of SCOPUS was temporarily stalled.
Another way of representing the data from a TURF Analysis. The number of unique titles added by each database is represented by the blue line. This number falls quit rapidly to a minimum and then diminishes. The raise in total costs is given by the red line. Adding the last 9 titles to the indexed collection costs about 100.000 euro. So you can ask yourself, is this really worth it?
Dutch as we are, the previous graph gave us the idea for the question which coverage might be reached if we would only use the cheapest databases? For example databases with a cost per title comparative to Primo? There are about 13 databases costing less than 4 euro per indexed title, including some free databases like PubMed. If you would only offer those databases without Primo you would reach a coverage of about 50 %.
Adding Primo Basic would improve the coverage to about 90 %. So with this collection of databases, together costing around 90.000 euro you would reach a coverage of only 2 % lower than that reached with all the expensive databases costing around € 500.000. This is excluding the costs for Primo itself.
So far, we only looked at the coverage of titles, but this also raised the question of how completely are they indexed? Are only authors and titles indexed, or also subject headings, abstracts and full text? This is subject for further research, but we took a preliminary glimpse at it.
For example the journal Contemporary Psychoanalysis, published by Taylor & Francis and also indexed in Web of Science. According to the Primo Central Collection List one would expect it to be indexed by keywords/subject headings and Full Text for records from T&F and Subject headings/keywords from Web of Science.
But if you search on articles you will only find records from Web of Science. One of our findings was that records from Web of Science never show Subject headings and abstracts, although these are given in Web of Science itself.
To determine whether this is a general pattern we checked a number of journals in CataloguePlus.
This table shows the results of a check of 272 journals. The percentage of titles for which we would expect subject headings or Full text is given by the blue columns. The red columns show the actual percentage of titles found. The results indicate that the number of records with keywords is somewhat lower than expected. This can be explained by older journals not having subjects heading or keywords. The percentage of titles with an abstract is about the same as the number of titles with keywords. Full text indexing seems very rare and much lower than expected from the PCI list. This is subject for further research. Possible we do not interpret the PCI list correctly, so that the occurrence of text snippets in the records does not mean that the records are being indexed Full text.
- This way of looking at databases will play a role in future decisions about new databases and it already did in the case of SCOPUS
- It might be more useful to acquire journal content, than additional bibliographical databases, because the collection seems already well covered.
It is an advantage when a journal collection is indexed in Primo Central.
- Primo should get more attention in our library skills courses as a one stop tool to get to our journal subscriptions.
- Ulrich’s Journal Data is a great framework to map all kind of journal information onto.
- If possible, efforts must be made to supplement Primo with databases and publishers not yet indexed. The fact that Proquest will be included as of October this year is good news.
When I searched for information about this subject I could not find much in our collection. Ulrich’s gives a subject indication for all titles in its database. This table shows a selection of the 35 largest subject areas in Ulrich’s. The more red, the more titles are not covered. Obviously we are missing something on Agriculture and Earth sciences, but the largest gap is on titles about Abstracting and Indexing services!
This presentation will also be uploaded to the IGeLU website.
Are there Questions?
This presentations can be downloaded together with all the other presentations of the conference at: