Research Article |
Corresponding author: Quentin Groom ( quentin.groom@plantentuinmeise.be ) Academic editor: Alan Paton
© 2017 Jorick Vissers, Frederik Van den Bosch, Ann Bogaerts, Christine Cocquyt, Jérôme Degreef, Denis Diagre, Myriam De Haan, Sofie De Smedt, Henry Engledow, Damien Ertz, Régine Fabri, Sandrine Godefroid, Nicole Hanquart, Patricia Mergen, Anne Ronse, Marc Sosef, Tariq Stévart, Piet Stoffelen, Sonia Vanderhoeven, Quentin Groom.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Vissers J, Bosch FV, Bogaerts A, Cocquyt C, Degreef J, Diagre D, de Haan M, De Smedt S, Engledow H, Ertz D, Fabri R, Godefroid S, Hanquart N, Mergen P, Ronse A, Sosef M, Stévart T, Stoffelen P, Vanderhoeven S, Groom Q (2017) Scientific user requirements for a herbarium data portal. PhytoKeys 78: 37-57. https://doi.org/10.3897/phytokeys.78.10936
|
The digitization of herbaria and their online access will greatly facilitate access to plant collections around the world. This will improve the efficiency of taxonomy and help reduce inequalities between scientists. The Botanic Garden Meise, Belgium, is currently digitizing 1.2 million specimens including label data. In this paper we describe the user requirements analysis conducted for a new herbarium web portal. The aim was to identify the required functionality, but also to assist in the prioritization of software development and data acquisition. The Garden conducted the analysis in cooperation with Clockwork, the digital engagement agency of Ordina. Using a series of interactive interviews, potential users were consulted from universities, research institutions, science-policy initiatives and the Botanic Garden Meise. Although digital herbarium data have many potential stakeholders, we focused on the needs of taxonomists, ecologists and historians, who are currently the primary users of the Meise herbarium data portal. The three categories of user have similar needs, all wanted as much specimen data as possible, and for those data, to be interlinked with other digital resources within and outside the Garden. Many users wanted an interactive system that they could comment on, or correct online, particularly if such corrections and annotations could be used to rank the reliability of data. Many requirements depend on the quality of the digitized data associated with each specimen. The essential data fields are the taxonomic name; geographic location; country; collection date; collector name and collection number. Also all researchers valued linkage between biodiversity literature and specimens. Nevertheless, to verify digitized data the researchers still want access to high quality images, even if fully transcribed label information is provided. The only major point of disagreement is the level of access users should have and what they should be allowed to do with the data and images. Not all of the user requirements are feasible given the current technical and regulatory landscape, however, the potential of these suggestions is discussed. Currently, there is no off-the-shelf solution to satisfy all these user requirements, but the intention of this paper is to guide other herbaria who are prioritising their investment in digitization and online web functionality.
Botanic garden, collections, database, data sharing, digitization, science infrastructure
A quiet revolution is happening in the way herbarium specimens are being accessed and used. Botanic gardens, museums and universities, all over the world, are digitally imaging herbarium specimens, transcribing their details while geolocating their origin (
Improving access to biological collections is a policy goal of many governments and organizations. For example, Article 17 of the Convention on Biodiversity focuses on the exchange of information, and target 19 of the Aichi Biodiversity Targets relates to biodiversity knowledge exchange (
In addition to their traditional role in plant taxonomy, improved access promotes new lines of research and applications for herbarium specimens. Such data can be used to monitor environmental changes, such as changes of plant phenology that result from climate warming (Vellend 2013,
The Botanic Garden Meise holds around 3.5 million herbarium specimens from around the world, with important historical collections, and a clear focus on Central Africa and Latin America, as well as additional significant collections from Belgium and Europe. In 2002, the Garden started imaging and cataloguing its collections with two small pilot studies within the EU funded framework of the European Network for Biodiversity Information: namely The Albertine Rift Project (
For the Garden this project presents many opportunities. It will update and extend the current digitalisation infrastructure, image storage and web portal. In addition, the project will raise the profile of the Garden; demonstrate the importance of the collections and make the general public more aware of the collection’s existence. It will also increase awareness of the Garden’s research initiatives and its relevance to conservation, science and society.
As part of the DOE! project, the Garden will update its herbarium portal. However, before redesigning the portal and making key decisions on data management, the Garden decided to conduct a user requirements analysis to establish the needs of scientists and other user groups and help prioritise investment. This prioritisation is necessary as funding is limited and we wish to fulfil the demands of the diverse users. Many different people and organizations interact with the Botanic Garden Meise and may access its data portal for different reasons (Fig.
In parallel with the user requirements analysis, the Garden has also developed a data management plan to clarify its position on data management and access towards digital herbarium data. This plan highlights system requirements for the herbarium data portal, which may not be directly visible to the users, but are nevertheless important to support the accessibility and citability of data. The plan will also clarify issues of licensing and data embargoing that are needed to make it clear what the limitations are on the use of data.
Having completed the process of user requirement analyses, we considered that the insights gained should be shared with other herbaria and museums. To this end, we present here the process and outcomes of this user requirements analysis. We identify similarities and differences in user expectations. We also identify requirements that users considered to be important, and suggested they be prioritised.
The mass digitization of herbaria around the world presents enormous challenges and opportunities for science. Ultimately, its success will be judged on the impact these efforts have on scientific progress and on society in general. An important part of this effort is to empower users with the tools and data they need to make an impact.
The Botanic Garden Meise contracted Clockwork, the digital engagement agency of Ordina, to conduct the user requirements analysis. Clockwork has extensive knowledge in the field of user experience and digital design and their lack of knowledge of botanical research was considered an asset as it helped to provide a fresh perspective on the user requirements for the herbarium portal.
Before consulting stakeholders a small team, comprising staff of the Garden, met to identify potential stakeholders and decided on those to be consulted in a requirements analysis as discussed in the introduction (Fig.
The second step of the preparatory phase was a “market analysis” where Clockwork and a core team from the Garden conducted a survey of the online landscape of herbarium tools and resources. The information gathered at this phase was used to inform ourselves of the current state of the art, so that questions could be framed and the opinions of the stakeholders could be contextualized.
Participants were recruited from scientists and historians of the Botanic Garden as well as from Belgian universities and scientific institutions. Externally, the recruitment was made by invitation in order to have some control on the representation of participants. Within the Garden, participants were recruited from a mixture of invited staff and self-selected volunteers. An effort was made to recruit from a diverse range of participants where scientific discipline, gender, language and origin were considered. Scientists were broadly recruited from within their fields, including those interested in tropical versus temperate regions to those studying vascular plants versus cryptogams. In total, 23 participants were recruited; 12 taxonomists; 7 ecologists and 4 historians. These included 10 women and 8 external researchers.
The approach was for pairs of participants with similar jobs to be interviewed together. Each pair was presented with different task-scenarios related to their daily work and for which they would apply information from the herbarium. These tasks were selected to be representative of the tasks of each type of scientist identified during the market analysis workshop (Table
Task scenarios presented to participants in the user requirements interviews.
Historical based scientists: 1. Write a biography on a collector called Joseph Bequaert, on his voyages, taxonomic interests and the people he worked with. 2. Contrast the traditional uses of the genus Solanum in Africa and South America, then comment on the impact of modern Solanaceous introductions to traditional agriculture. 3. The garden has received a large collection of photographs on glass plates. There are some limited details that come with the collection, but you would like to improve the metadata associated with each image. |
Ecological based scientists: 4. You need to start gathering data for a species distribution model of Agrimonia eupatoria. Where would you get the data and ensure that they are free from errors? 5. You need to create a red-list for Belgium. Look for the necessary data and determine the status. 6. An alien species is gradually moving northward in Europe. Before it has naturalized in Belgium you need to write an impact assessment for decision makers. |
Taxonomic based scientists: 7. You find a specimen in another collection's herbarium. You can't read the signature but have the locality and the date. How would you figure out who the collector was? 8. You think you have discovered a new species in the herbarium collection. How would you verify that it has not already been described? 9. You are writing a revision of a large genus. You need to create a distribution map of each species. |
After listing the platforms and sources that participants would use, we focused on the reasons why they choose these over other sources and platforms. They received a template for each source or platform they listed, up to a maximum of three. On this template, participants were asked to describe their experience of the source or platform; they listed its key functionalities, its strengths, its weaknesses and if there was something that could be improved or is missing.
Then, the focus moved to the current virtual herbarium of the Botanic Garden Meise in order to get an overview of its strengths and weaknesses according to the participants.
Before the participants were able to access and explore the current virtual herbarium, they were asked three questions:
Did you know that you can access a part of the Garden’s herbarium collection through the virtual herbarium on the website?
Have you ever accessed or used the virtual herbarium before? If so, what was your reason for using the virtual herbarium?
Did it meet your expectations? If not, why not?
During the exploration of the current virtual herbarium, they were also asked:
When you look back at the task that we worked on during this interview, do you think you could have used this virtual herbarium to fulfil one of your steps?
Why, or why not? What is missing in order to help you fulfil your task?
The final step of the interview focused on consolidating the input that was given by the participants during the previous steps. The goal of this step was to stimulate the participants to convert their feedback on external platforms as well as the Garden’s virtual herbarium into concrete requirements for the future virtual herbarium portal.
The key question that was asked to the participants at the end of the interview was “My ideal virtual herbarium should...”. Participants wrote down elements, functionalities, integrations, data-links, etc., they deemed to be crucial for the future virtual herbarium.
The interviews were conducted by two user experience researchers both with backgrounds in qualitative data analysis. One of the user experience researchers moderated the interview, while the other noted the feedback provided by each participant.
The interview notes were then enriched and digitized. The feedback of the participants was initially analysed by reading each comment and interpreting the underlying meaning. Affinity diagramming was used, as a qualitative content analysis method, over two iterations to establish core themes and then establish sub-themes (
The final step in the analysis approach consisted of manual clustering of closely related needs and insights. These clusters helped to identify overlap among the requirements that were proposed by the different types of researcher. These overlaps served as the basis for defining common user requirements.
A detailed list of all user requirements is provided in the supplementary data, but have been summarized below. An important distinction was between the requirements for data and functionality. This is pertinent because user requirements for functionality can be addressed in portal development, but data requirements need to be address in long term digitization strategy.
The majority of researchers that participated in the interviews explicitly mentioned the need for as much data from a specimen as possible. They felt that the quality of their analysis was strongly dependent on the amount of data that might be interrelated and analysed. The most important source of these data is the specimen itself. Taxonomists, historians and ecologists highly valued the ability to consult the physical specimen in the herbarium or failing that, a high quality zoomable image. The information that can be derived from these specimens goes beyond the data that is written on the labels. Taxonomists for example study the physical specimen anatomically, microscopically and genetically, while historians derive information from details such as the layout of the label, the handwriting, and even the type of paper used to display the specimen and the label. Finally, ecologists are interested in field notes that the collector added to the specimen, as these often contain information about the habitat, substrate and other environmental data. The detailed information on a herbarium specimen includes its label information, annotations and even the way it is mounted. Even when scientists are provided with a virtual herbarium they will often need to consult the physical specimen for further details.
The common data elements identified across the different researcher categories are listed in Table
A summary of the data elements mentioned by the different researcher types, showing which data elements researchers had in common and which were unique. This does not mean that any particular data element is not of interest to another group, only that it did not arise in the series of interviews. Details of these data elements can be found in the supplementary information. The full list of common data elements is listed in Table
Common data elements identified across the different researcher categories as being important to their work.
Key data elements | ‘nice to have’ |
---|---|
Current name and classification of the specimen | Abiotic factors related to the specimen |
The location where it was collected (ideally coordinates) | Information about the habitat of the collected specimen |
Country | Ecological information on the location where it was collected |
Date of collecting | Information on meteorology |
Name of the collector | Description of characteristics on both macro- and micro level |
Collection number given by collector | Being able to measure the length of leafs, flowers, … on the high resolution image via an intuitive tool that makes it able to draw lines |
High resolution photo of the physical specimen (to get access to the metadata on the label that was not digitized in the database) |
Throughout the interviews, it became clear that the requirements for data go beyond the data that can be accessed directly via the herbarium. When the participants were asked to describe their ideal virtual herbarium, all three types of researcher repeatedly mentioned the value of creating links between data in different databases within as well as outside the Botanic Garden. The heterogeneity of data sources increases the effort required to find relevant data and also risks data being overlooked.
The Garden’s internal databases include those of the library, preserved plant collections, seed bank, living plant collection and photograph collections. Connections between these databases would facilitate research and simplify access to resources. For example, historians strive to reconstruct the sources of collected knowledge and data by looking for links between people, specimens, locations and collections. They would like links between herbarium data, gazetteers, biographies of collectors and the library catalogue. Taxonomists would like data and pictures of the Garden’s living collections alongside the dried specimens from the herbarium. Ecologists would like links to field notes of the collector to provide a deeper understanding of the habitat, plant stage, and other factors related to the specimen.
The main reason to link to external databases and platforms is to facilitate finding relevant literature and additional data. Suggestions for useful linkages included: the Biodiversity Heritage Library (www.biodiversitylibrary.org) for literature; JSTOR (www.jstor.org) for type specimens; the Global Biodiversity Information Facility (www.gbif.org) (GBIF) for plant distribution; and, nomenclatural information from the African Plant Database (www.ville-ge.ch/musinfo/bd/cjb/africa); The Plant List (www.theplantlist.org) and IPNI (www.ipni.org/index.html). Other sorts of data and information can be provided through links to botanical illustrations, photographs and maps (historic and modern). Links to other herbaria, particularly to duplicate specimens, were considered important. This would assist curation and verification of material through taxonomic revisions in other herbaria. Scientists ideally would like a single shared and interactive portal for all herbarium specimen information.
Taxonomists attach importance to the correct identification and correct names. However, not the only scientists stressed this importance. It was suggested that the integration of a simple nomenclatural overview for each specimen would be valuable where the current name, related synonyms and the common names are mentioned.
This list could be used to search for and collect relevant data even if the accepted name is different in other databases. Using a smart search engine that shows all relevant data, material and literature for a particular taxon would be very valuable, by reducing the risk of overlooking data due to synonymy.
Finally, all three types of researcher expressed value in being able to track name changes on a specimen.
Many researchers within the Garden mentioned that they had to request complex data extracts from the curator. Improving querying and extraction of data would save curatorial staff and scientists’ time. Participants were enthusiastic about the idea of accessing the current virtual herbarium via a user-friendly online interface. Even though the current web portal lacks some functionality, all of the participants appreciated its speed and liked being able to search for data with a few clicks instead of typing complex search queries.
A specific functionality requested by the taxonomists was the ability to define a bounding box or polygon to select specimens from an area. This would help them plan field trips, but also could help them judge the ecological conditions of the area. It might also be useful for creating simple summaries such as a checklist of trees or endangered species of an area.
All of the participants wanted to be able to filter and sort the results of their queries. After which they should be able to download the dataset in a usable, spreadsheet compatible format. They felt that herbaria are public sources of data and that the virtual herbarium should support them to retrieve the right data. However, the actual analysis of these datasets should be conducted by the scientists outside of the virtual herbarium environment.
The scientists that participated in the interviews were, in principle, open to the idea of moving their personal datasets to a central database. The main reason why they create local datasets is to be able to work with their data within a comparatively simple environment, while the central database is often too rigid.
The idea of using the online portal as a tool to insert data centrally was received quite positively by the different types of researcher. “We have to digitize our data somewhere, so we might as well do it directly in the central database and get the opportunity to relate our data with other data and extract it in a usable format for analysis”.
Furthermore, it would be convenient to link materials, photographs, data, etc. stored in a central database. Firstly, this database would enable the scientists to access their data and other material remotely. Secondly, a centrally managed database would lower the risk of catastrophic loss. Finally, this database could take care of standardization of their data, including taxon names, collector names, country names, etc.
Researchers pointed out that editing data directly could add errors if editing is unrestricted. Opening up the system to uncontrolled data input could reduce the quality of information, potentially harming the data significantly. To balance the reliability of the herbarium data, openness should be met by a need for transparency of how the data are derived.
In the researchers’ opinion data editing rights should only be granted to approved users through username and password control. But, even then, such edits should always go through a review process before overwriting the existing data. Edits would be sent to a validator or data manager, who reviews them. In the meantime, pending adjustments could already be made visible to other users while they are still under consideration. As such, visitors can already benefit from the new data, knowing that it is provisional and not yet validated.
In order to streamline the process of data validation, researchers should be able to take on the role of the validator. As a validator they could subscribe to updates about changes to specific parts of the database linked to their field of expertise. This would enable them to keep track of what happens to data connected to their own work, and also bring their expertise into the validation process.
The need for transparency also reflects on what researchers expect to see after the validation process. Based on the interviews the outcome of this validation process should be made visible via a data history feature. Users should be able to track back what happened to specific data in the past. Incorporating a history of changes would help researchers understand the evolution of data, which in turn could lead to more informed decisions on future modifications. A number of taxonomists and historians mentioned that this history of data could even serve as a starting point for future research projects. The history of data described above would provide transparency on the origin of data, which in turn provides an indication of its reliability. Several participants also suggested adding a clearly marked reliability factor to validated data.
There was a remarkable difference of opinion among the scientists on whether data should be open or partially closed to external users. One of the main reasons for closing data was the fear that external scientists would “steal” the data, ideas and expertise and publish on it first. This concern is particularly present in the case of new specimens, collected during recent expeditions, or for specimens currently being used in research projects.
The idea of locking away certain specimens was mentioned a few times. Some scientists want to be able to embargo specimen data for a period or the duration of a project. Others believe it would suffice to make some data hidden from external users. In this case, users with an internal account would still be able to see all data. Data could be hidden by simply marking information ‘internal only’.
In contrast, there are researchers who believed hiding specimen data goes against the Garden’s role as a public institution. They felt strongly that data should be shared with all those working on biodiversity, regardless of whether the person works inside or outside the Garden. Only for newly collected data do they agree the need for a temporary embargo.
The opinion was also divided about being able to download data sets. Some researchers were opposed to making this option easy for external users. For them, it should be mandatory for external people to identify themselves before being able to download data sets from the platform.
There are several methods to produce user requirements, including prototyping, observing users, analysing pre-existing systems, focus groups and surveys. We engaged an external agency to leverage their expertise in creating user requirements with our botanical expertise. In selecting a suitable subcontractor the methodology was an important criterion as we wanted a consultative approach so that the stakeholders at the Botanic Garden were engaged with the process. Nevertheless, owing to the time and costs of such an approach we did have to limit our investigation to researchers living in Belgium. A useful follow-up to this study would to be to repeat this exercise in a tropical country where the benefits of data repatriation could be analysed.
The user requirements exercise demonstrated, to our own surprise, that researchers of different disciplines had similar needs. Both in terms of their data requirements and the functionality for a web portal. Furthermore, delivering all these requirements would be a significant challenge, even for large institutions with sufficient IT resources. It is clear that development has to be prioritised and requirements need to be rated on their cost-effectiveness.
Transcription of label information is one of the most time consuming aspects to digitization. Furthermore, geolocating specimens considerably increases the skill and effort required. Ideally all specimens would be transcribed, catalogued, photographed and geolocated, but decisions need to be made on the best way to achieve this, both from the perspective of cost and user needs. Is it better to have a little data from every specimen or all the data from some of the specimens? Users had broad data requirements, wanting as much of the label information as possible. So for most users it is better to transcribe the whole label of fewer specimens. This also makes the transcribed data more useful for a wide variety of research topics. However, if minimal data were recorded that enable users to find specimens comparatively easily, the image could be consulted directly to gain the additional information. So having an image available, even without much of its metadata, will support full transcription in the future and is a cost effective way to disseminate information. Users anticipate consulting individual specimens even where the same digitized data are available online. Given the limitations on the rate of transcription, the most appropriate strategy would be to consult researchers as to which specimens to prioritise for transcription, but then completely transcribe the label information on those specimens.
The importance of linking herbarium data to internal and external databases was a requirement of all users. For example, in the case of ecology, linking taxonomy to trait data can be used to assign taxa to functional groups and facilitate modelling of ecosystems based upon these functions (
A technical requirement related to linking is the need to ensure persistence of these links. Herbarium portals therefore need to provide a permanent URI to a specimen (
Although it was not mentioned by the scientists, interlinking may reduce redundancy between databases and therefore reduce the curatorial effort of maintaining data such as taxonomic names and citations. Linking databases would make it possible to automate the process of updating names when a duplicate receives a new name in another herbarium. These links could also facilitate the exchange of georeferencing information and other details of the specimen. It will be necessary to make the origin of these data clear online, both to credit the sources and to give an indication of its reliability.
Linking of databases potentially brings together a large amount of complicated data that needs to be summarized succinctly. This need for data consolidation was a general requirement of participants. Currently, they often start from sources that consolidate data at the level of species and genus in a handy overview. Two good examples of platforms that provide such summaries are Tropicos (www.tropicos.org) and the GBIF (www.gbif.org). The types of data that are consolidated by these platforms consist of: an overview of names and their related literature; a hierarchical classification; the distribution of the taxon shown on a map; descriptions of the organism; links to other sources that contain related data; links to publications and images of preserved and living specimens.
From the requirements it is clear that it is impossible to separate the user requirements for an internal collections database and an online portal. For example, an online commenting system would either need a workflow to integrate these comments back into the main database, or the web portal would be just one view of the institution’s main database, with all the security and capacity implications that architecture would have. There are various competing systems for herbarium database systems including several bespoke solutions. Examples include BG-BASE, DaRWIN, DINA-Web, BRAHMS, JACQ and Specify. It is safe to say that none of these solutions provides all the user and system requirements detailed here. Certainly, the Botanic Garden Meise is not alone in the struggle to maintain legacy systems and create modern interfaces with obsolete technology. The lack of suitable alternatives eliminates the possibility of providing many user requirements with one simple software change. Rather it seems we must aim for incremental change, whilst trying to ensure these investments are at the same time future-proof for when new solutions become available. The best way to do this is to ensure that the data are maintained in standard formats and conform to standard controlled vocabularies.
One of the greatest concerns within the current information technology landscape in biology are orphaned data (
All researchers thought that feedback systems for data would be a valuable addition to an online portal. A good example of where feedback is used to effect is on the Encyclopedia of Life website (http://eol.org). Here, changes and comments are displayed on taxon pages. Such a system would satisfy the researchers’ requirement to annotate records with their comments. However, translating these annotations into corrections that can be applied to the master database is an administration challenge, due to difficulties of contentious decisions where it is difficult to judge the authority and priority of edits. A compromise could be to implement a review system, whereby users can rate entries in addition to commenting. In this way potentially problematic entries can be flagged for review. Yet, these problems only need to be resolved when a user wishes to use a datum.
The most contentious subject among scientists is whether and how data are shared. This is a subject of much debate within the research community (Reichman 2011,
Most, if not all, of these user requirements will be familiar to curators, taxonomists and others who regularly work with herbarium data. Nevertheless, it is valuable to record these requirements for several reasons. We need to deliver as many of the requirements as possible, but also keep a record of our progress. Prioritisation is also critically important to make effective use of the available budget. Furthermore, it is useful to communicate our needs to other institutions because fulfilment of some of the user requirements requires cooperation and adoption of common standards by many institutions.
Researchers have high expectations of biodiversity informatics, both for the software and the data that have been digitized. User requirements are similar for different types of researcher and we should prioritize access to core data fields in an easily searchable and useable format. Nevertheless, the most useful way to prioritize the transcription of label information is to work on data that is required immediately for research, but always transcribe the whole label data. Furthermore, though researchers appreciate simple access to digital images and data, they still value access to the original specimens.
The authors would like to thank all those who took part in the user requirements exercise including Tim Adriaens, Petra De Block, Steven Dessein, Bram D’hondt, Koen Es, Paul Goetghebeur, Ivan Hoste, Pierre Meerts, Salvator Ntore, Els Ryken, Filip Vandelook and Paul Van Wambeke. Also, Sven Bellanger is thanked for his work on the illustrations. Funding was provided by the Vlaamse Regering for the project ‘Digitale Ontsluiting Erfgoedcollecties’.
Gathered needs per type of researcher
Data type: form