RAINBIO: a mega-database of tropical African vascular plants distributions

Abstract The tropical vegetation of Africa is characterized by high levels of species diversity but is undergoing important shifts in response to ongoing climate change and increasing anthropogenic pressures. Although our knowledge of plant species distribution patterns in the African tropics has been improving over the years, it remains limited. Here we present RAINBIO, a unique comprehensive mega-database of georeferenced records for vascular plants in continental tropical Africa. The geographic focus of the database is the region south of the Sahel and north of Southern Africa, and the majority of data originate from tropical forest regions. RAINBIO is a compilation of 13 datasets either publicly available or personal ones. Numerous in depth data quality checks, automatic and manual via several African flora experts, were undertaken for georeferencing, standardization of taxonomic names and identification and merging of duplicated records. The resulting RAINBIO data allows exploration and extraction of distribution data for 25,356 native tropical African vascular plant species, which represents ca. 89% of all known plant species in the area of interest. Habit information is also provided for 91% of these species.


Introduction
Improving our understanding of the distribution of biodiversity has been suggested as "one of the most signifi cant objectives for ecologists and biogeographers" (Gaston 2000). Indeed, fundamental understanding of biodiversity patterns and inference of conservation assessments leading to wise and sustainable management of biodiversity at various scales are heavily dependent on our knowledge of species distributions. For tropical regions especially, we have had an incomplete understanding of species distributions which causes a major problem for ecological and conservation research (Bini et al. 2006, Feeley andSilman 2011). Th ere has been a global lack of tropical biodiversity data availability (Collen et al. 2008, Feeley andSilman 2011), although this is increasingly being improved (e.g., Enquist et al. 2009;http://bien.nceas.ucsb.edu/ bien/, www.gbif.org, ter Steege et al. 2016). Th e tropical vegetation of Africa contains high levels of species diversity but is subject to important shifts in response to ongoing climate change and increasing anthropogenic pressures (Blach-Overgaard et al. 2015;Hansen et al. 2013;Lewis et al. 2009;McClean et al. 2005). Even though our knowledge of plant species distribution patterns in the African tropics has been improving over the years (Linder et al. 2012;Stropp et al. 2016), it remains limited (Küper et al. 2006), which calls for initiatives to collate African biodiversity data.
Here, we present RAINBIO, a unique comprehensive database of georeferenced records of vascular plants (Tracheophyta) in sub-Saharan tropical Africa and north of Southern Africa, including Gulf of Guinea islands, Cape Verde and Zanzibar archipelagos (Fig. 1). Until recently, distribution data on tropical African plants were scattered among institutions and individual researchers and not compiled into a single comprehensive database. A recent analysis of African (including Madagascar) vascular plant species occurrences available via the Global Biodiversity Information Facility portal (GBIF, www.gbif.org) resulted in 934,676 herbarium records after data fi ltering for 57 countries (Stropp et al. 2016). However, over half of these specimens (512,680) belonged to South Africa alone, with Madagascar and Tanzania having the second and third most specimens, respectively (Stropp et al. 2016). Th is study underlined the lack of high quality data for tropical Africa, especially the forested regions. Several resources, often accessible via the internet, off er access to a large number of occurrences thanks to recent eff orts to digitize and georeference herbarium specimens (e.g. TROPICOS, Oever and Goff erje 2012, Heerlien et al. 2014). Additionally, researchers on tropical African botany have created their own "working" datasets for their plant groups or regions of interest (e.g. Blach-Overgaard et al. 2010;Droissart et al. 2012;Wieringa and Sosef 2011). Th ese datasets have the advantage of having updated specimen identifi cations and generally more accurate georeferencing compared to the larger institutional datasets. RAINBIO is a compilation of thirteen datasets and should be seen as a readily workable dataset because we applied several quality fi lters, checked the data quality (both georeferencing and taxonomy) and identifi ed and merged duplicate records.

General description and purpose
Th e fi rst target of the RAINBIO project (African RAIN forest community dynamics: implications for tropical BIOdiversity conservation and climate change mitigation) funded by CESAB (CEntre de Synthèse et d'Analyse sur la Biodiversité) of the FRB (Fondation pour la Recherche sur la Biodiversité, France), is to compile a state-of-theart dataset on plant species distribution across tropical Africa. RAINBIO uses large publicly available datasets and smaller "non public"/private databases. Th e resulting RAINBIO mega database allows the exploration and extraction of distributional data for 25,356 species (29,664 taxa including infraspecifi c taxa: subspecies and varieties) across continental tropical Africa. It is the fi rst step towards a standardization of plant occurrences in this region and also contributes towards achieving Target 1 of the fi rst Objective of the Global Strategy for Plant Conservation, "an online fl ora of all known plants", adopted by the Convention on Biological Diversity (Paton et al. 2008).

Datasets
Two datasets are provided in csv format as well as an R.data working space (http:// rainbio.cesab.org/). For the latter, an R script is provided for exploring and mapping occurrences.
Th e database made available here represents a subset of available fi elds (see below). Th e actual RAINBIO database follows the Darwin Core standard (Wieczorek et al. 2012). Users interested in fi elds not provided here (see details in http://rainbio.cesab. org/) are invited to contact the fi rst author or last author.
Th e RAINBIO database is subject to future updates. Users interested in having an updated version of the database are invited to contact the fi rst or the last author.

Collectors and owners of the data
RAINBIO is a compilation of thirteen datasets (more details on these sources at the end of the article) of three kinds: (i) extensive 'public' databases of several herbaria institutes (BR, BRLU, K, LISC, MO, and WAG (incl. AMD, L & U as well); acronyms according to Th iers (continuously updated), (ii) personal databases collated by individual researchers (focusing on a given taxonomic group or a given geographic area) and (iii) other sources of plant occurrences such as silica-gel collections or vegetation plot inventories. Th e WAG dataset includes a series of personal datasets (like ii) compiled for taxonomical revisions of over 35 genera in diff erent families. Occurrences are thus supported by specimens deposited in herbaria (586,920 records), silica-dried specimens (13,510 records) or observations from plot inventories (13,443 records).

Methods of data collection
Th e workfl ow for building the database involved numerous steps of cleaning, standardizing and quality checks described below. Th ese steps were essentially built up in Postgres and PostGis scripts. Several other cleaning and checking steps were run using the R statistical software (R Core Team 2015).

Georeferencing verification processes
We performed two quality control checks on the geographical coordinates of the records: First we checked if the documented country of each record corresponds to the country in which the record is georeferenced (Fig. 2).
If false, we checked whether the georeferenced record fell within a country neighbouring the documented country ( Fig. 2A). If true, the occurrence was classifi ed as 'Neighbour' and the nearest distance between the occurrence and the border of the documented country was calculated. Records with a distance of more than 5 km were discarded while records with a distance less than 5 km were retained. Th e logic behind this is that records could well be from a country neighbouring the one provided by the coordinate, because either the map or the coordinate is not precise enough.
Second, we checked if the occurrence fell within an ocean (Fig. 2C). If true, the nearest distance between the occurrence and the coastline was calculated. If the distance was greater than 5 km, the record was discarded. If the distance was less than 5 km the record was retained. Again, the logic behind this was that the coordinate or the map may not be precise enough.
If false, the record was discarded.

Taxonomic backbone and standardization of taxonomic names
To resolve problems such as spelling errors and/or synonymies linked to heterogeneous taxonomic datasets, we fi rst relied on the taxonomic backbone table used by the Naturalis Biodiversity Center herbaria (AMD, L, U & WAG). Th e structure of this table provides links among taxa names allowing the standardization of species name spelling and synonyms. We then submitted this taxon list (30,147 names) to the online "Taxonomic Name Resolution Service" (TNRS, Boyle et al. 2013). Th is tool compares submitted names to names from four diff erent sources (TROPICOS (http://www.tropicos.org/), USDA (http://plants.usda.gov), the Global Compositae Checklist (www. compositae.org/checklist) and Th e NCBI Handbook (http://www.ncbi.nlm.nih.gov/ guide/taxonomy/)). Th e program returns a name match with the taxonomic status (accepted or not) and an overall matching score (a value between 0 (no match) and 1 (perfect match)). From this, two lists were produced: one identifying misspelled names and one identifying potential synonyms. Th e fi rst list was generated by fi ltering out names with a taxonomic status as "accepted", an overall score below 1 and no partial match (i.e. both genus and species names are matched). Th e second list was created by fi ltering out (i) names with an overall score of 1 and whose submitted name was diff erent from the accepted name, and (ii) names with an overall score under 1 and whose matched name was diff erent from the accepted name. For both lists, we further screened the database names manually for the presence of the matched name (for the list of misspelled names) and for the presence of the accepted name (for the list of synonyms). Th e remaining and/or unresolved names were then scrutinized against the African Plant Checklist and Database (Klopper et al. 2007, African Plants Database 2015 and the World Checklist of Selected Plant Families (Govaerts et al. 2009) to assess their status.
Overall, if we consider records that passed the diff erent georeferencing quality checks (see above), 3,114 species names (3806 taxa) were excluded from our database after these diff erent standardization procedures.
Family names for angiosperms were standardized to following the Angiosperm Phylogeny Group III system (APG III 2009).

Workflow to identify and merge record duplicates
Th e database is a compilation of both extensive 'public' databases compiled by herbarium institutes and smaller personal databases focusing on either a given taxonomic group or a given geographic area. Despite their limited number of records, the latter have been compiled by experts and therefore the quality of georeferencing and identifi cation are generally better. A major issue was that most records in personal databases were duplicated within large herbarium database. Likewise, there was overlap in specimen data among major herbarium databases because specimens have often been collected in several duplicates that were later distributed among herbaria. It was important to identify and merge these duplicates because each could carry a diff erent identifi cation and/or georeference. Hence, the identifi cation of duplicate records had to be carried out in order to select the most accurate information in cases where duplicate records contained confl icting data.
When duplicates with diff erent identifi cations were encountered, the following procedure was followed to identify the most reliable record: -if the identifi cation varied between an institutional and a personal database, we chose the identifi cation recorded in the personal database (see the description of the datasets below). -if a personal database was not available, we chose the identifi cation with the most recent date of identifi cation. -if identifi cation dates were similar or not given, we chose the identifi cation at the lowest taxonomic rank (e.g. genus, species, subspecies, etc.). For example, if one record was identifi ed to the infra-specifi c level while another was identifi ed to the genus level, then the former was chosen. -if after these steps no one record was identifi ed, a random one was chosen.
When duplicates with diff erent coordinates were identifi ed, several subsequent steps were undertaken to identify the most reliable georeferencing: -if only one of the records passed the quality check for country described above, those coordinates were chosen. -if the coordinates came from an institutional and a personal database, the chosen georeferencing was the one from a personal database (see the description of the datasets above). -if none was chosen by the previous step, the chosen georeferencing was the one with the highest precision of the geographical coordinates using a precision code calculated for the project from 1 to 8 (see Table 1). -if after these steps no one record was identifi ed, a random one was chosen.

Identification of introduced and cultivated taxa
Because we want to work only with natural occurrences of indigenous species, we had to, as far as possible, identify and discard specimens collected from planted and/or cultivated individuals and those from introduced species.
Th e fi rst step in this process was to screen the text in the locality fi eld of the specimen records. We fi rst built a preliminary list of locality descriptions by searching for a list of keywords (e.g. 'Botanical garden'). Of this preliminary list of 898 locality descriptions we selected 653 that most likely correspond to ex situ living collections. All records collected (1,427) from these localities were then discarded.
In order to diff erentiate between native species and cultivated or other introduced taxa, the following procedure was adopted. We expected to fi nd most cultivated or introduced taxa among those with few collections (these taxa are in fact rarely collected in the fi eld). We therefore fi rst extracted all species with fewer than eleven records. Th en, GBIF occurrences were used to document the distribution outside of the area covered by the RAINBIO database: for each species, we verifi ed whether occurrences were available on GBIF and if that was the case, we downloaded GBIF occurrences using the rgbif package (Chamberlain et al. 2016). Species appearing as mostly collected outside of the geographical coverage of the RAINBIO database were selected and manually checked to confi rm they were truly introduced/cultivated, resulting in a list of 1,658 cultivated/introduced species. Th is list was further completed by using information provided by the African Plant Checklist and Database (Klopper et al. 2007) and the World Checklist of Selected Plant Families (Govaerts et al. 2009). Th e fi nal list of cultivated/introduced species identifi ed within the RAINBIO database comprised 1,635 species, which corresponds fairly well to the ca. 2100 naturalized non-African species in the whole of Africa as calculated by Kleunen et al. (2015). All records belonging to those taxa were discarded (but see http://rainbio.cesab.org/ for the list of those species).

Geographic coverage
Records of the RAINBIO database are localized in continental Africa, excluding Madagascar and Indian Ocean islands, but including Gulf of Guinea islands, Cape Verde and Zanzibar archipelagos representing 51 diff erent countries. All records fall within an area delimited between -34.8328 and 37.1094 degrees of latitude, and between -25.33 and 51.4 degrees of longitude.
Th e geographic coverage of the RAINBIO database i.e. where record density is signifi cant, is a region broadly delimited by ecoregions (sensu Olson et al. 2001) south of the Sahel and north of Southern Africa. Th e most signifi cant amount of data originates from the tropical forest regions (Fig. 1). Table 1. Accuracy code given to georeferenced records and corresponding uncertainty in degrees.
In 2007, the total number of Angiosperm taxa in an area broadly corresponding to the geographic coverage of the RAINBIO database was estimated to be 32,424 by the African Plant Checklist and Database (Klopper et al. 2007, African Plants Database 2015. RAINBIO database comprises 29,013 angiosperm taxa. We can therefore estimate that the RAINBIO database includes information for approximately 89% of all known species in the area of interest.

Habit data
We provide habit for almost all species recorded in the RAINBIO database (available for 23,111 species or 91% of all species). Information was gathered at the species level and was initially taken from the Naturalis Herbarium Collections database.
Th is information was then completed by relying on the fi eld description of herbarium specimens: keywords for seven specifi c habits (tree, shrub, herb, liana, epiphyte, mycoheterotroph and parasitic) were searched for in the description fi eld of all specimens. For example, for the 'tree' habit, the key-words were "Tree","tree","Arbre","ar bre","Arbor","arbor". If one of these key-words was found in the description fi eld of a specimen, the record was tagged for the 'tree habit'. Th e tags for each habit were then summed for each species. Th is procedure resulted for example in twenty tags for the species Acacia adenocalyx among which nine of them concerned the 'shrub' category and seven the 'liana' category. For each species the habit with the highest number of tags was chosen. If this habit represented less than half of the tags, the second ranked habit was considered as a secondary habit. For Acacia adenocalyx, this procedure therefore resulted in the choice of 'shrub' habit as the primary habit and 'liana' habit as a secondary habit. Erect palm-like plants (e.g. Palms, Dracaena, Pandanus) are included as 'shrub' or 'tree' according to literature.
Th e results obtained through this procedure were compared to the information obtained through the Naturalis Herbarium Collections database. Results were mostly congruent, validating our procedure. Mismatches between both sources and species with missing habit were fi nally manually checked and added by using information provided by the African Plant Checklist and Database (Klopper et al. 2007, African Plants Database 2015, the World Checklist of Selected Plant Families (Govaerts et al. 2009) and by checking specimens of such species.

Temporal coverage
Collecting years range from 1782 to 2015.

Description of the thirteen datasets
Th e thirteen datasets that contributed to the RAINBIO are described below and sorted according to the total number of record provided.