SIFlore, a dataset of geographical distribution of vascular plants covering five centuries of knowledge in France: Results of a collaborative project coordinated by the Federation of the National Botanical Conservatories

Abstract More than 20 years ago, the French Muséum National d’Histoire Naturelle1 (MNHN, Secretariat of the Fauna and Flora) published the first part of an atlas of the flora of France at a 20km spatial resolution, accounting for 645 taxa (Dupont 1990). Since then, at the national level, there has not been any work on this scale relating to flora distribution, despite the obvious need for a better understanding. In 2011, in response to this need, the Federation des Conservatoires Botaniques Nationaux2 (FCBN, http://www.fcbn.fr) launched an ambitious collaborative project involving eleven national botanical conservatories of France. The project aims to establish a formal procedure and standardized system for data hosting, aggregation and publication for four areas: flora, fungi, vegetation and habitats. In 2014, the first phase of the project led to the development of the national flora dataset: SIFlore. As it includes about 21 million records of flora occurrences, this is currently the most comprehensive dataset on the distribution of vascular plants (Tracheophyta) in the French territory. SIFlore contains information for about 15'454 plant taxa occurrences (indigenous and alien taxa) in metropolitan France and Reunion Island, from 1545 until 2014. The data records were originally collated from inventories, checklists, literature and herbarium records. SIFlore was developed by assembling flora datasets from the regional to the national level. At the regional level, source records are managed by the national botanical conservatories that are responsible for flora data collection and validation. In order to present our results, a geoportal was developed by the Fédération des conservatoires botaniques nationaux that allows the SIFlore dataset to be publically viewed. This portal is available at: http://siflore.fcbn.fr. As the FCBN belongs to the Information System for Nature and Landscapes’ (SINP), a governmental program, the dataset is also accessible through the websites of the National Inventory of Natural Heritage (http://www.inpn.fr) and the Global Biodiversity Information Facility (http://www.gbif.fr). SIFlore is regularly updated with additional data records. It is also planned to expand the scope of the dataset to include information about taxon biology, phenology, ecology, chorology, frequency, conservation status and seed banks. A map showing an estimation of the dataset completeness (based on Jackknife 1 estimator) is presented and included as a numerical appendix. Purpose: SIFlore aims to make the data of the flora of France available at the national level for conservation, policy management and scientific research. Such a dataset will provide enough information to allow for macro-ecological reviews of species distribution patterns and, coupled with climatic or topographic datasets, the identification of determinants of these patterns. This dataset can be considered as the primary indicator of the current state of knowledge of flora distribution across France. At a policy level, and in the context of global warming, this should promote the adoption of new measures aiming to improve and intensify flora conservation and surveys.

In order to present our results, a geoportal was developed by the Fédération des conservatoires botaniques nationaux that allows the SIFlore dataset to be publically viewed. Th is portal is available at: http://sifl ore.fcbn. fr. As the FCBN belongs to the Information System for Nature and Landscapes' (SINP), a governmental program, the dataset is also accessible through the websites of the National Inventory of Natural Heritage (http:// www.inpn.fr) and the Global Biodiversity Information Facility ( http://www.gbif.fr). SIFlore is regularly updated with additional data records. It is also planned to expand the scope of the dataset to include information about taxon biology, phenology, ecology, chorology, frequency, conservation status and seed banks.
A map showing an estimation of the dataset completeness (based on Jackknife 1 estimator) is presented and included as a numerical appendix.

Taxonomic coverage
Note: Th e taxonomic and nomenclatural reference for the fi rst edition of the SIFlore dataset is the fi fth edition of the taxonomic repository for the fauna, fl ora and fungi of metropolitan France and overseas territories, TAXREF, which was developed within the framework of a convention between the French ministry of ecology, the MNHN, the FCBN and Tela Botanica. Th e overall methodological framework at the basis of the TAXREF repository is presented in Gargominy et al. (2014).
Th e version originally used for data aggregation is TAXREF v5.0, posted online on July 20th, 2012.
At the time of writing, the current version of TAXREF was v8.0, posted online on December 1st, 2014. Data is available in the taxonomic and nomenclatural reference TAXREF v5.0 on the http://sifl ore.fcbn.fr web atlas. For practical reasons, data has been automatically linked to TAXREF v8.0 on the GBIF website, which may generate some taxonomic errors. Th is work was carried out under the responsibility of the Service du patrimoine naturel (SPN/MNHN). In order to prevent any error of interpretation, amended taxa were tagged in the Darwin core archive within the fi eld datageneralization as follows: "taxon initially attached to the Taxonomic repository TAXREF v5. Th is taxon has undergone changes since".
It has to be noted that the taxonomic coverage of La Reunion Island was incomplete in this 1 st version of SIFlore dataset as TAXREF v5.0 did not include most of La Reunion Island taxa. Th is issue was adressed in TAXREF v7, and subsequent versions, with the integration of the " Index des Trachéophytes de La Réunion (ITR, Boullet & al, 2012) ". Consequently, SIFlore completeness will be improved in the near future by including more taxa originating from La Réunion.
General taxonomic coverage description: Th e taxonomic coverage of this dataset spans the phylum Tracheophyta (vascular plants) present in metropolitan France (excluding the departments of Alsace and Lorraine) and La Reunion Island. Taxa are fi rst identifi ed at the species level and, if appropriate, subspecies level. Th e largest number of data records belong to the Asteraceae family (2,317,247 records), followed by Poaceae (2,220,479 records), Fabaceae (1,436,758 records) and Rosaceae (1,264,321 records). Th e families with the least number of records are Escalloniaceae, Malpighiaceae and Schizaeaceae with one data record each (Figure 1).

General geographic description
Th is national dataset collates species occurrences from metropolitan France and Reunion Island. Four diff erent climates are covered in the metropolitan area: oceanic, continental, mediterranean and alpine. Reunion Island is an overseas department and region of France, located in the southwest Indian Ocean, in the Mascareignes archipelago, to the east of Madagascar. Th e island has a tropical climate.
Th is geographical zone is characterized by a large range in altitude, from 0 to 4,810 m above sea level, and extends over an area of 516,499 km² representing 74% of Metropolitan France, its overseas departments and other territories.

Reference grids
In France, the Muséum National d'Histoire Naturelle (MNHN) recommends the use of standardized grids for species distribution maps. Th e grid reference of MNHN is defi ned according to the French offi cial map projection systems: Lambert 93 in Metropolitan France (grid name "L93_10X10") and UTM 40 S in La Reunion Island (grid name "grille_10km_ZEE_974").
Lambert-93 (EPSG 2154) is a Lambert conic projection using RGF93 geodetic system (compatible with WGS84) and defi ned by two reference parallels: 44°N and 49°N. Th e central meridian, 3°E, is the Greenwich meridian and the latitude of origin is 46°30'N. Prime coordinates are 700,000 meters and 6,600,000 meters.
Th e Universal Transverse Mercator Projection 40 South (UTM 40S) is an adaptation of the standard Mercator projection. Th is is a cylindrical and conformal projection using RGR92 geodetic system (compatible with WGS84). Th e central meridian is 57°E and the latitude of origin is 0. Prime coordinates are 500,000 meters and 10,000,000 meters.
Grids are composed of cells of 10 km by 10 km. Continental and maritime metropolitan France (954, 500 sq.km) is divided into 9546 cells and La Reunion Island terrestrial area (2, 512 sq.km) is divided into 34 cells (information on maritime area is not given as it is disproportionate in relation to terrestrial one).
All SIFlore records are georeferenced through the code of the corresponding grid square. French municipalities repository: Th e offi cial geographic boundaries of the municipalities and associated data (BD CARTO®) were provided by the National Geographic Institute (IGN) based on the National Institute of Statistics and Economic Studies (Insee) database. Each municipality is referenced by an offi cial geographic code (code Insee) and its name. Th e records in SIFlore are georeferenced through the code of the corresponding municipality.
Th e BD CARTO® version originally used for data aggregation was published in 2011. Th e current version is BD CARTO® 3.1 which was revised in 2013.

Temporal coverage
Th e oldest record in the dataset is from the year 1545 and the most recent records are from 2014. Most records (69.6%) were obtained after 2000 ( Figure 2). Records for which the date of collection is unknown are registered with the year 1500.

Data collection
Primary data are collected by both professional (from CBN and other organisations) and volunteer. As data originate from various sources (fi eld inventories, scientifi c literature and herbaria), this task involves diff erent trades such as botanists and archivists. Data are entered into CBN databases and then follow a validation process. To insure dataset homogeneity, data records are extracted from CBN databases and provided to the national system in a simplifi ed format that is compatible with the SINP format (edited by MNHN). Th is format is also based on the Data Specifi cation on Species distribution, produced by the INSPIRE Th ematic Working Group Species distribution (http://inspire.ec.europa.eu/).
For each observation, key variables are recorded, such as a unique code for the record, valid scientifi c name of the taxon and its identifi er in the French national taxonomic repository (TAXREF). Additional details including observation and data transmission dates, the collector name, the source basis of record name and geospatial information, including municipality name and code, and the square code of the national grid, are also provided. Th e grid system used for species inventories in France is defi ned by the Muséum National d'Histoire Naturelle: this is a grid composed of cells of 10km by 10km. Information about municipalities are collected from the Insee national repository. When available, additional information is collected such as bibliographic and herbarium references or primary source of the data record.
Th e evolution of the dataset is described in the Database history section. Once standardized, data are checked for consistency before being incorporated into the Postgresl/Postgis national database by using Talend Open Studio for Data Integration. At the end of the process, data are posted on the FCBN geoportal.
Sampling description: Most records originate from fi eld inventories (88.7%), with other records identifi ed in scientifi c literature (11%) and in herbaria (0.2%). Th e protocols for collection vary over time and between collection sites, but also in response to other projects launched. However, in this fi rst version, SIFlore does not include information on the data collection procedures. It is expected that a simplifi ed description of the diff erent protocols used by each CBN will be provided in the near future.
Nevertheless, according to Vallet et al. (2012), it is possible to assess survey completeness of regional fl oristic inventories despite heterogeneous sources and protocols, through the use of the Jackknife 1, a non-parametric estimator. Th is estimator was calculated for each 10km by 10km cell of the French map, aiming to mitigate sampling bias as eff ectively as possible. Th us, occurrences were all generalized to the rank of species (including all infraspecies into this one rank) and only recent data were taken in account (from 1990). Cells with less than 250 species were excluded from the analyses, as they were considered as undersampled and therefore over represented ( Figure 3). Th e resulting map (Figure 4) should allow users to interpret their results appropriately.
It has to be noted that the completeness of La Reunion Island survey could not be assured in this work. Indeed, as mentioned above, in 2012, at the time of working on the 1 st version of SIFlore, there had been a lack of La Reunion taxa in the French national taxonomic referential for fauna, fl ora and fungi, TAXREF V5. As the analysis only takes into account the species which correspond across the two taxonomic guidelines (ITR and TAXREF v5), all of the La Reunion cells had been undersampled, according to Jackknife calculation requirements.
Quality control description: Quality control is implemented at diff erent levels, under diff erent responsibilities, throughout the data collection and validation process. Following digitalization, the dataset is fi rst checked by regional data administrators, in order to ensure compliance with the survey protocols. Th e records are individually reviewed according to specifi ed criteria, including the accuracy of the scientifi c name and the correctness of the geographical position entered, according to known chorology.
At the national level, a second step is carried out to ensure conformity of the data to the national standards, before compilation. Non-compliant data are rejected and an audit report is sent to the data provider. FCBN is currently working on an additional quality control step in order to ensure global coherence and to provide a relevance score for the distribution map of each species.

Dataset history
In 1975, the Botanical Conservatory of Brest was created, with the support of the Ministry of Environment. Th is was the fi rst establishment in the world entirely devoted to fl ora conservation. In 1988, the label "conservatoire botanique national" was legally recognized in France. Th ere are currently eleven national botanical conservatories (CBN) in France.
As they are mandated to share their expertise to the national and local authorities, CBN have built knowledge databases in which information is structured to allow data sharing.
CBN have operational and managerial autonomy. Consequently, their databases diff er in terms of structural design and the information contained. Nevertheless, there is an urgent need to provide information on fl ora distribution at the national level  (data from 1990 to 2013). Th e number of records in each cell was used as an estimator of the sampling eff ort. Th e ratio between the observed and estimated richness of species measures the completeness of inventory in each surveyed cell (Vallet et al. 2012). and, in particular, to defi ne more clearly the fl ora conservation priorities. Furthermore, FCBN is involved in establishing the IUCN Red List of Th reatened Flora at the national level, and in evaluating the conservation status of natural habitats and wild fl ora, according to the Council Directive 92/43/EEC of 21 May 1992 on the conservation of natural habitats and of wild fauna and fl ora. For these reasons, it was decided to pool all CBN fl ora records and also to apply a subsidiarity relationship between the FCBN and CBN. Th is means that data must only be handled by national botanical conservatories, and that the federation may only facilitate networking and ensure data aggregation and management, without altering the records in any way.
In 2010, a working group was created to implement this project. It includes botanists and data managers from across the CBN, and project facilitators. Due to the heterogeneity of the CBN databases, data cannot easily and quickly be aggregated unless determining a database exchange standard. As such, a standardized format was proposed and, in 2011, 10 taxa were selected for a fi rst trial run. Based on this pilot, the data standard was refi ned. In January 2013, the fl ora regional dataset aggregation was offi cially launched. Data were transmitted through various channels under the scenarios prepared by the working group. Data were then compiled and integrated into a Postgresql/Postgis database using an extract, transform and load system (ETL). In May 2013, 18 million data records were aggregated. In January 2014, an additional 3 million records were integrated. Meanwhile, a geoportal was developed to respond to the needs of the CBN and their partners for improving their understanding of fl ora distribution, as well as allowing the CBN to share its expertise with the public. Th e portal was fi nally published on the FCBN's website in February 2014.

Dataset description
Th e SIFlore dataset is a custom-made SQL view of the global database hosted by the FCBN. Only tracheophyta data are shown. In the current version of the database, the key information provided for each record includes: unique identifi er of the data record in SIFlore, identifi er that points to the data in the original database fi le, source institution and database identifi er, scientifi c name, valid identifi er in the French national taxonomic referential (TAXREF), taxon rank, location of sighting (grid cell and municipality code), date that the fl ora was sighted and name of the data collector.