
Scientific data integration and aggregation is a challenging, time consuming and error prone process. Marine Data (e.g. biodiversity data, data about Fisheries and Stocks etc.) is scattered across different and heterogeneous databases, with no standard structure, not intended to interoperate with others (database silos) and the guidelines for populating these databases are also heterogeneous.
The main challenge in BlueBRIDGE was to integrate and aggregate marine data coming from heterogeneous data sources in a semantic-rich way to build the Global Record of Stocks and Fisheries VRE.
Although the above processes were carried out in the marine domain in the context of the BlueBRIDGE project, these practices can be applied to any other domain.
The BlueBRIDGE best practice
To map, integrate and aggregate marine scientific data BlueBRIDGE has adopted a six steps process:
The “key” features of the aforementioned steps that guarantee the efficient and aggregation of heterogeneous marine data are the following:
|
Why this is considered a best practice
Best Practice Analysis |
|
Validation |
The output of the above-described marine data modelling and aggregation best practice is the Global Record of Stocks and Fisheries VRE (GRSFVRE). The GRSF VFRE and therefore the full modelling and aggregation process has been validated by both FAO (FIRMS Database), University of Washington (RAM Database) and SFP (FishSource Database). The feedback was positive as the users (experts) were able to report that: 1) They are able to answer queries that could not answer before 2) They are able to browse through and discover complete sets of data about stocks and fisheries that could not be performed before 3) The data is of better quality, providing them with “hints” as to how to improve their own database. |
Innovation |
The best practice contributed to delivering the (first) Global Record of Stocks and Fisheries. It enabled scientists to adopt it as a methodology to improve the final result with the overall aim of delivering a high-quality product. On this aspect, when a draft version of the registry was available, scientists adapted the competency queries to alleviate issues and capture more concepts, which in turn triggered all the subsequent steps towards re-constructing the registry.
As a result, the final version of the registry can be considered as a knowledge base containing a coherent set of facts on stocks and fisheries that can be used to carry out advanced stock and fishery assessment activities. |
Success Factors |
The main factors to guarantee that the practice was successful are the following:
|
Sustainability |
The activities for curating and maintaining the integrated dataset is an activity that without cease. The semantic warehouse relies on data coming from different data sources, and, as such, it might require adaptations as data from the source changes. These adaptations regard changes in the mappings, as well as changes in the algorithms that integrate the data.
In addition, competency queries might change. As new user requirements emerge, mappings and algorithms that integrate data need to be updated to capture them.
The aforementioned activities should be carried out by a person familiar with mapping technologies and software. Although we cannot estimate the potential effort and cost to carry this out (clearly this depends on the complexity and the number of new requests), there is confidence that the benefits to be obtained will be great. |
Replicability and/or up-scaling |
The proposed practice can be used by: a) Institutions that perform data mapping or data aggregation activities; b) Companies that exploit integrated data for a profit (e.g. companies exploiting marine resources) c) Non - profit organizations that use scientific data for decision making and prediction of potential disastrous scenarios (i.e. FAO for predicting the depletion of a marine resource)
The organisations involved in the context of the BlueBRIDGE project were:
Although, the best practice was tailored to integrate data for the marine domain, it should be clarified that it can be applied to any other domain, with little or no modifications (i.e. only the semantic model could be updated). |
Lessons Learnt
The collation of information is both difficult and time-consuming, as the information is scattered across different databases and is modelled using different formats and standards. The proposed best practice defines the steps that are required to integrate data from different sources, allowing it to be exploited for to respond to queries that could not be answered before (as they required combining knowledge from heterogeneous data sources), and assess the connectivity of the resulting semantic warehouse. Thanks to this practice scientists can easily discover big and complete datasets to assist them on their research, investors have central access to rich information to assist them in their decisions to invest in a specific field (e.g. invest on harvesting tuna in the North Atlantic), industry can acquire and analyse data to predict the sustainability of their businesses, and improve their profits and simple users can gain valid information in their domain of interest.
The main lessons learned during the application of the best practice in the context of BlueBRIDGE were:
- Collaboration with the providers of the data sources is critical. Without their support in understanding the semantics, the clarifications on their data and the validation of the mappings, it is extremely difficult attain successful aggregation.
- The use of a central semantic model is mandatory to overcome heterogeneity that exist among the schemata of the original data sources. Furthermore, it removes the complexity of updating technical components when the schemata of the original sources change (in this case only the mappings should be updated and all the technical components remain unchanged).
- Collection of the query requirements (the competency queries) should always be the first step. A wrong estimation of the querying capabilities of the resulting system can lead to a completely wrong design incapable of serving the query needs of the marine community.