
BlueBRIDGE organized a webinar on 28 June 2017, 11.10am CEST, with the aim to highlight the importance of a semantic integration of data, the current approaches for integration and the main case studies of marine data coming from the BlueBRIDGE project. It showcased a process to construct semantic warehouse of marine data and how it can be exploited.
This webinar was of particular interest for data Scientists working with marine data, engineers working on semantic data integration solutions and researchers interested in semantic data integration approaches. Around 40 particpants attended the webinar.
Below you can find the webinar recording and presentation.
Click here to access the material (or view it on Slideshare)
Download the webinar recording
Webinar Description
Every day we produce huge amounts of data of various types and purposes, however these data are not integrated. Data integration aims at combining data residing in different sources and providing users with a unified view of these data. The unified view enables answering queries and discovering insights which are not possible to obtain from individual sources. Data integration is significant in a variety of situations and of paramount importance for e-science especially for large-scale scientific questions such as global warming, invasive species spread, and resource depletion.
This webinar described the motivation for semantic integration of data, its difficulties, the related requirements and tasks, the main approaches for integration, and then it focused on case studies of marine data coming from the ongoing BlueBRIDGE project as well as from previous projects in the same area (the completed EU research infrastructure project iMarine). A process for constructing semantic warehouses of marine data was introduced, a process that comprises various steps including ontology-engineering, schema mapping, entity matching, provenance management, and quality testing. Finally, the webinar discussed the exploitation of such semantic warehouses, as well as future steps and open challenges.
Webinar info:
- Presenter: Yannis Tzitzikas, Associate Professor, Computer Science Department, University of Crete & Affiliated Researcher of FORTH-ICS
- Duration: 1 hour
- Start date: 28 June 2017
- Start time: 11.10 am CEST
- Timezone: Central European Summer Time (CEST)
Speaker Profile
Yannis Tzitzikas is currently Associate Professor of Information Systems in the Computer Science Dep. at University of Crete (Greece) and Affiliated Researcher of the Information Systems Laboratory at FORTH-ICS (Greece). Before joining UofCrete and FORTH-ICS he was postdoctoral fellow at the University of Namur (Belgium) and ERCIM postdoctoral fellow at ISTI-CNR (Pisa, Italy) and at VTT Technical Research Centre of Finland. He conducted his undergraduate and graduate studies (MSc, PhD) in the Computer Science Department at the University of Crete. In parallel, he was a member of the Information Systems Lab of FORTH-ICS for about 8 years, where he conducted basic and applied research around semantic network-based information systems within several EU-founded research projects. His research interests fall in the intersection of the following areas: Information Systems, Information Indexing and Retrieval, Conceptual Modeling, Knowledge Representation and Reasoning, and Collaborative Distributed Applications. Currently his research focuses on: exploratory searching (principles, techniques, applications), semantic data management (comparison functions, knowledge evolution, indexes, visualization, integration), and methodologies and technologies for building advanced information systems for digital preservation.
The results of his research (mainly conducted with students) have been published in more than 90 papers in refereed international conferences and journals, and he has received two best paper awards (at CIA’2003 and ISWC’07). He has supervised more than 20 Diploma and 15 MSc theses, and he actively participates in EU projects having achieved more that 700K Euros funding the last 4 years. Finally he regularly participates to the scientific committees of several international conferences and journals.
Questions & Answers
Q1 Are these ideas also harmonised with the INSPIRE directive (datamodels, libraries, codelists)?
Answer: Yes, and we actually support the metadata formats that have been adopted by the community, i.e. we extract data from this kind of formats/standards: ASFIS, AphiaIDs, ISO-2, ISO-3, ISSCFG, RFB etc.
Q2 Is there a specific mechanism to check data quality?
Answer: As mentioned in the presentation we use the competency queries for checking the quality of the integrated dataset and the connectivity metrics (their values, plots over time, 3D visualization). I could add that the integration of datasets also helped up to spot errors in the constituent datasets. In addition the data are integrated and their provenance information is being preserved, making therefore evident to the user where data of “poor” or “rich” quality has been derived from. Furthermore, for the case of GRSF, there is an abmin VRE where the experts check the data, and validate the records before making them public.
Q3 Is it expensive to update the meaning of a data triple in a warehouse?
Answer: Since we start from an ontology MarineTLO we have tackled the interpretation problem from the beginning, in the sense that we do not first collect data and then we try to interpret them. Instead we first fix the meaning and then populate the ontology. If a mistake has taken place in the ontology itself, the update is not expensive due to the monotonicity of the ontology (a change in a class or a property does not require changes to the sub/super classes/properties). If a mistake took place during the transformation of data then we can exploit the SPARQL language (which is a declarative query and update language) to restructure the information properly.
Q4 How do you handle disagreements between data sources?
Answer: Since we keep the provenance of all data, we actually host the encountered disagreements in the semantic warehouse. At the query layer one can exploit this information or enforce a policy for resolving disagreements. In our context, the disagreements are due to a few errors in the datasets and when we spot such problems we inform the organizations that are responsible for the datasets for fixing them (consequently also the quality of the contents of the semantic warehouse improves over time, i.e. after each reconstruction).
Q5 How can the wider community help further building the Marine TLO for other social and economic domains?
Answer: The general process is domain-independent and one could follow the same approach for other domains (social, economic and others). The process (as illustrated in the slides) start with the competency queries, ontological modeling, and so on. MarineTLO is based on conceptual modeling principles that have been proved successful also in other domains, e.g. the CIDOC CRM ISO standard for the cultural domain. Therefore we can safely say that this approach is applicable in other domains too since the top level of MarineTLO’s classes and properties is constructed based on a cross-domain (universal) discipline.
Q6 Is there any chance to include EU-Nomen - Pan European Species Directories Infrastructure, to support INSPIRE species mapping ?
Answer: We are open to suggestions and we would like to widen the coverage of the semantic warehouse. We will look at it.