Section: Application Domains
Data Journalism
One of today’s major issues in data science is to design techniques and algorithms that allow analysts to efficiently infer useful information and knowledge by inspecting heterogeneous information sources, from structured data to unstructured content. We take data journalism as an emblematic use-case, which stands at the crossroad of multiple research fields: content analysis, data management, knowledge representation and reasoning, visualization and human-machine interaction. We are particularly interested in issues raised by the design of data and knowledge management systems that will support data journalism. These systems include an ontology that typically expresses domain knowledge, heterogeneous data sources, and mappings that relate these data sources expressed with their own vocabulary and querying capabilities, to a (possibly virtual) factbase expressed using the ontological vocabulary. Ontologies play a central role as they act both as a mediation layer that glue together pieces of knowledge extracted from heterogeneous data sources, and as an inference layer that allow to draw new knowledge. In the context of data journalism, those ontologies require challenging features that we need to take into account:
-
the wide range of topics addressed in journalism requires a rich top-level ontology, though very specific ontologies might be required to handle specific knowledge (e.g. detailed knowledge on finance to handle the panama papers).
-
in data journalism, each piece of knowledge requires different timestamps (temporal information represented within the data, for instance when an event effectively takes place, and temporal information about the data itself, for instance when this event is recorded / validated in the system). Temporal relations (such as Allen's) can be used to express constraints between timestamps and ensure the consistency of the (virtual) knowledge base.
-
in data journalism, each piece of knowledge has an identified source. The analysis of conflicting knowledge in the (virtual) knowledge base has to take the source fiability into account.
Besides pure knowledge representation and reasoning issues, querying such systems raise issues at the crossroad of data and knowledge management. In particular, the notion of mappings has to be revisited in the light of the reasoning capabilities enabled by the ontology. More generally, the consistency and the efficiency of the system cannot be ensured by considering the components of the system in isolation (i.e., the ontology, data sources and mappings), but require to study the interactions between these components and to consider the system as a whole.