Keywords
Computer Science and Digital Science
- A3.1.1. Modeling, representation
- A3.1.7. Open data
- A3.2.1. Knowledge bases
- A3.2.2. Knowledge extraction, cleaning
- A3.2.4. Semantic Web
- A3.2.5. Ontologies
- A3.2.6. Linked data
- A3.3.2. Data mining
- A3.4.1. Supervised learning
- A3.4.2. Unsupervised learning
- A3.4.5. Bayesian methods
- A3.4.8. Deep learning
- A3.5.2. Recommendation systems
- A4.8. Privacy-enhancing technologies
- A8.1. Discrete mathematics, combinatorics
- A9.1. Knowledge
- A9.2. Machine learning
- A9.6. Decision support
- A9.8. Reasoning
- A9.10. Hybrid approaches for AI
Other Research Topics and Application Domains
- B1.1. Biology
- B2.3. Epidemiology
- B2.4.1. Pharmaco kinetics and dynamics
- B2.4.2. Drug resistance
- B3.5. Agronomy
- B3.6. Ecology
- B3.6.1. Biodiversity
- B5.2.1. Road vehicles
1 Team members, visitors, external collaborators
Research Scientist
- Amedeo Napoli [CNRS, Emeritus, HDR]
Faculty Members
- Miguel Couceiro [Team leader, Univ de Lorraine, Professor, HDR]
- Alexandre Blansché [Univ de Lorraine, Associate Professor]
- Baya Lydia Boudjeloud-Assala [Univ de Lorraine, Associate Professor, HDR]
- Brieuc Conan-Guez [Univ de Lorraine, Associate Professor]
- Adrien Coulet [Univ de Lorraine, Associate Professor, HDR]
- Sébastien Da Silva [Univ de Lorraine, Associate Professor]
- Alain Gely [Univ de Lorraine, Associate Professor]
- Jean-François Mari [Univ de Lorraine, Emeritus, HDR]
- Justine Reynaud [Univ de Lorraine, ATER, until Aug 2020]
- Yannick Toussaint [Univ de Lorraine, Professor, HDR]
Post-Doctoral Fellows
- Alexandre Bazin [Univ de Lorraine]
- Nyoman Juniarta [Inria]
PhD Students
- Nacira Abbas [Inria]
- Guilherme Alves Da Silva [Inria]
- Laurine Huber [Univ de Lorraine]
- Tatiana Makhalova [Inria]
- Pierre Monnin [Univ de Lorraine, until Dec 2020]
- Claire Theobald [CNRS]
- Laura Alejandra Zanella Calzada [Univ de Lorraine]
- Georgios Zervakis [Inria]
Technical Staff
- Jérémie Nevin [Inria, Engineer, until Jun 2020]
Interns and Apprentices
- Manar Amezrine [Univ de Lorraine, from Jun 2020 until Sep 2020]
- Srilakshmi Balard [Univ de Lorraine, from Mar 2020 until Aug 2020]
- Clement Bellanger [Univ de Lorraine, from Mar 2020 until Aug 2020]
- Vaishnavi Bhargava [Univ de Lorraine, from Jan until Jun 2020]
- Walid Hafiane [Univ de Lorraine, from Mar 2020 until Aug 2020]
- Emilie Leroy [Univ de Lorraine, from Jun 2020 until Aug 2020]
- Esteban Marquer [Inria, from Feb 2020 until Jun 2020]
- Yi Ting Tsai [Univ de Lorraine, from Apr 2020 until Sep 2020]
- Charles Vernerey [Univ de Lorraine, from Apr 2020 until Aug 2020]
Administrative Assistant
- Emmanuelle Deschamps [Inria, until Jun 2020]
External Collaborator
- Florence Le Ber [Ecole Nationale du Génie de l'Eau et de l'Environnement de Strasbourg, until Jun 2020]
2 Overall objectives
2.1 Introduction
The three main scientific objectives of the Orpailleur Team are: (1) Fundamentals of Knowledge Discovery in Databases (KDD), (2) KDD in practice, and (3) Explanations and Fairness in KDD. KDD aims at discovering intelligible and reusable patterns in possibly large and complex databases. From an operational point of view, the KDD process is based on three main steps: (i) selection and preparation of the data, (ii) data mining, (iii) interpretation of the discovered patterns. Moreover, the KDD process is iterative, interactive, and “exploratory”, i.e. controlled by an expert of the data domain, called the analyst.
The KDD process –as implemented in the Orpailleur team– is based on data mining methods which are either symbolic or subsymbolic (numerical). Symbolic methods are based on pattern mining, Formal Concept Analysis (FCA), and extensions. Subsymbolic methods are based on classifiers such as Random Forests, SVM, and Neural Networks. Then a main objective is to mine complex and possibly large data in combining symbolic and subsymbolic data mining methods for improving applicability and explainability of KDD and making KDD more “reliable”. In this way, domain knowledge may improve and guide the KDD process, leading to “Knowledge Discovery guided by Domain Knowledge” (KDDK). The discovered patterns can be interpreted as knowledge units and reused for problem-solving activities in knowledge systems. Thus knowledge discovery can be considered as a key task in knowledge engineering, having an impact in various semantic activities, e.g. information retrieval, recommendation, and ontology engineering.
Recent advances in KDD and in Machine Learning (ML) are mostly due to the success of deep learning methods in recognition tasks. However, deep learning and in general subsymbolic data mining methods are based on complex models, whose outputs and proposed decisions, as accurate as they are, cannot be easily explained to the layman. Accordingly, we are studying how to design hybrid and more interpretable KDD methods, combining complex subsymbolic models and explainable symbolic models. In addition, explanations can be used for assessing algorithmic fairness, which is a critical point especially in decision support systems with societal and scientific impacts, and that became another main topic in the team.
Regarding applications, the team mainly works in life sciences, i.e. agronomy, biology, chemistry, medicine, pharmacogenomics, and as well in astronomy and in the web of data. Text mining is of major interest for the team and appears to be a ground task in many applications. Indeed it has gained a lot of attention in the last years, especially with the progress due to deep learning systems. The Orpailleur team has a rich experience in KDD and intends to improve and to extend this experience, paying also more attention to the impact of knowledge discovery in the real world. This should lead to the design of “green”, sustainable, explainable, and fair data mining systems.
3 Research program
3.1 Fundamentals of Knowledge Discovery in Databases
Keywords: knowledge discovery in databases, knowledge discovery in databases guided by domain knowledge, pattern mining, data exploration, formal concept analysis, hybrid mining.
The Orpailleur team has an original and productive research activity in knowledge discovery and especially in developing hybrid –symbolic and subsymbolic– data mining methods. Actually, the team has an important activity in pattern mining, in Formal Concept Analysis and its extensions, in text mining, and as well in the search for explanations and studying the fairness of KDD algorithms. Moreover, the team combines in an original way knowledge discovery and knowledge engineering, in mining the web of data for ontology engineering, knowledge base completion, and link key discovery. KDD is a main task in many research projects as shown by the large range of applications in which the team is involved.
Advances in data and knowledge engineering have emphasized the needs for hybrid mining tools working on complex data. Accordingly, the Orpailleur team is developing knowledge discovery methods based on pattern mining, FCA and its extensions, subgroup discovery and redescription mining as well, to be used in real-sized applications on complex data including multi-valued attributes, n-ary relations, sequences, trees, and graphs...One main aspect is to make pattern mining more useful and to take into account domain knowledge, user-preferences, and constraints to guide the discovery process. In addition, the combination of symbolic and subsymbolic data mining methods should lead to more applicable, explainable, and reliable methods, as shown in 12.
3.2 KDD in Practice
Keywords: knowledge discovery in life sciences, text mining, exploration of web of data, knowledge engineering, ontology engineering from texts and web resources.
Life Sciences constitute a challenging domain for KDD. Biological data are complex from many points of views, e.g. huge size, high-dimensionality, deep inter-connections, and quick evolution. Analyzing such data for discovering interesting patterns is an important issue in domains such as health, environment, and agronomy, for solving problems such as disease understanding, drug discovery, pharmacovigilance, and pharmacogenomics. Moreover, many ontologies are available which can be used to improve KDD in biology, and, in turn, KDD can help improving the quality and completion of ontologies especially thanks to text mining. In particular, we work on the mining of real-world texts in combining subsymbolic and symbolic data mining methods.
From a knowledge discovery perspective, text mining is a particular process aimed at extracting interesting units from texts which can be used further for ontology engineering, e.g. design or completion of a knowledge base. In parallel, knowledge discovery in the web of data, i.e. mining set of RDF (Resource Description Framework) triples, is a major task for discovering concept definitions in terms of necessary and sufficient conditions. Both operations, text mining and mining the web of data can be considered as complementary mining operations.
3.3 Explanations and Fairness in KDD
Keywords: explainable artificial intelligence, production of explanations, fairness of data mining algorithms, fairness and explanations in decision making.
The output of KDD may be of different types, e.g. rules, classes, and concepts, which can be reused for solving problems and decision making. Then, elements supporting an algorithmic decision should be available as well 76, 78, 79. In practice, subsymbolic KDD methods show that they are more flexible and better suited to deal with the complexity of tasks such as recognition. However, symbolic KDD methods are more often used in pattern mining and in domains where the learning model should be understandable by human experts, or be related to domain ontologies for knowledge representation, reasoning, or decision making needs. Thus we investigate different ways of providing explanations and facilitating interpretation of the outputs of subsymbolic KDD methods. Besides explainability, one main limit that retains KDD systems from being more widely operated is there tendency to be biased. Indeed, since these systems are based on a KDD loop or generalize from a training set, they are reproducing biases that are initially present in the data, and may lead to inaccurate or unfair results. At present, this new direction of investigation is of major interest in the Orpailleur Team and takes place among the main topics studied in the team.
4 Application domains
4.1 Life Sciences: Agronomy, Biology, Chemistry, and Medicine
Keywords: knowledge discovery in life sciences, biology, chemistry, medicine, pharmacogenomics and precision medicine.
One major application domain which is currently investigated by the Orpailleur team is related to life sciences, with particular emphasis on biology, medicine, and chemistry. The understanding of biological systems provides complex problems for computer scientists, and the developed solutions bring new research ideas or possibilities for biologists and for computer scientists as well. Indeed, the interactions between researchers in biology and researchers in computer science improve not only knowledge about systems in biology, chemistry, and medicine, but knowledge about computer science as well.
Knowledge discovery is gaining more and more interest and importance in life sciences for mining either homogeneous databases such as protein sequences and structures, or heterogeneous databases for discovering interactions between genes and the environment, or between genetic and phenotypic data, especially for public health and precision medicine (pharmacogenomics). Pharmacogenomics is a main challenge in the Orpailleur team as it considers a large panel of complex data ranging from biological to medical data, and various kinds of encoded domain knowledge ranging from texts to formal ontologies.
On the same line as biological data, chemical data are presenting important challenges w.r.t. knowledge discovery, for example for mining collections of molecular structures and collections of chemical reactions in organic chemistry. The mining of such collections is an important task for various reasons including the challenge of graph mining and the industrial needs (especially in drug design, pharmacology and toxicology). Molecules and chemical reactions are complex data that can be modeled as labeled graphs. Graph mining and Formal Concept Analysis methods play an important role in this application domain and can be used in an efficient and well-founded way 73.
4.2 Text Mining
Keywords: text mining, knowledge discovery from texts, text annotation, ontology engineering from texts.
The objective of a text mining process is to extract useful knowledge units from large collections of texts 64. The text mining process shows specific characteristics due to the fact that texts are complex objects written in natural language. The information in a text is expressed in an informal way, following linguistic rules, making text mining a difficult task. A text mining process has to take into account –as much as possible– paraphrases, ambiguities, specialized vocabulary and terminology. This is why the preparation of texts for text mining is usually dependent on linguistic resources and methods.
From a knowledge discovery perspective, text mining aims at extracting “interesting units” (nouns and relations) from texts with the help of domain knowledge encoded within a knowledge base (it should be noticed that process is roughly similar for text annotation). Text mining is especially useful in the context of semantic web for ontology engineering. In the Orpailleur team, we work on the mining of real-world texts in application domains such as biology and medicine, using subsymbolic and symbolic data mining methods. Accordingly, the text mining process may be used to enrich and to extend linguistic resources, while, in turn, linguistic and ontological resources can be exploited to guide a text mining process”.
4.3 Knowledge Systems and Web of Data
Keywords: combining knowledge engineering and knowledge discovery, web of data, semantic web, ontology design, link keys.
The web of data constitutes a good platform for experimenting ideas on the combination of knowledge discovery and knowledge engineering (KE).
A software agent may be able to understand and manipulate information on the web if and only if domain knowledge and ontologies for achieving those tasks are available.
In particular, OWL (“Web Ontology Language” https://
Actually, there are many interconnections between concept lattices in FCA and ontologies, e.g. the partial order underlying an ontology can be supported by a concept lattice. Moreover, a pair of implications within a concept lattice provides a natural supports to a concept definition in an ontology 65, 66. In this way, we study how the web of data, considered as a set of knowledge sources, e.g. DBpedia, Wikidata, Yago, Freebase, can be mined for guiding the design of ontological definitions, and further, how knowledge discovery techniques can be applied for allowing a better usage of RDF triples in the web of data. In particular, in the Orpailleur team, we are interested in (i) providing thanks to FCA and pattern structures a support to concept definitions, (ii) using in addition redescription mining to support again concept definitions, (iii) and combining FCA and redescription mining for discovering link keys, which can be used for identifying individuals which share the same set of properties 6.
5 Highlights of the year
- The publication 31 was selected for the best paper award at the International Conference on Concept Lattices and Applications (CLA 2020 https://
cs. ttu. ee/ events/ cla2020/). - The new research direction of Orpailleur about explanations and fairness of knowledge discovery algorithms AI algorithms was encouraged by the invitation of Miguel Couceiro to give an invited talk (online) on the subject “Making ML models fairer through explanations: the case of LimeOut”, Miguel Couceiro, at the 9th International Conference on Analysis of Images, Social Networks and Texts, Russia 18.
- Adrien Coulet with Anita Burgun and Sarah Zohar of UMR1138 Inserm and Université de Paris proposed a project to the “2020 Call for Expression of Interest” for the creation of an Inserm Inria joint team in digital health (https://
inserm-inria. sciencescall. org/). This project was selected and one output is the creation of the joint team “HeKA” (Inria, Inserm, Université de Paris) to be officially created in Spring 2021 at the CRI of Paris. Accordingly, Adrien Coulet obtained a 5-year secondement (“détachement”) at Inria Paris.
6 New software and platforms
6.1 New software
6.1.1 ARPEnTAge
- Name: Analyse de Régularités dans les Paysages : Environnement, Territoires, Agronomie
- Keywords: Stochastic process, Hidden Markov Models
- Functional Description: ARPEnTAge is a software based on stochastic models (HMM2 and Markov Field) for analyzing spatio-temporal data-bases. ARPEnTAge is built on top of the CarottAge system to fully take into account the spatial dimension of input sequences. It takes as input an array of discrete data in which the columns contain the annual land-uses and the rows are regularly spaced locations of the studied landscape. It performs a Time-Space clustering of a landscape based on its time dynamic Land Uses (LUS). Displaying tools and the generation of Time-dominant shape files have also been defined.
-
URL:
http://
carottage. loria. fr/ index_in_english. html - Contact: Jean-François Mari
- Partner: INRA
6.1.2 CarottAge
- Keywords: Stochastic process, Hidden Markov Models
- Functional Description: The system CarottAge is based on Hidden Markov Models of second order and provides a non supervised temporal clustering algorithm for data mining and a synthetic representation of temporal and spatial data. CarottAge is currently used by INRA researchers interested in mining the changes in territories related to the loss of biodiversity (projects ANR BiodivAgrim and ACI Ecoger) and/or water contamination. CarottAge is also used for mining hydromorphological data. Actually a comparison was performed with three other algorithms classically used for the delineation of river continuum and CarottAge proved to give very interesting results for that purpose.
-
URL:
http://
carottage. loria. fr/ index_in_english. html - Contacts: Florence Le Ber, Jean-François Mari
- Participants: Florence Le Ber, Jean-François Mari
- Partner: INRA
6.1.3 CORON
- Keywords: Data mining, Closed itemset, Frequent itemset, Generator, Association rule, Rare itemset
-
Functional Description:
The Coron platform is a KDD toolkit organized around three main components: (1) Coron-base, (2) AssRuleX, and (3) pre- and post-processing modules.
The Coron-base component includes a complete collection of data mining algorithms for extracting itemsets such as frequent itemsets, closed itemsets, generators and rare itemsets. In this collection we can find APriori, Close, Pascal, Eclat, Charm, and, as well, original algorithms such as ZART, Snow, Touch, and Talky-G. AssRuleX generates different sets of association rules (from itemsets), such as minimal non-redundant association rules, generic basis, and informative basis. In addition, the Coron system supports the whole life-cycle of a data mining task and proposes modules for cleaning the input dataset, and for reducing its size if necessary.
-
URL:
http://
coron. loria. fr/ site/ index. php - Authors: Laszlo Szathmary, Florent Marcuola, Mehdi Kaytoue, Amedeo Napoli
- Contact: Amedeo Napoli
- Participants: Adrien Coulet, Aleksey Buzmakov, Amedeo Napoli, Florent Marcuola, Jérémie Bourseau, Laszlo Szathmary, Mehdi Kaytoue, Victor Codocedo, Yannick Toussaint
6.2 New platforms
6.2.1 FixOut: an ensemble approach to build fairer models
- Name: FixOut: an ensemble approach to build fairer models
- Keywords: explanations in artificial intelligence, XAI, Lime explanations, fairness metrics, Shapley values, feature importance, feature dropout, ensemble classifier.
- Functional Description: The FixOut system is aimed at evaluating fairness of processes and models in artificial intelligence. FixOut is a human-centered and model-agnostic framework, using any explanation method (based on feature importance) to assess the reliance of a classifier on sensitive features. Given a pre-trained classifier, FixOut first checks whether it relies on user defined sensitive features. If it does, then FixOut employs feature dropout to produce a pool of simplified classifiers that are then aggregated into an ensemble classifier. Empirical results using different classifiers on several real-world datasets show a consistent improvement in terms of widely used fairness metrics, decreased reliance on sensitive features, and without compromising the accuracy of classifiers.
-
URL:
https://
fixout. loria. fr/ - Contact: Miguel Couceiro
6.2.2 LatViz: Visualization of Concept Lattices
- Name: LatViz: Visualization of Concept Lattices.
- Keywords: Formal Concept Analysis, pattern structures, concept lattice, implications, visualization.
-
Functional Description:
LatViz is a tool allowing the construction, the display and the exploration of concept lattices.
LatViz proposes some noticeable improvements over existing tools and introduces various functionalities focusing on interaction with experts, such as visualization of pattern structures for dealing with complex non-binary data, AOC-poset which is composed of the core elements of the lattice, concept annotations, filtering based on various criteria and a visualization of implications 67.
This way the user can effectively perform interactive exploratory knowledge discovery as often needed in knowledge engineering.
The LatViz platform can be associated with the Coron platform and extends its visualization capabilities (see http://
coron. loria. fr). Recall that the Coron platform includes a complete collection of data mining algorithms for extracting itemsets and association rules. -
URL:
http://
latviz. loria. fr/ - Contact: Amedeo Napoli
6.2.3 OrphaMine: Data Mining Platform for Orphan Diseases
- Name: OrphaMine: A Data Mining Platform for Orphan Diseases
- Keywords: bioinformatics, data mining, biology, health, data visualization, drug development.
-
Functional Description:
The OrphaMine platform enables visualization, data integration and in-depth analytics in the domain of “orphan diseases”, where data is extracted from the OrphaData ontology (https://
www. orpha. net/ consor/ cgi-bin/ index. php). At present, we aim at building a true collaborative portal that will serve different actors: (i) a general visualization of OrphaData data for physicians working, maintaining and developing this knowledge database about orphan diseases, (ii) the integration of analytics (data mining) algorithms developed by the different academic actors, (iii) the use of these algorithms to improve our general knowledge of rare diseases. -
URL:
http://
orphamine. inria. fr/ - Contact: Miguel Couceiro, Laureline Nevin, Amedeo Napoli.
6.2.4 Siren: Interactive and Visual Redescription Mining
- Name: Siren: Interactive and Visual Redescription Mining
- Keywords: redescription mining, interactivity, visualization.
- Functional Description: Siren is a tool for interactive mining and visualization of redescriptions. Redescription mining aims to find distinct common characterizations of the same objects and, vice versa, to identify sets of objects that admit multiple shared descriptions. The goal is to provide domain experts with a tool allowing them to tackle their research questions using redescription mining. Merely being able to find redescriptions is not enough. The expert must also be able to understand the redescriptions found, adjust them to better match his domain knowledge and test alternative hypotheses with them, for instance. Thus, Siren allows mining redescriptions in an anytime fashion through efficient, distributed mining, to examine the results in various linked visualizations, to interact with the results either directly or via the visualizations, and to guide the mining algorithm toward specific redescriptions. New features, such as a visualization of the contribution of individual literals in the queries and the simplification of queries as a post-processing, have been added to the tool.
-
URL:
http://
siren. gforge. inria. fr/ main/ - Contact: Esther Galbrun
7 New results
7.1 Fundamentals of KDD
Participants: Nacira Abbas, Guilherme Alves Da Silva, Alexandre Bazin, Alexandre Blansché, Lydia Boudjeloud-Assala, Brieuc Conan-Guez, Miguel Couceiro, Adrien Coulet, Sébastien Da Silva, Alain Gély, Laurine Huber, Nyoman Juniarta, Florence Le Ber, Tatiana Makhalova, Jean-François Mari, Pierre Monnin, Amedeo Napoli, Laureline Nevin, Abdelkader Ouali, François Pirot, Frédéric Pennerath, Justine Reynaud, Claire Theobald, Yannick Toussaint, Laura Alejandra Zanella Calzada, Georgios Zervakis.
7.1.1 FCA, Pattern Mining, MDL, and Hybrid Mining
Advances in data and knowledge engineering have emphasized the needs for pattern mining tools working on complex data. FCA, which usually applies to binary data-tables, can be adapted to work on more complex data. In this way, we have contributed to some main extensions of FCA, namely Pattern Structures, Relational Concept Analysis and application of the “Minimum Description Length” (MDL) within FCA. Pattern Structures (PS 71, 72) allow building a concept lattice from complex data, e.g. numbers, sequences, trees and graphs. Relational Concept Analysis (RCA 77) is able to analyze objects described both by binary and relational attributes and can play an important role in text classification and text mining. Many developments were carried out in pattern mining and FCA for improving data mining algorithms and their applicability, and for solving some specific problems such as information retrieval, discovery of functional dependencies and biclustering. Reusing ideas form subgroup discovery, we have initiated a whole line of research on the covering of the pattern spaces based on the “Minimum Description Length” (MDL) principle and we are working on the adaptation of MDL within the FCA framework 33, 35. Moreover, we also investigated the discovery of functional dependencies 20, the mining of RDF data, and the production of explanations in an FCA framework 21.
In addition, we are working on the design of hybrid mining methods, based on mining methods able to deal with symbolic and numerical data in parallel. In the context of the GEENAGE project, we are interested in the identification, in biomedical data, of biomarkers that are predictive of the development of diseases in the elderly population 21. Actually, the data are issued from a preceding study on metabolomic data for the detection of diabetes of type 2 12. The problem can be viewed as a classification problem where features which are predictive of a class should be identified. This leads us to study the notions of prediction and discrimination in classification problems, and as well of hybrid mining. Combining subsymbolic machine learning methods such as random forests, neural networks, and SVM, then multicriteria decision making methods (Pareto fronts), and pattern mining methods (including FCA), we developed a hybrid mining approach for selecting the features which are the most predictive and/or discriminant. Then the selected features are organized within a concept lattice to be presented to the analyst together with the reasons for their selection. The concept lattice facilitates the understanding of the feature selection. As such, this approach can also be seen as an explicable mining method, where the output includes the reasons for which features are selected in terms of prediction and discrimination.
Besides symbolic knowledge discovery approaches, we also experimented “Bayesian deep learning” approaches in image recognition to take into account the uncertainty attached to such images and to the process classifying these images. We are dealing with astronomical images where galaxies should be recognized. First experiments trying to understand the action of feature dropout were carried out in the context of the AstroDeep project and will be consolidated the present year 42.
Redescription mining is another pattern mining method developed in the team, which aims at finding distinct common characterizations of the same objects and, reciprocally, at identifying sets of objects having multiple shared descriptions 70. This is motivated by the idea that in scientific investigations data often have different nature, and may originate from distinct sources or terminologies. Following this way, we applied redescription mining for analyzing and mining RDF data in the web of data with the objective of discovering definitions of concepts 74, 75. Redescription mining is well adapted to the task as a definition is naturally based on two sides of an equation, a left-hand side and a right-hand side. Moreover, we are studying how redescription mining can be used in the discovery of link keys over two RDF datasets.
7.2 KDD in Practice
Participants: Alexandre Bazin, Miguel Couceiro, Adrien Coulet, Sébastien Da Silva, Florence Le Ber, Jean-François Mari, Pierre Monnin, Amedeo Napoli, Yannick Toussaint.
KDD in Life Sciences.
In the ANR project PractiKPharma1 three main research activities were developed, namely the validation of pharmacogenomic (PGX) knowledge, the mining of Electronic Health Records (EHRs), and biological data aggregation for knowledge discovery. The validation of PGX knowledge is based on the PGxO ontology2 which is instantiated by data from various provenance, e.g. biomedical databases, literature, and EHRs 51. Moreover, we defined and applied a set of so-called “reconciliation rules” that compare and align whenever possible knowledge units of different provenance 38.
We also studied how cross-corpus training may guide relation extraction from texts, and “transfer learning” techniques, i.e. how large annotated corpora developed for alternative tasks may improve the performance of text mining in biomedical texts, for which only a few annotated resources are available. Then for improving the quality of knowledge units in PGxLOD originating from biomedical literature, we manually built an annotated corpus named PGxCorpus 13. Thanks to this corpus, the first one attached to PGx, we were able to train supervised ML approaches to discover PGx knowledge units from texts with a high quality, in particular using the transformer architecture BERT 61.
Furthermore, the “Fighting Heart Failure” project aims at identifying and describing relevant bio-profiles of patients suffering from heart failure. We worked on the clustering of patients, by defining multidimensional similarity measures based on graph aggregation, namely “Unsupervised Extremely Randomized Trees” 30, 10.
Text Mining.
The research work in text mining is based on two main lines. The first research subject is related to the study of discourse and argumentation structures in a text based on tree mining and redescription mining 32, while the second research work is related to the mining of Pubmed abstracts about rare diseases. In the first research line, we investigate the similarities existing between discourse and argumentation structures by aligning subtrees in a corpus where texts are annotated. Based on tree mining and redescription mining, we are able to show that the structures underlying discourse and argumentation can be (partially) aligned.
In the second research line, the objective is to discover features related to rare diseases, e.g. symptoms, related diseases, treatments, and possible disease evolution or variations. The texts to be analyzed are from Pubmed, i.e. a platform collecting millions of publications in the medical domain. This research project aims at developing new methods and tools for supporting knowledge discovery in textual data by combining methods from Natural Language Processing (NLP) and Knowledge Discovery in Databases (KDD).
Knowledge Systems and Web of Data.
The increase of data published in RDF format leads to challenging aspects regarding quality assessment and data exploration 66. For example, the discovery of implication rules is used to automatically detect possible completions of RDF data and to provide definitions. In this same way, we are investigating the characterization of functional dependencies and of similarity dependencies within FCA and pattern structures 20. This is fully related to the ANR Elker project where we study “link keys” 6. Link keys are more complex than functional dependencies in databases and they raise new problems to solve, e.g. they may be used to identify the same objects in different RDF datasets.
For example, link keys may identify the same book or article in different bibliographical data sources, where a link key is a statement of the form: stating that whenever an instance of the class Livre has the same values for properties auteur and titre as an instance of class Book has for properties creator and title, then they denote the same entity.
Currently we are investigating the connections with FCA and pattern structures 16. Moreover, we are also interested in the possible connections with Relational Concept Analysis and redescription mining. We would like to study the formulation of the discovery of link keys in reusing and extending some construction developed in redescription mining. In practice, redescription mining aims at constructing pairs of descriptions, i.e., pairs of logical statements, one for each of two datasets, such that their support sets, i.e., the sets of objects that satisfy each statements of a pair, respectively, are most similar, as measured for example by their Jaccard index. This is in full correspondence with the discovery of link keys in two RDF datasets.
7.3 Explanations and Fairness in KDD
Participants: Guilherme Alves Da Silva, Alexandre Bazin, Miguel Couceiro, Adrien Coulet, Alain Gély, Tatiana Makhalova, Pierre Monnin, Amedeo Napoli, Georgios Zervakis.
Inspired by the FCA machinery, we are seeking for a global approach to support hybridization and intelligibility of KDD systems. Indeed, the description of KDD algorithms through dimensions such as interpretability, fairness, accuracy, and tractability, leads to a classification into algorithmic concepts revealing synergies –and thus model-based interpretations and explanations– as well as suitability and compatibility w.r.t. context and fairness 69.
In 18, 22, we addressed fairness issues of KDD algorithms based on the idea of “feature dropout” followed by an “ensemble approach” to improve model fairness. All these ideas have been implemented in FixOut3, a human-centered and model-agnostic framework 53. Given a classifier, a dataset and a set of sensitive features, FixOut first assesses whether the classifier is fair by checking its reliance on sensitive features using explanations. If deemed unfair, FixOut then applies feature dropout to obtain a pool of classifiers which are combined into an ensemble classifier that will be less dependent on sensitive features without compromising the whole classifier accuracy.
Future investigations are in concern with the use of different explanation methods (e.g. Shapley values) and the application to different families of classifiers, ranging from random forests to neural networks. Moreover, we also would like to combine explanation and clustering methods to understand how the classification process is working and providing possible descriptions of the resulting clusters. Possible experiments will be carried out on complex chemical datasets.
8 Partnerships and cooperations
8.1 European initiatives
8.1.1 FP7 & H2020 Projects
H2020 Tailor (2020-2023)
Participants: Guilherme Alves Da Silva, Miguel Couceiro, Alain Gély, Amedeo Napoli, Georgios Zervakis.
H2020 Tailor (2020-2023):
Tailor stands for “Foundations of Trustworthy AI - Integrating Reasoning, Learning and Optimization” (https://
8.2 National initiatives
8.2.1 ANR
ANR ELKER (2017–2022)
Participants: Nacira Abbas, Alexandre Bazin, Miguel Couceiro, Amedeo Napoli.
The objectives of the ELKER ANR Research Project (https://
ANR PractiKPharma (2016–2020)
Participants: Miguel Couceiro, Adrien Coulet, Pierre Monnin, Amedeo Napoli, Yannick Toussaint.
PractiKPharma for “Practice-based evidences for actioning Knowledge in Pharmacogenomics” is an ANR research project (http://
ANR AstroDeep (2019–2023)
Participants: Brieuc Conan-Guez, Miguel Couceiro, Amedeo Napoli, Frédéric Pennerath, Claire Theobald.
Astronomical surveys planned for the coming years will produce data that present analysis challenges not only because of their scale (hundreds of petabytes), but also by the complexity of the measurement challenges on very deep images (for instance subpercent-level measurement of colors and shapes in blended objects). New machine learning techniques appear very promising: once trained, they are very efficient and excel at extracting features from complex images. In the AstroDeep project, we aim at developing such machine learning techniques that can be applied directly on complex images without going through the traditional steps of astronomical image processing, that lose information at each stage 42. The developed techniques will help to leverage the observation capabilities of future surveys (LSST, Euclid, and WFIRST), and will allow a joint analysis of data.
The AstroDeep ANR Project involves three labs, namely APC Paris (“Astroparticules et Cosmologie Paris”), the Orpailleur Team at Inria Nancy Grand Est/LORIA, and “Département d'Astrophysique CEA Saclay”.
8.2.2 Inria Project Labs, Exploratory Research Actions, and Technological Development Actions
Participants: Guilherme Alves Da Silva, Miguel Couceiro, Nyoman Juniarta, Amedeo Napoli, Laureline Nevin, Georgios Zervakis.
-
HyAIAI (IPL 2019-2022)
Recent progress in Machine Learning (ML) and especially in Deep Learning has made ML present and prominent in a wide range of applications. However, current and efficient ML approaches rely on complex numerical models. Then, the decisions which are proposed may be accurate but cannot be easily explained to the layman, especially in some cases where complex and human-oriented decisions should be made, e.g. to get a loan or not, to obtain a chosen enrollment at university. The objectives of the HyAIAI IPL are to study the problem of making ML methods more interpretable. For that, we will design hybrid ML approaches that combine state of the art subsymbolic models (e.g. neural networks) with explainable symbolic models (e.g. pattern mining). More precisely, one goal is to integrate high level domain constraints into ML models, to provide model designers information on ill-performing parts of the model, and to give the layman/practitioner understandable explanations on the results of the ML model.
The HyAIAI IPL project involves seven Inria Teams, namely Lacodam in Rennes (project leader), Magnet and SequeL in Lille, Multispeech and Orpailleur in Nancy, and TAU in Saclay.
-
Ordem (ADT 2019-2020)
One of the outputs of the former Hybride ANR project was the Orphamine system which aims at information retrieval and diagnosis aid in the domain of “rare diseases” (http://
orphamine. loria. fr/). The Orphamine system is based on domain knowledge, and in particular on medical ontologies such as ORDO (“Orphanet Rare Diseases Ontology”) and HPO (“Human Phenotype Ontology”). The objective of the “Ordem” ADT is to update Orphamine, in making the system more accessible and more open. This mainly requires software development for building the connections with domain knowledge, graph mining methods for retrieving relevant units in knowledge graphs, actual visualization tools, pattern mining, statistical decision tools for decision making (in particular log-linear models), and as well text mining tools for analyzing expert queries and medical literature about rare diseases. A large part of this development was carried out until during the last year, for making the Orphamine system more robust. The new version of the Orphamine system will be publicly accessible through a web interface in the next future. - HyGraMi (PRE Inria 2018-2020) The so called “projet de recherche exploratoire” (PRE) HyGraMi for “Hybrid Graph Mining for the Design of New Antibacterials” is about the fight against resistance of bacteria to antibiotics. The objective of HyGraMi is to design a hybrid data mining system for discovering new antibacterial agents. This system should rely on a combination of numeric and symbolic classifiers, that will be guided by expert domain knowledge. The analysis and classification of the chemical structures is based on an interaction between symbolic methods e.g. graph mining techniques, and numerical supervised classifiers based on exact and approximate matching 54.
9 Dissemination
9.1 Promoting scientific activities
9.1.1 Scientific Events: Organization
- Laurine Huber was co-chair in the scientific organization of the “6ième conférence conjointe Journées d'Études sur la Parole (JEP), Traitement Automatique des Langues Naturelles (TALN), et Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL)”, Nancy, June 2020 (https://
jep-taln2020. loria. fr). - Miguel Couceiro, Amedeo Napoli, and Pierre Monnin, were the General Chairs and the scientific organizers of the online workshop ALGOS 2020 dedicated to Maurice Pouzet for his 75th birthday, August 2020 49 (https://
algos2020. loria. fr/). - Amedeo Napoli was the co-Program Chair and scientific co-organizer of the 8th workshop FCA4AI co-located with ECAI 2020 in August 2020 50 (http://
www. fca4ai. hse. ru/).
9.1.2 General chair, scientific chair
- Amedeo Napoli was co-Program chair of the special track “General Topics of Data Analysis” at AIST in 2020 (http://
aistconf. org/). - See also above items 2 and 3.
9.1.3 Member of Editorial Boards
- Editor in Journal of Multiple-Valued Logic and Soft Computing and International Journal on Information Technologies and Security (Miguel Couceiro).
- Miguel Couceiro was elected as a member of the IEEE Technical Committee on Multiple-Valued Logic for the 2018-2020 period.
9.1.4 Reviewer - reviewing activities
- The members of the Orpailleur team are involved in the program and steering committees of main Conferences (AAAI, ECAI, ECML-PKDD, ICDM, IJCAI, KDD, VLDB...), as members of editorial boards, and in the organization of journal special issues.
9.1.5 Invited talks
- Invited Talk (Online): “Knowledge Discovery and Knowledge Representation: A Formal Concept Analysis Perspective”, Amedeo Napoli, Center for Strategic Research, Saint Petersburg, Russia, November 2020.
- Invited Talk (Online): “Making ML models fairer through explanations: the case of LimeOut”, Miguel Couceiro, The 9th International Conference on Analysis of Images, Social Networks and Texts, Russia 18.
9.1.6 Teaching
- All the permanent members of the Orpailleur team are involved in teaching at all levels and mainly at Université de Lorraine. Actually, most of the members of the Orpailleur team are employed on “Université de Lorraine” positions.
- Responsibility of the 2nd year of the NLP Master's program in the IDMC, Université de Lorraine.
- Local coordination of the European Erasmus Mundus Master's program LCT (Language and Communication Technologies). The LCT Master’s program “Language and communication Technologies” (LCT) is designed to provide students with practice–oriented knowledge in computational and theoretical linguistics, natural language processing, and computer science, to meet the demands of industry and research in these rapidly growing areas. The LCT consortium includes 7 European Universities, i.e. Saarland, Lorraine, Trento, Malta, Groningen, Charles in Prague, Basque Country, and includes several partners, e.g., DFKI, IBM (Czech Rep.), VICOMTECH, Sony (Europe), IBM (Ireland), and Inria (France).
- Responsibility in teaching courses about Artificial Intelligence and Knowledge-Based Systems at TELECOM Nancy, a engineer school for graduation in computer science at Université de Lorraine.
9.1.7 Tutorials
- Online Tutorial: “A Smooth Introduction to FCA” (AN), Troisième édition des JRI-2020 UFR-SEA, Université Joseph Ki-Zerbo, Ouagadougou, Burkina Faso, September 2020 (AN).
- Online Tutorial: “FCA and Knowledge Discovery” (AN), Tutorial at the 25th International Conference on Conceptual Structures (ICCS), Bolzano, Italy, September 18, 2020 (https://
iccs-conference. org/ ?page_id=538).
9.1.8 Supervision
- The members of the Orpailleur team are involved in student supervision, at all university levels, from under-graduate until post-graduate students, engineers, PhD, postdoc students.
- The permanent members of the Orpailleur team are also involved in HDR and thesis defenses, being thesis referees or thesis committee members.
- PhD in progress: Nacira Abbas, discovery of link keys in the web of data, started in 2018, Jérôme David (Inria Rhône-Alpes) and Amedeo Napoli.
- PhD in progress: Guilherme Alves, Fixout: a system for checking the fairness of knowledge discovery algorithms, started in 2018, Miguel Couceiro and Amedeo Napoli.
- PhD in progress: Laurine Huber, Analysis of Discourse and Text mining, started in 2018, Yannick Toussaint.
- PhD in progress: Tatiana Mahkalova, a study of pattern mining from an FCA perspective (MDL, numerical pattern mining), started in 2017, Sergei O. Kuznetsov (HSE Moscow) and Amedeo Napoli.
- PhD in progress: Esteban Marquer, Reasoning over Data: analogy based and transfer learning to improve machine learning, started in 2020, Miguel Couceiro and Alain Gély.
- PhD defended: Pierre Monnin, Data mining and ontology design in pharmacogenomics, started in 2016, Adrien Coulet and Amedeo Napoli.
- PhD in progress: Claire Theobald, Discovering and understanding images of galaxies in astronomical data, started in 2020, Miguel Couceiro and Frédéric Pennarth.
- PhD in progress: Laura Zanella, Text mining and Deep Learning, started in 2019, Yannick Toussaint.
- PhD in progress: Georgios Zervakis, Hybrid data mining systems, started in 2019, Miguel Couceiro and Emmanuel Vincent (Multispeech).
10 Scientific production
10.1 Major publications
- 1 articleLearning rule sets and Sugeno integrals for monotonic classification problemsFuzzy Sets and Systems401December 2020, 4-37
- 2 inproceedings Interpretable Dimensionally-Consistent Feature Extraction from Electrical Network Sensors European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases ECML/PKDD'20 Gand (Virtual Conference), Belgium September 2020
- 3 articleA hybrid and exploratory approach to knowledge discovery in metabolomic dataDiscrete Applied Mathematics273SI2020, 103-116
- 4 article PGxCorpus, a manually annotated corpus for pharmacogenomics Scientific Data 7 3 January 2020
- 5 inproceedings Discovering Approximate Functional Dependencies using Smoothed Mutual Information KDD 2020 - 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining San Diego / Virtual, United States August 2020
10.2 Publications of the year
International journals
International peer-reviewed conferences
National peer-reviewed Conferences
Conferences without proceedings
Scientific book chapters
Edition (books, proceedings, special issue of a journal)
Doctoral dissertations and habilitation theses
Reports & preprints
Other scientific publications
10.3 Cited publications
- 64 book C. Aggarwal C. Zhai Mining Text Data Springer 2012
- 65 inproceedingsMining Definitions from RDF Annotations Using Formal Concept AnalysisProceedings of IJCAIAAAI Press2015, 823--829
- 66 articleExploratory Knowledge Discovery over Web of DataDiscrete Applied Mathematics2492018, 2-17
- 67 inproceedingsLatViz: A New Practical Tool for Performing Interactive Exploration over Concept LatticesProceedings of the 13th International Conference on Concept Lattices and Their Applications (CLA)CEUR Workshop Proceedings 16242016, 9--20
- 68 book F. Baader D. Calvanese D. McGuinness D. Nardi P. Patel-Schneider The Description Logic Handbook Cambridge, UK Cambridge University Press 2003
- 69 inproceedingsElements About Exploratory, Knowledge-Based, Hybrid, and Explainable Knowledge DiscoveryICFCA 2019 - 15th International Conference on Formal Concept Analysis11511Formal Concept AnalysisFrankfurt, GermanySpringer International PublishingJune 2019, 3-16
- 70 articleMining redescriptions with SirenACM Transactions on Knowledge Discovery from Data (TKDD)1212018, 6:1--6:30
- 71 inproceedingsPattern Structures and Their ProjectionsProceedings of ICCS 2001LNCS 2120Springer2001, 129--142
- 72 articleMining gene expression data with pattern structures in formal concept analysisInformation Sciences181102011, 1989-2001
- 73 articleDiscovering structural alerts for mutagenicity using stable emerging molecular patternsJournal of Chemical Information and Modeling5552015, 925--940
- 74 inproceedings Redescription mining for learning definitions and disjointness axioms in Linked Open Data ICCS 2019 - 24th International Conference on Conceptual Structures Marburg, Germany July 2019
- 75 inproceedingsUsing Redescriptions and Formal Concept Analysis for Mining Definitions Linked DataICFCA 2019 - 15th International Conference on Formal Concept Analysis11511Formal Concept AnalysisFrancfort, GermanySpringer International PublishingJune 2019, 241-256
- 76 inproceedings"Why Should I Trust You?": Explaining the Predictions of Any ClassifierProceedings of KDDACM2016, 1135--1144
- 77 articleRelational Concept Analysis: Mining Concept Lattices From Multi-Relational DataAnnals of Mathematics and Artificial Intelligence671January 2013, 81-108URL: http://hal.inria.fr/lirmm-00816300
- 78 inproceedingsA Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual Unfairness via Inequality IndicesProceedings of KDDACM2018, 2239--2248
- 79 articleFairness Constraints: A Flexible Approach for Fair ClassificationJournal of Machine Learning Research202019, 75:1--75:42