2025Activity reportTeamSEMAGRAMME
RNSR: 201120979K- Research center Inria Centre at Université de Lorraine
- In partnership with:CNRS, Université de Lorraine
- Team name: Semantic Analysis of Natural Language
- In collaboration with:Laboratoire lorrain de recherche en informatique et ses applications (LORIA)
Creation of the Team: 2013 July 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A5.8. Natural language processing
- A7.2. Logic in Computer Science
- A9.4. Natural language processing
Other Research Topics and Application Domains
- B2. Digital health
- B9.6.8. Linguistics
- B9.9. Ethics
1 Team members, visitors, external collaborators
Research Scientists
- Philippe de Groote [Team leader, INRIA, Senior Researcher]
- Bruno Guillaume [INRIA, Researcher]
- Vincent Martin [INRIA, Researcher]
- Sylvain Pogodalla [INRIA, Researcher]
Faculty Members
- Maxime Amblard [UL, Professor, HDR]
- Karën Fort [UL, Professor, HDR]
- Jacques Jayez [ENS DE LYON, Emeritus]
- Michel Musiol [UL, Professor Delegation, until Aug 2025, HDR]
PhD Students
- Clémentine Bleuze [UL]
- Hee-Soo Choi [UL, ATER]
- Marie Cousin [UL]
- Amandine Decker [UL]
- Fanny Ducel [UNIV PARIS SACLAY]
- Maxime Guillaume [YSEOP, CIFRE]
- Amandine Lecomte [UL, until Mar 2025]
- Siyana Pavlova [UL, ATER]
- Valentin Richard [Univ Amsterdam]
- Vincent Tourneur [UL]
- Rémi de Vergnette de Lamotte [UL, from Nov 2025]
Technical Staff
- Khensa Amani Daoudi [INRIA, Engineer, until Jan 2025]
- Amandine Lecomte [UL, from Mar 2025]
- Iglika Zlatkova Nikolova-Stoupak [UL]
Interns and Apprentices
- Mohammad Al Takach [UL, from Mar 2025 until Aug 2025]
- Jeffrey Andre [UL, Intern, from Apr 2025 until May 2025]
- Apolline Bastien [UL, Intern, from Jun 2025 until Jun 2025]
- Ahana Chattopadhyay [UL, Intern, from Mar 2025 until Aug 2025]
- Luc Cheng [INRIA, Intern, from Mar 2025 until Jul 2025]
- Florian Cuny [UL, Intern, from Jun 2025 until Aug 2025]
- Lucie Digoin-Caparros [UL, Intern, from Jun 2025 until Aug 2025]
- Mae Dugoua Jacques [CNRS, Intern, from Apr 2025 until Jun 2025]
- Samba Fall [INRIA, Intern, from Jun 2025 until Aug 2025]
- Zsofia Flora Hauk [UL, Intern, from Jun 2025 until Aug 2025]
- Jules Holder [CNRS, Intern, from Apr 2025 until Jun 2025]
- Vidit Khazanchi [UL, Intern, from May 2025 until Jul 2025]
- Owen Le Ray [UL, Intern, from Nov 2025]
- Loic Leclere [UL, Intern, from Jun 2025 until Aug 2025]
- Tadzhat Marharian [UL, Intern, from Jun 2025 until Aug 2025]
- Ivaylo Mitov [UL, Intern, from Jun 2025 until Aug 2025]
- Wassila Oudinache [UL, Intern, from Apr 2025 until Jun 2025]
- Arthur Pedrini [UL, Intern, from Jun 2025 until Aug 2025]
- Shayan Ahmed Sharriff [UL, Intern, from Jun 2025 until Aug 2025]
- Austin Tangban [UL, from Jul 2025 until Aug 2025]
- Enola Thomas [UL, Intern, from Jun 2025 until Jun 2025]
- Celine Zyna Rahme [UL, from Jul 2025 until Aug 2025]
- Rémi de Vergnette de Lamotte [UL, Intern, from Mar 2025 until Aug 2025]
Administrative Assistants
- Véronique Constant [INRIA]
- Sophie Drouot [INRIA]
- Anne-Marie Messaoudi [LORIA, from Sep 2025]
- Anne-Marie Messaoudi [UL, until Aug 2025]
- Gallown Nizard [UL]
- Cecilia Olivier [INRIA]
External Collaborators
- Mathieu Constant [UL]
- Khensa Amani Daoudi [UNICAEN, from Feb 2025 until Aug 2025]
- Roberto Diaz Hernandez [Univ Jaén, from Apr 2025]
- Michel Musiol [UL, from Sep 2025, HDR]
2 Overall objectives
2.1 Scientific Context
Computational linguistics is a discipline at the intersection of computer science and linguistics. On the theoretical side, it aims to provide computational models of the human language faculty. On the applied side, it is concerned with natural language processing and its practical applications.
From a structural point of view, linguistics is traditionally organized into the following sub-fields:
- Phonology, the study of language abstract sound systems.
- Morphology, the study of word structure.
- Syntax, the study of language structure, i.e., the way words combine into grammatical phrases and sentences.
- Semantics, the study of meaning at the levels of words, phrases, and sentences.
- Pragmatics, the study of the ways in which the meaning of an utterance is affected by its context.
Computational linguistics is concerned by all these fields. Consequently, various computational models, whose application domains range from phonology to pragmatics, have been developed. Among these, logic-based models play an important part, especially at the “highest” levels.
At the level of syntax, generative grammars may be seen as basic inference systems, while categorial grammars are based on substructural logics specified by Gentzen sequent calculi. Finally, model-theoretic grammars amount to sets of logical constraints to be satisfied.
At the level of semantics, the most common approaches derive from Montague grammars, which are based on the simply typed -calculus and Church's simple theory of types. In addition, various logics (modal, hybrid, intensional, higher order...) are used to express logical semantic representations.
At the level of pragmatics, the situation is less clear. The word pragmatics has been introduced by Morris to designate the branch of philosophy of language that studies, besides linguistic signs, their relation to their users and the possible contexts of use. The definition of pragmatics was not quite precise, and, for a long time, several authors have considered (and some authors are still considering) pragmatics as the wastebasket of syntax and semantics. Nevertheless, as far as discourse processing is concerned (which includes pragmatic problems such as pronominal anaphora resolution), logic-based approaches have also been successful. In particular, Kamp's Discourse Representation Theory gave rise to sophisticated `dynamic' logics. The situation, however, is less satisfactory than it is at the semantic level. On the one hand, we are facing a kind of logical “tower of Babel”. The various pragmatic logic-based models that have been developed, while sharing underlying mathematical concepts, differ in several respects and are too often based on ad hoc features. As a consequence, they are difficult to compare and appear more as competitors than as collaborative theories that could be integrated. On the other hand, several phenomena related to discourse dynamics (e.g., context updating, presupposition projection and accommodation, contextual reference resolution...) are still lacking deep logical explanations. We strongly believe, however, that this situation can be improved by applying to pragmatics the same approach Montague applied to semantics, using the standard tools of mathematical logic.
Accordingly:
The overall objective of the Sémagramme project is to design and develop new unifying logic-based models, methods, and tools for the semantic analysis of natural language utterances and discourses. This includes the logical modeling of pragmatic phenomena related to discourse dynamics. Typically, these models and methods will be based on standard logical concepts (stemming from formal language theory, mathematical logic, and type theory), which should make them easy to integrate.
The project is organized along three research directions (i.e., syntax-semantics interface, discourse dynamics, and common basic resources), which interact as explained below.
Moreover, a transversal and transdisciplinary theme has been developed in the team in the past years: ethics in NLP and more generally in AI.
2.2 Syntax-Semantics Interface
The Sémagramme project intends to focus on the semantics of natural languages (in a wider sense than usual, including some pragmatics). Nevertheless, the semantic construction process is syntactically guided, that is, the constructions of logical representations of meaning are based on the analysis of the syntactic structures. We do not want, however, to commit ourselves to such or such specific theory of syntax. Consequently, our approach should be based on an abstract generic model of the syntax-semantic interface.
Here, an important idea of Montague comes into play, namely, the “homomorphism requirement”: semantics must appear as a homomorphic image of syntax. While this idea is almost a truism in the context of mathematical logic, it remains challenged in the context of natural languages. Nevertheless, Montague's idea has been quite fruitful, especially in the field of categorial grammars, where van Benthem showed how syntax and semantics could be connected using the Curry-Howard isomorphism. This correspondence is the keystone of the syntax-semantics interface of modern type-logical grammars. It also motivated the definition of our own Abstract Categorial Grammars 77.
Technically, an Abstract Categorial Grammar simply consists of a (linear) homomorphism between two higher-order signatures. Extensive studies have shown that this simple model allows several grammatical formalisms to be expressed, providing them with a syntax-semantics interface for free 75, 8.
We intend to carry on with the development of the Abstract Categorial Grammar framework. At the foundational level, we will define and study possible type theoretic extensions of the formalism, in order to increase its expressive power and its flexibility. At the implementation level, we will continue the development of an Abstract Categorial Grammar support system.
As said above, considering the syntax-semantics interface as the starting point of our investigations allows us not to be committed to some specific syntactic theory. The Montagovian syntax-semantics interface, however, cannot be considered to be universal. In particular, it does not seem to be well adapted to dependency and model-theoretic grammars. Consequently, in order to be as generic as possible, we intend to explore alternative models of the syntax-semantics interface. In particular, we will explore relational models where several distinct semantic representations can correspond to the same syntactic structure.
2.3 Discourse Dynamics
It is well known that the interpretation of a discourse is a dynamic process. Take a sentence occurring in a discourse. On the one hand, it must be interpreted according to its context. On the other hand, its interpretation affects this context, and must therefore result in an updating of the current context. For this reason, discourse interpretation is traditionally considered to belong to pragmatics. The cut between pragmatics and semantics, however, is not that clear.
As we mentioned above, we intend to apply to some aspects of pragmatics (mainly, discourse dynamics) the same methodological tools Montague applied to semantics. The challenge here is to obtain a completely compositional theory of discourse interpretation, by respecting Montague's homomorphism requirement. We think that this is possible by using techniques coming from programming language theory, in particular, continuation semantics, and the related theories of functional control operators.
We have indeed successfully applied such techniques in order to model the way quantifiers in natural languages may dynamically extend their scope 76. We intend to tackle, in a similar way, other dynamic phenomena (typically, anaphora and referential expressions, presupposition, modal subordination...).
What characterizes these different dynamic phenomena is that their interpretations need information to be retrieved from a current context. This raises the question of the modeling of the context itself. At a foundational level, we have to answer questions such as the following. What is the nature of the information to be stored in the context? What are the processes that allow implicit information to be inferred from the context? What are the primitives that allow a context to be updated? How does the structure of the discourse and the discourse relations affect the structure of the context? These questions also raise implementation issues. What are the appropriate data types? How can we keep the complexity of the inference algorithms sufficiently low?
2.4 Common Basic Resources
Even if our research primarily focuses on semantics and pragmatics, we nevertheless need syntax. More precisely, we need syntactic trees to start with. We consequently need grammars, lexicons, and parsing algorithms to produce such trees. During the last years, we have developed the notion of interaction grammar 78 and graph rewriting 3, 4 as models of natural language syntax. This includes the development of grammars for French 91, together with morphosyntactic lexicons. We intend to continue this line of research and development. In particular, we want to increase the coverage of our grammars for French, and provide our parsers with more robust algorithms.
Further primary resources are needed in order to put at work a computational semantic analysis of utterances and discourses. As we want our approach to be as compositional as possible, we must develop lexicons annotated with semantic information. This opens the quite wide research area of lexical semantics.
Finally, when dealing with logical representations of utterance interpretations, the need for inference facilities is ubiquitous. Inference is needed in the course of the interpretation process, but also to exploit the result of the interpretation. Indeed, an advantage of using formal logic for semantic representations is the possibility of using logical inference to derive new information. From a computational point of view, however, logical inference may be highly complex. Consequently, we need to investigate which logical fragments can be used efficiently for natural language oriented inference.
3 Research program
3.1 Overview
The research program of Sémagramme aims to develop models based on well-established mathematics. We seek two main advantages from this approach. On the one hand, by relying on mature theories, we have at our disposal sets of mathematical tools that we can use to study our models. On the other hand, developing various models on a common mathematical background will make them easier to integrate, and will ease the search for unifying principles.
The main mathematical domains on which we rely are formal language theory, symbolic logic, and type theory.
3.2 Formal Language Theory
Formal language theory studies the purely syntactic and combinatorial aspects of languages, seen as sets of strings (or possibly trees or graphs). Formal language theory has been especially fruitful for the development of parsing algorithms for context-free languages. We use it, in a similar way, to develop parsing algorithms for formalisms that go beyond context-freeness. Language theory also appears to be very useful in formally studying the expressive power and the complexity of the models we develop.
3.3 Symbolic Logic
Symbolic logic (and, more particularly, proof theory) is concerned with the study of the expressive and deductive power of formal systems. In a rule-based approach to computational linguistics, the use of symbolic logic is ubiquitous. As we previously said, at the level of syntax, several kinds of grammars (generative, categorial...) may be seen as basic deductive systems. At the level of semantics, the meaning of an utterance is captured by computing (intermediate) semantic representations that are expressed as logical forms. Finally, using symbolic logics allows one to formalize notions of inference and entailment that are needed at the level of pragmatics.
3.4 Type Theory and Typed Lambda-Calculus
Among the various possible logics that may be used, Church's simply typed -calculus and simple theory of types (also known as higher-order logic) play a central part. On the one hand, Montague semantics is based on the simply typed -calculus, and so is our syntax-semantics interface model. On the other hand, as shown by Gallin, the target logic used by Montague for expressing meanings (i.e., his intensional logic) is essentially a variant of higher-order logic featuring three atomic types (the third atomic type standing for the set of possible worlds).
4 Application domains
4.1 Deep Semantic Analysis
Our applicative domains concern natural language processing applications that rely on a deep semantic analysis. For instance, one may cite the following ones:
- textual entailment and inference,
- dialogue systems,
- semantic-oriented query systems,
- content analysis of unstructured documents,
- (semi) automatic knowledge acquisition,
- discourse structure analysis (argumentative relations, discourse markers),
- lexical resources.
4.2 Text Transformation
Text transformation is an application domain featuring two important sub-fields of computational linguistics:
- parsing, from surface form to abstract representation,
- generation, from abstract representation to surface form.
Text simplification or automatic summarization belong to that domain.
We aim at using the framework of Abstract Categorial Grammars we develop to this end. It is indeed a reversible framework that allows both parsing and generation. Its underlying mathematical structure of -calculus makes it fit with our type-theoretic approach to discourse dynamics modeling.
4.3 Types for discourse markers
While there is a rich descriptive literature on Discourse Markers (DM), for instance words/expressions like so or yet in English, the question of their representation in type systems is understudied. In addition to basic types such as individuals or events, or simple functional types (properties, etc.), DM are known to operate on domains like states of affairs, beliefs or speech acts. The entities inhabiting these domains are themselves complex. For instance, speech acts involve discourse planning in the form of a network of intentions and actions. Moreover, DM can combine with one another, forming clusters whose meaning is not always apparent from the meanings of the component DM. Within the context of the ANR CODIM, we aim at developing a typing system for (i) taking into account the array of types denoted by DM and (ii) addressing the questions of the semantic nature of their combinations.
5 Social and environmental responsibility
5.1 Footprint of research activities
ANR InExtenso:
WP4 of the project is dedicated to the evaluation of the environmental impact of the LLMs. More precisely, it aims at proposing a method for measuring the environmental impact of digital health and use it in the project evaluations and beyond.
6 Latest software developments, platforms, open data
6.1 Latest software developments
6.1.1 ACGtk
-
Name:
Abstract Categorial Grammar Development Toolkit
-
Keywords:
Natural language processing, Functional programming, Logic programming, Lambda-calculus, Ocaml
-
Scientific Description:
Abstract Categorial Grammars (ACG) are a grammatical formalism in which grammars are based on typed lambda-calculus. A grammar generates two languages: the abstract language (the language of parse structures), and the object language (the language of the surface forms, e.g., strings, or higher-order logical formulas), which is the realization of the abstract language.
ACGtk provides two software tools to develop and to use ACGs: acgc, which is a grammar compiler, and acg, which is an interpreter of a command language that allows one, in particular, to parse and realize terms.
-
Functional Description:
ACGtk provides a piece of software for developing and using Abstract Categorial Grammars (ACG).
-
Release Contributions:
This new version of the software provides two important functionalities. On the one hand, it provides support for parsing with almost linear grammars. On the other hand, it generates a javascript program to be used and loaded by web browers, in order to help demonstrating the software (a demo version is available on-line from the public gitlab webpages of the project).
- URL:
- Publications:
-
Contact:
Sylvain Pogodalla
-
Participants:
Philippe De Groote, Pierre Ludmann, Jiri Marsik, Sylvain Pogodalla, Vincent Tourneur
6.1.2 Grew
-
Name:
Graph Rewriting
-
Keywords:
Semantics, Syntactic analysis, NLP, Graph rewriting
-
Functional Description:
Grew is a Graph Rewriting tool dedicated to applications in NLP. Grew takes into account confluent and non-confluent graph rewriting and it includes several mechanisms that help to use graph rewriting in the context of NLP applications (built-in notion of feature structures, parametrization of rules with lexical information).
-
News of the Year:
In 2025, three new versions (1.17, 1.18 and 1.19) were released (together with several bug fixes). New features are, for version 1.17: Handling of multi-treebank requests in Grew-match, for version 1.18: Improved handling of metadata and global constraints, for version 1.19: Introduction of tuples of clustering keys, improve corpusbank manager.
- URL:
- Publications:
-
Contact:
Bruno Guillaume
-
Participants:
Bruno Guillaume, Guillaume Bonfante, an anonymous participant
6.1.3 HostoMytho
-
Keywords:
Game with a purpose, Natural language processing
-
Functional Description:
HostoMytho is a GWAP, or "game with a purpose" developed within the framework of the CODEINE ANR project. The aim of the game is to allow users to annotate medical files generated automatically, in order to evaluate their plausibility (quality of the language and medical semantics) and to add different layers of information (negation, hypothesis, time, etc.). HostoMytho is multiplatform.
- URL:
- Publication:
-
Contact:
Karën Fort
-
Partners:
LISN, CEA-List
6.1.4 Arborator-Grew
-
Name:
Arborator's Collaborative Annotation
-
Keywords:
Annotation tool, Syntactic analysis
-
Functional Description:
The online interface allows managing collaborative annotation projects in dependency syntax. It is possible to use Grew queries and also to directly rewrite graphs in the annotation tool.
-
News of the Year:
During 2025, we continued to refactor the code base for both frontend and backend. In addition, we worked on improving existing functionalities and adding new ones based on user requests.
- URL:
- Publication:
-
Contact:
Bruno Guillaume
-
Participant:
5 anonymous participants
-
Partners:
Université Paris Nanterre, LIMSI, LISN
6.2 Open data
7 New results
7.1 Syntax-Semantics Interface
Participants: Maxime Amblard, Marie Cousin, Philippe de Groote, Amandine Decker, Bruno Guillaume, Maxime Guillaume, Sylvain Pogodalla, Siyana Pavlova, Valentin Richard, Zhengjian Li.
7.1.1 Abstract Categorial Grammars
Feature Structure
ACG has proven to be a powerful framework with well-defined theoretical properties. It was, however, lacking a facility which is useful and widely used for grammar engineering: feature structures. The latter are often used to express in a concise way some combinatorial properties related to morphosyntactic properties of expressions, for instance subject-verb agreement.
We worked on extending the ACG type system to provide a generic feature structure framework. This extension relies on a restricted addition of the product (records) and dependent types and still allows for the reduction of grammars to Datalog programs (which is used to implement ACG parsing in ACGtk, see Sec. 6). In his thesis, Maxime Guillaume introduced Affix Abstract Categorial Grammars (AACGs), an extension of ACGs enriched by the integration of feature structures.
First, he defined an enriched λ-calculus that extends the simply typed linear λ-calculus with enumerations, records, and dependent products. On this basis, he defined AACGs and demonstrated their strong equivalence with classical ACGs through a series of formal transformations. The algorithmic implications of this equivalence for parsing were then studied. An adaptation of Kanazawa’s reduction was presented. This adaptation guarantees polynomial-time complexity while preserving the factorization benefits specific to AACGs. Finally, to validate the industrial applicability of this approach, a dedicated compiler for AACGs was designed and implemented, integrated into a text generation engine. Experiments conducted on a large-coverage French grammar highlight a significant reduction in grammar size as well as a notable improvement in parsing and generation performance.
Encoding of Meaning-Text Theory Into ACGs
Meaning-Text Theory (MTT) is a linguistic theory geared towards generating natural language expressions from semantic representations 87. It relies on seven representation levels (e.g., semantics, deep syntax, surface syntax, etc.). Representations at each level are related to representations at the adjacent levels by rewriting devices. Each representation is made of several structures, among which the predicative and the communicative ones. MTT uses the key concept of paraphrase, especially in these rewriting devices. ACGs come with several composition modes, one of which in particular corresponds to transduction of (tree or graph) structures.
We have therefore been studying the ability of ACGs to model MTT structure transformations between adjacent levels, focusing on the structures and levels of semantics, deep syntax, and surface syntax.
In previous work 68, 67 we proposed an encoding of MTT into ACGs where the predicative structure of the semantic level in MTT was used. However, MTT rewriting processes also make use of communicative structure information, decorating the predicate structures (at the semantic level) with theme and rheme information.
Indeed, both expressions "Charlie is Taylor's son" and "Charlie, the son of Taylor" share the same predicative structure and are not paraphrases of each other. While the second one is a nominal expression, the first one is a verbal expression about Charlie, that states that he is Taylor's son. The difference between both expressions, that share the same semantic predicative graph, is made by the communicative structure.
It shows the crucial role the communicative structure plays in MTT since they determine, from a given semantic graph (i.e., predicative and communicative structures), which deep-syntactic graph is to be obtained. We have therefore proposed to also take into account this communicative structure, using suitable types and grammatical composition as offered by the ACG framework 50, 59.
We also proposed an alternative approach to representing deep and surface syntactic trees 42, 29. This alternative approach, based on 74, 73, allows for a more flexible and generic representation of the syntactic structures, and for a better account of modifiers (adverbs, adjectives).
7.1.2 Formal semantics of adnominal modification
We have proposed a treatment of adnominal modification that parallels the treatment of adverbial modification in neo-Davidsonian event semantics 32. To this end, we introduced a notion of perspective that allows nouns to be interpreted as sets of sets of perspective. The resulting theory provides a unified compositional treatment of intersective, subsective, modal, and privative adjectives, and avoids the intensional paradoxes caused by an extensional treatment of subsecutive adjectives. Building on this work, we have advocated for unifying the concepts of events, states, and perspectives. We then defined possible worlds as sets of such event-like concepts. This approach allows different semantic treatments proposed in the literature to be reconstructed within a unified framework. In particular, it provides a formal treatment of the ambiguity between the intersective and the subsective interpretation that some adjectives present 52. Finally, with the aim of giving an account of hyperintensional phenomena related to the interpretation of proper names that refer to the same individual but cannot be substituted one for the other, we came up with the radical idea of interpreting individuals as sets of perspectives 28.
7.1.3 Semantic treatment of plurals textual mathematics.
We reviewed issues related to the semantics of plurals in natural language and demonstrated how these issues arise in the case of mathematical texts. In particular, we focused on the distinction between collective and distributive predicates 26. We also studied the conditions under which adjectives that denote binary relations can be used as collective predicates. This led us to propose a fine-grained semantic interpretation of grammatical numbers and to introduce distributivity operators that enable a compositional semantic treatment of plurals in natural mathematics 27.
7.1.4 Semantic Representation
Siyana Pavlova defended her PhD thesis in June 2025 58, in which she presented YARN, a new semantic representation formalism that aims to combine the benefits of logic-based formalisms with direct interpretability, making it widely usable. YARN is rooted in the encoding of different semantic phenomena as separate layers. The thesis presents a formal definition of the mathematical structure that constitutes YARN and illustrates with concrete examples how this structure can be used in the context of semantic representation for encoding multiple phenomena (such as modality, negation and quantification) as layers built on top of a central predicate-argument structure. The benefit of YARN is that it allows for the independent annotation and analysis of different phenomena as they are easy to “switch off”. Furthermore, the ability of YARN to encode simple interactions between phenomena is explored. The thesis concludes with a discussion of some of the interesting observations made during the development of YARN so far and outline our extensive future plans for this formalism.
In 40 Rémi De Vergnette, Maxime Amblard and Bruno Guillaume present different modular evaluation metrics for Layered Meaning Representation, defined as YARN, a semantic formalism encoded using rich structures that generalize AMR graphs. While existing metrics like SMATCH evaluate graph-based semantic representations such as AMR, they cannot directly handle YARN's more complex structures. A full use of the modular nature of YARN is used to propose two families of metrics, depending on the linguistic features and type of semantic phenomenon targeted. The first one, SMATCHY, extends the AMR SMATCH metric. The new metric YARNBLEU, based on the SEMBLEU metric for AMR is also proposed. Both families are evaluated on a small dataset of human annotated YARN structures, adding random modifications simulating annotation mistakes and show that SMATCHY provides a more consistent and reliable approach with respect to the type of modifications considered.
Ivaylo Mitov and Tadzhat Marharian conducted both an M1 internship under the supervision of Bruno Guillaume and Maxime Amblard. Ivaylo Mitov worked on the developement on annotation for AMR and for YARN for other languages and on the production of YARN from AMR, leveraging Universal Dependencies annotations. Tadzhat Marharian started the developement of a new Graphical User Interface for managing YARN annotations.
In 43 Amandine Decker and Maxime Amblard discuss the limits of semantic representation formalisms, in particular when it comes to representing meaning in context and interaction. Detailed representations can be used as basis for natural language understanding or generation. While these formalisms produce thorough analysis, they do not cover some crucial aspects of real language use. Most semantic representation formalisms like AMR, DRS or UMR operate out-of-context, which means they ignore a significant part of the content of the utterances they analyse. In this work they discuss various aspects of language use left out by semantic representation formalisms and argue that future work in this field should include extending these formalisms so they could cover the interactive aspect of language.
7.1.5 Syntax and semantics of questions
Natural language statements are composed not only of declarative sentences but also of interrogative ones. Moreover, sentences cannot be categorized into purely declarative or purely interrogative sentences. Typically, a declarative statement may contain a subordinated interrogative clause:
-
(a)
I don't know where Mary is.
We observe that noun phrases and declarative clauses can sometimes raise alternatives like hidden questions. For example, in a dependence statement like (b), several scenarios are considered (sunny, rainy,..., going to the beach, not going to the beach) and are related to each other implicitly. In 55, a compositional way to derive and link these alternatives is laid out.
-
(b)
Depending the weather, we might go to the beach.
-
(c)
Ça dépend (de) quel temps il fait.
In French, similar sentences using the verb dépendre can embed an interrogative clause. However, it is unclear what is more standard between keeping the de preposition or removing it in cases like (c). The contribution 54 investigates this grammatical issue by establishing corpus statistics on the frequency of a preposition between a verb and its embedded interrogative clause.
Like indefinites, interrogative words can be referred to by other expressions. For example, she in (d) refers to the person who was sitting there. This kind of anaphora has not been fully considered in anaphora-annotated corpora. The study 53 tries to evaluate this by making an inventory of the (missing) annotations of anaphora with a wh-word in the French corpus ANCOR.
-
(d)
Who was sitting there? She forgot her bag.
7.1.6 Use of semantics
Before the invention of the printing press, texts could only be reproduced through manual copying, a process prone to errors, accidents, and intentional modifications. These changes altered each manuscript and were subsequently propagated by other scribes. For philologists reconstructing text history and genealogical relationships (stemma codicum), analyzing these variants is crucial. Stemmatology methods aim to objectively construct genealogical trees of textual transmission.
At the University of Lorraine, the Écritures laboratory and MSH have focused on uncovering the genealogical lineage of Hebrew manuscripts. A join project with Maxime Amblard seeks to improve the manual work involved in critical editions of the Hebrew Bible by applying advanced methods from applied mathematics and natural language processing to reconstruct stemmas. With Iglika Zlatkova Nikolova-Stoupak, they design, train and test learning model to automatically tag scribal variants in manuscripts.
The current project 36 is inscribed within the field of stemmatology or the study and/or reconstruction of textual transmission based on the relationship between the available witnesses of given texts. In particular, the variants (differences) at the word-level in manuscripts written in Biblical Hebrew are addressed. A dataset based on the Book of Ben Sira is manually annotated for the following variant categories: ‘plus/minus’, ‘inversion’, ‘morphological’, ‘lexical’ or ‘unclassifiable’. A strong classifier (F1 value of 0.80) is then trained to predict these categories in collated (aligned) pairs of witnesses. The classifier is non-neural and makes use of the two words themselves as well as part-of-speech (POS) tags, hand-crafted rules per category, and additional synthetically derived data. Other models experimented with include neural ones based on the state-of-the-art model for Modern Hebrew, DictaBERT. Other features whose relevance is tested are different types of morphological information pertaining to the word pairs and the Levenshtein distance between the words within a pair. The strongest classifier as well as the used data are made publicly available. Coincidentally, the corelation between two sets of morphological labels is investigated: professionally established as per the QumranDigital online library and automatically derived with the sub-model DictaBERT-morph.
Maxime Amblard pursue a collaboration with the French Company Namkin. With Georgios Zervakis, they develop BEE A First Assessment of Language Models for Business Event Extraction. Event Extraction (EE) is the task of automatically extracting relevant information about events in text. Business events in particular, such as corporate investments or product launches, can provide enterprises with insight into how to better position themselves in the market with respect to the competition. We benchmark existing EE systems in the business domain. To this end, we introduce BEE (Business Event Extraction), a manually-curated corpus for end-to-end business event extraction. Empirical results of four different system architectures demonstrate the challenging nature of BEE, with Large Language Models (LLMs) underperforming compared to smaller models. Finally, we employ complementary evaluation metrics to understand the types of errors and reveal significant performance gains
While modern semantic representations may contain vast quantities of information, they do not always (or necessarily) contain the information that is useful for the concrete application. For instance, significant challenges still persist in dealing with temporal relations and finely-grained negation interpretation.
Recent research has looked into the benefits of exploiting semantic representations, and in particular Abstract Meaning Representation, for low-resources scenarios and document level event argument extraction. However, it appears that AMR has to be adapted in order to optimally support event extraction related tasks 95. One major limitation of AMR for document-level event extraction is that AMR works at the sentence level, and thus requires the aggregation of sentence-level representations. AMR is also limited in terms of negation and universal quantification expressive power.
7.2 Distributional Semantics and Lexical Structures
Participants: Sylvain Pogodalla.
Numerical and continuous representation of word semantics, in particular vector representations, and neural learning techniques gave rise to impressing results on a large number of natural language processing tasks. These representations, or embeddings, rely on the distributional hypothesis 79, 71: the meaning of word is provided by the linguistic context in which it occurs, and semantically related words should be represented by related embeddings.
However, the very nature of semantic relatedness encoded in embeddings remains somewhat unspecified, and can express different relations as classified by linguists (e.g., synonymy, hyponymy, etc.) 90, 81, and may even depend on the chosen methods to compute the vector similarity 88, the size of the context or its type 94, 89.
We have been studying the vector representations as provided by transformer and attention models 93 and compare them with linguistic knowledge as expressed by linguists. We rely more precisely on the theory of combinatorial explanatory lexicology, the lexicological part of the Meaning-Text Theory melcuk-polguere:2016,melcuk-polguere:2021, which hinges upon collocations to structure lexical knowledge as graphs. This theory provides a fine-grained description of lexical relations against which numerical models can be compared, as well as lexical resources (a lexicon for French 85, 65 and annotated examples 64). We focus on lexical structure, where previous works rather focused on morphosyntactic information and syntactic structures 82, 86. Data construction and statistical analysis is being performed and a publication is in preparation.
7.3 Discourse Dynamics
Participants: Maxime Amblard, Philippe de Groote, Amandine Decker, Jacques Jayez, Michel Musiol, Emeric Licorni, Ines Hernandez.
7.3.1 Dialogue Modeling
Dialogue encompasses a vast diversity of interactional forms which grows with technological and societal evolutions, such as the generalisation of video-mediated communication following the COVID-19 pandemic. As dialogue data becomes increasingly heterogeneous, modelling dialogue requires not only algorithmic advances but also a precise characterisation of the data on which these models are developed and evaluated. In order to better understand current practices in the field, Amandine Decker, Maxime Amblard and Ellen Breitholtz (Gothenburg University, Sweden) conducted a meta-review of papers on dialogue published in the ACL Anthology in 2024 30. This analysis provides an empirical overview of how dialogue data is really described and used by the community. One of the main observations is that dialogue data is increasingly treated primarily as a resource for model training, rather than as an object of analysis in its own right. As a consequence, research overwhelmingly focuses on English-language datasets, with a strong preference for clean, high-quality dialogues, and often overlook distinctions between task-oriented and open-domain interactions. These practices make it difficult to establish a principled framework for selecting appropriate dialogue resources for a given task and limit our ability to assess the scope and generalisability of reported results. This line of work aims to contribute to a more explicit reflection on dialogue data and its role in dialogue modelling research.
This work is complemented by ongoing research by Amandine Decker, Maxime Amblard and Ellen Breitholtz on topical structure analysis through the collection of a corpus of chat-based interactions in both English and French. The objective is to develop a resource specifically designed to support the study of topical organisation in dialogue, with a particular focus on how participants interpret and accommodate potentially incoherent contributions during interaction.
7.3.2 Discourse Markers
Jacques Jayez continues working with Mathilde Dargnat (ATILF), Paola Herreño (Ph.D. candidate ATILF-LLF) and Maeva Sillaire (Ph.D. candidate ATILF) on the semantic representation of D(iscourse) M(arkers). DMs are words/expressions like so or well in English which help structuring discourse or communicating speakers' internal epistemic or affective states as well as interactional moves. The discourse structuring function is the hallmark of connective DMs, which correspond to a large variety of discourse relations (causal, explanatory, concessive, temporal, etc.). Other functions are realized by discourse particles which can express for instance surprise, attention modification or various interactional moves (backchannels, calls to attention, etc.) 80.
Investigating the semantic profile of DMs is developed through three distinct but not quite independent subtasks. (1) Characterizing what DMs index (refer to, denote, etc). The domain-based approach initiated in the 90s consists in defining different types (aka domains) of semantic objects, like states of affairs, beliefs or speech acts. Domains are instrumental in teasing apart subclasses of connective DMs 84. Discourse particles index internal states of speakers or interactional operations 69. (2) The second subtask consists in determining what the semantic contribution of a DM is (propositional content, presupposition, conventional implicature). The semantic contribution aspect interacts with the indexing behaviour of DMs for connectives 84 and, moreover, in the case of particles, raises the question of the semantic analysis of `side effects' in terms of monads, as exemplified by Asudeh and Giorgolo 66 a.o. (3) The intuitions about the lexical meaning of DMs are notoriously difficult to substantiate, in particular for particles. We are currently studying how different types of intuitions can be coded in the declarative format of Dialogue Game Boards of 72. Points (2) and (3) converge toward the problem of defining an ontology which extends that of Ginzburg by including commitment, intentions and side-effects, in order to take into account the distinctions introduced in 92.
In the context of the CODIM ANR project, we have designed a workflow for annotating the DMs in our set of French spoken and written corpora and analysing the statistical properties of DM sequences. Given the overall poor performance of LLMs, we have kept the finite automata approach previously developed in CODIM, constructing a final cascade of 622 automata with the help of the Unitex-Gramlab software. The cascade extracts 900 DM types from the corpora for a total of 8195046 DM occurrences. The annotation results are normalized and passed to a set of 10 association measure functions, which estimate the strength of association between any two juxtaposed DMs in the corpora. The resulting vectors are scaled and compared by various distance estimators, in order to create a hierarchy of association for any two DMs sharing a common associate, for instance alors and bon with respect to mais in the pairs mais alors and mais bon.
Jacques Jayez has refined his work on the argumentative dimension of discourse, and the last version of his submission for a book on implicit manipulation has been accepted by de Gruyter 83.
7.3.3 Pathological Discourse Modeling
Also based on interviews between psychologists and schizophrenia patient, we began a study on the alignement between discourse descriptors and speech characteristics, in order to uncover potential link between what is said (discourse) and how it is said (speech characteristics). To do so, Vincent Martin supervised two M1 students (Speech pathologists) who worked on pauses characteristics on the difference discourse structures ; he then supervised two other interns (Zsofia Hauwk, M1 and Maé Dugoua-Jacques, L3) to work on the automation of diarization (speakers separation) and text transcription of these interviews. The low audio quality has represented a significant challenge, which we are currently trying to resolve at the time of writing this report.
Vincent Martin also proposed a new framework for analysing speech acoustic quality using network analyses of acoustic descriptors18. This framework has obtained relevant results on the SpeechWelness challenge, adressing suicidability in adolescent using only speech, the resultats have been presented at Interspeech 2025 37.
In parallel with this work, Vincent Martin pursued his work about refining sleep 13, 11, 12, 14 and psychiatric semiology 15, in order to improve the accuracy of digital psychiatry devices by refining their targets.
Michel Musiol has conducted theoretical, formal and empirical researches in semantics and conversation analysis in order to relate the linguistic, cognitive and psycholinguistic aspects of semantic representations as they appear in discourse. For instance, with Maxime Amblard, we build a formal, computational and dynamic model likely to reveal the properties of pathological discourse, based on the modeling of violations to coherence. In that way, empirical studies were based on clinical interviews between psychologists or psychiatrists and schizophrenic patients 19 or between psychologists or psychiatrists and bipolar patients 10. In the first paper, our dialog analysis model supplements to existing methods which often suffer from being ad-hoc, lacking compatibility with manual analysis, or failing to produce variables that align with computational or algebraic analysis. In the second paper, we show that cognitive and conversational properties measured with clinical assessment or discourse analysis have led to the formulation of a hypothesis suggesting that the two pathologies might be situated on a continuum. We examined the hypothesis of such a continuum in the context of the pragmatic discontinuities that occur in dialogue with a psychologist and either a schizophrenic or a bipolar patient. Furthermore, the aim is to delineate the cognitive and psycholinguistic impairments observed in the schizophrenic group in comparison to the bipolar group.
Anyway, this program is intended to subsequently propose computerized tools for diagnosis assistance, screening of people at risk, as well as psychotherapeutic and therapeutic evaluation and follow-up 56. For instance, we have investigated the socio-behavioral dynamics of Shwachman-Diamond Syndrom, focusing on how children with the condition navigate cooperative interactions. Using computational pragmatics, we aimed to identify the underlying principles guiding their social behavior 20.
In the line of last years project, Michel Musiol and Maxime Amblard pursue on the caracterisation of pathological discourse. With Arthur Trognon, they published a book chapter.
For the PhD work of Vincent-Thomas Barrouillet, in 10 they compare two matched clinical interview corpora, conducted with bipolar patients and, under the same conditions, with schizophrenic patients. The interview is non-directive, which encourages the patient to speak freely. Both corpora contain the same number of words. They conduct an exhaustive search for "breaks" using an investigative model of discursive disorganization that is sensitive to the linguistic and illocutionary properties of speech acts. We conduct an exhaustive search for "breaks" using an investigative model of discursive disorganization that is sensitive to the linguistic and illocutionary properties of speech acts. These "breaks" are then formally analyzed using hierarchical modeling, which reveals the defective relationships between speech acts in the dynamic structuring of conversational sequences. They conclude that hierarchical and dynamic discourse analysis methodology is a valuable tool for identifying certain bipolar disorders as well as for recognizing schizophrenic symptoms. It also makes it possible to clarify the psycholinguistic processes associated with the expression of bipolar and schizophrenic disorders in verbal interaction. Finally, it contributes to the hypothesis of a continuum between schizophrenia and bipolar disorder, supporting the high-level cognitive processes that underpin discursive competence.
7.4 Common Basic Resources
Participants: Maxime Amblard, Hee-Soo Choi, Philippe de Groote, Bruno Guillaume, Sylvain Pogodalla, Karën Fort.
7.4.1 Universal Dependencies and Surface Syntactic Universal Dependencies
The Universal Dependencies (UD) project aims to build a syntactic dependency scheme that enables similar analyses of several different languages. Bruno Guillaume is an active member of the UD community and contributes to the development and the improvement of the French data within this international initiative.
In 2025, he continued to work, in collaboration with Sylvain Kahane, Kim Gerdes and their teams to promote the Surface Syntactic Universal Dependencies (SUD) framework. SUD is an annotation scheme for syntactic dependency treebanks, that is almost isomorphic to UD (Universal Dependencies). Unlike to UD, it is based on syntactic criteria (favouring functional heads) and the relations are defined on distributional and functional bases.
This work is mainly conducted in the ANR project Autogramm (Induction of descriptive grammar from annotated corpora), which started in 01 2022. The project aims to automate, as far as possible, the extraction of descriptive grammars and grammatical descriptions from annotated corpora for linguistic and typological studies. The project also promotes the development of treebanks for low-resourced languages, in order to extract quantitative descriptive grammars for these languages.
In 38, the authors present a new format of the Rhapsodie Treebank, which contains both syntactic and prosodic annotations. This provides a comprehensive dataset for the study of spoken French. This integrated format enables complex, multilevel queries and paves the way for intonosyntactic studies.
In 34, the authors proposed a study of the different statuses of the morphosyntactic features used in UD treebanks. If most of these features correspond to values of inflectional morphemes, some describe lexical subclasses or are just conventional names of (polysemic) morphemes. Syncretism is also a challenge, because exact values are only deductible from contextual information. An attempt at clarification and an implementation in written and spoken French treebanks is then proposed.
Bruno Guillaume, in collaboration with Santiago Herrera, Ioana-Madalina Silai, Caio Corro and Sylvain Kahane 33 have developed a a data-driven contrastive framework to extract common and distinctive linguistic descriptions from syntactic treebanks. The extracted contrastive rules are defined by a statistically significant difference in frequency and precision, and classified as common and distinctive rules across the set of treebanks. The method is illustrated by working on object word order using Universal Dependencies (UD) treebanks in 6 Romance languages: Brazilian Portuguese, Catalan, French, Italian, Romanian and Spanish. The paper discusses the limitations faced due to inconsistent annotation and the feasibility of conducting contrastive studies using the UD collection.
During his M2 internship, Luc Cheng has applied the methodology used for contrastive studies to the corpus correction application. This study was conducted using written and spoken French, as well as two English corpora.
In 2025, two new versions of Universal Dependencies were released. Bruno Guillaume collaborated with field linguists to produce or improve Surface Syntactic Universal Dependencies treebanks and to convert them to Universal Dependencies:
- Version 2.16 on May:
- Version 2.17 on November:
- enhanced UD treebank for Old Egyptian (with Roberto Antonio Díaz Hernández)
- new treebank for Western Hausa (with Bernard Caron)
In April and May 2025, Roberto Antonio Díaz Hernández undertook a three-week visit to the LORIA, funded by an Short-Term Scientific Mission of the UniDive COST action. He collaborated with Bruno Guillaume to build a Grew-match instance dedicated to the annotations of the Ancient Egyptian hieroglyphic text from the pyramids: GrewPT.
In May 2025, Bruno Guillaume made a two-week visit to the University of Bologna (funded by an Short-Term Scientific Mission of the UniDive COST action). He collaborated with Ludovica Pannitto on a survey of the annotation of ppoken data in the Universal Dependencies project.
In 35, Nikolett Mus, in collaboration with Bruno Guillaume, Sylvain Kahane and Daniel Zeman, presents the development of the Tundra Nenets Universal Dependencies (UD) Treebank, the first syntactically annotated resource for the Samoyedic branch of the Uralic family. The treebank integrates spokenlanguage data and adopts the morphologically enhanced Surface-Syntactic UD (mSUD) framework to capture inflectional morphology and morphology-based syntactic relations. It further incorporates Information Structure annotation. The methodological workflow includes data selection, transcription conventions, sentence and lexeme segmentation, annotation of spoken-language features, lemmatization, treatment of morpheme status, part-of-speech and morphological tagging, and syntactic annotation based on the functional and distributional properties of syntactic elements. The paper also outlines the principles guiding multilevel annotation and justify the theoretical choices underlying the integration of prosodic, morphological, and syntactic information.
The work on the Gbaya treebank was publised in 39. The paper presents the first treebank for Gbaya, a language from the under-resourced Niger-Congo family. The language has a rich system of tonal morphemes and virtually no affixes. The dependency analysis is based on a morpheme-based tokenisation and the treebank is also distributed in a word-based Universal Dependencies version. Several constructions are discussed in the paper: genitive construction, clause coordination, sentence particles, adverbial and relative clauses, serial verb constructions, reported speech, topicalization, and focalization.
7.4.2 Citizen Science
Karën Fort worked with colleagues from Sorbonne on guidelines to develop citizen science projects. These guidelines were finally published in a journal article 21 and at a TALN workshop 48.
7.4.3 Synthetic clinical texts generation
In the context of the CODEINE ANR project and more specifically of Nicolas Hiebel's PhD thesis, Karën Fort worked with Aurélie Névéol (LISN-CNRS) and Olivier Ferret (CEA) on the generation of synthetic clinical texts.
The key idea of the project is to use confidential corpora to automatically generate anonymous synthetic texts capable of emulating real documents from the perspective of their linguistic characteristics. Nicolas Hiebel worked on a state of the art of clinical texts generation that has been published in a journal 16.
Another part of the project consists in using a Games With A Purpose to validate and then annotate the synthesized clinical texts. This game, developed by Bertrand Remy, is called HostoMytho (see Section 6.1.3), and includes various mini-games for different annotation layers, such as negation, error typing, or plausibility rating. The game is multi-platform, and therefore intended to be used on the web (see: online HostoMytho), on Android and iOS.
7.5 Ethics and biases
Participants: Karën Fort, Maxime Amblard, Michel Musiol, Marc Anderson, Fanny Ducel, Clémentine Bleuze.
7.5.1 Ethics dissemination in scientific communities
Karën Fort and Fanny Ducel, together with other members of the ACL Ethics committee and student volunteers to the committee, participated in the creation, organization, and presentation of a tutorial on ethical challenges in NLP, which took place at the ACL conference in July 2025 61 and attracted around 40 attendees.
Fanny Ducel, under the supervision of Karën Fort and Aurélie Névéol, authored a long abstract on the role that applied linguistics could play to aim at ethical NLP research, calling for more interdisciplinarity. This work was presented in French at NÉALA, a national applied linguistics conference 51.
7.5.2 Evaluating stereotypes in autoregressive language models
Fanny Ducel, under the supervision of Karën Fort and Aurélie Névéol, and in collaboration with Nicolas Hiebel, measured gender stereotypical biases in LLM-generated clinical cases, in French. This work has been presented and published at NAACL in English 31, and its translated French version at TALN 45.
Jeffrey André, under the supervision of Fanny Ducel, Karën Fort and Aurélie Névéol, designed a web interface (Masculead) that allows users to contribute to an interactive leaderboard, which is based on the previously published framework for gender bias detection 70. This interface, as well as arguments on the notion and flaws of leaderboards for language models, were presented at the "Ethic and Alignment of (large) Language Models" workshop, at TALN 44.
7.5.3 Biases in the biomedical domain
Karën Fort is PI of a 4 year ANR project (2023-2027), InExtenso (Intrinsic and Extrinsic evaluation of biases in large language models), in collaboration with Rouen's hospital (CHU) and LISN-CNRS. The project aims at better identifying stereotyped biases in LLMs in French and, when possible, mitigate them. Within the framework of this project, Clémentine Bleuze supervised the internship of M2 student Hawawou Oumarou-Tchapchet, along with partners from Rouen's hospital. This internship aimed at evaluating socio-demograpic biases of a french LLM in a medical classification task.
Under the supervision of Karën Fort and Aurélie Névéol, and in collaboration with Vincent Martin, Clémentine Bleuze conducted a literature review on the subject of LLM-assisted mental health prediction tasks, which has been submitted to the Journal of Medical Internet Research (JMIR).
7.5.4 NLP for NLP and Ethics
Clémentine Bleuze continued the work initiated during her M2 internship in collaboration with Fanny Ducel and under the supervision of Karën Fort and Maxime Amblard. This work explored the notion of scientific overclaiming (when researchers inadequately interpret or present elements of their research) in NLP papers. It also led to the definition of a taxonomy of relevant research claims, the constitution of a corpus of NLP claims originating from ArXiv and ACL papers (a subpart of which has been human-annotated), and the training of BERT-based models to predict claim types. This research, along with new results about typical claim patterns used in research papers, was presented at TALN 2025 as a poster 41.
Karën Fort and Vincent Martin conducted two automatic lexical analysis of the words censored by the Trump administration in the scientific litterature, respectively related to mental health 47 and sleep health 17. The results of these studies, combining lexical networks and temporal analyses, demonstrates the impossibility to produce scientific data – and consequently to produce global health policies based on these missing data – without the vocabulary under censure in the Trump administration.
8 Bilateral contracts and grants with industry
8.1 Bilateral contracts with industry
Maxime Amblard pursue a collaboration with the French Company Namkin. The industry faces numerous challenges that necessitate the evolution of BtoB marketing tools, in order to develop a valuable offer and provide an enhanced customer experience. Namkin's BrainLab develops industrial marketing tools for digitalizing customer relations, evolving business models, and exploiting business and economic data for business development. One of the key challenges of marketing intelligence is to identify risks and opportunities so as to guide marketing strategies. Among the sources of information useful to detect risks and opportunities, Namkin has identified Business Events, that is, “textually reported real-world occurrences, actions, relations, and situations involving companies and firms”. Un postdoctorant, Georgios Zervakis, chez Namkin et un ingénieur, Sullivan Benard ont participé à la collaboration.
9 Partnerships and cooperations
9.1 International research visitors
9.1.1 Visits of international scientists
Casey Kennington
-
Status
Researcher
-
Institution of origin:
Boise State University
-
Country:
USA
-
Dates:
25-29 march 2025
-
Context of the visit:
invitation to give a seminar
-
Mobility program/type of mobility:
research stay
Aarne Ranta
-
Status
Professor
-
Institution of origin:
University of Gothenburg
-
Country:
Sweden
-
Dates:
22-25 july 2025
-
Context of the visit:
Collaboration in the context of the Malinca project
-
Mobility program/type of mobility:
Invitation
Díaz Hernández Roberto Antonio
-
Status
Researcher
-
Institution of origin:
Universidad de Jaén
-
Country:
Spain
-
Dates:
28 april - 16 may 2025
-
Context of the visit:
development of NLP tools for Old Egyptian
-
Mobility program/type of mobility:
Short Term Scientific Mission (STSM) funded by UniDive
9.2 European initiatives
9.2.1 Horizon Europe
MALINCA
Participants: Philippe de Groote.
MALINCA project on cordis.europa.eu
-
Title:
Mathematicae Lingua Franca: Bridging the Linguistic Gap Between the Mathematician and the Machine
-
Duration:
From March 1, 2025 to February 28, 2031
-
Partners:
- Institut National de Recherche en Informatique et Automatique (Inria), France
- Universidad Pontificia Comillas (Comillas), Spain
- Université Paris Cité (UPCité), France
- Centre National de la Recherche Scientifique (CNRS), France
-
Inria contact:
Hugo Herbelin
-
Summary:
In the recent years, proof assistants have shown their astounding ability to tackle the complete formalisation of large pieces of mathematics, with the celebrated certifications of the Feit-Thompson theorem, of the Kepler conjecture, and more recently, the resolution of Scholze liquid tensor challenge. We believe that the time is ripe to demonstrate that they can tackle mathematics in the flexible and semi-formal way it is created and exchanged by the mathematicians. To that purpose, we aim to develop proof assistant technologies of an entirely new nature, including a formal language and a foundational approach to mathematical meaning, with the versatility necessary to represent the dynamic linguistic structures to be found in the daily practice of mathematics. The result will be a linguistic front-end that will allow mathematicians, and scientists in general, to express in proof assistants their proofs and computations the semi-formal way they think of them. Three research tracks stand out: the mathematical and linguistic foundations; formalisation of real-world vernacular mathematics into a high-level language of representation (Godement challenge); new techniques and software tools, based on natural language processing, to automate the formalisation process. The translation in the machine of semi-formal mathematics needs to go beyond the traditional view that reduces reasoning to logic, and requires to understand the dynamics of the discursive linguistic process which underlines mathematics. Building on advances of linguistics, mathematical logic, programming language semantics and machine learning, we will contribute significantly to the rise of a new generation of proof assistants, integrating at their heart a linguistic layer and automated guidance tools for mathematical proofs, theorems and definitions. The resulting high-level manipulation of concepts will lead to novel research outcomes supporting the daily activity of mathematical scientists.
9.2.2 Other european programs/initiatives
- Bruno Guillaume is a member of the core group of the COST action: CA21167 - Universality, diversity and idiosyncrasy in language technology (UniDive). He is the leader of the working group named "Corpus Annotation".
9.3 National initiatives
9.3.1 ANR Project: InExtenso
Participants: Karën Fort, Maxime Amblard, Michel Musiol, Fanny Ducel.
-
Title:
Intrinsic and Extrinsic evaluation of biases in large language models
-
Duration:
10 2023–09 2027
-
Coordinator:
Karën Fort
-
Partners:
CHU Rouen, LISN, LORIA
-
Participants:
Maxime Amblard, Fanny Ducel, Karën Fort (coordinator), Michel Musiol, Miguel Couceiro
-
Abstract:
Large Language Models (LLM) are the Swiss Army knife of today’s Natural Language Processing (NLP). They often outperform the state-of-the-art on benchmarks commonly used in the field for tasks such as part-of-speech tagging, text classification and named-entity recognition, thus paving the way to a myriad of end-user applications. However, it has been shown that LLM exhibit major ethical issues including significant environmental impact, mirroring and amplification of stereotyped biases, which in turn have a disproportionate impact on historically disadvantaged social groups. It is urgent to address the social impact of NLP as the applications we develop, such as chatGPT, are now directly made available to end users. The detection and mitigation of biases have therefore become an active area of research in the past few years, focusing mainly on Masked Language Models (MLM) such as BERT in English and the North American social context. Several sources of bias were identified in the NLP pipeline. However the interconnection between sources and overall impact of each source on downstream applications remains unclear. In this project, we want to observe the entire pipeline, from the intrinsic point of view (within the model itself), to the pre-training task point of view (in the case of autoregressive LLM, text generation), on to some real-world downstream applications. We chose to focus on two types of medical applications: mental illness diagnosis help and information extraction from clinical records for public health purposes such as patient enrollment into clinical trials. The project will provide corpora and methods for a global evaluation of bias in LLM in French as well as studies to further the understanding of biases in clinical NLP pipelines and the environmental impact of the integration of these models in digital health.
9.3.2 ANR Project: CoDeinE
Participants: Karën Fort, Bruno Guillaume, Bertrand Remy.
-
Title:
artificial text COrpus DEsIgNed Ethically automatic synthesis of clinical documents
-
Duration:
03 2021–02 2026
-
Coordinator:
Aurélie Névéol (Limsi)
-
Partners:
CRC, CEA List, LISN, LORIA
-
Participants:
Bruno Guillaume, Karën Fort (local coordinator), Bertrand Remy
-
Abstract:
Machine learning methods have become prevalent in language technologies. They rely on annotated corpora to train models and evaluate algorithms. The CoDeinE project proposes to address the lack of shareable corpora in sensitive domains such as health or banking. The key idea of the project is to use confidential corpora to automatically generate synthetic texts that mimic the linguistic properties of real documents while preserving confidentiality. We will use clinical documents in electronic patient records as a case study. Furthermore, the project will rely on Games With A Purpose and crowd sourcing to validate and annotate the synthesized texts.
9.3.3 ANR Project: Autogramm
Participants: Bruno Guillaume, Karën Fort, Khensa Amani Daoudi.
-
Title:
Induction of descriptive grammar from annotated corpora
-
Duration:
01 2022–12 2025
-
Coordinator:
Sylvain Kahane (Université Paris Nanterre)
-
Partners:
MoDyCo, LACITO, LISN, Inria Nancy – Grand Est
-
Participants:
Bruno Guillaume (local coordinator), Karën Fort
-
Abstract:
The goal of this project is to automate, as far as possible, the extraction of descriptive grammars and grammatical descriptions from annotated corpora for linguistic and typological studies. The project also promotes the development of treebanks for under-endowed languages, in order to extract quantitative descriptive grammars for these languages. The project uses the annotation scheme SUD (Surface-syntactic Universal Dependencies), the query tool Grew-match and the annotation tool ArboratorGrew.
9.3.4 ANR Project: CODIM
Participants: Maxime Amblard, Jacques Jayez.
-
Title:
Compositionality and discourse markers
-
Duration:
01 2023–12 2026
-
Coordinator:
Mathilde Dargnat (Université de Lorraine and ATILF)
-
Partners:
ATILF, LLF, LORIA
-
Participants:
Maxime Amblard, Jacques Jayez
-
Abstract:
The CODIM project focuses on the two main linguistic resources for organizing monologues or conversations in human languages : D(iscourse) M(arkers)(therefore/donc, well/ben, bon etc. in English/French) and prosody (in particular, intonation). It will evaluate their status with respect to two major views on communication: compositionality (the possibility of combining meaningful expressions into more complex meaningful expressions) and pattern or construction-based approaches (the idea that language users exploit partly `frozen’ strings of words). We will compare the semantic and prosodic properties of simple and complex French DM (e.g. ah + bon) found in corpora for written and spoken French, using a variety of technical tools for DM identification (category-driven text mining), clustering (statistics and Machine Learning) and research in prosody (duration and intensity measures, contour representation). The project fosters a number of collaborations between linguists and computer scientists.
9.3.5 PEPR Project Digital Health: Autonom Health
Participants: Maxime Amblard, Michel Musioil, Vincent Martin.
-
Title:
Autonom Health
-
Duration:
06 2023–12 2030
-
Coordinator:
Pierre Philip (Université de Bordeaux)
-
Partners:
LABRI, Sanpsy, LORIA, ISIR, CES, LIRIS
-
Participants:
Maxime Amblard, Michel Musiol, Vincent Martin
-
Abstract:
Western populations face an increase of longevity which mechanically increases the number of chronic disease patients to manage. Current healthcare strategies will not allow to maintain a high level of care with a controlled cost in the future and E health can optimize the management and costs of our health care systems. Healthy behaviors contribute to prevention and optimization of chronic diseases management, but their implementation is still a major challenge. Digital technologies could help their implementation through numeric behavioral medicine programs to be developed in complement (and not substitution) to the existing care in order to focus human interventions on the most severe cases demanding medical interventions.
10 Dissemination
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
- Vincent Martin has been moderator for a session from the Société Médico-Psychologique entitled “La psychiatrie à ses frontières”, 09 2025, Bordeaux, France.
- Vincent Tourneur has organized the Loria PhD seminar (8 presentations during the year).
General chair, scientific chair
- Sylvain Pogodalla: scientific co-chair of the 16th International Conference on Computational Semantics, 09 22–23, 2025, Düsseldorf, Germany, 57.
- Karën Fort: Ethics co-chair of the ACL 2025 conference.
10.1.2 Scientific events: selection
Chair of conference program committees
- Maxime Amblard: chair of the workshop 4AS Atelier sur les Avancées en AMR et en Analyse Sémantiques colocated with TALN 2025.
- Sylvain Pogodalla: co-chair of the program committee for the journéesImpact de la science ouverte sur la recherche et les pratiques scientifiques, 01 27–29, 2026, Nancy, France.
Member of the conference program committees
- Vincent Martin: member of the conference program committees for the Journée d’étude sur les technologies linguistiques pour les langues peu dotées (AFIA/AFCP), 12 2025, Paris, France.
Reviewer
- Philippe de Groote: reviewer for SCiL 2025, MOL 2025, IWCS 2025.
- Iglika Zlatkova Nikolova-Stoupak: reviewer for 2nd UniDive training school 01 2026, Yerevan, Armenia.
10.1.3 Journal
Member of the editorial boards
- Maxime Amblard: editor in chief of the Revue Traitement Automatique des Langues.
- Sylvain Pogodalla: Member of the editorial board of the journal Traitement Automatique des Langues, in charge of the Résumés de thèses section.
- Philippe de Groote: Area editor of the FoLLI-LNCS series.
Reviewer - reviewing activities
- Maxime Amblard: reviewer for the conferences: ACL, COLM, ECAI, IWCS, LREC, TAL, reviewer for the workshop: ISA-21, LARP, Lexique and reviwer for the journal Mathematical Structures in Computer Science
- Philippe de Groote: reviewer for the journal Logical Methods in Computer Science.
- Karën Fort: reviewer for ACL 2025 and ACM FAcct 2025.
- Vincent Martin: reviewer for Interspeech, ICASSP and the Journal of Internet Medical Research
- Sylvain Pogodalla: reviewing for the Journal of Language Modelling.
- Amandine Decker: reviewer for SemDial and sub-reviewer for ECAI.
10.1.4 Invited talks
Philippe de Groote gave an invited talk at the Conference on Mathematical and Computational Linguistics for Proofs26.
Karën Fort was invited to give a keynote speech at the Italian NLP conference CLiC-it in Sept. 2025 on the subject of "Large Language Models: the challenge of evaluation" 22.
Karën Fort was invited to give a keynote speech at the Association française de linguistique appliquée (AFLA) conference: Naturel et Artificiel en Linguistique Appliquée : une époque de paradoxes – Neala25, in Nancy, in July 2025, on the subject of "Les grands modèles de langue : des outils situés".
Karën Fort was invited to give a speech at the Conseil Scientifique of the Institut CNRS in Computer Science, in Paris, in March 2025, on the subject of "Les grands modèles de langue : les défis de l'évaluation." 23.
Fanny Ducel was invited to give a presentation about her research on stereotypical biases in LLMs to the work group "Intelligence Artificielle Soutenable, Intelligible et Vérifiable" of Université Paris-Saclay.
Vincent Martin was invited to give a talk at the French National Sleep Medicine Congress: `Enjeux des modélisations pour aborder la sémiologie du sommeil', 11 2025, Congrès du Sommeil, Strasbourg
10.1.5 Leadership within the scientific community
- Maxime Amblard is PI of INSIGHT project (Initiative d'Excellence Lorraine - PIA).
- Vincent Martin is member from the steering comitee of the Collège Technologies du Langage Humain (TLH) from the Association française pour l’Intelligence Artificielle (AfIA) since 09 2024.
- Karën Fort is PI of the GDR LIFT 2.
10.1.6 Scientific expertise
- Vincent Martin: member of the evaluation comitee for the “IA, HEalth and Biology” for the French Research Agence (ANR - Appel à projet TSIA).
- Sylvain Pogodalla: evaluation for the Inria Quadrant Programme, evaluation for the ANR generic call for proposals 2025.
10.1.7 Research administration
- Maxime Amblard:
- Member of CNU 27 (Computer Science)
- Head of the master in Natural Language Processing
- Karën Fort:
- Elected member of the Conseil de Pôle AM2I
- Chair of the Ethics committee of the ENACT AI cluster
- Member of the Steering Committee of the INSIGHT project
- Sylvain Pogodalla:
- Elected member of the comité de centre Inria Nancy – Grand Est.
- In charge of the local commission IES (information et édition scientifique) of the Inria Nancy – Grand Est and LORIA.
- Member of the national commission IES of Inria.
10.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
10.2.1 Teaching
- Licence:
- Maxime Amblard, AI Introduction, 14h, L1, Université de Lorraine, France.
- Maxime Amblard, Ethical aspects of NLP, 10h, L3, Université de Lorraine, France.
- Maxime Amblard, Human in the loop, 10h, L3, Université de Lorraine, France.
- Karën Fort, De l'écrit à l'information, 20h, L1 MIASHS,IDMC, Université de Lorraine, France.
- Karën Fort, Outils pour l'analyse linguistique, 25h, L3 MIASHS,IDMC, Université de Lorraine, France.
- Hee-Soo Choi and Fanny Ducel, De l'écrit à l'information, 5h, L1 MIASHS,IDMC, Université de Lorraine, France.
- Hee-Soo Choi, Langages de Script, 20h, L1 MIASHS, IDMC, Université de Lorraine, France.
- Hee-Soo Choi, Initiation aux Bases de Données, 24h, L1 MIASHS, IDMC, Université de Lorraine, France.
- Hee-Soo Choi, Bases de Données Avancées, 28h, L2 MIASHS, IDMC, Université de Lorraine, France.
- Hee-Soo Choi, Suivi de stages, 4h, L3, IDMC, Université de Lorraine, France.
- Hee-Soo Choi, Algorithmique et Programmation Impérative, 30h, L1 Informatique, FST, Université de Lorraine, France.
- Hee-Soo Choi, Algorithmique et Programmation, 20h, L1 Informatique, FST, Université de Lorraine, France.
- Hee-Soo Choi, Programmation, 36,7h, L1 Mathématiques, FST, Université de Lorraine, France.
- Hee-Soo Choi, Algorithmique et Programmation, 36,4h, L1 SPI, FST, Université de Lorraine, France.
- Vincent Tourneur, Administration UNIX, 24h, L2, IUT Charlemagne, Université de Lorraine, France.
- Vincent Tourneur, Compilation, 40h, L3, IUT Charlemagne, Université de Lorraine, France.
- Marie Cousin, Recherche Opérationnelle, 4h, L3, École des Mines de Nancy, Université de Lorraine, France.
- Clémentine Bleuze, Ingénierie de la langue, 15h, L3 MIASHS option TAL, IDMC, Université de Lorraine, France.
- Maxime Amblard and Clémentine Bleuze, Découverte du traitement des données langagières, 30h, L3 MIASHS option TAL, IDMC, Université de Lorraine, France.
- Clémentine Bleuze, Découverte du traitement des données langagières, 15h, L2 MIASHS option TAL, IDMC, Université de Lorraine, France.
- Iglika Zlatkova Nikolova-Stoupak, Découverte du traitement des données langagières, 15h, L2 MIASHS option TAL, IDMC, Université de Lorraine, France.
- Master:
- Maxime Amblard and Amandine Decker, Methods for NLP, 20h, M1 NLP (IDMC), Université de Lorraine, France.
- Maxime Amblard, NLP project, 30h, M1 NLP (IDMC), Université de Lorraine, France.
- Maxime Amblard and Amandine Decker, Dialogue ChatBot and Question Answering, 28h, M2 NLP (IDMC), Université de Lorraine, France.
- Karën Fort, Written Corpora (English), 37.5h, M1 NLP (IDMC), Université de Lorraine, France.
- Clémentine Bleuze, Written corpora (English), 16h, Master M1 NLP (IDMC), Université de Lorraine, France.
- Karën Fort, Software Projects (English), 25h, M2 NLP (IDMC), Université de Lorraine, France.
- Karën Fort, Python Programming (English), 37.5h, M1 NLP (IDMC), Université de Lorraine, France.
- Philippe de Groote, Formal Logic, 22h, M1 NLP (IDMC), Université de Lorraine, France.
- Philippe de Groote, Formal languages, 22h, M1 NLP (IDMC), Université de Lorraine, France.
- Philippe de Groote, Semantics, 22h, M2 NLP (IDMC), Université de Lorraine, France.
- Karën Fort, Clémentine Bleuze, Ethics and NLP (English), 19h, M1 NLP (IDMC), Université de Lorraine, France.
- Karën Fort, Ethics (English), 25h, M2 NLP (IDMC), Université de Lorraine, France.
- Karën Fort, Génie logiciel, 56.25h, M1 MIAGE,IDMC, Université de Lorraine, France.
- Bruno Guillaume, Lexical Resources (English), 15h, M2 NLP (IDMC), Université de Lorraine, France.
- Vincent Martin, Speech processing (English), 14h, M2 NLP (IDMC), Université de Lorraine, France
- Vincent Martin, Signal processing (English), 12h, M2 NLP (IDMC), Université de Lorraine, France
- Vincent Martin, NLP projects (English), 3h, M1 NLP (IDMC), Université de Lorraine, France
- Vincent Martin, Critical analysis of artificial intelligence for health (English), 6h, Master 2 Health Engineering, Université Grenoble Alpes, France
- Vincent Martin, Back to the big wide world: how to integrate digital tools into clinical practice? (English), 6h, Master 2 Health Engineering, Université Grenoble Alpes, France
- Vincent Martin, Quelques éléments de STS, 2h, Licence-Master Science de la Santé, Université de Bordeaux, France
- Sylvain Pogodalla, Semantics, 10h, M1 NLP (IDMC), Université de Lorraine, France
- Sylvain Pogodalla and Amandine Decker, Syntactic Models, 20h, M2 NLP (IDMC), Université de Lorraine, France
- Fanny Ducel, Software Projects (English), 10h, M2 NLP (IDMC), Université de Lorraine, France.
- Fanny Ducel, Python Programming (English), 14h, M1 NLP (IDMC), Université de Lorraine, France
- Fanny Ducel, Project Management Tools (English), 8h, M1 NLP (IDMC), Université de Lorraine, France.
- Clémentine Bleuze, NLP for low-resource language (English), 8h, Master M2 NLP (IDMC), Université de Lorraine, France.
- Maxime Amblard and Amandine Decker, Introduction to NLP, M1 NLP (IDMC), Université de Lorraine, France.
- Maxime Amblard and Amandine Decker, Dialogue Engineering, 14h, M2 NLP (IDMC) LI, Université de Lorraine, France.
- Maxime Amblard and Amandine Decker, Discourse, 14h, M2 NLP (IDMC), Université de Lorraine, France.
- Amandine Decker and Maxime Amblard, Dialogue Engineering, 14h, M2 NLP (IDMC), Université de Lorraine, France.
- Marie Cousin, Foundation of Computing, 14h, M1, École des Mines de Nancy, Université de Lorraine, France.
- Doctorate:
- Maxime Amblard Introduction to AI, Doctoral School SLTC, Université de Lorraine, 2 x 7h
- Tutorials:
- Karën Fort, Fanny Ducel, Navigating Ethical Challenges in NLP: Hands-on strategies for students and researchers 61
- International Summer School:
10.2.2 Supervision
PhD defended in 2025
- Maxime Guillaume, Structures de traits pour les Grammaires Catégorielles Abstraites, since 07 2021. Supervision: Philippe de Groote and Raphaël Salmon (Yseop).
- Santiago Herrera, Extraction de grammaires descriptives à partir de corpus annotés en syntaxe, since 09 2022. Supervision: Sylvain Kahane (MoDyCo, Université Paris Nanterre) and Bruno Guillaume.
- Nicolas Hiebel, Création éthique de données textuelles artificielles : application au domaine biomédical, since 10 2021. Supervision: Aurélie Névéol (LISN-CNRS), Karën Fort and Olivier Ferret (CEA).
- Siyana Pavlova, Tools and Methods for Semantic Annotation, since 11 2020. Supervision: Maxime Amblard and Bruno Guillaume.
PhD in progress
- Vincent-Thomas Barrouillet, Le discours pathologique du sujet schizophrène, caractérisation psycholinguistique et computationnelle des déviations décisives à la logicité dialogique en étude de corpus, since 10 2019. Supervision: Michel Musiol and Maxime Amblard.
- Clémentine Bleuze, Perception et évaluation des biais dans les applications des LLM au domaine biomédical, since 10 2024. Supervision: Karën Fort and Aurélie Névéol (LISN-CNRS).
- Colleen Beaumard, Biomarqueurs vocaux collectés par des agents conversationnels pour l'aide au diagnostic et le suivi des troubles du sommeil et des troubles mentaux, since 10 2022. Supervision: Jean-Luc Rouas (Université de Bordeaux, LaBRI), Pierre Philip (Université de Bordeaux, SANPSY) and Vincent Martin.
- Elio Stasica, Diagnostic différentiel d'infarctus à partir de la parole, since 9 2025. Supervision: Emmanuel Vincent (Multispeech), Romain Serizel (Multispeech), and Vincent Martin.
- Hee-Soo Choi, Lier des ressources lexicales du français en vue d'une interopérabilité entre niveaux linguistiques, since 10 2021. Supervision: Karën Fort and Mathieu Constant.
- Marie Cousin, Modélisation de paraphrase dans les grammaires catégorielles abstraites, since 10 2022. Supervision: Philippe de Groote and Sylvain Pogodalla.
- Amandine Decker, Modelling Topic-level Interaction in Pathological Conversations, since 10 2022. Supervision: Maxime Amblard and Ellen Breitholtz (University of Gothenburg, Sweden).
- Fanny Ducel, Evaluating stereotyped biases in auto-regressive language models, since 10 2023. Supervision: Karën Fort and Aurélie Névéol (LISN-CNRS).
- Amandine Lecomte, Analyse longitudinale de prise en charge psychothérapeutique de patients psychiatriques et de patients atteints de maladies neurodégénératives : informatisation et modélisation dialogique des indices comportementaux associés à l’efficacité (vs échec) des stratégies de prise en charge tentées par les thérapeutes, since 10 2019. Supervision: Michel Musiol and Alexandra König.
- Valentin Richard, Aspects dynamiques et présuppositionnels des questions, since 09 2021. Supervision: Philippe de Groote, Floris Roelofsen and Reinhard Muskens (Universiteit van Amsterdam, ILLC).
- Vincent Tourneur, Algorithmes d’analyse syntaxique pour les grammaires catégorielles abstraites, since 10 2024. Supervision: Philippe de Groote.
10.2.3 Other supervisions
Karën Fort and Fanny Ducel supervised six M1 students during their 2-month internship at LORIA. Four of these students worked on the stereotypes present in benchmarks used for LLMs, while the two others developed a method to measure racist biases in reaction to the presence of code-switching in LLM prompts. Karën Fort and Fanny Ducel also supervised two L3 interns, one of whom worked on the code-switching project, and the second one developed an interface based on previous work on biases by Karën Fort and Fanny Ducel. This work was published in a TALN workshop 44.
10.2.4 Juries
- Karën Fort, Maxime Amblard, Bruno Guillaume: NLP Master 1 and 2 juries (IDMC)
- Maxime Amblard was reviewer, president and member of the PhD jury of Zacchary Sadeddine, Meaning Representation Frameworks and Reasoning in the Era of LLMs, under the supervision of Fabian Suchanek (Telecom Paris), Institut polytechnique de Paris, 10 octobre 2025
- Maxime Amblard was reviewer od Jarom´ır Salamon, Influencing text generation by biological signal Roman Mouˇcek, University of West Bohemia, aout 2025.
- Maxime Amblard was president and member of the PhD jury of Aman Sinha (président), Evaluation of Medical Language Models,under the supervision of Marianne Clausel, Mathieu Constant, Université de Lorraine, 12 décembre 2025
- Maxime Amblard was president and member of the PhD jury of William eduardo Soto martinez (président), Multilingual Graph-to-Text Generation and Evaluation, under the supervision of Claire Gardent (DR CNRS), Yannick Parmentier (Université de Lorraine), 07 octobre 2025
10.2.5 Educational and pedagogical outreach
- Marie Cousin and Amandine Decker: animation of a MATh.en.JEANS workshop within Edmond de Goncourt secondary school in Pulnoy.
- Karën Fort presented her work on ethics of AI to CPGE students from Lycée Poincaré, LORIA, Nancy, Ethics of AI from an NLP point of view : the good, the bad and the evaluation. January 2025.
10.3 Popularization
10.3.1 Productions (articles, videos, podcasts, serious games, ...)
- Karën Fort was interviewed for La Recherche Magazine (January–March 2026) on LLM agents.
- Karën Fort was interviewed for Chut!, Imaginons des systèmes plus petits et plus ciblés, January 2025
- Maxime Amblard was interviewed by cortex.com for the kick-off event of UNYS
- Maxime Amblard was interviewed by newstank.com for the INSIGHT project
10.3.2 Participation in Live events
- Maxime Amblard participate in the event le Procès du Robot, at Lycée Loritz, 2025-02-28
- Fanny Ducel gave a presentation about her projects on stereotypical biases in LLMs at the Université Champagne-Ardenne, in the context of its AI Week and of "Fête de la Science".
- Amandine Decker: 2025-02-07, participation in FIRST (Femmes Ingénieures, Réussir en Sciences et Technologies), présentation de la recherche à des élèves (filles) de seconde (Lycée Fabert, Metz, France),
- Hee-Soo Choi and Fanny Ducel: 2025-02-27, participation to the Elles Bougent : Filles - Maths et Science day to promote scientific studies and careers to 150 female students from 40 middle schools.
- Marie Cousin: 2025-02-27, participation to the Grand-Est edition of "Sciences, un métier de femmes", presentation of what research in computer science is, interactions with high school female students (FST, Nancy).
- Marie Cousin: 2025-01-31, presentation to high school students in the context of the "Chiche !" initiative (Lycée des métiers du tertiaire Jean-Victor Poncelet Saint-Avold, France),
- Marie Cousin: 2025-09-20, participation in “Journées européennes du Matrimoine” (Féru des Sciences, Nancy, France).
11 Scientific production
11.1 Major publications
- 1 inproceedingsThe Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research.Proceedings of the 61st Annual Meeting of the Association for Computational LinguisticsVolume 1: Long Papers61st Annual Meeting of the Association for Computational Linguistics1Toronto, CanadaAssociation for Computational Linguitics2023, 13141-13160HAL
- 2 articleHuman Where? A New Scale Defining Human Involvement in Technology Communities from an Ethical Standpoint.International Review of Information EthicsAugust 2022HAL
- 3 articleNon-size increasing Graph Rewriting for Natural Language Processing.Mathematical Structures in Computer Science28082018, 1451--1484HALDOIback to text
- 4 bookApplication of Graph Rewriting to Natural Language Processing.1Logic, Linguistics and Computer Science SetISTE Wiley2018, 272HALback to text
- 5 article"You'll be a nurse, my son!" Automatically Assessing Gender Biases in Autoregressive Language Models in French and Italian.Language Resources and EvaluationOctober 2024HALDOI
- 6 articleA Note on Intensionalization.Journal of Logic, Language and Information2222013, 173-194HALDOI
- 7 inproceedingsFrench CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English.ACL 2022 - 60th Annual Meeting of the Association for Computational LinguisticsDublin, IrelandMay 2022HAL
- 8 articleA syntax-semantics interface for Tree-Adjoining Grammars through Abstract Categorial Grammars.Journal of Language Modelling532017, 527--605HALDOIback to text
- 9 articleFactives at hand: When presupposition mode affects motor response.Journal of Experimental Psychology2022HALDOI
11.2 Publications of the year
International journals
Invited conferences
International peer-reviewed conferences
National peer-reviewed Conferences
Conferences without proceedings
Scientific book chapters
Edition (books, proceedings, special issue of a journal)
Doctoral dissertations and habilitation theses
Reports & preprints
Other scientific publications
Scientific popularization
11.3 Cited publications
- 64 miscBEL-RL-fr.ORTOLANG (Open Resources and TOols for LANGuage) –www.ortolang.fr2025, URL: https://hdl.handle.net/11403/examples-ls-fr/back to text
- 65 miscRéseau Lexical du Français (RL-fr).ORTOLANG (Open Resources and TOols for LANGuage) –www.ortolang.fr2025, URL: https://hdl.handle.net/11403/lexical-system-fr/back to text
- 66 bookEnriched Meanings. Natural Language Semantics with Category Theory.1Oxford Studies in Semantics and Pragmatics13OxfordOxford University Press2020back to text
- 67 inproceedingsMeaning-Text Theory within Abstract Categorial Grammars: Towards Paraphrase and Lexical Function Modeling for Text Generation.Proceedings of the 15th International Conference on Computational Semantics (IWCS)Nancy, FranceAssociation for Computational LinguisticsJune 2023HALback to text
- 68 inproceedingsVers une implémentation de la théorie sens-texte avec les grammaires catégorielles abstraites.Actes de CORIA-TALN 2023. Actes des 16e Rencontres Jeunes Chercheurs en RI (RJCRI) et 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL)Paris, FranceATALAJune 2023, 72-86HALback to text
- 69 miscLes particules énonciatives.September 2024HALDOIback to text
- 70 article''You'll be a nurse, my son!'' Automatically Assessing Gender Biases in Autoregressive Language Models in French and Italian.Language Resources and EvaluationOctober 2024, 1495--1523HALDOIback to text
- 71 inbookStudies in Linguistic Analysis. Special volume of the Philological Society.Reprinted in: Palmer, F. R. (ed.) (1968). Selected Papers of J. R. Firth 1952-59, pages 168-205. Longmans, London.OxfordBlackwell1957, A Synopsis of Linguistic Theory, 1930-19551--32back to text
- 72 bookThe Interactive Stance.OxfordOxford University Press2012back to text
- 73 inproceedingsDeriving Formal Semantic Representations from~Dependency Structures.Logic and Engineering of Natural Language Semantics: 19th International Conference, LENLS19, Tokyo, Japan, November 19--21, 2022, Revised Selected PapersLecture Notes in Computer Science14213Tokyo (JP), JapanSpringerNovember 2022, 157-172HALDOIback to text
- 74 inproceedingsOn the semantics of dependencies: relative clauses and open clausal complements - extended abstract -.Logic and Engineering of Natural Language Semantics 20 (LENLS20)Osaka, JapanNovember 2023HALback to text
- 75 articleOn the expressive power of Abstract Categorial Grammars: Representing context-free formalisms.134http://www.springerlink.com/content/1572-9583/2004, 421--438HALDOIback to text
- 76 inproceedingsTowards a Montagovian account of dynamics.Proceedings of the 16th Semantics and Linguistic Theory Conference (SALT 16)2006DOIback to text
- 77 inproceedingsTowards abstract categorial grammars.Association for Computational Linguistics, 39th Annual Meeting and 10th Conference of the European ChapterColloque avec actes et comité de lecture. internationale.Toulouse, FranceJuly 2001, 148--155HALback to text
- 78 articleInteraction Grammars.72-42009, 171--208HALDOIback to text
- 79 articleDistributional Structure.Word102-31954, 146-162DOIback to text
- 80 inproceedingsDiscourse Markers for Topic Change.TrentoLogue: SemDial workshopUniversità Di TrentoRoverto, ItalySEMDIALSeptember 2024, 1-3HALback to text
- 81 inproceedingsModelling Word Similarity: an Evaluation of Automatic Synonymy Extraction Algorithms..Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)Marrakech, MoroccoEuropean Language Resources Association (ELRA)May 2008, URL: http://www.lrec-conf.org/proceedings/lrec2008/pdf/818_paper.pdfback to text
- 82 inproceedings What does BERT learn about the structure of language? ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics Florence, Italy July 2019 HAL back to text
- 83 unpublished(Innocent ?) Bias in argumentation. The view from language.January 2026, working paper or preprintHALback to text
- 84 inproceedingsDiscourse markers are not special (but they can be complicated.Empirical Issues in Syntax and Semantics. Selected papers from CSSP 2023Paris, France2025HALback to textback to text
- 85 inproceedingsConstruction of a French Lexical Network: Methodological Issues.Proceedings of the First International Workshop on Lexical Resources, WoLeR 2011. An ESSLLI 2011 WorkshopLjubljana, SloveniaAugust 2011, 54--61URL: https://hal.inria.fr/hal-00686467back to text
- 86 articleEmergent linguistic structure in artificial neural networks trained by self-supervision.Proceedings of the National Academy of Sciences117482020, 30046--30054DOIback to text
- 87 bookSemantics: From Meaning to Text.1Studies in Language Companion Series129Amsterdam/PhiladelphiaJohn Benjamins Publishing Company2012back to text
- 88 articleDependency-Based Construction of Semantic Space Models.Computational Linguistics3322007, 161--199URL: https://www.aclweb.org/anthology/J07-2002DOIback to text
- 89 inproceedingsComparing Similarity Measures for Distributional Thesauri.Proceedings of LREC 20142014, URL: https://www.aclweb.org/anthology/L14-1496/back to text
- 90 inproceedingsFinding semantically related words in Dutch: co-occurrences versus syntactic contexts.Proceedings of the 2007 Workshop on Contextual Information in Semantic Space Models: Beyond Words and Documents2007, 9-16URL: https://bibliotek.dk/eng/moreinfo/netarchive/870970-basis:28214510back to text
- 91 inproceedingsA French Interaction Grammar.RANLP 2007 - International Conference on Recent Advances in Natural Language ProcessingIPP & BAS & ACL-BulgariaBorovets, BulgariaINCOMA Ltd, Shoumen, BulgariaSeptember 2007, 463--467HALback to text
- 92 articleGreat ape interaction: Ladyginian but not Gricean.120422023DOIback to text
- 93 inproceedingsAttention is All You Need.Proceedings of the 31st International Conference on Neural Information Processing SystemsNIPS'17Red Hook, NY, USALong Beach, California, USACurran Associates Inc.2017, 6000–6010URL: https://dl.acm.org/doi/pdf/10.5555/3295222.3295349back to text
- 94 inproceedingsCharacterising Measures of Lexical Distributional Similarity.COLING 2004: Proceedings of the 20th International Conference on Computational LinguisticsGeneva, SwitzerlandCOLING2004, 1015--1021URL: https://www.aclweb.org/anthology/C04-1146back to text
- 95 inproceedingsAn AMR-based Link Prediction Approach for Document-level Event Argument Extraction.Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Toronto, CanadaAssociation for Computational LinguisticsJuly 2023, 12876--12889URL: https://aclanthology.org/2023.acl-long.720/DOIback to text