You are here

  1. Home
  2. Incorporating Domain-specific Knowledge into Pre-trained Language Model based Controllable Text Generation

Incorporating Domain-specific Knowledge into Pre-trained Language Model based Controllable Text Generation


Natural Language Generation (NLG) enables computers to generate natural text like humans. As an embodiment of advanced artificial intelligence technology, NLG plays a crucial role in a range of applications, such as dialogue systems, advertising, marketing and story generation. Making text generation controllable is an important and fundamental issue in NLG. Generally speaking, a NLG system should be able to reliably generate texts that meet certain controllable constraints. In general,

these constraints are task-specific. For example, the task of story generation always needs to control

the storyline and the ending. In the task of dialogue response generation, controlling emotion, persona, politeness, etc., is required. Moreover, for ethical AI applications, it is crucial to avoid generating offensive content such as gender bias, racial discrimination and toxic words.

In recent years, the development of deep learning (DL) techniques has given rise to a series of studies on DL-driven controllable text generation (CTG). However, the DL-based methods rely heavily on large-scale labelled datasets for the specific task, limiting their generalization capability. Since 2018, large-scale pretrained Language models (PLMs) such as BERT, RoBERTa and GPT, have become a new paradigm of NLP. PLMs are believed to have learned a great deal of semantic and syntactical knowledge from the large scale corpus on which they are pre-trained, and only a fine-tuning is required for downstream tasks. In term of CTG, PLMs have learned from a large number of corpus materials to model the distribution of natural language to a large extent, so that they are able to generate texts of unprecedented quality. However, the knowledge captured in PLMs is rather superficial. They will lose generalization ability when the pre-training data does not contain relevant domain-specific knowledge. Therefore, purely relying on PLMs could be difficult to control the generated texts faithfully to the rich knowledge specific to a target domain.

This project aims to investigate the incorporation of domain-specific knowledge graph into PLM-based CTG. Knowledge graph is a natural carrier of explicit domain-specific knowledge, and also provides effective reasoning mechanisms. Therefore, it can be a good complement to the use of PLMs. We will focus on the attribute-controlled text generation task within a selected domain (e.g., computer science), which aim to generate natural language sentences that satisfy certain pre-given attributes such as topic, emotion and keywords.


Skills Required:

Good undergraduate degree (2.1 or above) or Master degree in Computing, ideally with experience in natural language processing and information retrieval.


Background Reading:

Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. 2020. A distributional approach to controlled text generation.

arXiv preprint arXiv:2012.11635 (2020).

Xingwei He. 2021. Parallel Refinements for Lexically Constrained Text Generation with BART. CoRR abs/2109.12487 (2021).


Shrimai Prabhumoye, Alan W. Black, and Ruslan Salakhutdinov. 2020. Exploring Controllable Text Generation Techniques.

CoRR abs/2005.01822 (2020). arXiv:2005.01822

Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin.

2019a. Topic-guided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137 (2019).



Prof. Dawei Song (

Request your prospectus

Request a prospectus icon

Explore our qualifications and courses by requesting one of our prospectuses today.

Request prospectus

Are you already an OU student?

Go to StudentHome