A Unified Framework for NLP Tasks by ReLabel Method
A Unified Framework for NLP Tasks by ReLabel Method
Abstract
In industry deep learning application, our dataset has a certain number of noisy data. The init datasets are from human labeling or LLM (large language model) generation or user behavior log. To solve this problem and achieve more than 90 score in dev dataset, we present a framework to find the noisy data and relabel the noisy data, given the model predictions as references in relabeling. The process of relabeling can be done manually or using LLM for annotation. In this paper, we illustrate our idea for a broad set of deep learning tasks, includes classification, sequence tagging, object detection, sequence generation, click-through rate prediction. The dev dataset evaluation results and human evaluation results verify our idea.
Keywords
NLP, LLM
1. Introduction
In recent years, deep learning \cite{ref1} and LLM \cite{ref2,ref3,ref4,ref5} have shown significant improvement on natural language processing(NLP), computer vision and speech processing technologies. However, the model performance is limited by the dataset quality. The main reason is that the dataset has a certain number of noisy data. In this paper, we present a framework to find the noisy data and relabel the noisy data, then we further illustrate our idea for sequence tagging, object detection, sequence generation, click-through rate (CTR) prediction.
This paper’s contribution lies in the demonstration that the quality of a training dataset can be enhanced by first generating it with LLM and subsequently using LLM for re-annotation, without the need for manual re-annotation.
2. Method

2.1 Initial Datasets
Our initial datasets can be sourced from the following three methods:
1) Manual Annotation: Data noise in a manually annotated dataset, using a classification task as an example, occurs when there is disagreement among annotators. For instance, for 3 very similar data to-label, 2 annotators assign label-A, while 1 annotator assigns label-B.
2) LLM Generation: For datasets generated by LLM, data noise in a classification task often stems from overlapping or repetitive definitions for labels within the prompts.
3) User Behavior Logs: Datasets based on user behavior logs are constructed from user actions. For example, in an e-commerce scenario, a dataset can be built based on whether a user clicks on an item or places an order.
2.2 Find Noisy Data
We first train a model on the initial dataset. Then, we use this model to generate predictions for the entire training set. The data where the model’s prediction differs from the original ground-truth label, or where the prediction error is large, are identified as potential noise. This method allows us to find approximately 2-10% of the dataset for re-annotation. This approach not only reduces manual annotation costs, but its effectiveness in identifying noisy data has also been validated by our experimental results.
2.3 Relabel Step
We perform a manual re-annotation of the noisy data. During this process, we provide the human annotators with both the original label and the model’s prediction as input information. In the era of LLM, we are now replacing this manual re-annotation with an automated process using an LLM. Similarly, we feed the LLM the same inputs: the original label and the model’s prediction. In detail, we ask the LLM within the prompt to correct noisy data made in the last round of labeling. To ensure the quality of the LLM relabeling, we ran inference on the to-relabel dataset 7 times with the same LLM and selected the final label by majority vote.
3. Experimental Results

4. Discussion
We find noisy data by contrasting original labels with model predictions. To correct noisy labels, LLM can be employed to relabel data, thereby reducing the scope of manual annotation. In the LLM relabeling step, both the predicted labels and scores from our trained NLP model can be fed to the LLM as the information needed for noise correction.
The key advantage of prompt-based data annotation is its efficiency in batch processing. By including a few examples (few-shot learning) in the prompt for a LLM, the LLM can generalize and apply the annotation logic to an entire batch of data. Therefore, LLMs bring the amount of data labeling down to a quantity that is manageable for a single developer. For the relabeling step, the prompt-based LLM can be seen as a batch annotation tool. Humans write few-shot examples into the prompts to correct noise in the training dataset.
Furthermore, any incorrect labels (bad cases) generated by the LLM can be identified through manual review and then fed back into the prompt as new examples. This iterative optimization of the prompt allows for the batch correction of similar errors throughout the dataset.
The LLM-relabel method can also be applied to the post-training of LLMs. In this process, after identifying noisy data within the post-training dataset, we use the LLM to correct it.
5. Conclusion
In the era of LLM, our goal is to train models for NLP tasks. To correct the noise in our initial dataset, we propose a framework that supports both a human-in-the-loop (HITL) and an LLM-in-the-loop (LITL) approach. Experimental results have validated the effectiveness of our method.
Reference
\bibitem{ref1}
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25: 1097-1105.
\bibitem{ref2}
Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023.
\bibitem{ref3}
Radford A. Improving language understanding by generative pre-training[J]. 2018.
\bibitem{ref4}
Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[J]. Advances in neural information processing systems, 2022, 35: 27730-27744.
\bibitem{ref5}
Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of machine learning research, 2020, 21(140): 1-67.