Students
Tuition Fee
GBP 28,600
Per year
Start Date
2026-01-01
Medium of studying
On campus
Duration
3 years
Details
Program Details
Degree
PhD
Major
Artificial Intelligence | Data Science | Software Engineering
Area of study
Information and Communication Technologies
Education type
On campus
Timing
Full time
Course Language
English
Tuition Fee
Average International Tuition Fee
GBP 28,600
Intakes
Program start dateApplication deadline
2025-10-01-
2026-01-01-
2026-04-01-
2026-07-01-
About Program

Program Overview


Introduction to the PhD Program

The PhD program in Computer Science at Loughborough University is a research-focused degree that aims to develop novel, semantic-aware deduplication techniques for improving dataset quality while maintaining computational efficiency.


Program Details

Qualification(s) Available

  • PhD

Entry Requirements

Applicants should have, or expect to achieve, at least a 2:1 Honours degree (or equivalent) in computer science or a related subject. A relevant Master’s degree and/or experience in one or more of the following will be an advantage: artificial intelligence, information sciences, mathematics with experience in programming.


Fees for Entry

  • UK fee: £5,006 Full-time degree per annum
  • International fee: £28,600 Full-time degree per annum

Duration and Start Date

  • Duration: Full-time, 3 years
  • Start date: October 2025, January 2026, April 2026, July 2026

Application Deadline

  • Application deadline: 1 April 2026

Project Reference

  • Project reference: CO/GC-SF7/2025

Location

  • Location: Loughborough

Subject Area(s)

  • Subject area(s): Computer Science

Project Details

The exponential growth of training datasets in machine learning, particularly in natural language processing (NLP), has highlighted the critical challenge of data duplication. Duplicate or near-duplicate content, repetitive substrings, and redundant information in datasets can lead to biased models, inefficient training processes, and inflated evaluation metrics, ultimately undermining the reliability and generalisability of machine learning systems.


While data deduplication is essential for improving dataset quality, current methods are limited in their ability to capture semantic similarities and are often computationally expensive, making them impractical for large-scale applications.


This PhD project aims to address these limitations by developing novel, semantic-aware deduplication techniques that improve dataset quality while maintaining computational efficiency.


Research Objectives

  1. Develop and evaluate frameworks for semantic-aware deduplication that can identify both exact and near-duplicate content. A critical aspect will be preserving contextually important variations whilst removing truly redundant data. The effectiveness of these approaches will be evaluated against existing methods using standard benchmarks.
  2. Examine how different deduplication strategies affect model performance, memory usage and training efficiency. This will involve carefully quantifying the relationships between deduplication levels and various aspects of model output quality. Understanding these relationships is crucial for developing practical solutions that can be deployed at scale.
  3. Explore approaches including active learning approaches for deduplication that can efficiently process large-scale datasets. A key focus will be minimising both computational resources and manual labelling requirements through intelligent sample selection and automated processing techniques.
  4. Conduct case studies on benchmark datasets to validate the proposed methods in real-world scenarios. This will involve applying the developed frameworks to diverse datasets, analysing their performance, and providing insights into their applicability across different domains and use cases.

This research has the potential to make significant contributions to the field of machine learning by addressing fundamental challenges in dataset quality and model training efficiency. The findings could have broad implications for improving the reliability and performance of language models across various applications.


The project will require expertise in machine learning and natural language processing, with opportunities to develop novel theoretical frameworks as well as practical implementations. The successful candidate will join a dynamic research environment with access to substantial computational resources and real-world datasets for evaluation.


Supervisors

  • Primary supervisor: Professor Georgina Cosma

English Language Requirements

Applicants must meet the minimum English language requirements. Further details are available on the International website.


How to Apply

All applications should be made online. Under programme name, select Computer Science. Please quote the advertised reference number: CO/GC-SF7/2025 in your application.


To avoid delays in processing your application, please ensure that you submit a CV and the minimum supporting documents.


The following selection criteria will be used by academic schools to help them make a decision on your application. Please note that this criteria is used for both funded and self-funded projects.


Please note, applications for this project are considered on an ongoing basis once submitted and the project may be withdrawn prior to the application deadline, if a suitable candidate is chosen for the project.


See More
How can I help you today?