Development of data augmentation methods to improve performance of supervised machine learning models in natural language processing

Issifu, Abdul Majeed

Publication:
Development of data augmentation methods to improve performance of supervised machine learning models in natural language processing

dc.contributor.advisor	GANİZ, Murat Can
dc.contributor.author	Issifu, Abdul Majeed
dc.contributor.department	Marmara Üniversitesi
dc.contributor.department	Fen Bilimleri Enstitüsü
dc.contributor.department	Bilgisayar Mühendisliği Anabilim Dalı
dc.date.accessioned	2026-01-13T08:20:44Z
dc.date.issued	2022
dc.description.abstract	Başlangıçta, “Basit veri zenginleştirme” (Easy Data Augmentation) metin sınıflandırma görevleri için geliştirilmiştir. Temel olarak bu yaklaşım dört yöntemi kapsar: Eşanlamlı Değiştirme, Rastgele Ekleme, Rastgele Silme ve Rastgele Değiştirme. Bunlar derin sinir ağı modellerinde doğruluğu artırmak için hizalanır. Bu çalışma, bu yöntemleri medikal alanda Adlandırılmış Varlık Tanıma görevleri için genişletmeyi amaçlamaktadır. Adlandırılmış varlıkların (cümlelerdeki bir kelime veya kelime gruplarından veya ailelerden oluşan) doğası, veri zenginleştirme alanında bazı zorluklar getirse de, adlandırılmış varlık tanıma başarımını iyileştirilmesine öne sürmektedir. Bu yöntemleri biyomedikal kıyaslama veri kümelerinin boyutunu artırmak ve biyomedikal adlı varlık tanıma modellerinin performansını geliştirmek için kullanıyoruz. BERT gibi dönüştürücü modelleri üzerinde yapılan çalışmaları değerlendirmek için deneyler yaptık. Aktarım yoluyla öğrenme ile, veri kümeleri üzerinde bir biyomedikal dil modeli olan BioBERT'e ince ayar yaptık. BioBERT ve BERT modelleri ile tüm veri setlerinde genel bir iyileştirme ve BC5CDR-hastalık veri setinde sırasıyla %5.95 ve %8.49 F1 puanı artışı sağladık.
dc.description.abstract	Originally, Easy Data Augmentation holds its development to tasks of text classification. Basically, it encapsulates four methods: Synonym Replacement, Random Insertion, Random Deletion, and Random Swap aligned to improving accuracy on several deep neural network models. This study aimed at deploying these methods to new domains by augmenting Named Entity Recognition datasets from the medical domain. Although the nature of the named entities (consisting of a word or word groups or families in sentences) posed some challenges to the augmentation task, a case is advanced that an improvement of the named entity recognition performance is achievable. We use these methods to increase the size of biomedical benchmark datasets and improved the performance of biomedical named entity recognition models. We carried out experiments to evaluate the work on transformer model like BERT. With transfer learning, we fine-tuned BioBERT, a biomedical language model on the datasets. We achieved a general improvement on all datasets and a 5.95% and 8,49% increment of F1-score on BC5CDR-disease dataset with BioBERT and BERT models respectively.
dc.format.extent	X, 49 s.
dc.identifier.uri	https://katalog.marmara.edu.tr/veriler/yordambt/cokluortam/1E/6308702bcf43d.pdf
dc.identifier.uri	https://hdl.handle.net/11424/283574
dc.language.iso	eng
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	Bilgisayar mühendisliği
dc.subject	Computer engineering
dc.subject	data augmentation data augmentation
dc.title	Development of data augmentation methods to improve performance of supervised machine learning models in natural language processing
dc.type	masterThesis
dspace.entity.type	Publication

Collections

Tezler

Publication: Development of data augmentation methods to improve performance of supervised machine learning models in natural language processing

Files

Collections

Publication:
Development of data augmentation methods to improve performance of supervised machine learning models in natural language processing