Duplicate product record detection engine for e-commerce platforms

Albayrak, Osman Semih

Publication:
Duplicate product record detection engine for e-commerce platforms

dc.contributor.advisor	AYTEKİN, Tevfik
dc.contributor.author	Albayrak, Osman Semih
dc.contributor.department	Marmara Üniversitesi
dc.contributor.department	Fen Bilimleri Enstitüsü
dc.contributor.department	Bilgisayar Mühendisliği Anabilim Dalı
dc.date.accessioned	2026-01-13T06:24:56Z
dc.date.issued	2021
dc.description.abstract	Temiz ve sektör standartlarını karşılayabilen bir ürün kataloğuna sahip olabilmek ve eldeki ürün kataloğunu sektör standartlarının altına düşmeden yaşatabilmek e-ticaret şirketlerinin temel uğraşlarından biridir. Binlerce ürün sağlayıcısı tarafından sisteme girilen yeni ürün bilgileri şirketleri zorlu bir problemle karşı karşıya bırakır: Mükerrer ürün kayıtları. Herhangi bir ürünü birbirinden farklı kelimelerle, farklı resimlerle ve bileşenlerle tanımlamak mümkün olduğundan, mükerrer ürün kayıtlarını tespit edebilmek üstesinden gelmesi zor bir görevdir. Bu çalışmada, bir e-ticaret firması olan Hepsiburada.com için özgün bir mükerrer ürün kaydı tespit motoru önerilmiştir. Motor, Hepsiburada.com’un gerçek verileri temel alınarak geliştirilmiştir. Ham veriden, eğitilebilir bir veri seti oluşturabilmek için çeşitli metin benzerliği algoritmaları, e-ticarete özel kurallandırılmış metin benzerliği metrikleri ve görsel benzerlik metrikleri kullanılmıştır. Metin benzerliği hesaplamaları için Jaccar benzerliği, TF-IDF kosinüs benzerliği ve edit uzaklığı gibi geleneksel metin benzerliği yöntemlerine başvurulmuştur. Görsel benzerlik hesaplamaları için bir Siyam (İkiz) Sinir Ağı eğitilmiştir. Herhangi iki ürünün mükerrer olup olmadığını tespit edebilmek için oluşturulan veri seti kullanılarak iki sınıflı sınıflandırma modelleri eğitilmiştir. Deneysel sonuçlar, önerilen motorun, Hepsiburada.com içerisindeki mükerrer ürün kayıtlarını geleneksel yöntemlerden daha başarılı şekilde tespit edebildiğini göstermiştir.
dc.description.abstract	Having a clean product catalog and keeping it complying with the standards of the industry is one of the essential concerns of e-commerce companies. Integrating the product data from multiple providers confronts the companies with a challenging issue: Duplicate product records. Since it is possible to describe a product with a variety of different words, images and attributes, detecting duplicate product records is a difficult task to overcome with. In this thesis, a novel duplicate record detection engine is proposed for an e-commerce company, Hepsiburada.com. The engine is developed based on a real-world data set. A number of different text similarity algorithms, domain-specific distance metrics, image similarity metrics are used to form a training data set. Traditional text similarity algorithms such as Jaccard similarity, TF-IDF cosine similarity and edit distance are used for text similarity calculations. A Siamese (Twin) Neural Network is trained and used for image similarity calculations. Two-class classification models are trained using the data set created to determine whether any two products are duplicated or not. The experimental results show that our engine is able to use product information for duplicate record detection and outperforms the accuracy of non-adaptive methodologies.
dc.format.extent	VIII, 48 s.
dc.identifier.uri	https://katalog.marmara.edu.tr/veriler/yordambt/cokluortam/5D/6045bd5494ac3.pdf
dc.identifier.uri	https://hdl.handle.net/11424/216862
dc.language.iso	eng
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	Bilgisayar mühendisliği
dc.subject	Classification
dc.subject	Computer engineering
dc.subject	Deduplication
dc.subject	Metin Benzerliği
dc.subject	Mükerrer Tespiti Text Similarity
dc.subject	Sınıflandırma
dc.title	Duplicate product record detection engine for e-commerce platforms
dc.type	masterThesis
dspace.entity.type	Publication

Collections

Tezler

Publication: Duplicate product record detection engine for e-commerce platforms

Files

Collections

Publication:
Duplicate product record detection engine for e-commerce platforms