Vol.:(0123456789)1 3

Oral Radiology 
https://doi.org/10.1007/s11282-023-00689-4

ORIGINAL ARTICLE

Effect of data size on tooth numbering performance via artificial 
intelligence using panoramic radiographs

Semih Gülüm1 · Seçilay Kutal2 · Kader Cesur Aydin3   · Gazi Akgün4 · Aleyna Akdağ5

Received: 15 December 2022 / Accepted: 29 May 2023 
© The Author(s) under exclusive licence to Japanese Society for Oral and Maxillofacial Radiology 2023

Abstract
Objective  This study aims to investigate the effect of number of data on model performance, for the detection of tooth num-
bering problem on dental panoramic radiographs, with the help of image processing and deep learning algorithms.
Study Design  The data set consists of 3000 anonymous dental panoramic X-rays of adult individuals. Panoramic X-rays 
were labeled on the basis of 32 classes in line with the FDI tooth numbering system. In order to examine the relationship 
between the number of data used in image processing algorithms and model performance, four different datasets which 
include 1000, 1500, 2000 and 2500 panoramic X-rays, were used. The training of the models was carried out with the 
YOLOv4 algorithm and trained models were tested on a fixed test dataset with 500 data and compared based on F1 score, 
mAP, sensitivity, precision and recall metrics.
Results  The performance of the model increased as the number of data used during the training of the model increased. 
Therefore, the last model trained with 2500 data showed the highest success among all the trained models.
Conclusion  Dataset size is important for dental enumeration, and large samples should be considered as more reliable.

Keywords  Artificial intelligence · Image processing · Health technologies · Panoramic x-ray · Tooth numbering

Introduction

Artificial intelligence (AI), a branch of computer science 
that can analyze complex medical data [1], is expanding 
its medical applications day by day with the latest devel-
opments in digitized data collection, machine learning and 
computing infrastructure. In this way, it became involved 

in many medical fields that were previously considered 
only the domain of human specialists [2]. In addition, the 
increase in the number and complexity of data in the medical 
field means that artificial intelligence (AI) will be applied 
more actively in the coming days [3]. Popular AI techniques 
include the classical approach of support vector machine 
(SVM) and machine learning (ML) methods for structured 
data such as neural networks. Oncology, neurology and car-
diology are among the main areas treated using AI [4]. How-
ever, at this point, supervised learning is very costly in the 
medical field, because it is difficult to obtain field-specific 
information and ground truth labels [5]. Therefore, as the 

Semih Gülüm and Seçilay Kutal have contributed equally to this 
work and share first authorship.

Kader Cesur Aydin and Gazi Akgün have contributed equally to this 
work and share senior authorship.

 *	 Kader Cesur Aydin 
	 kadercesur@yahoo.com

	 Semih Gülüm 
	 semih.gulum@avl.com

	 Seçilay Kutal 
	 secilay.kutal@areal.ai

	 Gazi Akgün 
	 gazi.akgun@marmara.edu

	 Aleyna Akdağ 
	 aleyna.akdag@std.medipol.edu.tr

1	 AVL Research and Engineering, Abdurrahmangazi 
Mah. Ataturk Cad. No: 22 11/22 Kat: 6 Sultanbeyli, 
Istanbul 34920, Turkey

2	 AREAL.ai, San Francisco, USA
3	 School of Dentistry, Head of Department Dentomaxillofacial 

Radiology, Istanbul Medipol University, Ataturk Bulvarı No: 
27 Unkapanı, Istanbul 34083, Turkey

4	 Technology Faculty RTE Campus, Marmara University, 
T1/203, Maltepe,  Istanbul 34854, Turkey

5	 School of Dentistry, Istanbul Medipol University, Goztepe 
Mah, Ataturk Cad, Beykoz, Istanbul 34810, Turkey

http://crossmark.crossref.org/dialog/?doi=10.1007/s11282-023-00689-4&domain=pdf
http://orcid.org/0000-0002-6429-4197


	 Oral Radiology

1 3

medical field to be studied (e.g. dental health, the relevant 
field in this study) becomes specialized, the effort and cost 
increase.

As the medical field becomes more specialized, which is 
desired to be solved with deep learning methods, the speci-
fied processes become more and more difficult. This also 
applies to oral diseases, which are among the most common 
diseases in humans. Since only the crowns of the erupted 
teeth can be visualized in the mouth and the roots cannot be 
detected by inspection, it is difficult for dentists to manually 
diagnose dental anomalies and diseases. Therefore, dental 
x-ray imaging methods are the most popular and most pre-
ferred auxiliary examination method for diagnosing dental 
anomalies and diseases before dental treatments [6]. Evalu-
ation and interpretation of x-rays and making the correct 
diagnosis is one of the most important processes in the ini-
tial phase of treatment. Automatic tooth recognition applica-
tions have started to be developed by using panoramic x-ray 
images, which is the most frequently used dental imaging 
method today.

Jaideep Sur et al. conducted a survey among dentists in 
India about artificial intelligence and its possible contribu-
tions to radiological diagnosis [7]. On contrary to the popu-
lar classical belief, their study proves that the use of artificial 
intelligence in the field of health is supported by physicians. 
When the results of the study are examined, it is seen that 
dentists mostly gave positive answers about the preference 
and widespread use of artificial intelligence in radiological 
detection. At the same time, it is predicted that the use of 
artificial intelligence in this field may be useful for dentists, 
especially in examining complex X-rays.

Newly released study by Gunec et al. stated that 83.7% of 
the laypeople think that “artificial intelligence applications 
can be effective in dental diagnosis” and 93.8% confirm that 
“dentist and artificial intelligence can work together” [8]. 
This survey reveals that artificial intelligence-based dental 
health applications should be used in dentistry practice and 
therefore they can play role in the benefit of the society in 
both preventive medicine and dental health tourism.

In this study, the relationship between the problem of 
tooth detection and numbering, which is one of the main 
topics of dentistry, and the number of data and the suc-
cess rate, which is one of the main topics of deep learning 
problems, are discussed. The number of data was consid-
ered important in this study because the number of trained 
classes were high (32 different classes for all teeth) and 
different from studies using few classifications, we assume 
that the success rate of the trainings that have higher num-
ber of images during testing will be elevated. In the prob-
lem that was tried to be solved with the You Only Look 
Once v4 (YOLOv4) algorithm -which is an object detection 
algorithm with high accuracy and speed- it was aimed to 
measure the role of the number of data in the success of the 

model, and in this direction, help from specialist dentists 
was taken. The development of tooth number detection may 
be used by dentists, dental students and even by laypeople, 
for initial acquiring of the relevant tooth and related dental 
pathologies may be addressed easier.

Methods and materials

Dataset

To create the dataset, only adult panoramic X-rays obtained 
through the databases of various dental clinics were 
included, both genders were selected, and care was taken to 
include X-rays taken at different time intervals. All X-rays 
were anonymized to protect patient confidentiality before 
the data was labeled and made meaningful for the model. 
In order to avoid performance loss that may occur due to 
data differences during training, all data were analyzed and 
separated according to image quality, color difference, etc. 
After the analysis of the data, 3000 panoramic X-rays and 
58,575 labeled teeth became usable in the study.

Oral physiology was taken into consideration and there-
fore the maximum number of adult teeth classes were 
determined as 32, for the classification problem, although 
there may be fewer numbers of teeth. Support was received 
from dentomaxillofacial radiology specialist dentists dur-
ing the labeling process carried out in line with the pre-
ferred FDI (Federation Dentaire Internationale) numbering 
system [9] for class names. Labeling was carried out with 
the LabelImg graphic labeling tool [10] and the resulting 
label files were brought into a suitable format so that they 
can be given as input to the model.

During the labeling process, large variations such as 
fillings, crowns and implants that disrupt the morphologi-
cal structure of the tooth were not labeled in order to learn 
the tooth numbers by the model. Tooth samples with the 
mentioned variations are given in Fig. 1.

The data distribution according to the tooth numbers in 
the data set, which has an average of 19.5 teeth labeled in 
each x-ray, is shown in Fig. 2. Considering this distribu-
tion, it can be concluded that teeth numbered 18, 28, 38 and 
48 are less common in an adult human mouth compared to 
other teeth. In addition, the distribution of teeth belonging 
to other classes on the data is uneven. However, the fact 
that the data set was collected from different time intervals, 
clinics, genders and patients proves that the data obtained 
has a high power to represent the real world. Therefore, no 
augmentation was performed for the teeth in the data set 
and care was taken to preserve the observed distribution.


Oral Radiology	

1 3

YOLOv4 method

The deep learning model chosen for training in the study is 
YOLOv4 [11], which has a CNN-based architecture devel-
oped by Bochkovskiy et al. and is a real-time object rec-
ognition algorithm that can present multiple objects in a 
single frame. YOLOv4, which has 162 layers in total, con-
tains more than 64 million parameters. This architecture, 
which can detect more than one object on an image, draws 
bounding boxes around each object to indicate the area of 
the object it predicts. The model needs an image-specific 
label file containing the coordinate information and class 
names of the labeled objects, as well as the image on which 
the labeling is made, as input during the training phase.

By dividing the data set consisting of 3000 X-rays in total, 
four different model trainings were carried out with 1000, 
1500, 2000 and 2500 data numbers to be used in training. 
The remaining 500 X-rays were separated from the training 
dataset and treated as a fixed test dataset to measure the per-
formance of each model. The four models trained in 12,000 
iterations were tested by recording their weights every 3000 
iterations. Model parameters determined for the trainings 

carried out with the GPU support provided on Google Colab 
Notebook are given in Table 1.

Results

Precision, recall, mean average precision (mAP), sensitiv-
ity and F1 scores, which are frequently preferred metrics, 
were taken into account during the testing of deep learning 
models and comparison of their performance. The impor-
tance of the metrics varies according to various studies. The 
reason why the F1 score is preferred is that it can provide 

Fig. 1   Unlabeled teeth with variations

Fig. 2   Data distribution according to the data sets used in the models

Table 1   Model parameters

Image size 
(width 
× height)

Batch size Subdivision Learning rate

608 × 608 32 16 0.001
Confidence Momentum threshold Decay Iteration count
0.3 0.949 0.0005 12,000


	 Oral Radiology

1 3

a class-specific performance evaluation in classification 
problems.

The performance graphs observed as a result of the tests 
performed at the end of each epoch on the test data set sepa-
rated during the training of the model are given in Fig. 3. 
The highest performance values obtained on the test data 
sets of each trained model are given in Table 2.

In the four different trainings conducted, a directly pro-
portional correlation was found between the number of data 
and performance metrics, and it was observed that the model 
performance increased as the number of data used during the 
training increased. However, in cases where the number of 
data is insufficient, underfitting is observed since it cannot 
learn enough to explain the data. By increasing the number 
of the dataset, the tendency of underfitting was overcome. In 
this case, if the model was trained a lot, the model showed 
a tendency to overfit by memorizing the data. Overfitting 
was observed after 9 epochs in 2500 data training with the 
highest training data, as expected, showed itself in earlier 
epochs in models with less data numbers. In addition, as 

expected, underfitting was observed in 1000 data training 
with the lowest training data.

The outputs obtained as a result of the test performed 
with the most successful epochs of each model are shown in 
Fig. 4. When this figure is examined, it is observed that as a 
result of the model trainings carried out with 1000 and 1500 
data, it is not possible to distinguish between the presence of 
the tooth in neighbouring quadrants at the same horizontal 
axis. While this problem was overcome in the 2000 data 
training, it is observed that the problems of estimating more 
than one class on a single tooth that emerged and especially 
in the determination of the wisdom teeth were overcome as 
a result of the training made with 2500 data. The study is 
mainly based on re-testing data and cut-off for the study was 
set at 3000 radiographs (2500 labelling and 500 for testing) 
because the testing for this data size revealed overfitting. The 
model learns about the noise in the training data because 
its complexity is high. In overfitting problems due to high 
education classes, increasing data size may reveal limited or 
no significance for success.

Fig. 3   Test accuracy graphics of models

Table 2   Model-specific highest 
test performance metrics and 
related epoch information

Precision Recall F1-score Sensitivity mAP Best epoch

1000 data 0.48 0.72 0.57 0.72 0.58 12
1500 data 0.45 0.82 0.58 0.82 0.63 3
2000 data 0.58 0.68 0.68 0.83 0.75 6
2500 data 0.62 0.73 0.83 0.88 0.81 9


Oral Radiology	

1 3

Among the models trained in line with the findings, the 
highest performance model was the model in which 9 epoch 
trainings were performed with 2500 data. In order to exam-
ine the effect of the intersection over union (IoU) parameter 
on the model results, this model was also tested with differ-
ent IoU values. As the IoU value falls below 0.5, an increase 
is observed in mAP values, while a decrease is observed 
when it is above 0.5.

Discussion

Tuzoff et al. used Faster Region Based Convolutional Neu-
ral Networks (R-CNN) for the detection of the teeth in the 
method they applied, and the detected teeth were cropped 
and then given to the Visual Geometry Group-16 (VGG-16) 
algorithm for numbering [12]. Then, the X-ray was post-
processed in order for the X-ray to be interpretable and the 
cropped teeth were reassembled.

Mahdi et  al. preferred Residual Neural Network-50 
(ResNet-50) as a backbone in the Faster R-CNN architec-
ture they used to solve the tooth numbering problem [13]. 
During the data labeling of their study with 900 X-Rays, 
teeth with large restorations, such as teeth with large-area 
fillings, were not labeled. As a result of the study, a value of 
0.942 mAP was reached.

Muramatsu et al. in their study, the problems of determin-
ing the teeth, discussed the problems of identifying teeth, 
determining the types of teeth, and classifying teeth accord-
ing to their restorations [14]. In the study, which was carried 
out with 1000 X-ray images, it was aimed to minimize the 
performance problem that may arise with the fourfold cross 
validation method due to the insufficient number of data. 
With the Convolutional Neural Networkbased (CNN-based) 
GoogleNet architecture used for the detection of teeth, a sen-
sitivity value of 96.4% has been reached. With the ResNet 
architecture used for classification problems, accuracy values 

of 93.2% and 98% were achieved, respectively, in the clas-
sification of teeth according to their types and restorations.

In their study, Kim et al., in addition to detecting teeth 
and implants, also worked on the numbering of detected 
teeth [15]. In the study, R-CNN model was preferred and a 
total of 303 X-rays, 253 trains and 50 test data, were used. It 
has reached an accuracy of 77.4% in tooth numbering. The 
biggest problem in numbering was that when there were 
dental interventions such as fillings and crowns on the tooth, 
the shape of the tooth could not be preserved because it 
could not be identified.

Muresan et al. used the Semantic Segmentation method 
to detect fourteen different dental diseases [16]. Labeling 
has been done for fourteen different diseases on grayscale 
X-rays, and also for the background. Since the background 
class is related to pixel values, it can take values between 
0 and 255. In the study, the Efficient Residual Factorized 
Convolutional Network (ERFNet) model, which is the first 
16-layer encoder and the last 7-layer decoder, was preferred. 
Among the methods used, the highest F1 score was 0.93.

Cho et al. performed tests using the Hospital Picture 
Archiving and Communication System (PACS) Dataset with 
the help of Convolutional Neural Network (CNN) algorithm 
in order to observe the effect of the data set size on solving 
medical classification problems. As a result of the tests, it 
was seen that there was a directly proportional correlation 
between the data set size and the success rates [17].

Yüksel et al. performed enumeration tests on panoramic 
radiograps, initially by segmenting to 4 quadrants, then by 
numerating the 8 teeth on all quadrants, with the resultant 
classification of 32 FDI notations [18]. By decreasing the 
number of classes in each quadrant, object detection problem 
was overcome. They have evaluated average precision scores 
for success evaluation and it resulted as 89.4% with an overall 
of 600 labeled images. Average precision score in our 2500 
data testing resulted as 62%, regarding all 32 classes were 

Fig. 4   Predicted test samples


	 Oral Radiology

1 3

determined on the same images during testing. The difference 
in the average precision between the two studies is considered 
about difference of the number of classes (8 vs. 32). On the 
other hand, different evaluation parameters can reveal high 
results, as we have performed higher results for mAP (0.81) 
and recall values (0.73) than Yüksel et.al.

In this study, the effect of data number on model success on 
tooth numbering problem with YOLOv4 architecture, which is 
one of the most state of-the-art object detection algorithms, was 
investigated in tooth detection and numbering tasks. The most 
successful model obtained as a result of the trainings carried out 
was the model with the highest number of data with an F1 score 
of 0.83. With this model, which was trained, high-performance 
results were obtained by numbering the teeth at the radiologist 
level. The mAp value outcomes are lower than Mahdi et al. [13], 
on the other hand this finding may be in relation with the test 
size, and if the labelled images were used for testing as well, this 
may have caused a false positive outcome. Also the classification 
achievement by Muramatsu et al. was higher [14], with a tooth 
detection sensitivity of 96.4%, detailed analysis of the data has 
shown that their classification was based on incisor, canine, pre-
molar and molar labelling (4 classes) and 32 classes of the FDI 
notation was not included. Regarding the great number of classes 
trained, our results with 88% sensitivity reveal a great role.

At the stage of measuring the performance of the models, 
each X-ray was analyzed independently by the trained model 
and a radiologist who is an expert in the field. In the analyzed 
X-rays, it was seen that most of the predictions that the model 
interpreted as incorrect were not wrong, but teeth that were 
largely deformed in their structure and were not labeled were 
found to be correct by the model and increased the False Posi-
tive (FP) value.

There are limitations to this study due to labeling exclu-
sions of largely variated tooth shapes like fillings, crowns and 
implants. In future studies, teeth with large restorations can be 
numbered and included in the model training, thus increasing 
the success of the model. As a different solution method, in 
addition to the detection of teeth with object detection algo-
rithms in the prediction phase of the model, location-based 
information can also be included. Thus, the tooth number pre-
diction that can be made by looking at the tooth shape can be 
strengthened according to the latitudinal region of the tooth 
and the model performance can be increased.

Conclusion

In our study, the problem of detecting tooth numbering in 
panoramic X-rays with artificial intelligence and image 
processing algorithms is emphasized. In addition, the 
importance of the size of the data set in object detection 
algorithms for medical images was tried to be emphasized. 
In case the number of data is low, usage of various altering 

augmentation solutions can be used to increase the number 
of data artificially and the level of performance obtained 
can be increased.

With the spread of algorithms based on artificial intel-
ligence and deep learning and their introduction into daily 
life, an assistant can be developed that facilitates dentists 
to make faster and more reliable clinical decisions for cli-
nicians and dental students.

Acknowledgements  The authors declare that they have no known 
competing financial interests or personal relationships that could have 
appeared to influence the work reported in this paper.

Author contributions  KCA: conceptualization, methodology, 
ınvestigation, writing—original draft preparation, writing—review-
ing and editing, visualization. SK: methodology, ınvestigation, soft-
ware, validation, data curation, ınvestigation, writing—original draft 
preparation. SG: methodology, ınvestigation, software, validation, data 
curation, ınvestigation, writing—original draft preparation, resources. 
GA: methodology, project administration, supervision. AA: software, 
validation, resources.

Funding  Not applicable

Data availibility  The data used to support the findings of this study are 
available from the corresponding author upon request.

Declarations 

Conflicts of interest  All authors declare that they have no conflict of 
interest.

Ethics approval  Not applicable. This article does not contain any stud-
ies with human or animal subjects performed by any of the authors.

Informed consent  Not applicable

References

	 1.	 Ramesh AN, Kambhampati C, Monson JRT, Drew PJ. Arti-
ficial intelligence in medicine. Ann R Coll Surg Engl. 
2004;86(5):334–8.

	 2.	 Yu K-H, Beam AL, Kohane IS. Artificial intelligence in health-
care. Nat Biomed Eng. 2018;2(10):719–31.

	 3.	 Davenport T, Kalakota R. The potential for artificial intelligence 
in healthcare. Future Healthc J. 2019;6(2):94–8.

	 4.	 Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong 
Q, Shen H, Wang Y. Artificial intelligence in healthcare: past, 
present and future. Stroke Vasc Neurol. 2017;2(4):230–43.

	 5.	 Holzinger A, Langs G, Denk H, Zatloukal K, Muller H. 
Causability and explainability of artificial intelligence in 
medicine, Wiley Interdiscip. Rev Data Min Knowl Discov. 
2019;9(4):e1312.

	 6.	 Whyte A, Matias MATJ. Imaging of orofacial pain. J Oral 
Pathol Med. 2020;49(6):490–8.

	 7.	 Sur J, Bose S, Khan F, Dewangan D, Sawriya E, Roul A. Knowl-
edge, attitudes, and perceptions regarding the future of artificial 
intelligence in oral radiology in india: a survey. Imaging Sci 
Dent. 2020;50(3):193.


Oral Radiology	

1 3

	 8.	 Günec HG, Gökyay SS, Kaya E, Cesur-Aydın K. Toplum Yapay 
Zeka Ile Dental Tani Konmasina Hazir Mı? Selcuk Dental Jour-
nal. 2022;9:200–7. https://​doi.​org/​10.​15311/​selcu​kdentj.​915522.

	 9.	 Keiser-Nielsen S. Fédération Dentaire Internationale two-digit 
system of designating teeth. Int Dent J. 1971;21:104–6.

	10.	 Tzutalin, Labelimg, Gitcode https://​github.​com/​tzuta​lin/​label​Img, 
[accessed 20 Oct 2021] (2015)

	11.	 A. Bochkovskiy, C. Wang, H. M. Liao, Yolov4: Optimal speed 
and accuracy of object detection, CoRR abs/2004.10934 (2020). 
arXiv:​2004.​10934. Accessed on 23 Apr 2020

	12.	 Tuzoff DV, Tuzova LN, Bornstein MM, Krasnov AS, Kharch-
enko MA, Nikolenko SI, Sveshnikov MM, Bednenko GB. Tooth 
detection and numbering in panoramic radiographs using con-
volutional neural networks. Dentomaxillofacial Radiology. 
2019;48(4):20180051.

	13.	 Mahdi FP, Yagi N, Kobashi S. Automatic teeth recognition in 
dental x-ray images using transfer learning based faster r-cnn, in, 
IEEE 50th International Symposium on Multiple-Valued Logic 
(ISMVL). IEEE. 2020;2020:16–21.

	14.	 Muramatsu C, Morishita T, Takahashi R, Hayashi T, Nishiyama 
W, Ariji Y, Zhou X, Hara T, Katsumata A, Ariji E, et al. Tooth 
detection and classification on panoramic radiographs for auto-
matic dental chart filing: improved classification by multi-sized 
input data. Oral Radiol. 2021;37(1):13–9.

	15.	 Kim C, Kim D, Jeong H, Yoon S-J, Youm S. Automatic tooth 
detection and numbering using a combination of a cnn and heu-
ristic algorithm. Appl Sci. 2020;10(16):5624.

	16.	 Muresan MP, Barbura AR, Nedevschi S (2020) Teeth detection 
and dental problem classification in panoramic x-ray images using 
deep learning and image processing techniques. In: 2020 IEEE 
16th International Conference on Intelligent Computer Commu-
nication and Processing (ICCP), IEEE, 2020, pp. 457–463.

	17.	 Cho J, Lee K, Shin E, Choy G, Do S (2015) How much data is 
needed to train a medical image deep learning system to achieve 
necessary high accuracy?, arXiv preprint arXiv:​1511.​06348 

	18.	 Yüksel AE, Gültekin S, Simsar E, Özdemir ŞD, Gündoğar M, 
Tokgöz SB, Hamamcı İE. Dental enumeration and multiple treat-
ment detection on panoramic X-rays using deep learning. Sci Rep. 
2021;11(1):1–10.

Publisher's Note  Springer Nature remains neutral with regard to 
jurisdictional claims in published maps and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds 
exclusive rights to this article under a publishing agreement with the 
author(s) or other rightsholder(s); author self-archiving of the accepted 
manuscript version of this article is solely governed by the terms of 
such publishing agreement and applicable law.

https://doi.org/10.15311/selcukdentj.915522
https://github.com/tzutalin/labelImg
http://arxiv.org/abs/2004.10934
http://arxiv.org/abs/1511.06348

	Effect of data size on tooth numbering performance via artificial intelligence using panoramic radiographs
	Abstract
	Objective 
	Study Design 
	Results 
	Conclusion 

	Introduction
	Methods and materials
	Dataset
	YOLOv4 method

	Results
	Discussion
	Conclusion
	Acknowledgements 
	References