A comprehensive simulation study on classification of RNA-Seq data

Publication:
A comprehensive simulation study on classification of RNA-Seq data

dc.contributor.authors	Zararsiz, Gokmen; Goksuluk, Dincer; Korkmaz, Selcuk; Eldem, Vahap; Zararsiz, Gozde Erturk; Duru, Izzet Parug; Ozturk, Ahmet
dc.date.accessioned	2022-03-14T08:29:20Z
dc.date.available	2022-03-14T08:29:20Z
dc.date.issued	2017-08-23
dc.description.abstract	RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNASeq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www. bioconductor. org/packages/release/bioc/ html/MLSeq. html.
dc.identifier.doi	10.1371/journal.pone.0182507
dc.identifier.issn	1932-6203
dc.identifier.pubmed	28832679
dc.identifier.uri	https://hdl.handle.net/11424/241868
dc.identifier.wos	WOS:000408355800026
dc.language.iso	eng
dc.publisher	PUBLIC LIBRARY SCIENCE
dc.relation.ispartof	PLOS ONE
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	BIOCONDUCTOR PACKAGE
dc.subject	MODELS
dc.title	A comprehensive simulation study on classification of RNA-Seq data
dc.type	article
dspace.entity.type	Publication
local.avesis.id	f908dbc6-759a-4112-85b3-58bf9d263e6d
local.import.package	SS16
local.indexed.at	WOS
local.indexed.at	SCOPUS
local.indexed.at	PUBMED
local.journal.articlenumber	e0182507
local.journal.numberofpages	19
local.journal.quartile	Q1
oaire.citation.issue	8
oaire.citation.title	PLOS ONE
oaire.citation.volume	12

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Zararsız et al. - 2017 - A comprehensive simulation study on classification.pdf
Size:: 5.99 MB
Format:: Adobe Portable Document Format

Download

Collections

Research Outputs

Publication: A comprehensive simulation study on classification of RNA-Seq data

Files

Original bundle

Collections

Publication:
A comprehensive simulation study on classification of RNA-Seq data