Source: https://github.com/markziemann/GeneNameErrors2020
View the reports: http://ziemann-lab.net/public/gene_name_errors/
Gene name errors result when data are imported improperly into MS Excel and other spreadsheet programs (Zeeberg et al, 2004). Certain gene names like MARCH3, SEPT2 and DEC1 are converted into date format. These errors are surprisingly common in supplementary data files in the field of genomics (Ziemann et al, 2016). This could be considered a small error because it only affects a small number of genes, however it is symptomtic of poor data processing methods. The purpose of this script is to identify gene name errors present in supplementary files of PubMed Central articles in the previous month.
library("XML")
library("jsonlite")
library("xml2")
library("reutils")
library("readxl")
library("RCurl")
Here I will be getting PubMed Central IDs for the previous month.
Start with figuring out the date to search PubMed Central.
CURRENT_MONTH=format(Sys.time(), "%m")
CURRENT_YEAR=format(Sys.time(), "%Y")
if (CURRENT_MONTH == "01") {
PREV_YEAR=as.character(as.numeric(format(Sys.time(), "%Y"))-1)
PREV_MONTH="12"
} else {
PREV_YEAR=CURRENT_YEAR
PREV_MONTH=as.character(as.numeric(format(Sys.time(), "%m"))-1)
}
DATE=paste(PREV_YEAR,"/",PREV_MONTH,sep="")
DATE
## [1] "2022/11"
Let’s see how many PMC IDs we have in the past month.
QUERY ='((genom*[Abstract]))'
ESEARCH_RES <- esearch(term=QUERY, db = "pmc", rettype = "uilist", retmode = "xml", retstart = 0,
retmax = 5000000, usehistory = TRUE, webenv = NULL, querykey = NULL, sort = NULL, field = NULL,
datetype = NULL, reldate = NULL,
mindate = paste(DATE,"/1",sep="") , maxdate = paste(DATE,"/31",sep=""))
pmc <- efetch(ESEARCH_RES,retmode="text",rettype="uilist",outfile="pmcids.txt")
## Retrieving UIDs 1 to 500
## Retrieving UIDs 501 to 1000
## Retrieving UIDs 1001 to 1500
## Retrieving UIDs 1501 to 2000
## Retrieving UIDs 2001 to 2500
## Retrieving UIDs 2501 to 3000
## Retrieving UIDs 3001 to 3500
pmc <- read.table(pmc)
pmc <- paste("PMC",pmc$V1,sep="")
NUM_ARTICLES=length(pmc)
NUM_ARTICLES
## [1] 3209
writeLines(pmc,con="pmc.txt")
Now run the bash script. Note that false positives can occur (~1.5%) and these results have not been verified by a human.
Here are some definitions:
NUM_XLS = Number of supplementary Excel files in this set of PMC articles.
NUM_XLS_ARTICLES = Number of articles matching the PubMed Central search which have supplementary Excel files.
GENELISTS = The gene lists found in the Excel files. Each Excel file is counted once even it has multiple gene lists.
NUM_GENELISTS = The number of Excel files with gene lists.
NUM_GENELIST_ARTICLES = The number of PMC articles with supplementary Excel gene lists.
ERROR_GENELISTS = Files suspected to contain gene name errors. The dates and five-digit numbers indicate transmogrified gene names.
NUM_ERROR_GENELISTS = Number of Excel gene lists with errors.
NUM_ERROR_GENELIST_ARTICLES = Number of articles with supplementary Excel gene name errors.
ERROR_PROPORTION = This is the proportion of articles with Excel gene lists that have errors.
system("./gene_names.sh pmc.txt")
results <- readLines("results.txt")
XLS <- results[grep("XLS",results,ignore.case=TRUE)]
NUM_XLS = length(XLS)
NUM_XLS
## [1] 4550
NUM_XLS_ARTICLES = length(unique(sapply(strsplit(XLS," "),"[[",1)))
NUM_XLS_ARTICLES
## [1] 661
GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>2]
#GENELISTS
NUM_GENELISTS <- length(unique(sapply(strsplit(GENELISTS," "),"[[",2)))
NUM_GENELISTS
## [1] 438
NUM_GENELIST_ARTICLES <- length(unique(sapply(strsplit(GENELISTS," "),"[[",1)))
NUM_GENELIST_ARTICLES
## [1] 242
ERROR_GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>3]
#ERROR_GENELISTS
NUM_ERROR_GENELISTS = length(ERROR_GENELISTS)
NUM_ERROR_GENELISTS
## [1] 177
GENELIST_ERROR_ARTICLES <- unique(sapply(strsplit(ERROR_GENELISTS," "),"[[",1))
GENELIST_ERROR_ARTICLES
## [1] "PMC9702351" "PMC9636422" "PMC9701233" "PMC9692985" "PMC9686236"
## [6] "PMC9681662" "PMC9687456" "PMC9684208" "PMC9684078" "PMC9675774"
## [11] "PMC9670201" "PMC9635827" "PMC9664452" "PMC9652368" "PMC9647041"
## [16] "PMC9642970" "PMC9641937" "PMC9640562" "PMC9638865" "PMC9651605"
## [21] "PMC9636578" "PMC9636173" "PMC9636038" "PMC9635165" "PMC9634822"
## [26] "PMC9631646" "PMC9632743" "PMC9632285" "PMC9630580" "PMC9630140"
## [31] "PMC9618559" "PMC9616240" "PMC9614465" "PMC9703440" "PMC9699749"
## [36] "PMC9632359" "PMC9700556" "PMC9692998" "PMC9691957" "PMC9687483"
## [41] "PMC9679647" "PMC9678939" "PMC9641807" "PMC9674023" "PMC9672303"
## [46] "PMC9671924" "PMC9671801" "PMC9668736" "PMC9668334" "PMC9664004"
## [51] "PMC9655846" "PMC9649434" "PMC9633874" "PMC9627672" "PMC9614582"
## [56] "PMC9649441" "PMC9639328" "PMC9638908" "PMC9638897" "PMC9635648"
## [61] "PMC9633298" "PMC9631240" "PMC9636699" "PMC9636136" "PMC9634817"
## [66] "PMC9634371" "PMC9634000" "PMC9633647" "PMC9633279" "PMC9633064"
## [71] "PMC9632086" "PMC9622801"
NUM_ERROR_GENELIST_ARTICLES <- length(GENELIST_ERROR_ARTICLES)
NUM_ERROR_GENELIST_ARTICLES
## [1] 72
ERROR_PROPORTION = NUM_ERROR_GENELIST_ARTICLES / NUM_GENELIST_ARTICLES
ERROR_PROPORTION
## [1] 0.2975207
Here you can have a look at all the gene lists detected in the past month, as well as those with errors. The dates are obvious errors, these are commonly dates in September, March, December and October. The five-digit numbers represent dates as they are encoded in the Excel internal format. The five digit number is the number of days since 1900. If you were to take these numbers and put them into Excel and format the cells as dates, then these will also mostly map to dates in September, March, December and October.
#GENELISTS
ERROR_GENELISTS
## [1] "PMC9702351 PMC_DL/PMC9702351/supplementaryfiles/Table_1.XLSX Rnorvegicus 25 44807 44622 44809 44810 44818 44630 44814 44623 44625 44621 44808 44816 44624 44812 44806 44805 44622 44813 44631 44621 44815 44626 44628 44627 44811"
## [2] "PMC9636422 PMC_DL/PMC9636422/supplementaryfiles/12276_2022_845_MOESM2_ESM.xls Hsapiens 22 43715 43710 43531 43717 43525 43530 43529 43711 43532 43714 43716 43526 43719 43712 43718 43527 43709 43526 43720 43534 43525 43722"
## [3] "PMC9636422 PMC_DL/PMC9636422/supplementaryfiles/12276_2022_845_MOESM2_ESM.xls Hsapiens 21 43718 43709 43526 43720 43534 43529 43527 43716 43531 43712 43710 43532 43714 43717 43719 43715 43711 43525 43530 43526 43525"
## [4] "PMC9636422 PMC_DL/PMC9636422/supplementaryfiles/12276_2022_845_MOESM2_ESM.xls Hsapiens 18 43709 43526 43716 43531 43532 43710 43714 43717 43715 43712 43529 43711 43525 43530 43719 43527 43718 43526"
## [5] "PMC9636422 PMC_DL/PMC9636422/supplementaryfiles/12276_2022_845_MOESM2_ESM.xls Hsapiens 17 43712 43709 43714 43527 43532 43717 43530 43525 43711 43715 43710 43719 43716 43531 43529 43526 43718"
## [6] "PMC9636422 PMC_DL/PMC9636422/supplementaryfiles/12276_2022_845_MOESM2_ESM.xls Hsapiens 17 43712 43714 43525 43530 43711 43715 43719 43532 43717 43710 43531 43716 43718 43527 43529 43526 43709"
## [7] "PMC9636422 PMC_DL/PMC9636422/supplementaryfiles/12276_2022_845_MOESM2_ESM.xls Hsapiens 17 43712 43527 43714 43530 43719 43525 43711 43715 43717 43532 43710 43529 43531 43716 43718 43709 43526"
## [8] "PMC9636422 PMC_DL/PMC9636422/supplementaryfiles/12276_2022_845_MOESM2_ESM.xls Hsapiens 17 43709 43526 43716 43531 43532 43710 43714 43717 43715 43712 43529 43711 43525 43530 43719 43527 43718"
## [9] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM4_ESM.xlsx Hsapiens 1 39326"
## [10] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM4_ESM.xlsx Hsapiens 1 39326"
## [11] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 1 40787"
## [12] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 3 38231 38961 37500"
## [13] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 2 39326 37500"
## [14] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 3 38961 37865 37500"
## [15] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 5 37865 40422 38961 37500 39326"
## [16] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 2 38961 39326"
## [17] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 2 40422 38961"
## [18] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 1 40787"
## [19] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 1 40787"
## [20] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 2 39326 38961"
## [21] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 1 39326"
## [22] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 1 40787"
## [23] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 1 38961"
## [24] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 2 38961 40787"
## [25] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 2 38961 40787"
## [26] "PMC9701233 PMC_DL/PMC9701233/supplementaryfiles/42003_2022_4264_MOESM3_ESM.xlsx Hsapiens 1 38961"
## [27] "PMC9692985 zip/microorganisms-1974933-supplementary2/microorganisms-1974933-supplementary/Supporting_information/File_S3.xlsx Drerio 2 44081 44081"
## [28] "PMC9692985 zip/microorganisms-1974933-supplementary2/microorganisms-1974933-supplementary/Supporting_information/File_S3.xlsx Drerio 1 43895"
## [29] "PMC9692985 zip/microorganisms-1974933-supplementary2/microorganisms-1974933-supplementary/Supporting_information/File_S3.xlsx Drerio 1 43895"
## [30] "PMC9692985 zip/microorganisms-1974933-supplementary2/microorganisms-1974933-supplementary/Supporting_information/File_S3.xlsx Drerio 1 43898"
## [31] "PMC9692985 zip/microorganisms-1974933-supplementary2/microorganisms-1974933-supplementary/Supporting_information/File_S5.xlsx Drerio 1 44625"
## [32] "PMC9692985 zip/microorganisms-1974933-supplementary2/microorganisms-1974933-supplementary/Supporting_information/File_S5.xlsx Drerio 1 44625"
## [33] "PMC9686236 PMC_DL/PMC9686236/supplementaryfiles/41598_2022_24611_MOESM1_ESM.xlsx Hsapiens 18 44806 44622 44621 44816 44630 44814 44807 44815 44623 44812 44628 44626 44622 44813 44896 44811 44629 44810"
## [34] "PMC9686236 PMC_DL/PMC9686236/supplementaryfiles/41598_2022_24611_MOESM1_ESM.xlsx Hsapiens 12 44621 44622 44627 44628 44629 44805 44814 44806 44808 44809 44811 44813"
## [35] "PMC9681662 PMC_DL/PMC9681662/supplementaryfiles/mmc3.xlsx Hsapiens 3 40057 39326 37500"
## [36] "PMC9687456 zip/Table_S3.xlsx Hsapiens 1 44256"
## [37] "PMC9687456 zip/Table_S4.xlsx Hsapiens 7 44258 44265 44264 44262 44265 44264 44256"
## [38] "PMC9687456 zip/Table_S5.xlsx Ggallus 2 44258 44256"
## [39] "PMC9684208 PMC_DL/PMC9684208/supplementaryfiles/Table_4.XLSX Hsapiens 29 44450 44450 44450 44450 44450 44263 44263 44446 44258 44449 44448 44257 44262 44262 44262 44262 44263 44263 44446 44446 44446 44447 44447 44261 44447 44443 44443 44445 44443"
## [40] "PMC9684078 PMC_DL/PMC9684078/supplementaryfiles/41586_2022_5311_MOESM4_ESM.xlsx Hsapiens 17 2002-03-01 2002-03-01 2007-09-01 2006-09-01 2007-03-01 2011-09-01 2009-03-01 2006-03-01 2008-09-01 2008-03-01 2002-09-01 2003-03-01 2009-09-01 2005-09-01 2001-03-01 2010-09-01 2005-03-01"
## [41] "PMC9684078 PMC_DL/PMC9684078/supplementaryfiles/41586_2022_5311_MOESM5_ESM.xlsx Hsapiens 60 2001-03-01 2001-03-01 2001-03-01 2007-03-01 2002-09-01 2002-09-01 2002-09-01 2002-09-01 2011-09-01 2011-09-01 2011-09-01 2011-09-01 2011-09-01 2011-09-01 2011-09-01 2011-09-01 2011-09-01 2011-09-01 2011-09-01 2006-03-01 2006-03-01 2006-03-01 2006-03-01 2007-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2009-09-01 2002-03-01"
## [42] "PMC9675774 PMC_DL/PMC9675774/supplementaryfiles/42003_2022_4237_MOESM9_ESM.xlsx Ggallus 3 37316 38961 40057"
## [43] "PMC9670201 PMC_DL/PMC9670201/supplementaryfiles/EMBJ-41-e108040-s005.xlsx Hsapiens 3 43891 44166 43892"
## [44] "PMC9635827 zip/sciadv.abq7598_data_s1.xlsx Hsapiens 8 44262 44441 44445 44446 44447 44448 44450 44454"
## [45] "PMC9635827 zip/sciadv.abq7598_data_s1.xlsx Hsapiens 8 44262 44441 44445 44446 44447 44448 44450 44454"
## [46] "PMC9635827 zip/sciadv.abq7598_data_s4.xlsx Hsapiens 3 44443 44258 44400"
## [47] "PMC9664452 PMC_DL/PMC9664452/supplementaryfiles/NIHMS1845179-supplement-2.xlsx Hsapiens 1 44446"
## [48] "PMC9652368 zip/Table_S4_HCMV_lncRNA_interactome.xlsx Hsapiens 1 44806"
## [49] "PMC9647041 PMC_DL/PMC9647041/supplementaryfiles/mmc7.xlsx Hsapiens 24 37681 37500 42248 40057 38961 40422 41883 38047 39873 38231 37316 39142 36951 40238 39508 38596 39326 38412 37865 40787 38777 37135 37226 40603"
## [50] "PMC9647041 PMC_DL/PMC9647041/supplementaryfiles/mmc2.xlsx Hsapiens 25 38412 36951 39326 40422 42248 38961 37681 41883 40787 37316 39142 38596 39873 40057 39508 38231 37500 38047 38777 40238 37865 37226 40603 37135 41153"
## [51] "PMC9642970 PMC_DL/PMC9642970/supplementaryfiles/Table_3.xlsx Rnorvegicus 11 44625 44624 44623 44627 44629 44630 44626 44628 44621 44622 44631"
## [52] "PMC9641937 PMC_DL/PMC9641937/supplementaryfiles/12915_2022_1449_MOESM1_ESM.xlsx Hsapiens 1 44257"
## [53] "PMC9641937 PMC_DL/PMC9641937/supplementaryfiles/12915_2022_1449_MOESM1_ESM.xlsx Hsapiens 1 44813"
## [54] "PMC9640562 PMC_DL/PMC9640562/supplementaryfiles/41523_2022_492_MOESM3_ESM.xlsx Hsapiens 26 44811 44815 44623 44808 44819 44630 44809 44810 44807 44896 44813 44628 44626 44818 44622 44627 44805 44816 44629 44624 44812 44806 44814 44621 44631 44625"
## [55] "PMC9640562 PMC_DL/PMC9640562/supplementaryfiles/41523_2022_492_MOESM3_ESM.xlsx Hsapiens 3 44809 44813 44810"
## [56] "PMC9638865 PMC_DL/PMC9638865/supplementaryfiles/EMBR-23-e54061-s001.xlsx Hsapiens 14 37865 38231 39326 38961 40787 41153 41883 39692 37500 37135 40057 38596 40422 37316"
## [57] "PMC9638865 PMC_DL/PMC9638865/supplementaryfiles/EMBR-23-e54061-s001.xlsx Hsapiens 14 37865 38231 39326 38961 40787 41153 41883 39692 37500 37135 40057 38596 40422 37316"
## [58] "PMC9638865 PMC_DL/PMC9638865/supplementaryfiles/EMBR-23-e54061-s001.xlsx Hsapiens 14 37865 38231 39326 38961 40787 41153 41883 39692 37500 37135 40057 38596 40422 37316"
## [59] "PMC9638865 PMC_DL/PMC9638865/supplementaryfiles/EMBR-23-e54061-s001.xlsx Hsapiens 14 37865 38231 39326 38961 40787 41153 41883 39692 37500 37135 40057 38596 40422 37316"
## [60] "PMC9638865 PMC_DL/PMC9638865/supplementaryfiles/EMBR-23-e54061-s001.xlsx Hsapiens 14 37865 38231 39326 38961 40787 41153 41883 39692 37500 37135 40057 38596 40422 37316"
## [61] "PMC9651605 PMC_DL/PMC9651605/supplementaryfiles/MGG3-10-e2037-s002.xlsx Hsapiens 1 14977"
## [62] "PMC9636578 PMC_DL/PMC9636578/supplementaryfiles/mmc5.xlsx Hsapiens 26 44806 44631 44805 44818 44819 44810 44628 44812 44814 44808 44896 44625 44621 44629 44813 44816 44809 44623 44815 44624 44630 44627 44626 44811 44622 44807"
## [63] "PMC9636173 PMC_DL/PMC9636173/supplementaryfiles/41467_2022_34427_MOESM4_ESM.xlsx Hsapiens 5 44626 44623 44625 44629 44627"
## [64] "PMC9636038 PMC_DL/PMC9636038/supplementaryfiles/mmc5.xlsx Hsapiens 28 40422 37865 38412 41153 37316 39142 37500 38961 39508 39326 38777 37226 39692 36951 39873 37316 38231 42248 38047 40787 37681 41883 37135 40057 40603 38596 36951 40238"
## [65] "PMC9636038 PMC_DL/PMC9636038/supplementaryfiles/mmc3.xlsx Hsapiens 14 38596 36951 38047 38777 37226 40238 40057 39142 37681 37226 40238 36951 40057 36951"
## [66] "PMC9635165 PMC_DL/PMC9635165/supplementaryfiles/13148_2022_1347_MOESM4_ESM.xlsx Hsapiens 5 37500 40057 40422 41153 37865"
## [67] "PMC9634822 PMC_DL/PMC9634822/supplementaryfiles/elife-79570-fig5-data2.xlsx Scerevisiae 1 1-Oct"
## [68] "PMC9631646 PMC_DL/PMC9631646/supplementaryfiles/advancesADV2022007250-suppl5.xlsx Hsapiens 3 44810 44813 44809"
## [69] "PMC9632743 zip/Table_3.xlsx Hsapiens 19 44257 44256 44443 44531 44256 44257 44445 44449 44446 44260 44262 44263 44450 44258 44261 44454 44441 44448 44447"
## [70] "PMC9632743 zip/Table_3.xlsx Hsapiens 26 44257 44256 44256 44443 44260 44449 44259 44450 44441 44531 44265 44454 44446 44261 44451 44262 44440 44442 44257 44263 44266 44448 44264 44445 44258 44447"
## [71] "PMC9632743 zip/Table_3.xlsx Hsapiens 26 44257 44443 44256 44256 44260 44451 44440 44531 44450 44449 44265 44441 44259 44262 44442 44446 44266 44261 44257 44263 44448 44264 44454 44445 44258 44447"
## [72] "PMC9632743 zip/Table_3.xlsx Hsapiens 26 44257 44443 44440 44256 44266 44256 44449 44450 44257 44531 44451 44263 44454 44446 44442 44445 44259 44260 44262 44448 44261 44441 44264 44447 44265 44258"
## [73] "PMC9632743 zip/Table_3.xlsx Hsapiens 26 44257 44443 44440 44256 44266 44256 44449 44450 44257 44531 44451 44263 44454 44446 44442 44445 44259 44260 44262 44448 44261 44441 44264 44447 44265 44258"
## [74] "PMC9632285 PMC_DL/PMC9632285/supplementaryfiles/Table6.XLS Hsapiens 1 2022/09/07"
## [75] "PMC9630580 zip/Supplementary_Files/Supplementary_Table_1.xlsx Hsapiens 26 44622 44621 44631 44816 44630 44621 44813 44808 44623 44629 44625 44810 44815 44626 44805 44819 44896 44811 44806 44624 44814 44627 44628 44812 44807 44622"
## [76] "PMC9630580 zip/Supplementary_Files/Supplementary_Table_1.xlsx Hsapiens 19 44813 44811 44896 44812 44622 44819 44626 44621 44814 44806 44815 44808 44623 44628 44621 44810 44625 44627 44622"
## [77] "PMC9630580 zip/Supplementary_Files/Supplementary_Table_1.xlsx Hsapiens 15 44626 44812 44629 44810 44815 44813 44808 44811 44814 44806 44819 44809 44622 44627 44805"
## [78] "PMC9630580 zip/Supplementary_Files/Supplementary_Table_1.xlsx Hsapiens 12 44631 44621 44630 44628 44622 44629 44623 44624 44627 44626 44896 44625"
## [79] "PMC9630140 PMC_DL/PMC9630140/supplementaryfiles/41559_2022_1884_MOESM4_ESM.xlsx Dmelanogaster 1 38231"
## [80] "PMC9630140 PMC_DL/PMC9630140/supplementaryfiles/41559_2022_1884_MOESM4_ESM.xlsx Dmelanogaster 3 38231 38231 38231"
## [81] "PMC9618559 PMC_DL/PMC9618559/supplementaryfiles/41419_2022_5358_MOESM4_ESM.xlsx Hsapiens 3 44256 44256 44446"
## [82] "PMC9618559 PMC_DL/PMC9618559/supplementaryfiles/41419_2022_5358_MOESM4_ESM.xlsx Hsapiens 3 44256 44256 44446"
## [83] "PMC9618559 PMC_DL/PMC9618559/supplementaryfiles/41419_2022_5358_MOESM4_ESM.xlsx Hsapiens 3 44256 44256 44446"
## [84] "PMC9618559 PMC_DL/PMC9618559/supplementaryfiles/41419_2022_5358_MOESM4_ESM.xlsx Hsapiens 3 44256 44256 44446"
## [85] "PMC9616240 PMC_DL/PMC9616240/supplementaryfiles/mmc2.xlsx Hsapiens 16 37865 37012 37135 38108 41153 37500 38231 40422 40787 42248 39692 39326 40057 38596 37226 41883"
## [86] "PMC9616240 PMC_DL/PMC9616240/supplementaryfiles/mmc2.xlsx Hsapiens 16 37865 37012 38231 39326 37226 38596 41153 40787 38108 40057 39692 41883 37135 42248 40422 37500"
## [87] "PMC9614465 zip/Supplemental_Table_S3.xlsx Athaliana 1 44288"
## [88] "PMC9614465 zip/Supplemental_Table_S3.xlsx Athaliana 1 44288"
## [89] "PMC9614465 zip/Supplemental_Table_S3.xlsx Athaliana 1 44288"
## [90] "PMC9703440 PMC_DL/PMC9703440/supplementaryfiles/41467_2022_34969_MOESM4_ESM.xlsx Mmusculus 1 44621"
## [91] "PMC9699749 PMC_DL/PMC9699749/supplementaryfiles/aging-14-204369-s005.xlsx Hsapiens 1 44625"
## [92] "PMC9632359 PMC_DL/PMC9632359/supplementaryfiles/supp_mcs.a006218_Supplemental_Tables.xlsx Hsapiens 31 43161 43346 43347 43161 43350 43349 43166 43354 43168 43355 43163 43160 43165 43357 43351 43167 43345 43435 43169 43162 43344 43358 43170 43352 43348 43160 43353 43164 43352 43167 43344"
## [93] "PMC9700556 PMC_DL/PMC9700556/supplementaryfiles/12672_2022_595_MOESM9_ESM.xlsx Hsapiens 1 44263"
## [94] "PMC9700556 PMC_DL/PMC9700556/supplementaryfiles/12672_2022_595_MOESM9_ESM.xlsx Hsapiens 1 44256"
## [95] "PMC9700556 PMC_DL/PMC9700556/supplementaryfiles/12672_2022_595_MOESM9_ESM.xlsx Hsapiens 1 44263"
## [96] "PMC9692998 zip/Table_S5.xlsx Rnorvegicus 7 44265 44260 44266 44258 44265 44258 44259"
## [97] "PMC9692998 zip/Table_S5.xlsx Mmusculus 4 44265 44258 44266 44259"
## [98] "PMC9691957 zip/Suppl_Table_7.xlsx Ggallus 1 44806"
## [99] "PMC9691957 zip/Suppl_Table_7.xlsx Ggallus 1 44627"
## [100] "PMC9687483 zip/Table_S6.xlsx Hsapiens 3 43891 44081 43895"
## [101] "PMC9679647 PMC_DL/PMC9679647/supplementaryfiles/Table_4.xlsx Hsapiens 10 44810 44810 44813 44813 44813 44810 44813 44810 44813 44810"
## [102] "PMC9678939 PMC_DL/PMC9678939/supplementaryfiles/Table2.XLSX Hsapiens 19 44896 44621 44622 44621 44622 44623 44625 44626 44627 44628 44819 44814 44815 44806 44808 44810 44811 44812 44813"
## [103] "PMC9678939 PMC_DL/PMC9678939/supplementaryfiles/Table2.XLSX Hsapiens 19 44896 44621 44622 44621 44622 44623 44625 44626 44627 44628 44819 44814 44815 44806 44808 44810 44811 44812 44813"
## [104] "PMC9641807 PMC_DL/PMC9641807/supplementaryfiles/Supplementary_Data3.xlsx Ggallus 26 44815 44628 44814 44627 44806 44621 44811 44626 44625 44812 44809 44629 44813 44810 44896 44808 44623 44819 44624 44630 44805 44816 44631 44807 44818 44622"
## [105] "PMC9674023 PMC_DL/PMC9674023/supplementaryfiles/NIHMS1847457-supplement-Table_S1.xlsx Hsapiens 28 44896 44621 44622 44621 44630 44631 44622 44623 44624 44625 44626 44627 44628 44629 44819 44805 44814 44815 44816 44818 44806 44807 44808 44809 44810 44811 44812 44813"
## [106] "PMC9672303 PMC_DL/PMC9672303/supplementaryfiles/41467_2022_33944_MOESM8_ESM.xlsx Hsapiens 24 44449 44256 44446 44445 44263 44447 44261 44454 44448 44260 44257 44450 44441 44262 44440 44257 44264 44258 44443 44444 44442 44256 44266 44453"
## [107] "PMC9672303 PMC_DL/PMC9672303/supplementaryfiles/41467_2022_33944_MOESM10_ESM.xlsx Hsapiens 8 44805 44814 44810 44819 44814 44819 44806 44805"
## [108] "PMC9671924 PMC_DL/PMC9671924/supplementaryfiles/41467_2022_34746_MOESM14_ESM.xlsx Hsapiens 3 1-Mar 2-Mar 6-Sep"
## [109] "PMC9671801 PMC_DL/PMC9671801/supplementaryfiles/41591_2022_2046_MOESM2_ESM.xlsx Hsapiens 2 36951 36951"
## [110] "PMC9671801 PMC_DL/PMC9671801/supplementaryfiles/41591_2022_2046_MOESM2_ESM.xlsx Hsapiens 3 36951 39508 39508"
## [111] "PMC9671801 PMC_DL/PMC9671801/supplementaryfiles/41591_2022_2046_MOESM2_ESM.xlsx Hsapiens 6 36951 39508 39508 39508 39508 39508"
## [112] "PMC9671801 PMC_DL/PMC9671801/supplementaryfiles/41591_2022_2046_MOESM2_ESM.xlsx Hsapiens 6 36951 39508 39508 39508 39508 39508"
## [113] "PMC9671801 PMC_DL/PMC9671801/supplementaryfiles/41591_2022_2046_MOESM2_ESM.xlsx Hsapiens 3 36951 39508 39508"
## [114] "PMC9668736 PMC_DL/PMC9668736/supplementaryfiles/mmc2.xls Hsapiens 8 44631 44629 44625 44623 44630 44627 44622 44626"
## [115] "PMC9668334 PMC_DL/PMC9668334/supplementaryfiles/elife-82459-supp4.xlsx Dmelanogaster 14 38231 38231 38231 38231 38231 38231 38231 38231 38231 38231 38231 38231 38231 38231"
## [116] "PMC9668334 PMC_DL/PMC9668334/supplementaryfiles/elife-82459-supp4.xlsx Dmelanogaster 14 36769 36769 36769 36769 36769 36769 36769 36769 36769 36769 36769 36769 36769 36769"
## [117] "PMC9668334 PMC_DL/PMC9668334/supplementaryfiles/elife-82459-supp3.xlsx Dmelanogaster 1 37500"
## [118] "PMC9668334 PMC_DL/PMC9668334/supplementaryfiles/elife-82459-fig4-data1.xlsx Dmelanogaster 1 38231"
## [119] "PMC9668334 PMC_DL/PMC9668334/supplementaryfiles/elife-82459-fig4-data1.xlsx Dmelanogaster 1 38231"
## [120] "PMC9664004 PMC_DL/PMC9664004/supplementaryfiles/Table_4.xlsx Mmusculus 1 44622"
## [121] "PMC9664004 PMC_DL/PMC9664004/supplementaryfiles/Table_4.xlsx Mmusculus 5 44815 44628 44805 44621 44626"
## [122] "PMC9664004 PMC_DL/PMC9664004/supplementaryfiles/Table_4.xlsx Mmusculus 3 44806 44627 44811"
## [123] "PMC9664004 PMC_DL/PMC9664004/supplementaryfiles/Table_4.xlsx Mmusculus 7 44814 44812 44808 44813 44622 44625 44810"
## [124] "PMC9664004 PMC_DL/PMC9664004/supplementaryfiles/Table_3.xlsx Mmusculus 20 44622 44626 44627 44625 44622 44621 44627 44621 44627 44621 44627 44621 44627 44627 44627 44627 44626 44625 44622 44626"
## [125] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44621"
## [126] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44622"
## [127] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44621"
## [128] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44622"
## [129] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44621"
## [130] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44622"
## [131] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44621"
## [132] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44622"
## [133] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44621"
## [134] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44896"
## [135] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44896"
## [136] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44896"
## [137] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44622"
## [138] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44621"
## [139] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44622"
## [140] "PMC9655846 zip/File_S1.xlsx Hsapiens 1 44896"
## [141] "PMC9649434 PMC_DL/PMC9649434/supplementaryfiles/41588_2022_1192_MOESM4_ESM.xlsx Ggallus 28 38412 38231 36951 37316 41153 37681 39508 37135 40057 37135 37865 38961 37500 39142 40603 38777 36951 42248 37316 38047 39326 40787 37226 39692 41883 39873 40238 40422"
## [142] "PMC9633874 PMC_DL/PMC9633874/supplementaryfiles/mmc2.xlsx Hsapiens 50 44626 44808 44622 44806 44811 44810 44819 44621 44811 44626 44622 44814 44815 44810 44808 44806 44806 44815 44626 44819 44814 44622 44808 44805 44810 44814 44811 44808 44813 44806 44622 44819 44811 44813 44814 44808 44811 44815 44806 44811 44806 44815 44811 44813 44626 44815 44808 44810 44622 44628"
## [143] "PMC9627672 PMC_DL/PMC9627672/supplementaryfiles/EMBJ-41-e110727-s003.xlsx Hsapiens 1 40057"
## [144] "PMC9627672 PMC_DL/PMC9627672/supplementaryfiles/EMBJ-41-e110727-s002.xlsx Hsapiens 24 44263 44261 44451 44443 44262 44257 44453 44445 44448 44442 44450 44258 44441 44449 44260 44446 44531 44259 44266 44264 44454 44447 44444 44265"
## [145] "PMC9614582 PMC_DL/PMC9614582/supplementaryfiles/jamanetwopen-e2238880-s002.xlsx Hsapiens 28 44625 44813 44622 44628 44896 44621 44807 44812 44621 44805 44627 44815 44631 44805 44811 44814 44622 44624 44816 44630 44808 44809 44818 44806 44819 44623 44626 44629"
## [146] "PMC9614582 PMC_DL/PMC9614582/supplementaryfiles/jamanetwopen-e2238880-s002.xlsx Hsapiens 28 44628 44631 44818 44630 44624 44622 44805 44813 44815 44812 44626 44896 44629 44808 44807 44805 44627 44621 44816 44623 44819 44625 44621 44811 44809 44814 44622 44806"
## [147] "PMC9649441 PMC_DL/PMC9649441/supplementaryfiles/41588_2022_1210_MOESM3_ESM.xlsx Mmusculus 1 38412"
## [148] "PMC9649441 PMC_DL/PMC9649441/supplementaryfiles/41588_2022_1210_MOESM3_ESM.xlsx Mmusculus 1 38412"
## [149] "PMC9649441 PMC_DL/PMC9649441/supplementaryfiles/41588_2022_1210_MOESM3_ESM.xlsx Mmusculus 1 38412"
## [150] "PMC9649441 PMC_DL/PMC9649441/supplementaryfiles/41588_2022_1210_MOESM3_ESM.xlsx Mmusculus 1 38412"
## [151] "PMC9649441 PMC_DL/PMC9649441/supplementaryfiles/41588_2022_1210_MOESM3_ESM.xlsx Mmusculus 1 38412"
## [152] "PMC9649441 PMC_DL/PMC9649441/supplementaryfiles/41588_2022_1210_MOESM3_ESM.xlsx Mmusculus 1 38412"
## [153] "PMC9639328 PMC_DL/PMC9639328/supplementaryfiles/40104_2022_779_MOESM11_ESM.xlsx Hsapiens 1 44266"
## [154] "PMC9639328 PMC_DL/PMC9639328/supplementaryfiles/40104_2022_779_MOESM15_ESM.xlsx Hsapiens 2 44256 44444"
## [155] "PMC9638908 zip/Supp_Tables_1-6.xlsx Mmusculus 8 44621 44622 44621 44622 44621 44622 44622 44622"
## [156] "PMC9638897 zip/Supplementary_Table_3-useme.xlsx Hsapiens 8 44442 44444 44256 44450 44448 44260 44256 44262"
## [157] "PMC9638897 zip/Supplementary_Table_1-useme.xlsx Hsapiens 2 44815 44625"
## [158] "PMC9635648 PMC_DL/PMC9635648/supplementaryfiles/jkac227_supplementary_table_s3.xlsx Hsapiens 6 44623 44628 44629 44621 44627 44625"
## [159] "PMC9633298 PMC_DL/PMC9633298/supplementaryfiles/CAS-113-3932-s001.xlsx Hsapiens 8 H3K27me3-marked genes in SCLC cells (Sato et al.)"
## [160] "PMC9631240 PMC_DL/PMC9631240/supplementaryfiles/mmc4.xlsx Ggallus 1 36951"
## [161] "PMC9636699 PMC_DL/PMC9636699/supplementaryfiles/13578_2022_917_MOESM10_ESM.xlsx Hsapiens 3 43892 43891 44166"
## [162] "PMC9636136 PMC_DL/PMC9636136/supplementaryfiles/41467_2022_34163_MOESM3_ESM.xlsx Hsapiens 1 44256"
## [163] "PMC9636136 PMC_DL/PMC9636136/supplementaryfiles/41467_2022_34163_MOESM3_ESM.xlsx Hsapiens 5 44256 44256 44256 44453 44453"
## [164] "PMC9634817 PMC_DL/PMC9634817/supplementaryfiles/Table3.XLSX Hsapiens 1 44442"
## [165] "PMC9634371 PMC_DL/PMC9634371/supplementaryfiles/mmc2.xlsx Hsapiens 1 44256"
## [166] "PMC9634371 PMC_DL/PMC9634371/supplementaryfiles/mmc2.xlsx Hsapiens 5 43894 44076 43891 44166 44166"
## [167] "PMC9634000 PMC_DL/PMC9634000/supplementaryfiles/Table_1.xlsx Hsapiens 1 44623"
## [168] "PMC9634000 PMC_DL/PMC9634000/supplementaryfiles/Table_1.xlsx Hsapiens 1 44623"
## [169] "PMC9634000 PMC_DL/PMC9634000/supplementaryfiles/Table_1.xlsx Hsapiens 1 44623"
## [170] "PMC9633647 PMC_DL/PMC9633647/supplementaryfiles/41598_2022_23149_MOESM5_ESM.xlsx Hsapiens 85 44809 44813 44624 44896 44623 44623 44805 44624 44818 44626 44896 44623 44621 44815 44623 44623 44624 44622 44809 44624 44896 44631 44815 44623 44815 44631 44628 44813 44624 44805 44805 44623 44623 44627 44621 44624 44812 44896 44630 44623 44627 44621 44621 44815 44809 44805 44627 44631 44626 44621 44621 44626 44809 44624 44805 44624 44815 44626 44815 44896 44621 44621 44631 44623 44627 44631 44623 44624 44896 44621 44631 44626 44896 44813 44628 44815 44809 44621 44626 44621 44809 44621 44627 44626 44817"
## [171] "PMC9633279 zip/Revised-Supplementary_tables_.xlsx Hsapiens 3 43897 44082 43901"
## [172] "PMC9633064 PMC_DL/PMC9633064/supplementaryfiles/elife-79525-supp1.xlsx Scerevisiae 4 44470 44470 44470 44470"
## [173] "PMC9633064 PMC_DL/PMC9633064/supplementaryfiles/elife-79525-fig3-data1.xlsx Scerevisiae 4 44470 44470 44470 44470"
## [174] "PMC9633064 PMC_DL/PMC9633064/supplementaryfiles/elife-79525-fig3-data1.xlsx Scerevisiae 1 44470"
## [175] "PMC9632086 zip/Tables-S5-S13.xlsx Hsapiens 14 43891 43893 43892 43898 43894 43892 44166 43901 43896 43897 43895 43900 43891 43899"
## [176] "PMC9632086 zip/Tables-S5-S13.xlsx Hsapiens 14 43901 43893 43891 43892 43894 43896 43898 43892 44166 43897 43895 43899 43900 43891"
## [177] "PMC9622801 PMC_DL/PMC9622801/supplementaryfiles/DataSheet_3.xls Hsapiens 11 2022/03/08 2022/03/06 2022/03/05 2022/03/09 2022/03/01 2022/03/04 2022/03/07 2022/03/03 2022/03/10 2022/03/11 2022/03/02"
Let’s investigate the errors in more detail.
# By species
SPECIES <- sapply(strsplit(ERROR_GENELISTS," "),"[[",3)
table(SPECIES)
## SPECIES
## Athaliana Dmelanogaster Drerio Ggallus Hsapiens
## 3 7 6 7 133
## Mmusculus Rnorvegicus Scerevisiae
## 14 3 4
par(mar=c(5,12,4,2))
barplot(table(SPECIES),horiz=TRUE,las=1)
par(mar=c(5,5,4,2))
# Number of affected Excel files per paper
DIST <- table(sapply(strsplit(ERROR_GENELISTS," "),"[[",1))
DIST
##
## PMC9614465 PMC9614582 PMC9616240 PMC9618559 PMC9622801 PMC9627672 PMC9630140
## 3 2 2 4 1 2 2
## PMC9630580 PMC9631240 PMC9631646 PMC9632086 PMC9632285 PMC9632359 PMC9632743
## 4 1 1 2 1 1 5
## PMC9633064 PMC9633279 PMC9633298 PMC9633647 PMC9633874 PMC9634000 PMC9634371
## 3 1 1 1 1 3 2
## PMC9634817 PMC9634822 PMC9635165 PMC9635648 PMC9635827 PMC9636038 PMC9636136
## 1 1 1 1 3 2 2
## PMC9636173 PMC9636422 PMC9636578 PMC9636699 PMC9638865 PMC9638897 PMC9638908
## 1 7 1 1 5 2 1
## PMC9639328 PMC9640562 PMC9641807 PMC9641937 PMC9642970 PMC9647041 PMC9649434
## 2 2 1 2 1 2 1
## PMC9649441 PMC9651605 PMC9652368 PMC9655846 PMC9664004 PMC9664452 PMC9668334
## 6 1 1 16 5 1 5
## PMC9668736 PMC9670201 PMC9671801 PMC9671924 PMC9672303 PMC9674023 PMC9675774
## 1 1 5 1 2 1 1
## PMC9678939 PMC9679647 PMC9681662 PMC9684078 PMC9684208 PMC9686236 PMC9687456
## 2 1 1 2 1 2 3
## PMC9687483 PMC9691957 PMC9692985 PMC9692998 PMC9699749 PMC9700556 PMC9701233
## 1 2 6 2 1 3 18
## PMC9702351 PMC9703440
## 1 1
summary(as.numeric(DIST))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.458 2.250 18.000
hist(DIST,main="Number of affected Excel files per paper")
# PMC Articles with the most errors
DIST_DF <- as.data.frame(DIST)
DIST_DF <- DIST_DF[order(-DIST_DF$Freq),,drop=FALSE]
head(DIST_DF,20)
## Var1 Freq
## 70 PMC9701233 18
## 46 PMC9655846 16
## 30 PMC9636422 7
## 43 PMC9649441 6
## 66 PMC9692985 6
## 14 PMC9632743 5
## 33 PMC9638865 5
## 47 PMC9664004 5
## 49 PMC9668334 5
## 52 PMC9671801 5
## 4 PMC9618559 4
## 8 PMC9630580 4
## 1 PMC9614465 3
## 15 PMC9633064 3
## 20 PMC9634000 3
## 26 PMC9635827 3
## 63 PMC9687456 3
## 69 PMC9700556 3
## 2 PMC9614582 2
## 3 PMC9616240 2
MOST_ERR_FILES = as.character(DIST_DF[1,1])
MOST_ERR_FILES
## [1] "PMC9701233"
# Number of errors per paper
NERR <- as.numeric(sapply(strsplit(ERROR_GENELISTS," "),"[[",4))
names(NERR) <- sapply(strsplit(ERROR_GENELISTS," "),"[[",1)
NERR <-tapply(NERR, names(NERR), sum)
NERR
## PMC9614465 PMC9614582 PMC9616240 PMC9618559 PMC9622801 PMC9627672 PMC9630140
## 3 56 32 12 11 25 4
## PMC9630580 PMC9631240 PMC9631646 PMC9632086 PMC9632285 PMC9632359 PMC9632743
## 72 1 3 28 1 31 123
## PMC9633064 PMC9633279 PMC9633298 PMC9633647 PMC9633874 PMC9634000 PMC9634371
## 9 3 8 85 50 3 6
## PMC9634817 PMC9634822 PMC9635165 PMC9635648 PMC9635827 PMC9636038 PMC9636136
## 1 1 5 6 19 42 6
## PMC9636173 PMC9636422 PMC9636578 PMC9636699 PMC9638865 PMC9638897 PMC9638908
## 5 129 26 3 70 10 8
## PMC9639328 PMC9640562 PMC9641807 PMC9641937 PMC9642970 PMC9647041 PMC9649434
## 3 29 26 2 11 49 28
## PMC9649441 PMC9651605 PMC9652368 PMC9655846 PMC9664004 PMC9664452 PMC9668334
## 6 1 1 16 36 1 31
## PMC9668736 PMC9670201 PMC9671801 PMC9671924 PMC9672303 PMC9674023 PMC9675774
## 8 3 20 3 32 28 3
## PMC9678939 PMC9679647 PMC9681662 PMC9684078 PMC9684208 PMC9686236 PMC9687456
## 38 10 3 77 29 30 10
## PMC9687483 PMC9691957 PMC9692985 PMC9692998 PMC9699749 PMC9700556 PMC9701233
## 3 2 7 11 1 3 32
## PMC9702351 PMC9703440
## 25 1
hist(NERR,main="number of errors per PMC article")
NERR_DF <- as.data.frame(NERR)
NERR_DF <- NERR_DF[order(-NERR_DF$NERR),,drop=FALSE]
head(NERR_DF,20)
## NERR
## PMC9636422 129
## PMC9632743 123
## PMC9633647 85
## PMC9684078 77
## PMC9630580 72
## PMC9638865 70
## PMC9614582 56
## PMC9633874 50
## PMC9647041 49
## PMC9636038 42
## PMC9678939 38
## PMC9664004 36
## PMC9616240 32
## PMC9672303 32
## PMC9701233 32
## PMC9632359 31
## PMC9668334 31
## PMC9686236 30
## PMC9640562 29
## PMC9684208 29
MOST_ERR = rownames(NERR_DF)[1]
MOST_ERR
## [1] "PMC9636422"
GENELIST_ERROR_ARTICLES <- gsub("PMC","",GENELIST_ERROR_ARTICLES)
### JSON PARSING is more reliable than XML
ARTICLES <- esummary( GENELIST_ERROR_ARTICLES , db="pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLES,as= "parsed")
ARTICLE_DATA <- ARTICLE_DATA$result
ARTICLE_DATA <- ARTICLE_DATA[2:length(ARTICLE_DATA)]
JOURNALS <- unlist(lapply(ARTICLE_DATA,function(x) {x$fulljournalname} ))
JOURNALS_TABLE <- table(JOURNALS)
JOURNALS_TABLE <- JOURNALS_TABLE[order(-JOURNALS_TABLE)]
length(JOURNALS_TABLE)
## [1] 50
NUM_JOURNALS=length(JOURNALS_TABLE)
par(mar=c(5,25,4,2))
barplot(head(JOURNALS_TABLE,10), horiz=TRUE, las=1,
xlab="Articles with gene name errors in supp files",
main="Top journals this month")
Congrats to our Journal of the Month winner!
JOURNAL_WINNER <- names(head(JOURNALS_TABLE,1))
JOURNAL_WINNER
## [1] "Nature Communications"
There are two categories:
Paper with the most suplementary files affected by gene name errors (MOST_ERR_FILES)
Paper with the most gene names converted to dates (MOST_ERR)
Sometimes, one paper can win both categories. Congrats to our winners.
MOST_ERR_FILES <- gsub("PMC","",MOST_ERR_FILES)
ARTICLES <- esummary( MOST_ERR_FILES , db="pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLES,as= "parsed")
ARTICLE_DATA <- ARTICLE_DATA[2]
ARTICLE_DATA
## $result
## $result$uids
## [1] "9701233"
##
## $result$`9701233`
## $result$`9701233`$uid
## [1] "9701233"
##
## $result$`9701233`$pubdate
## [1] "2022 Nov 27"
##
## $result$`9701233`$epubdate
## [1] "2022 Nov 27"
##
## $result$`9701233`$printpubdate
## [1] ""
##
## $result$`9701233`$source
## [1] "Commun Biol"
##
## $result$`9701233`$authors
## name authtype
## 1 Davis JL Author
## 2 Kennedy C Author
## 3 Clerkin S Author
## 4 Treacy NJ Author
## 5 Dodd T Author
## 6 Moss C Author
## 7 Murphy A Author
## 8 Brazil DP Author
## 9 Cagney G Author
## 10 Brougham DF Author
## 11 Murad R Author
## 12 Finlay D Author
## 13 Vuori K Author
## 14 Crean J Author
##
## $result$`9701233`$title
## [1] "Single-cell multiomics reveals the complexity of TGFβ signalling to chromatin in iPSC-derived kidney organoids"
##
## $result$`9701233`$volume
## [1] "5"
##
## $result$`9701233`$issue
## [1] ""
##
## $result$`9701233`$pages
## [1] "1301"
##
## $result$`9701233`$articleids
## idtype value
## 1 pmid 36435939
## 2 doi 10.1038/s42003-022-04264-1
## 3 pmcid PMC9701233
##
## $result$`9701233`$fulljournalname
## [1] "Communications Biology"
##
## $result$`9701233`$sortdate
## [1] "2022/11/27 00:00"
##
## $result$`9701233`$pmclivedate
## [1] "2022/11/28"
MOST_ERR <- gsub("PMC","",MOST_ERR)
ARTICLE_DATA <- esummary(MOST_ERR,db = "pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLE_DATA,as= "parsed")
ARTICLE_DATA
## $header
## $header$type
## [1] "esummary"
##
## $header$version
## [1] "0.3"
##
##
## $result
## $result$uids
## [1] "9636422"
##
## $result$`9636422`
## $result$`9636422`$uid
## [1] "9636422"
##
## $result$`9636422`$pubdate
## [1] "2022 Oct 6"
##
## $result$`9636422`$epubdate
## [1] "2022 Oct 6"
##
## $result$`9636422`$printpubdate
## [1] ""
##
## $result$`9636422`$source
## [1] "Exp Mol Med"
##
## $result$`9636422`$authors
## name authtype
## 1 Ghalali A Author
## 2 Wang L Author
## 3 Stopsack KH Author
## 4 Rice JM Author
## 5 Wu S Author
## 6 Wu CL Author
## 7 Zetter BR Author
## 8 Rogers MS Author
##
## $result$`9636422`$title
## [1] "AZIN1 RNA editing alters protein interactions, leading to nuclear translocation and worse outcomes in prostate cancer"
##
## $result$`9636422`$volume
## [1] "54"
##
## $result$`9636422`$issue
## [1] "10"
##
## $result$`9636422`$pages
## [1] "1713-1726"
##
## $result$`9636422`$articleids
## idtype value
## 1 pmid 36202978
## 2 doi 10.1038/s12276-022-00845-6
## 3 pmcid PMC9636422
##
## $result$`9636422`$fulljournalname
## [1] "Experimental & Molecular Medicine"
##
## $result$`9636422`$sortdate
## [1] "2022/10/06 00:00"
##
## $result$`9636422`$pmclivedate
## [1] "2022/11/28"
To plot the trend over the past 6-12 months.
url <- "https://ziemann-lab.net/public/gene_name_errors/"
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/@href")
links <- links[grep("html",links)]
listing <- htmlParse( getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE) )
listing <- xpathSApply(listing, "//a/@href")
listing <- listing[grep("html",listing)]
unlink("online_files/",recursive=TRUE)
dir.create("online_files")
sapply(listing, function(mylink) {
download.file(paste(url,mylink,sep=""),destfile=paste("online_files/",mylink,sep=""))
} )
## href href href href href href href href href href href href href href href href
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## href href href href href
## 0 0 0 0 0
myfilelist <- list.files("online_files/",full.names=TRUE)
trends <- sapply(myfilelist, function(myfilename) {
x <- readLines(myfilename)
# Num XL gene list articles
NUM_GENELIST_ARTICLES <- x[grep("NUM_GENELIST_ARTICLES",x)[3]+1]
NUM_GENELIST_ARTICLES <- sapply(strsplit(NUM_GENELIST_ARTICLES," "),"[[",3)
NUM_GENELIST_ARTICLES <- sapply(strsplit(NUM_GENELIST_ARTICLES,"<"),"[[",1)
NUM_GENELIST_ARTICLES <- as.numeric(NUM_GENELIST_ARTICLES)
# number of affected articles
NUM_ERROR_GENELIST_ARTICLES <- x[grep("NUM_ERROR_GENELIST_ARTICLES",x)[3]+1]
NUM_ERROR_GENELIST_ARTICLES <- sapply(strsplit(NUM_ERROR_GENELIST_ARTICLES," "),"[[",3)
NUM_ERROR_GENELIST_ARTICLES <- sapply(strsplit(NUM_ERROR_GENELIST_ARTICLES,"<"),"[[",1)
NUM_ERROR_GENELIST_ARTICLES <- as.numeric(NUM_ERROR_GENELIST_ARTICLES)
# Error proportion
ERROR_PROPORTION <- x[grep("ERROR_PROPORTION",x)[3]+1]
ERROR_PROPORTION <- sapply(strsplit(ERROR_PROPORTION," "),"[[",3)
ERROR_PROPORTION <- sapply(strsplit(ERROR_PROPORTION,"<"),"[[",1)
ERROR_PROPORTION <- as.numeric(ERROR_PROPORTION)
# number of journals
NUM_JOURNALS <- x[grep('JOURNALS_TABLE',x)[3]+1]
NUM_JOURNALS <- sapply(strsplit(NUM_JOURNALS," "),"[[",3)
NUM_JOURNALS <- sapply(strsplit(NUM_JOURNALS,"<"),"[[",1)
NUM_JOURNALS <- as.numeric(NUM_JOURNALS)
NUM_JOURNALS
res <- c(NUM_GENELIST_ARTICLES,NUM_ERROR_GENELIST_ARTICLES,ERROR_PROPORTION,NUM_JOURNALS)
return(res)
})
colnames(trends) <- sapply(strsplit(colnames(trends),"_"),"[[",3)
colnames(trends) <- gsub(".html","",colnames(trends))
trends <- as.data.frame(trends)
rownames(trends) <- c("NUM_GENELIST_ARTICLES","NUM_ERROR_GENELIST_ARTICLES","ERROR_PROPORTION","NUM_JOURNALS")
trends <- t(trends)
trends <- as.data.frame(trends)
CURRENT_RES <- c(NUM_GENELIST_ARTICLES,NUM_ERROR_GENELIST_ARTICLES,ERROR_PROPORTION,NUM_JOURNALS)
trends <- rbind(trends,CURRENT_RES)
paste(CURRENT_YEAR,CURRENT_MONTH,sep="-")
## [1] "2022-12"
rownames(trends)[nrow(trends)] <- paste(CURRENT_YEAR,CURRENT_MONTH,sep="-")
plot(trends$NUM_GENELIST_ARTICLES, xaxt = "n" , type="b" , main="Number of articles with Excel gene lists per month",
ylab="number of articles", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$NUM_ERROR_GENELIST_ARTICLES, xaxt = "n" , type="b" , main="Number of articles with gene name errors per month",
ylab="number of articles", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$ERROR_PROPORTION, xaxt = "n" , type="b" , main="Proportion of articles with Excel gene list affected by errors",
ylab="proportion", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$NUM_JOURNALS, xaxt = "n" , type="b" , main="Number of journals with affected articles",
ylab="number of journals", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
unlink("online_files/",recursive=TRUE)
Zeeberg, B.R., Riss, J., Kane, D.W. et al. Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5, 80 (2004). https://doi.org/10.1186/1471-2105-5-80
Ziemann, M., Eren, Y. & El-Osta, A. Gene name errors are widespread in the scientific literature. Genome Biol 17, 177 (2016). https://doi.org/10.1186/s13059-016-1044-7
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RCurl_1.98-1.9 readxl_1.4.1 reutils_0.2.3 xml2_1.3.3 jsonlite_1.8.3
## [6] XML_3.99-0.12
##
## loaded via a namespace (and not attached):
## [1] knitr_1.40 magrittr_2.0.3 R6_2.5.1 rlang_1.0.6
## [5] fastmap_1.1.0 stringr_1.4.1 highr_0.9 tools_4.2.2
## [9] xfun_0.34 cli_3.4.1 jquerylib_0.1.4 htmltools_0.5.3
## [13] yaml_2.3.6 digest_0.6.30 assertthat_0.2.1 sass_0.4.2
## [17] bitops_1.0-7 cachem_1.0.6 evaluate_0.17 rmarkdown_2.17
## [21] stringi_1.7.8 compiler_4.2.2 bslib_0.4.0 cellranger_1.1.0