Source: https://github.com/markziemann/GeneNameErrors2020
View the reports: http://ziemann-lab.net/public/gene_name_errors/
Gene name errors result when data are imported improperly into MS Excel and other spreadsheet programs (Zeeberg et al, 2004). Certain gene names like MARCH3, SEPT2 and DEC1 are converted into date format. These errors are surprisingly common in supplementary data files in the field of genomics (Ziemann et al, 2016). This could be considered a small error because it only affects a small number of genes, however it is symptomtic of poor data processing methods. The purpose of this script is to identify gene name errors present in supplementary files of PubMed Central articles in the previous month.
library("XML")
library("jsonlite")
library("xml2")
library("reutils")
library("readxl")
Here I will be getting PubMed Central IDs for the previous month.
Start with figuring out the date to search PubMed Central.
CURRENT_MONTH=format(Sys.time(), "%m")
CURRENT_YEAR=format(Sys.time(), "%Y")
if (CURRENT_MONTH == "01") {
PREV_YEAR=as.character(as.numeric(format(Sys.time(), "%Y"))-1)
PREV_MONTH="12"
} else {
PREV_YEAR=CURRENT_YEAR
PREV_MONTH=as.character(as.numeric(format(Sys.time(), "%m"))-1)
}
DATE=paste(PREV_YEAR,"/",PREV_MONTH,sep="")
DATE
## [1] "2021/9"
Let’s see how many PMC IDs we have in the past month.
QUERY ='((genom*[Abstract]))'
ESEARCH_RES <- esearch(term=QUERY, db = "pmc", rettype = "uilist", retmode = "xml", retstart = 0,
retmax = 5000000, usehistory = TRUE, webenv = NULL, querykey = NULL, sort = NULL, field = NULL,
datetype = NULL, reldate = NULL, mindate = DATE, maxdate = DATE)
pmc <- efetch(ESEARCH_RES,retmode="text",rettype="uilist",outfile="pmcids.txt")
## Retrieving UIDs 1 to 500
## Retrieving UIDs 501 to 1000
## Retrieving UIDs 1001 to 1500
## Retrieving UIDs 1501 to 2000
## Retrieving UIDs 2001 to 2500
## Retrieving UIDs 2501 to 3000
## Retrieving UIDs 3001 to 3500
pmc <- read.table(pmc)
pmc <- paste("PMC",pmc$V1,sep="")
NUM_ARTICLES=length(pmc)
NUM_ARTICLES
## [1] 3287
writeLines(pmc,con="pmc.txt")
Now run the bash script. Note that false positives can occur (~1.5%) and these results have not been verified by a human.
Here are some definitions:
NUM_XLS = Number of supplementary Excel files in this set of PMC articles.
NUM_XLS_ARTICLES = Number of articles matching the PubMed Central search which have supplementary Excel files.
GENELISTS = The gene lists found in the Excel files. Each Excel file is counted once even it has multiple gene lists.
NUM_GENELISTS = The number of Excel files with gene lists.
NUM_GENELIST_ARTICLES = The number of PMC articles with supplementary Excel gene lists.
ERROR_GENELISTS = Files suspected to contain gene name errors. The dates and five-digit numbers indicate transmogrified gene names.
NUM_ERROR_GENELISTS = Number of Excel gene lists with errors.
NUM_ERROR_GENELIST_ARTICLES = Number of articles with supplementary Excel gene name errors.
ERROR_PROPORTION = This is the proportion of articles with Excel gene lists that have errors.
system("./gene_names.sh pmc.txt")
results <- readLines("results.txt")
XLS <- results[grep("XLS",results,ignore.case=TRUE)]
NUM_XLS = length(XLS)
NUM_XLS
## [1] 4483
NUM_XLS_ARTICLES = length(unique(sapply(strsplit(XLS," "),"[[",1)))
NUM_XLS_ARTICLES
## [1] 714
GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>2]
#GENELISTS
NUM_GENELISTS <- length(unique(sapply(strsplit(GENELISTS," "),"[[",2)))
NUM_GENELISTS
## [1] 494
NUM_GENELIST_ARTICLES <- length(unique(sapply(strsplit(GENELISTS," "),"[[",1)))
NUM_GENELIST_ARTICLES
## [1] 237
ERROR_GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>3]
#ERROR_GENELISTS
NUM_ERROR_GENELISTS = length(ERROR_GENELISTS)
NUM_ERROR_GENELISTS
## [1] 167
GENELIST_ERROR_ARTICLES <- unique(sapply(strsplit(ERROR_GENELISTS," "),"[[",1))
GENELIST_ERROR_ARTICLES
## [1] "PMC8477972" "PMC8457190" "PMC8473911" "PMC8473691" "PMC8464068"
## [6] "PMC8457092" "PMC8459994" "PMC8459715" "PMC8446408" "PMC8421397"
## [11] "PMC8421382" "PMC8421377" "PMC8413298" "PMC8410909" "PMC8390680"
## [16] "PMC8390644" "PMC8387478" "PMC8382756" "PMC8379159" "PMC8454932"
## [21] "PMC8450517" "PMC8449218" "PMC8446523" "PMC8408347" "PMC8405691"
## [26] "PMC7611670" "PMC8441059" "PMC8440929" "PMC8438076" "PMC8439149"
## [31] "PMC8439085" "PMC8439043" "PMC8438179" "PMC8432159" "PMC8428331"
## [36] "PMC8387440" "PMC8434745" "PMC8434739" "PMC8425920" "PMC8425914"
## [41] "PMC8419706" "PMC8432294" "PMC8431925" "PMC8408433" "PMC8427675"
## [46] "PMC8427129" "PMC8426625" "PMC8422739" "PMC8420019" "PMC8395571"
## [51] "PMC8423302" "PMC8421684" "PMC8417883" "PMC8417742" "PMC8417727"
## [56] "PMC8416908" "PMC8415302" "PMC8415032" "PMC8410083" "PMC8405809"
## [61] "PMC8407593" "PMC8406749" "PMC8406630" "PMC8387919"
NUM_ERROR_GENELIST_ARTICLES <- length(GENELIST_ERROR_ARTICLES)
NUM_ERROR_GENELIST_ARTICLES
## [1] 64
ERROR_PROPORTION = NUM_ERROR_GENELIST_ARTICLES / NUM_GENELIST_ARTICLES
ERROR_PROPORTION
## [1] 0.2700422
Here you can have a look at all the gene lists detected in the past month, as well as those with errors. The dates are obvious errors, these are commonly dates in September, March, December and October. The five-digit numbers represent dates as they are encoded in the Excel internal format. The five digit number is the number of days since 1900. If you were to take these numbers and put them into Excel and format the cells as dates, then these will also mostly map to dates in September, March, December and October.
#GENELISTS
ERROR_GENELISTS
## [1] "PMC8477972 /pmc/articles/PMC8477972/bin/Table_5.XLSX Hsapiens 1 44257"
## [2] "PMC8477972 /pmc/articles/PMC8477972/bin/Table_5.XLSX Ggallus 19 44262 44262 44261 44450 44450 44449 44450 44441 44450 44450 44450 44450 44447 44450 44449 44450 44450 44450 44450"
## [3] "PMC8477972 /pmc/articles/PMC8477972/bin/Table_5.XLSX Hsapiens 6 44450 44450 44447 44447 44450 44450"
## [4] "PMC8477972 /pmc/articles/PMC8477972/bin/Table_5.XLSX Hsapiens 7 44262 44262 44262 44450 44450 44449 44450"
## [5] "PMC8457190 /pmc/articles/PMC8457190/bin/MEC-30-4039-s001.xlsx Drerio 6 43891 43892 43897 44084 44085 44083"
## [6] "PMC8457190 /pmc/articles/PMC8457190/bin/MEC-30-4039-s001.xlsx Drerio 8 43891 43892 43897 43898 44084 44085 44079 44083"
## [7] "PMC8473911 /pmc/articles/PMC8473911/bin/Data_Sheet_2.xlsx Hsapiens 1 44444"
## [8] "PMC8473691 /pmc/articles/PMC8473691/bin/Table_4.XLS Hsapiens 17 2021/09/15 2021/03/05 2021/03/03 2021/03/04 2021/03/09 2021/03/02 2021/03/09 42986 2021/09/15 2021/03/08 2021/03/06 2021/03/09 2021/03/02 2021/03/01 2021/03/03 2021/03/10 2021/03/05"
## [9] "PMC8473691 /pmc/articles/PMC8473691/bin/Table_4.XLS Hsapiens 10 2021/03/04 2021/03/04 2021/03/04 2021/03/02 2021/03/09 2021/03/09 2021/03/06 2021/09/15 2021/03/02 2021/03/06"
## [10] "PMC8464068 zip/Salviato_etal_SupplementaryTableS4_Comparison_2021-04-21rev.xlsx Hsapiens 8 40322 37135 40057 39692 39692 37135 38961 38961"
## [11] "PMC8457092 /pmc/articles/PMC8457092/bin/GEPI-45-664-s002.xlsx Ggallus 1 43897"
## [12] "PMC8459994 /pmc/articles/PMC8459994/bin/pone.0257343.s001.xlsx Hsapiens 15 44531 44256 44257 44258 44260 44261 44262 44263 44449 44454 44441 44443 44445 44446 44448"
## [13] "PMC8459994 /pmc/articles/PMC8459994/bin/pone.0257343.s002.xlsx Hsapiens 4 44258 44261 44263 44446"
## [14] "PMC8459994 /pmc/articles/PMC8459994/bin/pone.0257343.s002.xlsx Hsapiens 3 44260 44441 44448"
## [15] "PMC8459994 /pmc/articles/PMC8459994/bin/pone.0257343.s002.xlsx Hsapiens 1 44445"
## [16] "PMC8459994 /pmc/articles/PMC8459994/bin/pone.0257343.s002.xlsx Hsapiens 3 44256 44262 44454"
## [17] "PMC8459715 /pmc/articles/PMC8459715/bin/Table_2.xlsx Hsapiens 10 44448 44441 44449 44446 44450 44447 44445 44260 44257 44454"
## [18] "PMC8459715 /pmc/articles/PMC8459715/bin/Table_2.xlsx Hsapiens 8 44448 44446 44441 44450 44445 44447 44449 44260"
## [19] "PMC8459715 /pmc/articles/PMC8459715/bin/Table_2.xlsx Hsapiens 9 44448 44450 44441 44446 44445 44447 44449 44260 44454"
## [20] "PMC8459715 /pmc/articles/PMC8459715/bin/Table_2.xlsx Hsapiens 8 44448 44446 44441 44450 44445 44447 44449 44260"
## [21] "PMC8446408 /pmc/articles/PMC8446408/bin/CAM4-10-6428-s002.xlsx Hsapiens 1 44262"
## [22] "PMC8446408 /pmc/articles/PMC8446408/bin/CAM4-10-6428-s002.xlsx Hsapiens 2 44448 44261"
## [23] "PMC8446408 /pmc/articles/PMC8446408/bin/CAM4-10-6428-s002.xlsx Hsapiens 6 44261 44531 44442 44262 44441 44263"
## [24] "PMC8446408 /pmc/articles/PMC8446408/bin/CAM4-10-6428-s002.xlsx Hsapiens 1 44448"
## [25] "PMC8446408 /pmc/articles/PMC8446408/bin/CAM4-10-6428-s002.xlsx Hsapiens 1 44445"
## [26] "PMC8421397 /pmc/articles/PMC8421397/bin/41467_2021_25614_MOESM3_ESM.xlsx Hsapiens 14 43894 43893 43900 43891 44078 43894 43891 43900 44079 43891 43892 43900 43894 44078"
## [27] "PMC8421382 /pmc/articles/PMC8421382/bin/41467_2021_25539_MOESM9_ESM.xlsx Hsapiens 1 44078"
## [28] "PMC8421377 /pmc/articles/PMC8421377/bin/41467_2021_25618_MOESM7_ESM.xlsx Hsapiens 27 44454 44257 44256 44449 44262 44259 44441 44450 44256 44261 44266 44258 44447 44446 44453 44531 44263 44260 44264 44451 44440 44443 44265 44448 44257 44444 44442"
## [29] "PMC8421377 /pmc/articles/PMC8421377/bin/41467_2021_25618_MOESM8_ESM.xlsx Hsapiens 6 44077 43893 43891 43900 43892 43891"
## [30] "PMC8421377 /pmc/articles/PMC8421377/bin/41467_2021_25618_MOESM9_ESM.xlsx Hsapiens 27 44266 44260 44442 44444 44447 44257 44450 44262 44440 44259 44258 44449 44256 44443 44265 44264 44263 44257 44441 44454 44261 44531 44445 44446 44448 44453 44256"
## [31] "PMC8421377 /pmc/articles/PMC8421377/bin/41467_2021_25618_MOESM9_ESM.xlsx Hsapiens 27 44265 44257 44442 44258 44256 44264 44448 44445 44443 44450 44260 44257 44262 44446 44444 44453 44261 44266 44449 44454 44256 44440 44263 44441 44447 44259 44531"
## [32] "PMC8421377 /pmc/articles/PMC8421377/bin/41467_2021_25618_MOESM9_ESM.xlsx Hsapiens 27 44258 44442 44443 44448 44447 44259 44445 44453 44256 44263 44256 44449 44450 44440 44260 44446 44531 44257 44265 44261 44441 44444 44454 44257 44264 44266 44262"
## [33] "PMC8421377 /pmc/articles/PMC8421377/bin/41467_2021_25618_MOESM9_ESM.xlsx Hsapiens 27 44444 44259 44258 44443 44263 44266 44449 44442 44441 44264 44261 44454 44447 44448 44445 44256 44453 44262 44531 44450 44260 44265 44256 44440 44446 44257 44257"
## [34] "PMC8421377 /pmc/articles/PMC8421377/bin/41467_2021_25618_MOESM9_ESM.xlsx Hsapiens 27 44442 44264 44259 44443 44448 44256 44444 44257 44449 44261 44447 44260 44441 44257 44445 44263 44450 44440 44454 44266 44446 44258 44531 44453 44256 44265 44262"
## [35] "PMC8413298 /pmc/articles/PMC8413298/bin/41467_2021_25467_MOESM9_ESM.xlsx Mmusculus 10 43532 43529 43533 43717 43526 43531 43710 43530 43716 43715"
## [36] "PMC8410909 /pmc/articles/PMC8410909/bin/42003_2021_2533_MOESM7_ESM.xlsx Hsapiens 27 43898 43892 44076 44079 44081 44082 43900 43901 44083 44086 43891 44078 43896 43897 44085 43892 44084 44077 44075 44080 44166 43895 43894 43891 43899 43893 44088"
## [37] "PMC8390680 /pmc/articles/PMC8390680/bin/41467_2021_25432_MOESM6_ESM.xlsx Hsapiens 27 44075 44083 44080 44085 43893 43898 44076 43891 44166 43896 43897 43892 43891 44082 44081 43900 44077 44079 43894 43892 43901 44088 44084 43899 44078 44086 43895"
## [38] "PMC8390680 /pmc/articles/PMC8390680/bin/41467_2021_25432_MOESM7_ESM.xlsx Hsapiens 26 44075 44080 43891 44076 44085 43895 43898 43892 43899 44084 43894 43892 43900 44082 43897 44081 44086 43891 44166 44077 44079 43896 44083 44078 43893 43901"
## [39] "PMC8390644 /pmc/articles/PMC8390644/bin/41467_2021_25107_MOESM5_ESM.xlsx Hsapiens 40 44257 44257 44264 44256 44261 44258 44257 44257 44259 44266 44257 44256 44260 44264 44264 44265 44257 44257 44259 44259 44259 44259 44256 44256 44266 44258 44531 44531 44260 44259 44256 44531 44265 44258 44256 44263 44260 44262 44261 44263"
## [40] "PMC8387478 /pmc/articles/PMC8387478/bin/41467_2021_25357_MOESM16_ESM.xlsx Hsapiens 10 44080 44076 44081 43891 44082 43892 44085 43895 44084 44083"
## [41] "PMC8382756 /pmc/articles/PMC8382756/bin/41467_2021_25377_MOESM3_ESM.xlsx Hsapiens 1 43526"
## [42] "PMC8379159 /pmc/articles/PMC8379159/bin/41467_2021_25392_MOESM9_ESM.xlsx Hsapiens 1 43892"
## [43] "PMC8454932 /pmc/articles/PMC8454932/bin/pgen.1009763.s009.xlsx Scerevisiae 2 44105 43975"
## [44] "PMC8454932 /pmc/articles/PMC8454932/bin/pgen.1009763.s009.xlsx Scerevisiae 2 44105 43975"
## [45] "PMC8454932 /pmc/articles/PMC8454932/bin/pgen.1009763.s009.xlsx Scerevisiae 2 44105 43975"
## [46] "PMC8454932 /pmc/articles/PMC8454932/bin/pgen.1009763.s009.xlsx Scerevisiae 2 44105 43975"
## [47] "PMC8450517 /pmc/articles/PMC8450517/bin/Table_1.xlsx Hsapiens 2 44259 44266"
## [48] "PMC8449218 /pmc/articles/PMC8449218/bin/12864_2021_7970_MOESM5_ESM.xlsx Rnorvegicus 2 43710 43719"
## [49] "PMC8446523 /pmc/articles/PMC8446523/bin/Table_14.XLSX Hsapiens 1 40787"
## [50] "PMC8408347 /pmc/articles/PMC8408347/bin/LSA-2021-01083_TableS4.xlsx Hsapiens 28 44166 43891 43892 43900 43901 43891 43892 43893 43894 43895 43896 43897 43898 43899 44089 44084 44085 44086 44088 44075 44076 44077 44078 44079 44080 44081 44082 44083"
## [51] "PMC8408347 /pmc/articles/PMC8408347/bin/LSA-2021-01083_TableS5.xlsx Hsapiens 28 44166 43891 43892 43900 43901 43891 43892 43893 43894 43895 43896 43897 43898 43899 44089 44084 44085 44086 44088 44075 44076 44077 44078 44079 44080 44081 44082 44083"
## [52] "PMC8405691 /pmc/articles/PMC8405691/bin/41419_2021_4108_MOESM6_ESM.xlsx Mmusculus 2 44444 44450"
## [53] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 2 37500 41153"
## [54] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 5 37500 37500 37500 41153 39326"
## [55] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 5 41153 37500 37500 37500 39326"
## [56] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 2 41153 37500"
## [57] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 5 41153 39326 37500 37500 37500"
## [58] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 2 37500 41153"
## [59] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 5 37500 37500 41153 39326 37500"
## [60] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 2 41153 37500"
## [61] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 5 41153 37500 37500 37500 39326"
## [62] "PMC7611670 /pmc/articles/PMC7611670/bin/EMS118268-supplement-Data_Set_1.xlsx Hsapiens 2 41153 37500"
## [63] "PMC8441059 /pmc/articles/PMC8441059/bin/CAC2-41-867-s002.xls Mmusculus 6 43716 43718 43532 43525 43527 43528"
## [64] "PMC8441059 /pmc/articles/PMC8441059/bin/CAC2-41-867-s002.xls Mmusculus 5 43716 43525 43711 43525 43526"
## [65] "PMC8440929 /pmc/articles/PMC8440929/bin/Table_1.XLSX Hsapiens 7 44265 44443 44259 44256 44258 44444 44440"
## [66] "PMC8438076 /pmc/articles/PMC8438076/bin/41598_2021_97295_MOESM1_ESM.xlsx Hsapiens 5 10357 27158 69405 26151 25706"
## [67] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_SD_ED_Fig9.xlsx Hsapiens 468 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083"
## [68] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_SD_Fig4.xlsx Hsapiens 1 43896"
## [69] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_SD_Fig4.xlsx Hsapiens 17 44083 43895 43899 43896 44076 44080 44086 44084 44166 43892 43892 43894 43891 43897 43891 43901 44088"
## [70] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_SD_Fig4.xlsx Hsapiens 26 43897 43892 43895 43892 44166 44079 43896 44086 43898 44083 43891 44085 43899 44075 43893 44080 43891 44082 44078 43894 44084 44076 44077 43900 43901 44088"
## [71] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_SD_Fig4.xlsx Hsapiens 21 44080 44088 43892 44082 43898 44083 43899 43894 43895 44077 44089 44076 43891 43893 44079 43896 44081 43897 43892 44084 44085"
## [72] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_SD_Fig4.xlsx Hsapiens 26 44076 43892 43891 44080 43893 44166 43897 44079 44078 44086 43901 44085 44082 44077 43891 43892 43900 43899 43894 43898 43896 44075 44088 44084 43895 44083"
## [73] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_Sup_tab7.xlsx Hsapiens 21 43892 43891 43892 43893 43894 43895 43896 43897 43898 43899 44089 44084 44085 44088 44076 44077 44079 44080 44081 44082 44083"
## [74] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_Sup_tab7.xlsx Hsapiens 21 43892 43891 43892 43893 43894 43895 43896 43897 43898 43899 44089 44084 44085 44088 44076 44077 44079 44080 44081 44082 44083"
## [75] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_Sup_tab8.xlsx Hsapiens 26 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083"
## [76] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_Sup_tab8.xlsx Hsapiens 26 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083"
## [77] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_Sup_tab8.xlsx Hsapiens 26 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083"
## [78] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_Sup_tab8.xlsx Hsapiens 26 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083"
## [79] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_Sup_tab8.xlsx Hsapiens 26 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083"
## [80] "PMC8439149 /pmc/articles/PMC8439149/bin/NIHMS1624022-supplement-1624022_Sup_tab8.xlsx Hsapiens 26 43891 43892 43891 43900 43892 43893 43894 43895 43896 43897 43898 43899 44089 44075 44084 44085 44086 44088 44076 44077 44078 44079 44080 44081 44082 44083"
## [81] "PMC8439085 /pmc/articles/PMC8439085/bin/12882_2021_2445_MOESM1_ESM.xlsx Hsapiens 8 6-Sep 6-Sep 6-Sep 6-Sep 6-Sep 7-Sep 7-Sep 7-Sep"
## [82] "PMC8439085 /pmc/articles/PMC8439085/bin/12882_2021_2445_MOESM2_ESM.xlsx Hsapiens 8 6-Sep 6-Sep 6-Sep 6-Sep 6-Sep 7-Sep 7-Sep 7-Sep"
## [83] "PMC8439043 /pmc/articles/PMC8439043/bin/12864_2021_7930_MOESM5_ESM.xlsx Drerio 15 41153 39142 37500 38777 36951 37865 40422 37530 36951 38961 39508 38412 39873 37316 42248"
## [84] "PMC8438179 /pmc/articles/PMC8438179/bin/DataSheet1.xlsx Hsapiens 1 44448"
## [85] "PMC8438179 /pmc/articles/PMC8438179/bin/DataSheet1.xlsx Ggallus 1 44448"
## [86] "PMC8438179 /pmc/articles/PMC8438179/bin/DataSheet1.xlsx Hsapiens 1 44450"
## [87] "PMC8438179 /pmc/articles/PMC8438179/bin/DataSheet1.xlsx Hsapiens 1 44450"
## [88] "PMC8432159 /pmc/articles/PMC8432159/bin/PATH-252-151-s001.xlsx Hsapiens 164 38047 40057 37135 40057 40057 38961 40057 40057 39508 40057 38777 37226 37681 40057 40057 37865 40057 40057 40238 37681 38961 40057 40238 40057 40057 40238 40057 40603 40603 40787 38777 40057 40057 39692 40057 42248 40057 40057 40422 38047 40057 37865 40603 40057 40057 40057 40787 40057 36951 40057 39508 38961 41153 40057 40603 38047 40057 38231 40057 40057 38047 42248 40057 40057 39508 39692 38777 41153 41153 36951 41883 41153 40238 40238 38777 36951 40057 37226 40057 40057 40057 40057 40057 40057 40057 40603 40787 40057 41153 40603 40057 40057 40057 40603 37681 40057 40057 38412 41153 40057 36951 40057 36951 40057 36951 36951 40603 39692 38047 39692 40422 37316 40057 42248 36951 37226 38231 38596 36951 40057 39692 40057 39508 40422 40603 37681 41883 37865 40057 40603 40057 39142 39142 40238 38231 39692 40603 40057 42248 38047 38231 42248 40603 37316 40238 40603 40603 40057 40057 41883 40057 40057 37681 40057 40057 40603 40057 36951 40787 36951 36951 37226 40057 36951"
## [89] "PMC8432159 /pmc/articles/PMC8432159/bin/PATH-252-151-s001.xlsx Hsapiens 5 36951 40057 40057 40057 38777"
## [90] "PMC8432159 /pmc/articles/PMC8432159/bin/PATH-252-151-s001.xlsx Hsapiens 4 41153 40057 38596 38961"
## [91] "PMC8428331 /pmc/articles/PMC8428331/bin/mmc3.xlsx Mmusculus 1 39326"
## [92] "PMC8428331 /pmc/articles/PMC8428331/bin/mmc3.xlsx Mmusculus 13 38047 36951 40422 39692 38231 40238 40057 40603 37865 37681 40787 36951 38961"
## [93] "PMC8428331 /pmc/articles/PMC8428331/bin/mmc3.xlsx Rnorvegicus 10 38047 36951 37316 40422 39692 40057 40603 37865 38596 37681"
## [94] "PMC8387440 /pmc/articles/PMC8387440/bin/41523_2021_287_MOESM1_ESM.xlsx Hsapiens 2 43891 43891"
## [95] "PMC8434745 /pmc/articles/PMC8434745/bin/12711_2021_668_MOESM4_ESM.xlsx Hsapiens 16 43894 43894 44076 44076 43891 43891 43891 43891 43891 43891 44085 43893 43892 43900 43898 43898"
## [96] "PMC8434739 /pmc/articles/PMC8434739/bin/13073_2021_962_MOESM1_ESM.xlsx Hsapiens 2 44264 44260"
## [97] "PMC8434739 /pmc/articles/PMC8434739/bin/13073_2021_962_MOESM1_ESM.xlsx Hsapiens 1 44263"
## [98] "PMC8425920 /pmc/articles/PMC8425920/bin/ADVS-8-2100849-s004.xlsx Mmusculus 4 38047 38231 38231 38047"
## [99] "PMC8425920 /pmc/articles/PMC8425920/bin/ADVS-8-2100849-s007.xlsx Mmusculus 2 38231 38047"
## [100] "PMC8425920 /pmc/articles/PMC8425920/bin/ADVS-8-2100849-s015.xlsx Mmusculus 4 38047 38231 38047 38231"
## [101] "PMC8425914 /pmc/articles/PMC8425914/bin/ADVS-8-2101230-s002.xlsx Hsapiens 1 43891"
## [102] "PMC8419706 /pmc/articles/PMC8419706/bin/EMBR-22-e51328-s003.xls Hsapiens 3 40057 37500 38961"
## [103] "PMC8419706 /pmc/articles/PMC8419706/bin/EMBR-22-e51328-s006.xlsx Mmusculus 27 39508 38231 39326 38412 37865 40057 36951 40787 38596 37135 37500 37316 41153 38777 38961 40422 39142 37500 37316 36951 37681 39692 39873 40603 41883 38047 40238"
## [104] "PMC8419706 /pmc/articles/PMC8419706/bin/EMBR-22-e51328-s008.xlsx Hsapiens 23 42435 42616 42615 42437 42436 42618 42620 42432 42622 42619 42621 42623 42433 42438 42624 42628 42617 42434 42705 42439 42625 42440 42431"
## [105] "PMC8432294 /pmc/articles/PMC8432294/bin/Table1.XLSX Hsapiens 1 44088"
## [106] "PMC8432294 /pmc/articles/PMC8432294/bin/Table1.XLSX Hsapiens 1 44088"
## [107] "PMC8432294 /pmc/articles/PMC8432294/bin/Table1.XLSX Hsapiens 1 44088"
## [108] "PMC8431925 /pmc/articles/PMC8431925/bin/13068_2021_2032_MOESM2_ESM.xlsx Athaliana 1 44107"
## [109] "PMC8408433 /pmc/articles/PMC8408433/bin/mmc3.xlsx Hsapiens 8 43894 43894 44079 44083 43897 43900 44084 43894"
## [110] "PMC8427675 /pmc/articles/PMC8427675/bin/6672899.f1.xlsx Mmusculus 1 43891"
## [111] "PMC8427675 /pmc/articles/PMC8427675/bin/6672899.f1.xlsx Rnorvegicus 1 43898"
## [112] "PMC8427129 /pmc/articles/PMC8427129/bin/FEB2-9999-0-s006.xlsx Hsapiens 5 38200 37469 37104 37834 36892"
## [113] "PMC8427129 /pmc/articles/PMC8427129/bin/FEB2-9999-0-s006.xlsx Hsapiens 5 37469 37104 36892 38200 37834"
## [114] "PMC8427129 /pmc/articles/PMC8427129/bin/FEB2-9999-0-s006.xlsx Hsapiens 5 37104 37469 38200 37834 36892"
## [115] "PMC8426625 /pmc/articles/PMC8426625/bin/Table_3.xlsx Hsapiens 4 44083 44083 44083 44083"
## [116] "PMC8422739 /pmc/articles/PMC8422739/bin/40478_2021_1248_MOESM1_ESM.xlsx Hsapiens 5 42248 36951 37316 41883 39326"
## [117] "PMC8422739 /pmc/articles/PMC8422739/bin/40478_2021_1248_MOESM1_ESM.xlsx Hsapiens 7 40057 42248 39326 40238 38231 36951 37316"
## [118] "PMC8420019 /pmc/articles/PMC8420019/bin/13072_2021_414_MOESM1_ESM.xlsx Hsapiens 3 42621 42617 42435"
## [119] "PMC8395571 /pmc/articles/PMC8395571/bin/peerj-09-11995-s004.xls Hsapiens 3 2021/03/08 2021/12/01 2021/03/11"
## [120] "PMC8423302 /pmc/articles/PMC8423302/bin/pbio.3001352.s010.xlsx Mmusculus 1 40603"
## [121] "PMC8421684 /pmc/articles/PMC8421684/bin/DataSheet_1.xlsx Hsapiens 1 44448"
## [122] "PMC8421684 /pmc/articles/PMC8421684/bin/DataSheet_1.xlsx Hsapiens 26 44454 44257 44256 44263 44260 44264 44440 44443 44265 44448 44257 44449 44262 44259 44441 44444 44442 44450 44256 44261 44258 44447 44446 44453 44531 44445"
## [123] "PMC8417883 /pmc/articles/PMC8417883/bin/Table_2.xlsx Hsapiens 4 37500 40057 38200 40787"
## [124] "PMC8417742 /pmc/articles/PMC8417742/bin/Table_2.xlsx Hsapiens 5 38961 38777 40057 36951 37316"
## [125] "PMC8417727 /pmc/articles/PMC8417727/bin/Table_2.xlsx Hsapiens 7 43895 44076 43891 44085 44085 44085 44082"
## [126] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [127] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [128] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [129] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44259"
## [130] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [131] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44259"
## [132] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 3 44259 44440 44440"
## [133] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [134] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [135] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [136] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44259"
## [137] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 3 44440 44440 44259"
## [138] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [139] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44440"
## [140] "PMC8416908 /pmc/articles/PMC8416908/bin/Table_1.xlsx Hsapiens 1 44453"
## [141] "PMC8415302 /pmc/articles/PMC8415302/bin/Table_1.XLSX Hsapiens 1 44531"
## [142] "PMC8415302 /pmc/articles/PMC8415302/bin/Table_1.XLSX Hsapiens 1 44531"
## [143] "PMC8415302 /pmc/articles/PMC8415302/bin/Table_1.XLSX Hsapiens 2 44263 44441"
## [144] "PMC8415032 /pmc/articles/PMC8415032/bin/Table_1.xls Hsapiens 25 2021/03/02 2021/03/01 2021/09/03 2021/09/02 2021/09/06 2021/03/06 2021/12/01 2021/03/10 2021/03/08 2021/09/08 2021/09/12 2021/09/09 2021/09/01 2021/03/03 2021/09/11 2021/09/07 2021/03/11 2021/03/09 2021/09/05 2021/03/07 2021/09/04 2021/03/05 2021/09/14 2021/09/10 2021/03/04"
## [145] "PMC8410083 /pmc/articles/PMC8410083/bin/elife-65574-fig7-data1.xlsx Hsapiens 1 37316"
## [146] "PMC8410083 /pmc/articles/PMC8410083/bin/elife-65574-fig7-data1.xlsx Hsapiens 1 37316"
## [147] "PMC8405809 /pmc/articles/PMC8405809/bin/41598_2021_96837_MOESM1_ESM.xlsx Drerio 1 44263"
## [148] "PMC8405809 /pmc/articles/PMC8405809/bin/41598_2021_96837_MOESM1_ESM.xlsx Drerio 1 44263"
## [149] "PMC8407593 /pmc/articles/PMC8407593/bin/pgen.1009754.s044.xlsx Mmusculus 3 43891 43898 Intergenic_Gm25403_1-Mar"
## [150] "PMC8407593 /pmc/articles/PMC8407593/bin/pgen.1009754.s045.xlsx Mmusculus 3 43891 43898 Intergenic_Gm25403_1-Mar"
## [151] "PMC8407593 /pmc/articles/PMC8407593/bin/pgen.1009754.s046.xlsx Mmusculus 3 43891 43898 Intergenic_Gm25403_1-Mar"
## [152] "PMC8407593 /pmc/articles/PMC8407593/bin/pgen.1009754.s047.xlsx Mmusculus 3 43891 43898 Intergenic_Gm25403_1-Mar"
## [153] "PMC8407593 /pmc/articles/PMC8407593/bin/pgen.1009754.s048.xlsx Mmusculus 3 43891 43898 Intergenic_Gm25403_1-Mar"
## [154] "PMC8407593 /pmc/articles/PMC8407593/bin/pgen.1009754.s049.xlsx Mmusculus 3 43891 43898 Intergenic_Gm25403_1-Mar"
## [155] "PMC8407593 zip/Supp_Table_18-FHS_HDL-.xls Hsapiens 18 44089 43897 43894 43891 44085 44082 43893 43901 43896 44088 44081 44166 43898 43895 44083 43900 44078 44077"
## [156] "PMC8407593 zip/Supp_Table_19-FHS_LDL-.xls Hsapiens 18 44089 43897 43894 44085 43891 43896 43901 43893 44082 44081 44088 44166 43898 43895 44078 43900 44083 44077"
## [157] "PMC8407593 zip/Supp_Table_19-FHS_LDL-.xls Hsapiens 25 44454 44441 44262 44449 44259 44256 44450 44266 44258 44447 44261 44446 44453 44531 44263 44260 44264 44440 44451 44443 44448 44265 44257 44444 44442"
## [158] "PMC8407593 zip/Supp_Table_21-UKB_HDL-.xls Hsapiens 18 44089 43897 43894 44085 43891 43896 43901 43893 44082 44081 44088 44166 43898 43895 44078 43900 44083 44077"
## [159] "PMC8407593 zip/Supp_Table_22-UKB_LDL-.xls Hsapiens 18 44089 43897 43894 44085 43891 43896 43901 43893 44082 44081 44088 44166 43898 43895 44078 43900 44083 44077"
## [160] "PMC8406749 /pmc/articles/PMC8406749/bin/Table2.XLS Hsapiens 27 44531 44256 44265 44266 44257 44258 44259 44260 44261 44262 44263 44264 44454 44440 44449 44450 44451 44452 44453 44441 44442 44443 44444 44445 44446 44447 44448"
## [161] "PMC8406749 /pmc/articles/PMC8406749/bin/Table3.XLS Hsapiens 26 2021/12/01 2021/03/01 2021/03/02 2021/03/01 2021/03/10 2021/03/11 2021/03/02 2021/03/03 2021/03/04 2021/03/05 2021/03/06 2021/03/07 2021/03/08 2021/03/09 2021/09/15 2021/09/01 2021/09/10 2021/09/11 2021/09/12 2021/09/02 2021/09/03 2021/09/04 2021/09/06 2021/09/07 2021/09/08 2021/09/09"
## [162] "PMC8406749 /pmc/articles/PMC8406749/bin/Table4.XLS Hsapiens 11 2021/03/02 2021/03/03 2021/03/09 2021/03/10 2021/03/11 2021/09/01 2021/09/04 2021/09/06 2021/09/07 2021/09/10 2021/09/14"
## [163] "PMC8406749 /pmc/articles/PMC8406749/bin/Table5.XLS Hsapiens 26 2021/12/01 2021/03/01 2021/03/02 2021/03/10 2021/03/11 2021/03/03 2021/03/04 2021/03/05 2021/03/06 2021/03/07 2021/03/08 2021/03/09 2021/09/15 2021/09/01 2021/09/10 2021/09/11 2021/09/12 2021/09/14 2021/09/02 2021/09/03 2021/09/04 2021/09/05 2021/09/06 2021/09/07 2021/09/08 2021/09/09"
## [164] "PMC8406749 /pmc/articles/PMC8406749/bin/Table6.XLS Hsapiens 2 2001/08/03 2021/04/02"
## [165] "PMC8406630 /pmc/articles/PMC8406630/bin/DataSheet_2.xlsx Hsapiens 11 44266 44260 44531 44262 44265 44444 44443 44261 44450 44442 44453"
## [166] "PMC8387919 /pmc/articles/PMC8387919/bin/mmc2.xlsx Hsapiens 1 40057"
## [167] "PMC8387919 /pmc/articles/PMC8387919/bin/mmc2.xlsx Hsapiens 8 40238 39508 40603 38047 39873 36951 36951 37316"
Let’s investigate the errors in more detail.
# By species
SPECIES <- sapply(strsplit(ERROR_GENELISTS," "),"[[",3)
table(SPECIES)
## SPECIES
## Athaliana Drerio Ggallus Hsapiens Mmusculus Rnorvegicus
## 1 5 3 133 18 3
## Scerevisiae
## 4
par(mar=c(5,12,4,2))
barplot(table(SPECIES),horiz=TRUE,las=1)
par(mar=c(5,5,4,2))
# Number of affected Excel files per paper
DIST <- table(sapply(strsplit(ERROR_GENELISTS," "),"[[",1))
DIST
##
## PMC7611670 PMC8379159 PMC8382756 PMC8387440 PMC8387478 PMC8387919 PMC8390644
## 10 1 1 1 1 2 1
## PMC8390680 PMC8395571 PMC8405691 PMC8405809 PMC8406630 PMC8406749 PMC8407593
## 2 1 1 2 1 5 11
## PMC8408347 PMC8408433 PMC8410083 PMC8410909 PMC8413298 PMC8415032 PMC8415302
## 2 1 2 1 1 1 3
## PMC8416908 PMC8417727 PMC8417742 PMC8417883 PMC8419706 PMC8420019 PMC8421377
## 15 1 1 1 3 1 7
## PMC8421382 PMC8421397 PMC8421684 PMC8422739 PMC8423302 PMC8425914 PMC8425920
## 1 1 2 2 1 1 3
## PMC8426625 PMC8427129 PMC8427675 PMC8428331 PMC8431925 PMC8432159 PMC8432294
## 1 3 2 3 1 3 3
## PMC8434739 PMC8434745 PMC8438076 PMC8438179 PMC8439043 PMC8439085 PMC8439149
## 2 1 1 4 1 2 14
## PMC8440929 PMC8441059 PMC8446408 PMC8446523 PMC8449218 PMC8450517 PMC8454932
## 1 2 5 1 1 1 4
## PMC8457092 PMC8457190 PMC8459715 PMC8459994 PMC8464068 PMC8473691 PMC8473911
## 1 2 4 5 1 2 1
## PMC8477972
## 4
summary(as.numeric(DIST))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.500 2.609 3.000 15.000
hist(DIST,main="Number of affected Excel files per paper")
# PMC Articles with the most errors
DIST_DF <- as.data.frame(DIST)
DIST_DF <- DIST_DF[order(-DIST_DF$Freq),,drop=FALSE]
head(DIST_DF,20)
## Var1 Freq
## 22 PMC8416908 15
## 49 PMC8439149 14
## 14 PMC8407593 11
## 1 PMC7611670 10
## 28 PMC8421377 7
## 13 PMC8406749 5
## 52 PMC8446408 5
## 60 PMC8459994 5
## 46 PMC8438179 4
## 56 PMC8454932 4
## 59 PMC8459715 4
## 64 PMC8477972 4
## 21 PMC8415302 3
## 26 PMC8419706 3
## 35 PMC8425920 3
## 37 PMC8427129 3
## 39 PMC8428331 3
## 41 PMC8432159 3
## 42 PMC8432294 3
## 6 PMC8387919 2
MOST_ERR_FILES = as.character(DIST_DF[1,1])
MOST_ERR_FILES
## [1] "PMC8416908"
# Number of errors per paper
NERR <- as.numeric(sapply(strsplit(ERROR_GENELISTS," "),"[[",4))
names(NERR) <- sapply(strsplit(ERROR_GENELISTS," "),"[[",1)
NERR <-tapply(NERR, names(NERR), sum)
NERR
## PMC7611670 PMC8379159 PMC8382756 PMC8387440 PMC8387478 PMC8387919 PMC8390644
## 35 1 1 2 10 9 40
## PMC8390680 PMC8395571 PMC8405691 PMC8405809 PMC8406630 PMC8406749 PMC8407593
## 53 3 2 2 11 92 115
## PMC8408347 PMC8408433 PMC8410083 PMC8410909 PMC8413298 PMC8415032 PMC8415302
## 56 8 2 27 10 25 4
## PMC8416908 PMC8417727 PMC8417742 PMC8417883 PMC8419706 PMC8420019 PMC8421377
## 19 7 5 4 53 3 168
## PMC8421382 PMC8421397 PMC8421684 PMC8422739 PMC8423302 PMC8425914 PMC8425920
## 1 14 27 12 1 1 10
## PMC8426625 PMC8427129 PMC8427675 PMC8428331 PMC8431925 PMC8432159 PMC8432294
## 4 15 2 24 1 173 3
## PMC8434739 PMC8434745 PMC8438076 PMC8438179 PMC8439043 PMC8439085 PMC8439149
## 3 16 5 4 15 16 757
## PMC8440929 PMC8441059 PMC8446408 PMC8446523 PMC8449218 PMC8450517 PMC8454932
## 7 11 11 1 2 2 8
## PMC8457092 PMC8457190 PMC8459715 PMC8459994 PMC8464068 PMC8473691 PMC8473911
## 1 14 35 26 8 27 1
## PMC8477972
## 33
hist(NERR,main="number of errors per PMC article")
NERR_DF <- as.data.frame(NERR)
NERR_DF <- NERR_DF[order(-NERR_DF$NERR),,drop=FALSE]
head(NERR_DF,20)
## NERR
## PMC8439149 757
## PMC8432159 173
## PMC8421377 168
## PMC8407593 115
## PMC8406749 92
## PMC8408347 56
## PMC8390680 53
## PMC8419706 53
## PMC8390644 40
## PMC7611670 35
## PMC8459715 35
## PMC8477972 33
## PMC8410909 27
## PMC8421684 27
## PMC8473691 27
## PMC8459994 26
## PMC8415032 25
## PMC8428331 24
## PMC8416908 19
## PMC8434745 16
MOST_ERR = rownames(NERR_DF)[1]
MOST_ERR
## [1] "PMC8439149"
GENELIST_ERROR_ARTICLES <- gsub("PMC","",GENELIST_ERROR_ARTICLES)
### JSON PARSING is more reliable than XML
ARTICLES <- esummary( GENELIST_ERROR_ARTICLES , db="pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLES,as= "parsed")
ARTICLE_DATA <- ARTICLE_DATA$result
ARTICLE_DATA <- ARTICLE_DATA[2:length(ARTICLE_DATA)]
JOURNALS <- unlist(lapply(ARTICLE_DATA,function(x) {x$fulljournalname} ))
JOURNALS_TABLE <- table(JOURNALS)
JOURNALS_TABLE <- JOURNALS_TABLE[order(-JOURNALS_TABLE)]
length(JOURNALS_TABLE)
## [1] 42
NUM_JOURNALS=length(JOURNALS_TABLE)
par(mar=c(5,25,4,2))
barplot(head(JOURNALS_TABLE,10), horiz=TRUE, las=1,
xlab="Articles with gene name errors in supp files",
main="Top journals this month")
Congrats to our Journal of the Month winner!
JOURNAL_WINNER <- names(head(JOURNALS_TABLE,1))
JOURNAL_WINNER
## [1] "Nature Communications"
There are two categories:
Paper with the most suplementary files affected by gene name errors (MOST_ERR_FILES)
Paper with the most gene names converted to dates (MOST_ERR)
Sometimes, one paper can win both categories. Congrats to our winners.
MOST_ERR_FILES <- gsub("PMC","",MOST_ERR_FILES)
ARTICLES <- esummary( MOST_ERR_FILES , db="pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLES,as= "parsed")
ARTICLE_DATA <- ARTICLE_DATA[2]
ARTICLE_DATA
## $result
## $result$uids
## [1] "8416908"
##
## $result$`8416908`
## $result$`8416908`$uid
## [1] "8416908"
##
## $result$`8416908`$pubdate
## [1] "2021 Aug 19"
##
## $result$`8416908`$epubdate
## [1] "2021 Aug 19"
##
## $result$`8416908`$printpubdate
## [1] ""
##
## $result$`8416908`$source
## [1] "Front Med (Lausanne)"
##
## $result$`8416908`$authors
## name authtype
## 1 Shboul ZA Author
## 2 Diawara N Author
## 3 Vossough A Author
## 4 Chen JY Author
## 5 Iftekharuddin KM Author
##
## $result$`8416908`$title
## [1] "Joint Modeling of RNAseq and Radiomics Data for Glioma Molecular Characterization and Prediction"
##
## $result$`8416908`$volume
## [1] "8"
##
## $result$`8416908`$issue
## [1] ""
##
## $result$`8416908`$pages
## [1] "705071"
##
## $result$`8416908`$articleids
## idtype value
## 1 pmid 34490297
## 2 doi 10.3389/fmed.2021.705071
## 3 pmcid PMC8416908
##
## $result$`8416908`$fulljournalname
## [1] "Frontiers in Medicine"
##
## $result$`8416908`$sortdate
## [1] "2021/08/19 00:00"
##
## $result$`8416908`$pmclivedate
## [1] "2021/09/05"
MOST_ERR <- gsub("PMC","",MOST_ERR)
ARTICLE_DATA <- esummary(MOST_ERR,db = "pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLE_DATA,as= "parsed")
ARTICLE_DATA
## $header
## $header$type
## [1] "esummary"
##
## $header$version
## [1] "0.3"
##
##
## $result
## $result$uids
## [1] "8439149"
##
## $result$`8439149`
## $result$`8439149`$uid
## [1] "8439149"
##
## $result$`8439149`$pubdate
## [1] "2020 Dec 9"
##
## $result$`8439149`$epubdate
## [1] "2020 Dec 9"
##
## $result$`8439149`$printpubdate
## [1] "2020 Dec"
##
## $result$`8439149`$source
## [1] "Nature"
##
## $result$`8439149`$authors
## name authtype
## 1 Jin X Author
## 2 Demere Z Author
## 3 Nair K Author
## 4 Ali A Author
## 5 Ferraro GB Author
## 6 Natoli T Author
## 7 Deik A Author
## 8 Petronio L Author
## 9 Tang AA Author
## 10 Zhu C Author
## 11 Wang L Author
## 12 Rosenberg D Author
## 13 Mangena V Author
## 14 Roth J Author
## 15 Chung K Author
## 16 Jain RK Author
## 17 Clish CB Author
## 18 Vander Heiden MG Author
## 19 Golub TR Author
##
## $result$`8439149`$title
## [1] "A metastasis map of human cancer cell lines"
##
## $result$`8439149`$volume
## [1] "588"
##
## $result$`8439149`$issue
## [1] "7837"
##
## $result$`8439149`$pages
## [1] "331-336"
##
## $result$`8439149`$articleids
## idtype value
## 1 pmid 33299191
## 2 doi 10.1038/s41586-020-2969-2
## 3 pmcid PMC8439149
## 4 MID HHMIMS1624022
##
## $result$`8439149`$fulljournalname
## [1] "Nature"
##
## $result$`8439149`$sortdate
## [1] "2020/12/09 00:00"
##
## $result$`8439149`$pmclivedate
## [1] "2021/09/14"
To plot the trend over the past 6-12 months.
url <- "http://ziemann-lab.net/public/gene_name_errors/"
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/@href")
links <- links[grep("html",links)]
links
## href href href
## "Report_2021-02.html" "Report_2021-03.html" "Report_2021-04.html"
## href href href
## "Report_2021-05.html" "Report_2021-06.html" "Report_2021-07.html"
## href href
## "Report_2021-08.html" "Report_2021-09.html"
unlink("online_files/",recursive=TRUE)
dir.create("online_files")
sapply(links, function(mylink) {
download.file(paste(url,mylink,sep=""),destfile=paste("online_files/",mylink,sep=""))
} )
## href href href href href href href href
## 0 0 0 0 0 0 0 0
myfilelist <- list.files("online_files/",full.names=TRUE)
trends <- sapply(myfilelist, function(myfilename) {
x <- readLines(myfilename)
# Num XL gene list articles
NUM_GENELIST_ARTICLES <- x[grep("NUM_GENELIST_ARTICLES",x)[3]+1]
NUM_GENELIST_ARTICLES <- sapply(strsplit(NUM_GENELIST_ARTICLES," "),"[[",3)
NUM_GENELIST_ARTICLES <- sapply(strsplit(NUM_GENELIST_ARTICLES,"<"),"[[",1)
NUM_GENELIST_ARTICLES <- as.numeric(NUM_GENELIST_ARTICLES)
# number of affected articles
NUM_ERROR_GENELIST_ARTICLES <- x[grep("NUM_ERROR_GENELIST_ARTICLES",x)[3]+1]
NUM_ERROR_GENELIST_ARTICLES <- sapply(strsplit(NUM_ERROR_GENELIST_ARTICLES," "),"[[",3)
NUM_ERROR_GENELIST_ARTICLES <- sapply(strsplit(NUM_ERROR_GENELIST_ARTICLES,"<"),"[[",1)
NUM_ERROR_GENELIST_ARTICLES <- as.numeric(NUM_ERROR_GENELIST_ARTICLES)
# Error proportion
ERROR_PROPORTION <- x[grep("ERROR_PROPORTION",x)[3]+1]
ERROR_PROPORTION <- sapply(strsplit(ERROR_PROPORTION," "),"[[",3)
ERROR_PROPORTION <- sapply(strsplit(ERROR_PROPORTION,"<"),"[[",1)
ERROR_PROPORTION <- as.numeric(ERROR_PROPORTION)
# number of journals
NUM_JOURNALS <- x[grep('JOURNALS_TABLE',x)[3]+1]
NUM_JOURNALS <- sapply(strsplit(NUM_JOURNALS," "),"[[",3)
NUM_JOURNALS <- sapply(strsplit(NUM_JOURNALS,"<"),"[[",1)
NUM_JOURNALS <- as.numeric(NUM_JOURNALS)
NUM_JOURNALS
res <- c(NUM_GENELIST_ARTICLES,NUM_ERROR_GENELIST_ARTICLES,ERROR_PROPORTION,NUM_JOURNALS)
return(res)
})
colnames(trends) <- sapply(strsplit(colnames(trends),"_"),"[[",3)
colnames(trends) <- gsub(".html","",colnames(trends))
trends <- as.data.frame(trends)
rownames(trends) <- c("NUM_GENELIST_ARTICLES","NUM_ERROR_GENELIST_ARTICLES","ERROR_PROPORTION","NUM_JOURNALS")
trends <- t(trends)
trends <- as.data.frame(trends)
CURRENT_RES <- c(NUM_GENELIST_ARTICLES,NUM_ERROR_GENELIST_ARTICLES,ERROR_PROPORTION,NUM_JOURNALS)
trends <- rbind(trends,CURRENT_RES)
paste(CURRENT_YEAR,CURRENT_MONTH,sep="-")
## [1] "2021-10"
rownames(trends)[nrow(trends)] <- paste(CURRENT_YEAR,CURRENT_MONTH,sep="-")
plot(trends$NUM_GENELIST_ARTICLES, xaxt = "n" , type="b" , main="Number of articles with Excel gene lists per month",
ylab="number of articles", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$NUM_ERROR_GENELIST_ARTICLES, xaxt = "n" , type="b" , main="Number of articles with gene name errors per month",
ylab="number of articles", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$ERROR_PROPORTION, xaxt = "n" , type="b" , main="Proportion of articles with Excel gene list affected by errors",
ylab="proportion", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$NUM_JOURNALS, xaxt = "n" , type="b" , main="Number of journals with affected articles",
ylab="number of journals", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
unlink("online_files/",recursive=TRUE)
Zeeberg, B.R., Riss, J., Kane, D.W. et al. Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5, 80 (2004). https://doi.org/10.1186/1471-2105-5-80
Ziemann, M., Eren, Y. & El-Osta, A. Gene name errors are widespread in the scientific literature. Genome Biol 17, 177 (2016). https://doi.org/10.1186/s13059-016-1044-7
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.3.1 reutils_0.2.3 xml2_1.3.2 jsonlite_1.7.2 XML_3.99-0.7
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 knitr_1.34 magrittr_2.0.1 R6_2.5.1
## [5] rlang_0.4.11 fastmap_1.1.0 stringr_1.4.0 highr_0.9
## [9] tools_4.1.1 xfun_0.26 jquerylib_0.1.4 htmltools_0.5.2
## [13] yaml_2.2.1 digest_0.6.27 assertthat_0.2.1 sass_0.4.0
## [17] bitops_1.0-7 RCurl_1.98-1.4 evaluate_0.14 rmarkdown_2.11
## [21] stringi_1.7.4 compiler_4.1.1 bslib_0.3.0 cellranger_1.1.0