Source: https://github.com/markziemann/GeneNameErrors2020
View the reports: http://ziemann-lab.net/public/gene_name_errors/
Gene name errors result when data are imported improperly into MS Excel and other spreadsheet programs (Zeeberg et al, 2004). Certain gene names like MARCH3, SEPT2 and DEC1 are converted into date format. These errors are surprisingly common in supplementary data files in the field of genomics (Ziemann et al, 2016). This could be considered a small error because it only affects a small number of genes, however it is symptomtic of poor data processing methods. The purpose of this script is to identify gene name errors present in supplementary files of PubMed Central articles in the previous month.
library("XML")
library("jsonlite")
library("xml2")
library("reutils")
library("readxl")
Here I will be getting PubMed Central IDs for the previous month.
Start with figuring out the date to search PubMed Central.
CURRENT_MONTH=format(Sys.time(), "%m")
CURRENT_YEAR=format(Sys.time(), "%Y")
if (CURRENT_MONTH == "01") {
PREV_YEAR=as.character(as.numeric(format(Sys.time(), "%Y"))-1)
PREV_MONTH="12"
} else {
PREV_YEAR=CURRENT_YEAR
PREV_MONTH=as.character(as.numeric(format(Sys.time(), "%m"))-1)
}
DATE=paste(PREV_YEAR,"/",PREV_MONTH,sep="")
DATE
## [1] "2021/10"
Let’s see how many PMC IDs we have in the past month.
QUERY ='((genom*[Abstract]))'
ESEARCH_RES <- esearch(term=QUERY, db = "pmc", rettype = "uilist", retmode = "xml", retstart = 0,
retmax = 5000000, usehistory = TRUE, webenv = NULL, querykey = NULL, sort = NULL, field = NULL,
datetype = NULL, reldate = NULL, mindate = DATE, maxdate = DATE)
pmc <- efetch(ESEARCH_RES,retmode="text",rettype="uilist",outfile="pmcids.txt")
## Retrieving UIDs 1 to 500
## Retrieving UIDs 501 to 1000
## Retrieving UIDs 1001 to 1500
## Retrieving UIDs 1501 to 2000
## Retrieving UIDs 2001 to 2500
## Retrieving UIDs 2501 to 3000
## Retrieving UIDs 3001 to 3500
pmc <- read.table(pmc)
pmc <- paste("PMC",pmc$V1,sep="")
NUM_ARTICLES=length(pmc)
NUM_ARTICLES
## [1] 3320
writeLines(pmc,con="pmc.txt")
Now run the bash script. Note that false positives can occur (~1.5%) and these results have not been verified by a human.
Here are some definitions:
NUM_XLS = Number of supplementary Excel files in this set of PMC articles.
NUM_XLS_ARTICLES = Number of articles matching the PubMed Central search which have supplementary Excel files.
GENELISTS = The gene lists found in the Excel files. Each Excel file is counted once even it has multiple gene lists.
NUM_GENELISTS = The number of Excel files with gene lists.
NUM_GENELIST_ARTICLES = The number of PMC articles with supplementary Excel gene lists.
ERROR_GENELISTS = Files suspected to contain gene name errors. The dates and five-digit numbers indicate transmogrified gene names.
NUM_ERROR_GENELISTS = Number of Excel gene lists with errors.
NUM_ERROR_GENELIST_ARTICLES = Number of articles with supplementary Excel gene name errors.
ERROR_PROPORTION = This is the proportion of articles with Excel gene lists that have errors.
system("./gene_names.sh pmc.txt")
results <- readLines("results.txt")
XLS <- results[grep("XLS",results,ignore.case=TRUE)]
NUM_XLS = length(XLS)
NUM_XLS
## [1] 4537
NUM_XLS_ARTICLES = length(unique(sapply(strsplit(XLS," "),"[[",1)))
NUM_XLS_ARTICLES
## [1] 803
GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>2]
#GENELISTS
NUM_GENELISTS <- length(unique(sapply(strsplit(GENELISTS," "),"[[",2)))
NUM_GENELISTS
## [1] 579
NUM_GENELIST_ARTICLES <- length(unique(sapply(strsplit(GENELISTS," "),"[[",1)))
NUM_GENELIST_ARTICLES
## [1] 277
ERROR_GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>3]
#ERROR_GENELISTS
NUM_ERROR_GENELISTS = length(ERROR_GENELISTS)
NUM_ERROR_GENELISTS
## [1] 174
GENELIST_ERROR_ARTICLES <- unique(sapply(strsplit(ERROR_GENELISTS," "),"[[",1))
GENELIST_ERROR_ARTICLES
## [1] "PMC8555832" "PMC8554095" "PMC8519947" "PMC8514428" "PMC8511044"
## [6] "PMC8544599" "PMC8532573" "PMC8497462" "PMC8479071" "PMC8530322"
## [11] "PMC8498718" "PMC8518179" "PMC8501628" "PMC8523829" "PMC8519440"
## [16] "PMC8490982" "PMC8513875" "PMC8509297" "PMC8507324" "PMC8498966"
## [21] "PMC8498917" "PMC8554124" "PMC8521585" "PMC8514339" "PMC8505442"
## [26] "PMC8496489" "PMC8494907" "PMC8492352" "PMC8486594" "PMC8486213"
## [31] "PMC8458293" "PMC8455625" "PMC8455578" "PMC8529555" "PMC8544335"
## [36] "PMC8539775" "PMC8481544" "PMC8463701" "PMC8518362" "PMC8517214"
## [41] "PMC8528285" "PMC8515723" "PMC8503750" "PMC8494642" "PMC8516354"
## [46] "PMC8505008" "PMC8503558" "PMC8502891" "PMC8494765" "PMC8476631"
## [51] "PMC8455334" "PMC8484507" "PMC8490889" "PMC8490743" "PMC8490713"
## [56] "PMC8490697" "PMC8452624" "PMC8489709" "PMC8488206" "PMC8488120"
## [61] "PMC8449357" "PMC8445938" "PMC8484522" "PMC8482641"
NUM_ERROR_GENELIST_ARTICLES <- length(GENELIST_ERROR_ARTICLES)
NUM_ERROR_GENELIST_ARTICLES
## [1] 64
ERROR_PROPORTION = NUM_ERROR_GENELIST_ARTICLES / NUM_GENELIST_ARTICLES
ERROR_PROPORTION
## [1] 0.2310469
Here you can have a look at all the gene lists detected in the past month, as well as those with errors. The dates are obvious errors, these are commonly dates in September, March, December and October. The five-digit numbers represent dates as they are encoded in the Excel internal format. The five digit number is the number of days since 1900. If you were to take these numbers and put them into Excel and format the cells as dates, then these will also mostly map to dates in September, March, December and October.
#GENELISTS
ERROR_GENELISTS
## [1] "PMC8555832 /pmc/articles/PMC8555832/bin/pgen.1009813.s023.xlsx Hsapiens 5 44445 44441 44446 44450 44448"
## [2] "PMC8554095 /pmc/articles/PMC8554095/bin/Table2.XLSX Hsapiens 10 43894 44085 43892 43893 44080 43900 43901 43896 44086 43898"
## [3] "PMC8554095 /pmc/articles/PMC8554095/bin/Table2.XLSX Hsapiens 13 43897 44076 44077 43891 44085 43892 43893 44080 44078 43901 43896 43895 43898"
## [4] "PMC8554095 /pmc/articles/PMC8554095/bin/Table2.XLSX Ggallus 10 44077 43891 44085 43893 44082 43892 43900 43896 44086 43898"
## [5] "PMC8554095 /pmc/articles/PMC8554095/bin/Table2.XLSX Hsapiens 1 43532"
## [6] "PMC8554095 /pmc/articles/PMC8554095/bin/Table2.XLSX Hsapiens 2 43716 43534"
## [7] "PMC8519947 /pmc/articles/PMC8519947/bin/41467_2021_26299_MOESM14_ESM.xlsx Hsapiens 17 26148 92973 55747 90834 26222 11217 23285 79686 80761 90784 55410 54094 94023 25787 10896 84849 83935"
## [8] "PMC8519947 /pmc/articles/PMC8519947/bin/41467_2021_26299_MOESM14_ESM.xlsx Hsapiens 15 92973 54094 26148 25787 79686 55747 90784 55410 90834 94023 80761 23285 84849 83935 10896"
## [9] "PMC8519947 /pmc/articles/PMC8519947/bin/41467_2021_26299_MOESM14_ESM.xlsx Hsapiens 7 79686 80761 55747 26148 23285 83935 90784"
## [10] "PMC8519947 /pmc/articles/PMC8519947/bin/41467_2021_26299_MOESM4_ESM.xlsx Hsapiens 17 26148 92973 55747 90834 26222 11217 23285 79686 80761 90784 55410 54094 94023 25787 10896 84849 83935"
## [11] "PMC8519947 /pmc/articles/PMC8519947/bin/41467_2021_26299_MOESM6_ESM.xlsx Hsapiens 15 92973 54094 26148 25787 79686 55747 90784 55410 90834 94023 80761 23285 84849 83935 10896"
## [12] "PMC8519947 /pmc/articles/PMC8519947/bin/41467_2021_26299_MOESM8_ESM.xlsx Hsapiens 7 79686 80761 55747 26148 23285 83935 90784"
## [13] "PMC8514428 zip/Supplementary_Software_1/Genomics_Analysis/interim_files/df_Out_preGeneRename.xlsx Hsapiens 1 12-SEP"
## [14] "PMC8511044 /pmc/articles/PMC8511044/bin/41467_2021_25860_MOESM9_ESM.xlsx Hsapiens 1 42619"
## [15] "PMC8511044 /pmc/articles/PMC8511044/bin/41467_2021_25860_MOESM9_ESM.xlsx Hsapiens 1 42429"
## [16] "PMC8544599 /pmc/articles/PMC8544599/bin/Table_1.XLSX Hsapiens 13 44531 44265 44266 44256 44257 44258 44259 44260 44261 44262 44263 44264 44454"
## [17] "PMC8532573 /pmc/articles/PMC8532573/bin/13059_2021_2513_MOESM5_ESM.xlsx Hsapiens 27 44444 44442 44256 44443 44266 44257 44449 44265 44446 44261 44257 44262 44264 44441 44447 44451 44531 44445 44259 44440 44258 44256 44260 44263 44450 44453 44448"
## [18] "PMC8532573 /pmc/articles/PMC8532573/bin/13059_2021_2513_MOESM7_ESM.xlsx Hsapiens 27 44442 44444 44256 44265 44447 44259 44531 44446 44256 44450 44262 44449 44443 44266 44445 44441 44448 44451 44440 44257 44257 44263 44453 44261 44264 44260 44258"
## [19] "PMC8532573 /pmc/articles/PMC8532573/bin/13059_2021_2513_MOESM9_ESM.xlsx Hsapiens 27 44259 44262 44444 44442 44531 44256 44257 44256 44265 44266 44257 44258 44260 44261 44263 44264 44440 44449 44450 44451 44453 44441 44443 44445 44446 44447 44448"
## [20] "PMC8497462 /pmc/articles/PMC8497462/bin/42003_2021_2673_MOESM4_ESM.xlsx Hsapiens 1 44259"
## [21] "PMC8479071 /pmc/articles/PMC8479071/bin/41467_2021_25935_MOESM13_ESM.xlsx Mmusculus 1 37681"
## [22] "PMC8479071 /pmc/articles/PMC8479071/bin/41467_2021_25935_MOESM4_ESM.xlsx Mmusculus 10 37316 40787 38596 39692 38961 38231 37681 40603 37135 39873"
## [23] "PMC8479071 /pmc/articles/PMC8479071/bin/41467_2021_25935_MOESM7_ESM.xlsx Hsapiens 63 39326 39326 39326 39326 39326 39326 39326 39326 39326 40603 40603 38777 38777 38777 38961 40787 40787 40787 40787 40787 40787 40787 40787 39508 39508 39508 39508 39508 39508 38412 38412 38412 38412 38412 38047 38047 39142 37500 39692 39692 39692 38596 38596 38596 40057 40057 40057 40057 40057 40057 40057 40057 40057 40057 40057 40057 40057 40057 40057 37316 37681 37681 37681"
## [24] "PMC8479071 /pmc/articles/PMC8479071/bin/41467_2021_25935_MOESM8_ESM.xlsx Mmusculus 8 37316 40787 38596 39692 38961 38231 37681 40603"
## [25] "PMC8530322 /pmc/articles/PMC8530322/bin/pone.0258316.s002.xlsx Hsapiens 1 40057"
## [26] "PMC8498718 /pmc/articles/PMC8498718/bin/mmc1.xlsx Hsapiens 28 43435 43160 43161 43160 43169 43170 43161 43162 43163 43164 43165 43166 43167 43168 43358 43344 43353 43354 43355 43357 43345 43346 43347 43348 43349 43350 43351 43352"
## [27] "PMC8518179 /pmc/articles/PMC8518179/bin/13059_2021_2498_MOESM2_ESM.xlsx Mmusculus 1 43535"
## [28] "PMC8518179 /pmc/articles/PMC8518179/bin/13059_2021_2498_MOESM2_ESM.xlsx Mmusculus 1 43717"
## [29] "PMC8501628 /pmc/articles/PMC8501628/bin/41021_2021_216_MOESM2_ESM.xlsx Mmusculus 1 42795"
## [30] "PMC8523829 /pmc/articles/PMC8523829/bin/DataSheet4.xlsx Hsapiens 1 44258"
## [31] "PMC8523829 /pmc/articles/PMC8523829/bin/DataSheet4.xlsx Hsapiens 2 44263 44265"
## [32] "PMC8523829 /pmc/articles/PMC8523829/bin/DataSheet4.xlsx Hsapiens 1 44257"
## [33] "PMC8523829 /pmc/articles/PMC8523829/bin/DataSheet4.xlsx Hsapiens 1 44258"
## [34] "PMC8519440 /pmc/articles/PMC8519440/bin/pntd.0009750.s017.xlsx Hsapiens 26 11951 11758 12281 15323 10721 11453 10021 16180 13800 28723 10899 22725 15324 14221 10378 23893 17177 11469 25485 11721 10295 14821 17631 11956 15645 16496"
## [35] "PMC8490982 /pmc/articles/PMC8490982/bin/EMBR-22-e52823-s004.xlsx Hsapiens 4 40057 39692 38961 40787"
## [36] "PMC8490982 /pmc/articles/PMC8490982/bin/EMBR-22-e52823-s004.xlsx Hsapiens 4 38231 39326 38961 41153"
## [37] "PMC8490982 /pmc/articles/PMC8490982/bin/EMBR-22-e52823-s005.xlsx Hsapiens 2 37135 38596"
## [38] "PMC8490982 /pmc/articles/PMC8490982/bin/EMBR-22-e52823-s005.xlsx Hsapiens 2 37135 38231"
## [39] "PMC8490982 /pmc/articles/PMC8490982/bin/EMBR-22-e52823-s005.xlsx Hsapiens 2 40057 37135"
## [40] "PMC8490982 /pmc/articles/PMC8490982/bin/EMBR-22-e52823-s008.xlsx Hsapiens 2 40057 37500"
## [41] "PMC8513875 /pmc/articles/PMC8513875/bin/pgen.1009834.s011.xlsx Dmelanogaster 1 38596"
## [42] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_1.XLSX Dmelanogaster 4 44441 44440 44443 44444"
## [43] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_1.XLSX Dmelanogaster 2 44440 44443"
## [44] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_1.XLSX Dmelanogaster 3 44440 44441 44444"
## [45] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_1.XLSX Dmelanogaster 2 44440 44443"
## [46] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_1.XLSX Dmelanogaster 3 44440 44441 44444"
## [47] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_1.XLSX Dmelanogaster 1 44440"
## [48] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_1.XLSX Dmelanogaster 1 44440"
## [49] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_2.XLSX Dmelanogaster 1 44440"
## [50] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_2.XLSX Dmelanogaster 1 44440"
## [51] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_2.XLSX Dmelanogaster 1 44440"
## [52] "PMC8509297 /pmc/articles/PMC8509297/bin/Table_2.XLSX Dmelanogaster 1 44440"
## [53] "PMC8507324 zip/data_sheet/Supplementary_Table_3.xlsx Hsapiens 1 43891"
## [54] "PMC8498966 /pmc/articles/PMC8498966/bin/mmc10.xlsx Hsapiens 41 44083 43894 43893 44083 44085 44166 43894 44083 44166 44076 43891 44083 44080 44075 44083 44166 44085 44083 44085 44166 43894 44083 44083 44085 44083 44083 44166 43894 44082 44075 44085 44083 44166 43891 43894 44080 44083 44166 44083 44166 44085"
## [55] "PMC8498917 /pmc/articles/PMC8498917/bin/ADVS-8-2101426-s003.xlsx Hsapiens 12 44447 44450 44446 44441 44257 44454 44448 44445 44449 44260 44256 44261"
## [56] "PMC8498917 /pmc/articles/PMC8498917/bin/ADVS-8-2101426-s003.xlsx Hsapiens 12 44261 44445 44447 44260 44449 44448 44446 44256 44454 44450 44257 44441"
## [57] "PMC8498917 /pmc/articles/PMC8498917/bin/ADVS-8-2101426-s003.xlsx Hsapiens 12 44448 44447 44441 44450 44261 44449 44446 44445 44260 44454 44256 44257"
## [58] "PMC8498917 /pmc/articles/PMC8498917/bin/ADVS-8-2101426-s003.xlsx Hsapiens 9 44445 44260 44441 44261 44447 44446 44450 44448 44454"
## [59] "PMC8554124 /pmc/articles/PMC8554124/bin/DataSheet10.xlsx Mmusculus 1 43165"
## [60] "PMC8554124 /pmc/articles/PMC8554124/bin/DataSheet10.xlsx Mmusculus 5 39695 43166 43170 43347 43894"
## [61] "PMC8554124 /pmc/articles/PMC8554124/bin/DataSheet10.xlsx Mmusculus 1 38596"
## [62] "PMC8554124 /pmc/articles/PMC8554124/bin/DataSheet10.xlsx Mmusculus 16 43348 44085 43891 43165 44081 43167 43347 43166 43352 43345 43346 43353 43354 43350 43170 43168"
## [63] "PMC8554124 /pmc/articles/PMC8554124/bin/DataSheet10.xlsx Mmusculus 28 44085 43894 43348 43891 43347 39695 43170 43166 44081 42618 43163 43167 43901 43893 43354 43169 40971 41156 41162 42620 43352 43353 39697 42621 43345 40969 43346 43350"
## [64] "PMC8554124 /pmc/articles/PMC8554124/bin/DataSheet10.xlsx Mmusculus 3 44085 43348 43891"
## [65] "PMC8554124 /pmc/articles/PMC8554124/bin/DataSheet2.XLSX Hsapiens 1 43527"
## [66] "PMC8554124 /pmc/articles/PMC8554124/bin/DataSheet3.XLSX Hsapiens 1 43527"
## [67] "PMC8521585 /pmc/articles/PMC8521585/bin/41467_2021_26174_MOESM5_ESM.xlsx Ggallus 1 40057"
## [68] "PMC8514339 /pmc/articles/PMC8514339/bin/41586_2021_3922_MOESM2_ESM.xlsx Hsapiens 24 42068 42256 42248 42254 42065 42253 42257 42262 42259 42249 42262 42253 42250 42251 42064 42258 42257 42070 42253 42071 42065 42072 42251 42249"
## [69] "PMC8505442 /pmc/articles/PMC8505442/bin/41467_2021_26272_MOESM9_ESM.xlsx Hsapiens 19 43896 44084 43893 44079 44085 43892 44082 44083 43901 43894 43892 43898 44078 43897 44081 44076 43895 44080 44077"
## [70] "PMC8505442 /pmc/articles/PMC8505442/bin/41467_2021_26272_MOESM9_ESM.xlsx Hsapiens 19 43896 44084 43893 44079 44085 43892 44082 44083 43901 43894 43892 43898 44078 43897 44081 44076 43895 44080 44077"
## [71] "PMC8496489 /pmc/articles/PMC8496489/bin/Table_4.XLSX Hsapiens 15 43898 43899 43899 44075 43900 44083 44083 44083 43892 44079 43896 44082 44082 44081 44080"
## [72] "PMC8496489 /pmc/articles/PMC8496489/bin/Table_4.XLSX Hsapiens 19 43892 43898 43899 43899 44075 44083 44083 43892 44076 44079 44085 44085 44085 44085 43896 44082 44082 44081 44080"
## [73] "PMC8496489 /pmc/articles/PMC8496489/bin/Table_4.XLSX Hsapiens 13 43892 43898 43899 43899 44083 44083 44083 43892 44079 44079 44085 44082 44081"
## [74] "PMC8496489 /pmc/articles/PMC8496489/bin/Table_4.XLSX Hsapiens 16 43891 43899 43899 44075 44083 44083 44083 44083 43892 43894 44079 44079 44079 44082 44082 44080"
## [75] "PMC8496489 /pmc/articles/PMC8496489/bin/Table_5.XLSX Hsapiens 1 43534"
## [76] "PMC8496489 /pmc/articles/PMC8496489/bin/Table_6.XLSX Hsapiens 1 44075"
## [77] "PMC8494907 /pmc/articles/PMC8494907/bin/41419_2021_4191_MOESM2_ESM.xlsx Hsapiens 28 38412 37865 39142 39873 36951 37226 38047 40057 40238 38231 37681 39508 37500 39692 38777 38961 38596 40787 36951 37316 37135 39326 42248 37316 41153 40422 41883 40603"
## [78] "PMC8494907 /pmc/articles/PMC8494907/bin/41419_2021_4191_MOESM2_ESM.xlsx Hsapiens 28 39142 37865 41883 38412 37500 39692 38047 37316 38961 37681 40603 38231 40422 36951 41153 38777 39873 38596 39508 36951 39326 40787 42248 37226 37316 37135 40057 40238"
## [79] "PMC8492352 /pmc/articles/PMC8492352/bin/jciinsight-6-144575-s123.xlsx Mmusculus 361 44089 43892 43895 43896 43897 44076 43898 43892 44082 44084 44081 43899 44077 44078 44085 43891 44080 44083 44075 44083 44076 44085 43898 43896 44082 43897 43892 44084 43891 44077 44081 44078 44089 44080 44075 43897 43891 44083 43891 44075 44080 43894 43893 44077 43892 43897 44084 43898 43896 44082 43895 44085 43892 44078 44081 44076 44089 44083 44085 44075 44078 44089 44076 44081 43892 44084 43896 44082 43895 43898 43899 43891 43897 44083 44085 44080 44075 44075 44080 43893 44077 43896 44084 43898 44082 43895 43892 43891 44083 43892 44081 44078 44076 44085 44089 44075 44080 44077 43896 43898 44084 43895 44082 43892 44083 43891 44078 44076 44085 44089 44075 44080 44077 43898 44084 43896 44082 43892 43892 44078 43891 44076 44089 44075 44085 44084 43898 44078 43896 43891 43897 44089 44076 44089 43892 43895 43898 43894 44081 44085 44084 44076 43896 43891 44079 43900 43899 43891 44080 44075 44089 43898 43899 44085 44083 44080 44081 44075 44080 44077 44083 44081 44076 43892 44089 44085 43891 43895 43892 43897 43896 44084 44079 44082 43898 44078 44075 44081 44075 44085 44083 44080 44076 43895 43892 43892 44089 44084 43897 44082 43896 43898 44078 44078 44089 44076 44084 43894 43893 44082 43896 44077 43892 44081 43895 43891 43892 44085 43897 44083 44080 44075 44078 44089 44076 44081 44082 44084 44085 43892 43898 44079 44077 43897 43891 44083 44080 44075 44075 44085 44080 43892 44083 44082 44089 44081 43893 44077 44084 43898 44078 43897 43891 44076 44075 44085 44089 44081 44082 44077 43898 44084 43895 44078 43891 43897 44076 44089 44081 44082 44076 43891 43892 43892 43898 44084 43901 44077 44078 44085 44083 44080 44075 44089 44078 43895 44083 44089 43892 43891 44081 44076 43892 43897 44077 43898 44078 44080 44075 43895 44083 43891 44078 44080 44075 44083 43891 43898 44077 43896 43897 44082 43895 43892 44089 44078 44085 44080 43892 44081 44075 43897 43899 44077 43898 44084 44082 43896 44078 43892 43895 44083 44076 44081 44080 44085 43892 44075 44089 44075 44085 44080 44079 43893 44077 44084 44078 44082 43897 43896 43892 44083 43891 43892 44076 44089 44075 44080 44081 44083 44077 44082 43897 43896 44084 43898 43892 43891 44078 44089 44076"
## [80] "PMC8492352 /pmc/articles/PMC8492352/bin/jciinsight-6-144575-s123.xlsx Mmusculus 35 44078 44078 43898 44082 44083 44085 43892 44081 44089 44089 44082 44083 44078 43892 44075 44083 44080 43891 44077 43892 44078 44075 43899 44089 44078 44085 44083 44078 44089 44081 44076 44085 43895 43897 44083"
## [81] "PMC8486594 /pmc/articles/PMC8486594/bin/MOL2-15-2766-s002.xlsx Hsapiens 13 43891 43893 43891 43896 43892 43898 43897 43899 43900 43901 43894 43892 43895"
## [82] "PMC8486594 /pmc/articles/PMC8486594/bin/MOL2-15-2766-s002.xlsx Hsapiens 2 43891 43892"
## [83] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s001.xlsx Hsapiens 1 43898"
## [84] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s001.xlsx Hsapiens 1 44079"
## [85] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s001.xlsx Hsapiens 1 43891"
## [86] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s001.xlsx Hsapiens 1 44083"
## [87] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s001.xlsx Ggallus 1 44078"
## [88] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s001.xlsx Ggallus 1 43892"
## [89] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s001.xlsx Ggallus 1 43896"
## [90] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s004.xlsx Hsapiens 28 44454 44257 44256 44449 44262 44259 44441 44450 44256 44261 44266 44258 44447 44446 44453 44531 44263 44260 44264 44451 44440 44443 44265 44448 44257 44444 44442 44445"
## [91] "PMC8486213 /pmc/articles/PMC8486213/bin/CAS-112-4377-s005.xlsx Hsapiens 27 1-Mar 2-Mar 1-Mar 10-Mar 11-Mar 2-Mar 3-Mar 4-Mar 5-Mar 6-Mar 7-Mar 8-Mar 9-Mar 15-Sep 1-Sep 10-Sep 11-Sep 12-Sep 14-Sep 2-Sep 3-Sep 4-Sep 5-Sep 6-Sep 7-Sep 8-Sep 9-Sep"
## [92] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM4_ESM.xlsx Hsapiens 8 42248 39692 38961 40422 40787 37500 39326 40057"
## [93] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM5_ESM.xlsx Hsapiens 7 39692 40787 40422 37500 39326 40057 40057"
## [94] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM5_ESM.xlsx Hsapiens 8 42248 39692 40787 40422 37500 39326 40057 40057"
## [95] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM5_ESM.xlsx Hsapiens 8 42248 39692 40787 40422 37500 39326 40057 40057"
## [96] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM5_ESM.xlsx Hsapiens 7 39692 40787 40422 37500 39326 40057 40057"
## [97] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM5_ESM.xlsx Hsapiens 2 42248 40422"
## [98] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM5_ESM.xlsx Hsapiens 6 39692 40787 37500 39326 40057 40057"
## [99] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM7_ESM.xlsx Hsapiens 21 43892 44078 43892 44081 44080 43897 44085 43899 43894 43891 43896 44082 43898 44076 43893 44075 44083 44079 43891 44084 43895"
## [100] "PMC8458293 /pmc/articles/PMC8458293/bin/41467_2021_25288_MOESM7_ESM.xlsx Hsapiens 21 44078 43898 43892 43891 43895 43893 43894 44076 43899 44081 44080 43892 44084 44079 43897 43896 44085 44082 44075 44083 43891"
## [101] "PMC8455625 /pmc/articles/PMC8455625/bin/41525_2021_239_MOESM1_ESM.xlsx Hsapiens 1 44445"
## [102] "PMC8455578 /pmc/articles/PMC8455578/bin/41467_2021_25769_MOESM5_ESM.xlsx Hsapiens 20 44446 44258 44453 44265 44265 44265 44441 44258 44446 44446 44446 44447 44266 44260 44441 44448 44256 44448 44446 44453"
## [103] "PMC8455578 /pmc/articles/PMC8455578/bin/41467_2021_25769_MOESM5_ESM.xlsx Hsapiens 207 44450 44261 44266 44258 44447 44444 44442 44450 44261 44266 44258 44447 44263 44260 44444 44442 44450 44443 44265 44448 44260 44450 44444 44442 44262 44259 44441 44443 44257 44256 44449 44262 44259 44441 44450 44261 44266 44258 44447 44531 44263 44260 44264 44451 44440 44265 44448 44442 44450 44449 44262 44259 44441 44531 44260 44444 44442 44453 44260 44264 44265 44450 44443 44265 44448 44444 44442 44449 44262 44259 44441 44258 44447 44453 44443 44265 44448 44444 44531 44263 44260 44451 44442 44449 44262 44259 44441 44450 44446 44453 44262 44446 44443 44265 44448 44444 44442 44263 44265 44442 44441 44261 44266 44258 44447 44531 44444 44442 44264 44450 44263 44260 44442 44262 44263 44260 44260 44444 44442 44261 44266 44258 44447 44449 44450 44443 44265 44448 44444 44442 44446 44453 44260 44264 44451 44440 44443 44265 44258 44446 44451 44443 44265 44448 44444 44442 44449 44444 44442 44450 44263 44264 44443 44265 44448 44262 44259 44441 44453 44531 44444 44442 44450 44442 44531 44259 44441 44258 44447 44263 44260 44442 44448 44442 44261 44266 44264 44265 44262 44259 44441 44450 44440 44448 44444 44442 44257 44256 44444 44442 44260 44264 44442 44449 44259 44441 44450 44443 44265 44448 44444 44442 44443 44265 44448 44444 44442"
## [104] "PMC8529555 /pmc/articles/PMC8529555/bin/mmc11.xlsx Mmusculus 1 2002-03-01"
## [105] "PMC8544335 /pmc/articles/PMC8544335/bin/aging-13-203619-s001.xlsx Hsapiens 1 43898"
## [106] "PMC8539775 /pmc/articles/PMC8539775/bin/12885_2021_8818_MOESM2_ESM.xlsx Hsapiens 2 43893 44085"
## [107] "PMC8481544 /pmc/articles/PMC8481544/bin/41467_2021_25951_MOESM10_ESM.xlsx Hsapiens 4 44449 44442 44443 44447"
## [108] "PMC8463701 /pmc/articles/PMC8463701/bin/41467_2021_25893_MOESM5_ESM.xlsx Athaliana 1 44474"
## [109] "PMC8518362 /pmc/articles/PMC8518362/bin/AJT-21-3133-s003.xlsx Ggallus 34 9. Trégouët DA, Heath S, Saut N, et al. Common susceptibility alleles are unlikely to contribute as strongly as the FV and ABO loci to VTE risk: Results from aGWAS approach. Blood. 2009. doi:10.1182/blood-2008-11-190389"
## [110] "PMC8517214 /pmc/articles/PMC8517214/bin/mmc3.xlsx Hsapiens 1 44166"
## [111] "PMC8517214 /pmc/articles/PMC8517214/bin/mmc5.xlsx Hsapiens 1 43892"
## [112] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Mmusculus 27 44256 44257 44256 44265 44266 44257 44258 44259 44260 44261 44262 44263 44264 44454 44440 44449 44450 44451 44453 44441 44442 44443 44444 44445 44446 44447 44448"
## [113] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Mmusculus 27 44256 44257 44256 44265 44266 44257 44258 44259 44260 44261 44262 44263 44264 44454 44440 44449 44450 44451 44453 44441 44442 44443 44444 44445 44446 44447 44448"
## [114] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Mmusculus 27 44256 44257 44256 44265 44266 44257 44258 44259 44260 44261 44262 44263 44264 44454 44440 44449 44450 44451 44453 44441 44442 44443 44444 44445 44446 44447 44448"
## [115] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 3 44263 44260 44447"
## [116] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 1 44256"
## [117] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Mmusculus 27 44256 44257 44256 44265 44266 44257 44258 44259 44260 44261 44262 44263 44264 44454 44440 44449 44450 44451 44453 44441 44442 44443 44444 44445 44446 44447 44448"
## [118] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 3 44263 44260 44447"
## [119] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 5 44263 44262 44261 44258 44265"
## [120] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 1 44256"
## [121] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 5 44263 44262 44261 44258 44265"
## [122] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 1 44256"
## [123] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 19 44257 44449 44450 44454 44264 44263 44262 44261 44260 44259 44258 44257 44440 44441 44444 44442 44448 44446 44447"
## [124] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 7 44449 44450 44454 44264 44260 44256 44448"
## [125] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 3 44263 44260 44447"
## [126] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 1 44256"
## [127] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 1 44256"
## [128] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 19 44257 44449 44450 44454 44264 44263 44262 44261 44260 44259 44258 44257 44440 44441 44444 44442 44448 44446 44447"
## [129] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 18 44257 44449 44450 44454 44264 44263 44262 44261 44260 44258 44257 44440 44441 44444 44448 44446 44447 44265"
## [130] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 7 44449 44450 44454 44264 44260 44256 44448"
## [131] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 11 44450 44453 44454 44264 44263 44256 44442 44448 44446 44257 44266"
## [132] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 2 44449 44260"
## [133] "PMC8528285 /pmc/articles/PMC8528285/bin/pbio.3001085.s010.xlsx Hsapiens 6 44453 44442 44446 44263 44257 44266"
## [134] "PMC8515723 /pmc/articles/PMC8515723/bin/13073_2021_965_MOESM1_ESM.xlsx Hsapiens 1 44261"
## [135] "PMC8515723 /pmc/articles/PMC8515723/bin/13073_2021_965_MOESM1_ESM.xlsx Hsapiens 1 44261"
## [136] "PMC8503750 /pmc/articles/PMC8503750/bin/Supplementary_Data2.xlsx Hsapiens 5 43715 43719 43717 43716 43712"
## [137] "PMC8494642 /pmc/articles/PMC8494642/bin/41586_2021_3209_MOESM4_ESM.xlsx Hsapiens 14 43891 44084 44085 43896 44082 44082 44082 44082 44082 43895 44078 44080 44080 44080"
## [138] "PMC8494642 /pmc/articles/PMC8494642/bin/41586_2021_3209_MOESM4_ESM.xlsx Hsapiens 12 44089 44089 43892 43891 43897 43896 43896 43895 44078 44078 44078 44080"
## [139] "PMC8516354 /pmc/articles/PMC8516354/bin/Table_1.xlsx Hsapiens 13 44445 44263 44444 44449 44441 44257 44446 44447 44443 44260 44450 44261 44442"
## [140] "PMC8516354 /pmc/articles/PMC8516354/bin/Table_1.xlsx Hsapiens 12 44263 44449 44260 44442 44261 44447 44445 44257 44443 44444 44441 44446"
## [141] "PMC8505008 /pmc/articles/PMC8505008/bin/pgen.1009828.s015.xlsx Scerevisiae 1 44105"
## [142] "PMC8505008 /pmc/articles/PMC8505008/bin/pgen.1009828.s016.xlsx Scerevisiae 2 44340 44470"
## [143] "PMC8505008 /pmc/articles/PMC8505008/bin/pgen.1009828.s017.xlsx Scerevisiae 2 44470 44340"
## [144] "PMC8505008 /pmc/articles/PMC8505008/bin/pgen.1009828.s017.xlsx Scerevisiae 2 44470 44340"
## [145] "PMC8503558 /pmc/articles/PMC8503558/bin/Table_2.xlsx Hsapiens 1 44258"
## [146] "PMC8503558 /pmc/articles/PMC8503558/bin/Table_2.xlsx Hsapiens 1 44258"
## [147] "PMC8502891 /pmc/articles/PMC8502891/bin/Table_2.xlsx Drerio 6 44451 44446 44445 44262 44451 44262"
## [148] "PMC8494765 /pmc/articles/PMC8494765/bin/41598_2021_99079_MOESM4_ESM.xlsx Mmusculus 104 44441 44441 44441 44441 44444 44444 44444 44444 44443 44443 44443 44443 44447 44447 44447 44447 44442 44442 44442 44442 44450 44450 44450 44450 44448 44448 44448 44448 44440 44440 44440 44440 44445 44445 44445 44445 44262 44262 44262 44262 44256 44256 44256 44256 44257 44257 44257 44257 44260 44260 44260 44260 44451 44451 44451 44451 44263 44263 44263 44263 44256 44256 44256 44256 44453 44453 44453 44453 44454 44454 44454 44454 44449 44449 44449 44449 44266 44266 44266 44266 44264 44264 44264 44264 44261 44261 44261 44261 44257 44257 44257 44257 44446 44446 44446 44446 44259 44259 44259 44259 44265 44265 44265 44265"
## [149] "PMC8476631 /pmc/articles/PMC8476631/bin/41419_2021_4176_MOESM4_ESM.xlsx Hsapiens 1 38047"
## [150] "PMC8455334 /pmc/articles/PMC8455334/bin/41397_2021_237_MOESM2_ESM.xlsx Hsapiens 1 38231"
## [151] "PMC8455334 /pmc/articles/PMC8455334/bin/41397_2021_237_MOESM2_ESM.xlsx Hsapiens 1 39873"
## [152] "PMC8455334 /pmc/articles/PMC8455334/bin/41397_2021_237_MOESM2_ESM.xlsx Hsapiens 3 37865 38231 41153"
## [153] "PMC8455334 /pmc/articles/PMC8455334/bin/41397_2021_237_MOESM2_ESM.xlsx Hsapiens 1 38961"
## [154] "PMC8455334 /pmc/articles/PMC8455334/bin/41397_2021_237_MOESM2_ESM.xlsx Hsapiens 1 38777"
## [155] "PMC8455334 /pmc/articles/PMC8455334/bin/41397_2021_237_MOESM2_ESM.xlsx Hsapiens 1 37135"
## [156] "PMC8484507 /pmc/articles/PMC8484507/bin/mmc4.xlsx Hsapiens 2 43348 43170"
## [157] "PMC8490889 /pmc/articles/PMC8490889/bin/Table_4.XLSX Hsapiens 26 44166 44076 43897 44080 44086 43898 44075 44079 44088 44084 44083 43900 44081 44078 44082 43901 43896 43891 43892 43893 43894 43895 43899 44089 44085 44077"
## [158] "PMC8490743 /pmc/articles/PMC8490743/bin/Table_1.xlsx Hsapiens 5 44443 44256 44445 44450 44440"
## [159] "PMC8490743 /pmc/articles/PMC8490743/bin/Table_6.xlsx Hsapiens 25 44257 44256 44449 44262 44259 44441 44450 44261 44266 44258 44447 44446 44453 44445 44531 44263 44260 44264 44451 44440 44443 44265 44448 44444 44442"
## [160] "PMC8490743 /pmc/articles/PMC8490743/bin/Table_7.xlsx Hsapiens 26 44445 44263 44258 44444 44440 44265 44449 44266 44441 44531 44451 44453 44256 44257 44446 44447 44443 44260 44450 44261 44454 44448 44259 44262 44264 44442"
## [161] "PMC8490713 /pmc/articles/PMC8490713/bin/Table_1.xlsx Mmusculus 22 44441 44257 44262 44454 44450 44263 44440 44256 44446 44449 44264 44447 44443 44448 44266 44261 44442 44444 44257 44258 44260 44445"
## [162] "PMC8490697 /pmc/articles/PMC8490697/bin/Table_1.XLSX Hsapiens 15 44444 44258 44256 44451 44445 44266 44259 44446 44442 44453 44261 44447 44263 44450 44448"
## [163] "PMC8452624 /pmc/articles/PMC8452624/bin/41467_2021_25709_MOESM12_ESM.xlsx Hsapiens 2 43712 43722"
## [164] "PMC8452624 /pmc/articles/PMC8452624/bin/41467_2021_25709_MOESM7_ESM.xlsx Ggallus 2 43533 43532"
## [165] "PMC8489709 /pmc/articles/PMC8489709/bin/pgen.1009801.s005.xlsx Dmelanogaster 2 43344 43347"
## [166] "PMC8488206 /pmc/articles/PMC8488206/bin/Table_1.XLSX Hsapiens 20 44265 44448 44447 44266 44453 44261 44444 44259 44445 44260 44442 44449 44451 44262 44440 44264 44256 44441 44263 44443"
## [167] "PMC8488206 /pmc/articles/PMC8488206/bin/Table_1.XLSX Hsapiens 4 44447 44450 44441 44445"
## [168] "PMC8488120 /pmc/articles/PMC8488120/bin/Table_1.XLSX Hsapiens 2 44450 44259"
## [169] "PMC8449357 /pmc/articles/PMC8449357/bin/pnas.2106080118.sd03.xlsx Drerio 2 44086 44082"
## [170] "PMC8449357 /pmc/articles/PMC8449357/bin/pnas.2106080118.sd04.xlsx Drerio 4 44085 43891 44081 44079"
## [171] "PMC8445938 /pmc/articles/PMC8445938/bin/41467_2021_25553_MOESM14_ESM.xlsx Hsapiens 16 43897 44081 43893 44085 43896 43532 43895 44082 44084 43899 44076 44083 43891 44080 44078 44077"
## [172] "PMC8445938 /pmc/articles/PMC8445938/bin/41467_2021_25553_MOESM5_ESM.xlsx Hsapiens 1 43532"
## [173] "PMC8484522 /pmc/articles/PMC8484522/bin/Table_3.xls Hsapiens 5 44259 44442 44444 44453 44266"
## [174] "PMC8482641 /pmc/articles/PMC8482641/bin/NIHMS1656410-supplement-1656410_Sup_tab_2.xlsx Mmusculus 50 43900 44078 44083 44083 43893 43897 44081 44085 43894 43894 43900 43900 44079 44083 44083 43891 43891 43894 43894 43896 44080 44081 44081 44085 44085 44085 43897 43900 43900 44078 44082 44083 44083 43892 43892 43894 43894 43900 44078 44078 44080 44083 44083 44083 44083 44083 44083 44083 44085 44085"
Let’s investigate the errors in more detail.
# By species
SPECIES <- sapply(strsplit(ERROR_GENELISTS," "),"[[",3)
table(SPECIES)
## SPECIES
## Athaliana Dmelanogaster Drerio Ggallus Hsapiens
## 1 13 3 7 124
## Mmusculus Scerevisiae
## 22 4
par(mar=c(5,12,4,2))
barplot(table(SPECIES),horiz=TRUE,las=1)
par(mar=c(5,5,4,2))
# Number of affected Excel files per paper
DIST <- table(sapply(strsplit(ERROR_GENELISTS," "),"[[",1))
DIST
##
## PMC8445938 PMC8449357 PMC8452624 PMC8455334 PMC8455578 PMC8455625 PMC8458293
## 2 2 2 6 2 1 9
## PMC8463701 PMC8476631 PMC8479071 PMC8481544 PMC8482641 PMC8484507 PMC8484522
## 1 1 4 1 1 1 1
## PMC8486213 PMC8486594 PMC8488120 PMC8488206 PMC8489709 PMC8490697 PMC8490713
## 9 2 1 2 1 1 1
## PMC8490743 PMC8490889 PMC8490982 PMC8492352 PMC8494642 PMC8494765 PMC8494907
## 3 1 6 2 2 1 2
## PMC8496489 PMC8497462 PMC8498718 PMC8498917 PMC8498966 PMC8501628 PMC8502891
## 6 1 1 4 1 1 1
## PMC8503558 PMC8503750 PMC8505008 PMC8505442 PMC8507324 PMC8509297 PMC8511044
## 2 1 4 2 1 11 2
## PMC8513875 PMC8514339 PMC8514428 PMC8515723 PMC8516354 PMC8517214 PMC8518179
## 1 1 1 2 2 2 2
## PMC8518362 PMC8519440 PMC8519947 PMC8521585 PMC8523829 PMC8528285 PMC8529555
## 1 1 6 1 4 22 1
## PMC8530322 PMC8532573 PMC8539775 PMC8544335 PMC8544599 PMC8554095 PMC8554124
## 1 3 1 1 1 5 8
## PMC8555832
## 1
summary(as.numeric(DIST))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.500 2.719 2.250 22.000
hist(DIST,main="Number of affected Excel files per paper")
# PMC Articles with the most errors
DIST_DF <- as.data.frame(DIST)
DIST_DF <- DIST_DF[order(-DIST_DF$Freq),,drop=FALSE]
head(DIST_DF,20)
## Var1 Freq
## 55 PMC8528285 22
## 41 PMC8509297 11
## 7 PMC8458293 9
## 15 PMC8486213 9
## 63 PMC8554124 8
## 4 PMC8455334 6
## 24 PMC8490982 6
## 29 PMC8496489 6
## 52 PMC8519947 6
## 62 PMC8554095 5
## 10 PMC8479071 4
## 32 PMC8498917 4
## 38 PMC8505008 4
## 54 PMC8523829 4
## 22 PMC8490743 3
## 58 PMC8532573 3
## 1 PMC8445938 2
## 2 PMC8449357 2
## 3 PMC8452624 2
## 5 PMC8455578 2
MOST_ERR_FILES = as.character(DIST_DF[1,1])
MOST_ERR_FILES
## [1] "PMC8528285"
# Number of errors per paper
NERR <- as.numeric(sapply(strsplit(ERROR_GENELISTS," "),"[[",4))
names(NERR) <- sapply(strsplit(ERROR_GENELISTS," "),"[[",1)
NERR <-tapply(NERR, names(NERR), sum)
NERR
## PMC8445938 PMC8449357 PMC8452624 PMC8455334 PMC8455578 PMC8455625 PMC8458293
## 17 6 4 8 227 1 88
## PMC8463701 PMC8476631 PMC8479071 PMC8481544 PMC8482641 PMC8484507 PMC8484522
## 1 1 82 4 50 2 5
## PMC8486213 PMC8486594 PMC8488120 PMC8488206 PMC8489709 PMC8490697 PMC8490713
## 62 15 2 24 2 15 22
## PMC8490743 PMC8490889 PMC8490982 PMC8492352 PMC8494642 PMC8494765 PMC8494907
## 56 26 16 396 26 104 56
## PMC8496489 PMC8497462 PMC8498718 PMC8498917 PMC8498966 PMC8501628 PMC8502891
## 65 1 28 45 41 1 6
## PMC8503558 PMC8503750 PMC8505008 PMC8505442 PMC8507324 PMC8509297 PMC8511044
## 2 5 7 38 1 20 2
## PMC8513875 PMC8514339 PMC8514428 PMC8515723 PMC8516354 PMC8517214 PMC8518179
## 1 24 1 2 25 2 2
## PMC8518362 PMC8519440 PMC8519947 PMC8521585 PMC8523829 PMC8528285 PMC8529555
## 34 26 78 1 5 221 1
## PMC8530322 PMC8532573 PMC8539775 PMC8544335 PMC8544599 PMC8554095 PMC8554124
## 1 81 2 1 13 36 56
## PMC8555832
## 5
hist(NERR,main="number of errors per PMC article")
NERR_DF <- as.data.frame(NERR)
NERR_DF <- NERR_DF[order(-NERR_DF$NERR),,drop=FALSE]
head(NERR_DF,20)
## NERR
## PMC8492352 396
## PMC8455578 227
## PMC8528285 221
## PMC8494765 104
## PMC8458293 88
## PMC8479071 82
## PMC8532573 81
## PMC8519947 78
## PMC8496489 65
## PMC8486213 62
## PMC8490743 56
## PMC8494907 56
## PMC8554124 56
## PMC8482641 50
## PMC8498917 45
## PMC8498966 41
## PMC8505442 38
## PMC8554095 36
## PMC8518362 34
## PMC8498718 28
MOST_ERR = rownames(NERR_DF)[1]
MOST_ERR
## [1] "PMC8492352"
GENELIST_ERROR_ARTICLES <- gsub("PMC","",GENELIST_ERROR_ARTICLES)
### JSON PARSING is more reliable than XML
ARTICLES <- esummary( GENELIST_ERROR_ARTICLES , db="pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLES,as= "parsed")
ARTICLE_DATA <- ARTICLE_DATA$result
ARTICLE_DATA <- ARTICLE_DATA[2:length(ARTICLE_DATA)]
JOURNALS <- unlist(lapply(ARTICLE_DATA,function(x) {x$fulljournalname} ))
JOURNALS_TABLE <- table(JOURNALS)
JOURNALS_TABLE <- JOURNALS_TABLE[order(-JOURNALS_TABLE)]
length(JOURNALS_TABLE)
## [1] 34
NUM_JOURNALS=length(JOURNALS_TABLE)
par(mar=c(5,25,4,2))
barplot(head(JOURNALS_TABLE,10), horiz=TRUE, las=1,
xlab="Articles with gene name errors in supp files",
main="Top journals this month")
Congrats to our Journal of the Month winner!
JOURNAL_WINNER <- names(head(JOURNALS_TABLE,1))
JOURNAL_WINNER
## [1] "Nature Communications"
There are two categories:
Paper with the most suplementary files affected by gene name errors (MOST_ERR_FILES)
Paper with the most gene names converted to dates (MOST_ERR)
Sometimes, one paper can win both categories. Congrats to our winners.
MOST_ERR_FILES <- gsub("PMC","",MOST_ERR_FILES)
ARTICLES <- esummary( MOST_ERR_FILES , db="pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLES,as= "parsed")
ARTICLE_DATA <- ARTICLE_DATA[2]
ARTICLE_DATA
## $result
## $result$uids
## [1] "8528285"
##
## $result$`8528285`
## $result$`8528285`$uid
## [1] "8528285"
##
## $result$`8528285`$pubdate
## [1] "2021 Oct 20"
##
## $result$`8528285`$epubdate
## [1] "2021 Oct 20"
##
## $result$`8528285`$printpubdate
## [1] ""
##
## $result$`8528285`$source
## [1] "PLoS Biol"
##
## $result$`8528285`$authors
## name authtype
## 1 Carroll PA Author
## 2 Freie BW Author
## 3 Cheng PF Author
## 4 Kasinathan S Author
## 5 Gu H Author
## 6 Hedrich T Author
## 7 Dowdle JA Author
## 8 Venkataramani V Author
## 9 Ramani V Author
## 10 Wu X Author
## 11 Raftery D Author
## 12 Shendure J Author
## 13 Ayer DE Author
## 14 Muller CH Author
## 15 Eisenman RN Author
##
## $result$`8528285`$title
## [1] "The glucose-sensing transcription factor MLX balances metabolism and stress to suppress apoptosis and maintain spermatogenesis"
##
## $result$`8528285`$volume
## [1] "19"
##
## $result$`8528285`$issue
## [1] "10"
##
## $result$`8528285`$pages
## [1] "e3001085"
##
## $result$`8528285`$articleids
## idtype value
## 1 pmid 34669700
## 2 doi 10.1371/journal.pbio.3001085
## 3 pmcid PMC8528285
##
## $result$`8528285`$fulljournalname
## [1] "PLoS Biology"
##
## $result$`8528285`$sortdate
## [1] "2021/10/20 00:00"
##
## $result$`8528285`$pmclivedate
## [1] "2021/10/21"
MOST_ERR <- gsub("PMC","",MOST_ERR)
ARTICLE_DATA <- esummary(MOST_ERR,db = "pmc" , retmode = "json" )
ARTICLE_DATA <- reutils::content(ARTICLE_DATA,as= "parsed")
ARTICLE_DATA
## $header
## $header$type
## [1] "esummary"
##
## $header$version
## [1] "0.3"
##
##
## $result
## $result$uids
## [1] "8492352"
##
## $result$`8492352`
## $result$`8492352`$uid
## [1] "8492352"
##
## $result$`8492352`$pubdate
## [1] "2021 Sep 8"
##
## $result$`8492352`$epubdate
## [1] "2021 Sep 8"
##
## $result$`8492352`$printpubdate
## [1] ""
##
## $result$`8492352`$source
## [1] "JCI Insight"
##
## $result$`8492352`$authors
## name authtype
## 1 Yun JH Author
## 2 Lee C Author
## 3 Liu T Author
## 4 Liu S Author
## 5 Kim EY Author
## 6 Xu S Author
## 7 Curtis JL Author
## 8 Pinello L Author
## 9 Bowler RP Author
## 10 Silverman EK Author
## 11 Hersh CP Author
## 12 Zhou X Author
##
## $result$`8492352`$title
## [1] "Hedgehog interacting protein–expressing lung fibroblasts suppress lymphocytic inflammation in mice"
##
## $result$`8492352`$volume
## [1] "6"
##
## $result$`8492352`$issue
## [1] "17"
##
## $result$`8492352`$pages
## [1] "e144575"
##
## $result$`8492352`$articleids
## idtype value
## 1 pmid 34375314
## 2 doi 10.1172/jci.insight.144575
## 3 pmcid PMC8492352
##
## $result$`8492352`$fulljournalname
## [1] "JCI Insight"
##
## $result$`8492352`$sortdate
## [1] "2021/09/08 00:00"
##
## $result$`8492352`$pmclivedate
## [1] "2021/10/07"
To plot the trend over the past 6-12 months.
url <- "http://ziemann-lab.net/public/gene_name_errors/"
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/@href")
links <- links[grep("html",links)]
links
## href href href
## "Report_2021-02.html" "Report_2021-03.html" "Report_2021-04.html"
## href href href
## "Report_2021-05.html" "Report_2021-06.html" "Report_2021-07.html"
## href href href
## "Report_2021-08.html" "Report_2021-09.html" "Report_2021-10.html"
unlink("online_files/",recursive=TRUE)
dir.create("online_files")
sapply(links, function(mylink) {
download.file(paste(url,mylink,sep=""),destfile=paste("online_files/",mylink,sep=""))
} )
## href href href href href href href href href
## 0 0 0 0 0 0 0 0 0
myfilelist <- list.files("online_files/",full.names=TRUE)
trends <- sapply(myfilelist, function(myfilename) {
x <- readLines(myfilename)
# Num XL gene list articles
NUM_GENELIST_ARTICLES <- x[grep("NUM_GENELIST_ARTICLES",x)[3]+1]
NUM_GENELIST_ARTICLES <- sapply(strsplit(NUM_GENELIST_ARTICLES," "),"[[",3)
NUM_GENELIST_ARTICLES <- sapply(strsplit(NUM_GENELIST_ARTICLES,"<"),"[[",1)
NUM_GENELIST_ARTICLES <- as.numeric(NUM_GENELIST_ARTICLES)
# number of affected articles
NUM_ERROR_GENELIST_ARTICLES <- x[grep("NUM_ERROR_GENELIST_ARTICLES",x)[3]+1]
NUM_ERROR_GENELIST_ARTICLES <- sapply(strsplit(NUM_ERROR_GENELIST_ARTICLES," "),"[[",3)
NUM_ERROR_GENELIST_ARTICLES <- sapply(strsplit(NUM_ERROR_GENELIST_ARTICLES,"<"),"[[",1)
NUM_ERROR_GENELIST_ARTICLES <- as.numeric(NUM_ERROR_GENELIST_ARTICLES)
# Error proportion
ERROR_PROPORTION <- x[grep("ERROR_PROPORTION",x)[3]+1]
ERROR_PROPORTION <- sapply(strsplit(ERROR_PROPORTION," "),"[[",3)
ERROR_PROPORTION <- sapply(strsplit(ERROR_PROPORTION,"<"),"[[",1)
ERROR_PROPORTION <- as.numeric(ERROR_PROPORTION)
# number of journals
NUM_JOURNALS <- x[grep('JOURNALS_TABLE',x)[3]+1]
NUM_JOURNALS <- sapply(strsplit(NUM_JOURNALS," "),"[[",3)
NUM_JOURNALS <- sapply(strsplit(NUM_JOURNALS,"<"),"[[",1)
NUM_JOURNALS <- as.numeric(NUM_JOURNALS)
NUM_JOURNALS
res <- c(NUM_GENELIST_ARTICLES,NUM_ERROR_GENELIST_ARTICLES,ERROR_PROPORTION,NUM_JOURNALS)
return(res)
})
colnames(trends) <- sapply(strsplit(colnames(trends),"_"),"[[",3)
colnames(trends) <- gsub(".html","",colnames(trends))
trends <- as.data.frame(trends)
rownames(trends) <- c("NUM_GENELIST_ARTICLES","NUM_ERROR_GENELIST_ARTICLES","ERROR_PROPORTION","NUM_JOURNALS")
trends <- t(trends)
trends <- as.data.frame(trends)
CURRENT_RES <- c(NUM_GENELIST_ARTICLES,NUM_ERROR_GENELIST_ARTICLES,ERROR_PROPORTION,NUM_JOURNALS)
trends <- rbind(trends,CURRENT_RES)
paste(CURRENT_YEAR,CURRENT_MONTH,sep="-")
## [1] "2021-11"
rownames(trends)[nrow(trends)] <- paste(CURRENT_YEAR,CURRENT_MONTH,sep="-")
plot(trends$NUM_GENELIST_ARTICLES, xaxt = "n" , type="b" , main="Number of articles with Excel gene lists per month",
ylab="number of articles", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$NUM_ERROR_GENELIST_ARTICLES, xaxt = "n" , type="b" , main="Number of articles with gene name errors per month",
ylab="number of articles", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$ERROR_PROPORTION, xaxt = "n" , type="b" , main="Proportion of articles with Excel gene list affected by errors",
ylab="proportion", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
plot(trends$NUM_JOURNALS, xaxt = "n" , type="b" , main="Number of journals with affected articles",
ylab="number of journals", xlab="month")
axis(1, at=1:nrow(trends), labels=rownames(trends))
unlink("online_files/",recursive=TRUE)
Zeeberg, B.R., Riss, J., Kane, D.W. et al. Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5, 80 (2004). https://doi.org/10.1186/1471-2105-5-80
Ziemann, M., Eren, Y. & El-Osta, A. Gene name errors are widespread in the scientific literature. Genome Biol 17, 177 (2016). https://doi.org/10.1186/s13059-016-1044-7
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.3.1 reutils_0.2.3 xml2_1.3.2 jsonlite_1.7.2 XML_3.99-0.8
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 knitr_1.36 magrittr_2.0.1 R6_2.5.1
## [5] rlang_0.4.12 fastmap_1.1.0 stringr_1.4.0 highr_0.9
## [9] tools_4.1.1 xfun_0.27 jquerylib_0.1.4 htmltools_0.5.2
## [13] yaml_2.2.1 digest_0.6.28 assertthat_0.2.1 sass_0.4.0
## [17] bitops_1.0-7 RCurl_1.98-1.5 evaluate_0.14 rmarkdown_2.11
## [21] stringi_1.7.5 compiler_4.1.1 bslib_0.3.1 cellranger_1.1.0