Source: https://github.com/markziemann/GeneNameErrors2020
View the reports: http://ziemann-lab.net/public/gene_name_errors/
Gene name errors result when data are imported improperly into MS Excel and other spreadsheet programs (Zeeberg et al, 2004). Certain gene names like MARCH3, SEPT2 and DEC1 are converted into date format. These errors are surprisingly common in supplementary data files in the field of genomics (Ziemann et al, 2016). This could be considered a small error because it only affects a small number of genes, however it is symptomtic of poor data processing methods. The purpose of this script is to identify gene name errors present in supplementary files of PubMed Central articles in the previous month.
library("XML")
library("jsonlite")
library("xml2")
library("reutils")
library("readxl")
library("RCurl")
Here I will be getting PubMed Central IDs for the previous month.
Start with figuring out the date to search PubMed Central.
CURRENT_MONTH=format(Sys.time(), "%m")
CURRENT_YEAR=format(Sys.time(), "%Y")
if (CURRENT_MONTH == "01") {
PREV_YEAR=as.character(as.numeric(format(Sys.time(), "%Y"))-1)
PREV_MONTH="12"
} else {
PREV_YEAR=CURRENT_YEAR
PREV_MONTH=as.character(as.numeric(format(Sys.time(), "%m"))-1)
}
DATE=paste(PREV_YEAR,"/",PREV_MONTH,sep="")
DATE
## [1] "2022/12"
Let’s see how many PMC IDs we have in the past month.
QUERY ='((genom*[Abstract]))'
ESEARCH_RES <- esearch(term=QUERY, db = "pmc", rettype = "uilist", retmode = "xml", retstart = 0,
retmax = 5000000, usehistory = TRUE, webenv = NULL, querykey = NULL, sort = NULL, field = NULL,
datetype = NULL, reldate = NULL,
mindate = paste(DATE,"/1",sep="") , maxdate = paste(DATE,"/31",sep=""))
pmc <- efetch(ESEARCH_RES,retmode="text",rettype="uilist",outfile="pmcids.txt")
## Retrieving UIDs 1 to 500
## Retrieving UIDs 501 to 1000
## Retrieving UIDs 1001 to 1500
## Retrieving UIDs 1501 to 2000
## Retrieving UIDs 2001 to 2500
## Retrieving UIDs 2501 to 3000
## Retrieving UIDs 3001 to 3500
## Retrieving UIDs 3501 to 4000
pmc <- read.table(pmc)
pmc <- paste("PMC",pmc$V1,sep="")
NUM_ARTICLES=length(pmc)
NUM_ARTICLES
## [1] 3567
writeLines(pmc,con="pmc.txt")
Now run the bash script. Note that false positives can occur (~1.5%) and these results have not been verified by a human.
Here are some definitions:
NUM_XLS = Number of supplementary Excel files in this set of PMC articles.
NUM_XLS_ARTICLES = Number of articles matching the PubMed Central search which have supplementary Excel files.
GENELISTS = The gene lists found in the Excel files. Each Excel file is counted once even it has multiple gene lists.
NUM_GENELISTS = The number of Excel files with gene lists.
NUM_GENELIST_ARTICLES = The number of PMC articles with supplementary Excel gene lists.
ERROR_GENELISTS = Files suspected to contain gene name errors. The dates and five-digit numbers indicate transmogrified gene names.
NUM_ERROR_GENELISTS = Number of Excel gene lists with errors.
NUM_ERROR_GENELIST_ARTICLES = Number of articles with supplementary Excel gene name errors.
ERROR_PROPORTION = This is the proportion of articles with Excel gene lists that have errors.
system("./gene_names.sh pmc.txt")
results <- readLines("results.txt")
XLS <- results[grep("XLS",results,ignore.case=TRUE)]
NUM_XLS = length(XLS)
NUM_XLS
## [1] 6654
NUM_XLS_ARTICLES = length(unique(sapply(strsplit(XLS," "),"[[",1)))
NUM_XLS_ARTICLES
## [1] 1171
GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>2]
#GENELISTS
NUM_GENELISTS <- length(unique(sapply(strsplit(GENELISTS," "),"[[",2)))
NUM_GENELISTS
## [1] 634
NUM_GENELIST_ARTICLES <- length(unique(sapply(strsplit(GENELISTS," "),"[[",1)))
NUM_GENELIST_ARTICLES
## [1] 351
ERROR_GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>3]
#ERROR_GENELISTS
NUM_ERROR_GENELISTS = length(ERROR_GENELISTS)
NUM_ERROR_GENELISTS
## [1] 247
GENELIST_ERROR_ARTICLES <- unique(sapply(strsplit(ERROR_GENELISTS," "),"[[",1))
GENELIST_ERROR_ARTICLES
## [1] "PMC9801655" "PMC9795187" "PMC9793550" "PMC9792465" "PMC9757741"
## [6] "PMC9789999" "PMC9785075" "PMC9784255" "PMC9775101" "PMC9772819"
## [11] "PMC9769457" "PMC9768194" "PMC9763387" "PMC9763382" "PMC9762592"
## [16] "PMC9762028" "PMC9759858" "PMC9758924" "PMC9755029" "PMC9750130"
## [21] "PMC9748153" "PMC9746999" "PMC9742853" "PMC9734637" "PMC9733279"
## [26] "PMC9729295" "PMC9729111" "PMC9728798" "PMC9727928" "PMC9724437"
## [31] "PMC9721428" "PMC9720402" "PMC9720150" "PMC9683724" "PMC9718428"
## [36] "PMC9715950" "PMC9715678" "PMC9713440" "PMC9714951" "PMC9713698"
## [41] "PMC9713327" "PMC9713173" "PMC9708086" "PMC9706722" "PMC9703941"
## [46] "PMC9800021" "PMC9795334" "PMC9792466" "PMC9791056" "PMC9776514"
## [51] "PMC9776006" "PMC9775906" "PMC9775105" "PMC9774719" "PMC9771818"
## [56] "PMC9768914" "PMC9764863" "PMC9763853" "PMC9763118" "PMC9763110"
## [61] "PMC9762029" "PMC9758519" "PMC9753033" "PMC9749333" "PMC9748020"
## [66] "PMC9748018" "PMC9746894" "PMC9744761" "PMC9743592" "PMC9743561"
## [71] "PMC9738480" "PMC9736101" "PMC9734139" "PMC9731154" "PMC9723631"
## [76] "PMC9724785" "PMC9722939" "PMC9718667" "PMC9716074" "PMC9715725"
## [81] "PMC9714834" "PMC9713371" "PMC9705836"
NUM_ERROR_GENELIST_ARTICLES <- length(GENELIST_ERROR_ARTICLES)
NUM_ERROR_GENELIST_ARTICLES
## [1] 83
ERROR_PROPORTION = NUM_ERROR_GENELIST_ARTICLES / NUM_GENELIST_ARTICLES
ERROR_PROPORTION
## [1] 0.2364672
Here you can have a look at all the gene lists detected in the past month, as well as those with errors. The dates are obvious errors, these are commonly dates in September, March, December and October. The five-digit numbers represent dates as they are encoded in the Excel internal format. The five digit number is the number of days since 1900. If you were to take these numbers and put them into Excel and format the cells as dates, then these will also mostly map to dates in September, March, December and October.
#GENELISTS
ERROR_GENELISTS
## [1] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Ggallus 2 44813 44630"
## [2] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 1 44806"
## [3] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 1 44624"
## [4] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Ggallus 2 44622 44621"
## [5] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 2 44812 44623"
## [6] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Ggallus 1 44806"
## [7] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 1 44624"
## [8] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 2 44622 44621"
## [9] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 1 44622"
## [10] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 2 44625 44628"
## [11] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 2 44621 44815"
## [12] "PMC9801655 PMC_DL/PMC9801655/supplementaryfiles/13578_2022_948_MOESM3_ESM.xlsx Hsapiens 1 44896"
## [13] "PMC9795187 zip/Table_2.xlsx Hsapiens 2 44808 44809"
## [14] "PMC9793550 PMC_DL/PMC9793550/supplementaryfiles/12859_2022_5109_MOESM4_ESM.xlsx Hsapiens 5 44443 44257 44442 44265 44259"
## [15] "PMC9792465 PMC_DL/PMC9792465/supplementaryfiles/42003_2022_4351_MOESM4_ESM.xlsx Drerio 1 44454"
## [16] "PMC9792465 PMC_DL/PMC9792465/supplementaryfiles/42003_2022_4351_MOESM4_ESM.xlsx Drerio 5 44624 44816 44631 44816 44816"
## [17] "PMC9792465 PMC_DL/PMC9792465/supplementaryfiles/42003_2022_4351_MOESM4_ESM.xlsx Drerio 11 44626 44628 44625 44625 44808 44626 44626 44628 44625 44625 44808"
## [18] "PMC9792465 PMC_DL/PMC9792465/supplementaryfiles/42003_2022_4351_MOESM4_ESM.xlsx Drerio 2 44819 44819"
## [19] "PMC9792465 PMC_DL/PMC9792465/supplementaryfiles/42003_2022_4351_MOESM4_ESM.xlsx Drerio 1 44443"
## [20] "PMC9792465 PMC_DL/PMC9792465/supplementaryfiles/42003_2022_4351_MOESM4_ESM.xlsx Drerio 3 44451 44266 44259"
## [21] "PMC9792465 PMC_DL/PMC9792465/supplementaryfiles/42003_2022_4351_MOESM4_ESM.xlsx Drerio 3 44263 44261 44260"
## [22] "PMC9757741 zip/sciadv.abo4082_table_s1.xlsx Celegans 10 44621 44805 44652 44624 44626 44625 44836 44713 44835 44622"
## [23] "PMC9789999 PMC_DL/PMC9789999/supplementaryfiles/41467_2022_35604_MOESM6_ESM.xlsx Mmusculus 24 44631 44815 44628 44808 44818 44819 44813 44812 44627 44622 44630 44624 44806 44625 44805 44621 44814 44816 44807 44809 44626 44810 44811 44629"
## [24] "PMC9789999 PMC_DL/PMC9789999/supplementaryfiles/41467_2022_35604_MOESM5_ESM.xlsx Mmusculus 24 44805 44812 44625 44630 44621 44818 44622 44814 44628 44627 44816 44806 44811 44807 44626 44819 44808 44809 44810 44629 44813 44624 44631 44815"
## [25] "PMC9789999 PMC_DL/PMC9789999/supplementaryfiles/41467_2022_35604_MOESM4_ESM.xlsx Mmusculus 24 44631 44628 44818 44813 44630 44812 44807 44808 44811 44806 44809 44819 44814 44815 44816 44626 44810 44805 44625 44627 44622 44624 44629 44621"
## [26] "PMC9785075 zip/Supplementary_Table_S5.XLSX Hsapiens 3 38200 37135 38961"
## [27] "PMC9784255 PMC_DL/PMC9784255/supplementaryfiles/12863_2022_1099_MOESM7_ESM.xlsx Hsapiens 1 44256"
## [28] "PMC9775101 zip/Table_S1-S11.xlsx Hsapiens 2 44623 44630"
## [29] "PMC9772819 zip/Table_S6_630_PSGs_expression_in_each_cell_type.xlsx Hsapiens 1 43895"
## [30] "PMC9772819 zip/Table_S12_Gene_expression_average_and_proportion.xlsx Hsapiens 25 44256 44257 44256 44265 44266 44257 44258 44259 44260 44261 44262 44263 44264 44440 44449 44450 44451 44441 44442 44443 44444 44445 44446 44447 44448"
## [31] "PMC9772819 zip/Table_S12_Gene_expression_average_and_proportion.xlsx Hsapiens 25 44256 44257 44256 44265 44266 44257 44258 44259 44260 44261 44262 44263 44264 44440 44449 44450 44451 44441 44442 44443 44444 44445 44446 44447 44448"
## [32] "PMC9772819 zip/Table_S13_The_mean_expression_of_DEGs_in_different_cell_types.xlsx Hsapiens 4 44621 44628 44808 44811"
## [33] "PMC9769457 PMC_DL/PMC9769457/supplementaryfiles/Table_2.xlsx Hsapiens 14 44624 44630 44807 44816 44629 44627 44806 44811 44622 44628 44810 44622 44814 44805"
## [34] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 8 44621 44624 44626 44628 44805 44807 44810 44811"
## [35] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 2 44626 44815"
## [36] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 5 44621 44626 44628 44818 44806"
## [37] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 11 44896 44621 44622 44621 44627 44628 44815 44806 44807 44808 44812"
## [38] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 8 44621 44621 44630 44624 44625 44628 44806 44812"
## [39] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 18 44896 44621 44630 44624 44625 44626 44627 44628 44629 44814 44815 44816 44807 44808 44810 44811 44812 44813"
## [40] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 13 44896 44622 44623 44625 44629 44814 44815 44818 44806 44807 44811 44812 44813"
## [41] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 17 44621 44621 44623 44624 44626 44627 44628 44819 44814 44815 44806 44807 44808 44810 44811 44812 44813"
## [42] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 16 44621 44621 44624 44625 44627 44628 44629 44815 44816 44818 44807 44809 44810 44811 44812 44813"
## [43] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 17 44621 44623 44624 44625 44626 44627 44628 44819 44815 44818 44806 44807 44809 44810 44811 44812 44813"
## [44] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 15 44896 44621 44621 44622 44623 44624 44626 44627 44628 44819 44815 44818 44810 44811 44812"
## [45] "PMC9768194 PMC_DL/PMC9768194/supplementaryfiles/Table1.XLSX Hsapiens 19 44621 44630 44623 44624 44625 44626 44627 44628 44629 44819 44814 44815 44818 44806 44807 44810 44811 44812 44813"
## [46] "PMC9763387 PMC_DL/PMC9763387/supplementaryfiles/41387_2022_228_MOESM1_ESM.xlsx Hsapiens 5 44625 44624 44813 44621 44813"
## [47] "PMC9763387 PMC_DL/PMC9763387/supplementaryfiles/41387_2022_228_MOESM1_ESM.xlsx Hsapiens 13 44896 44621 44626 44624 44814 44625 44806 44810 44627 44630 44814 44629 44626"
## [48] "PMC9763387 PMC_DL/PMC9763387/supplementaryfiles/41387_2022_228_MOESM1_ESM.xlsx Hsapiens 1 44806"
## [49] "PMC9763382 PMC_DL/PMC9763382/supplementaryfiles/mmc3.xlsx Hsapiens 52 44896 44896 44896 44896 44630 44630 44630 44630 44621 44631 44631 44631 44631 44621 44621 44621 44622 44622 44622 44622 44623 44623 44623 44623 44624 44624 44624 44624 44625 44625 44625 44625 44626 44626 44626 44626 44627 44627 44627 44627 44628 44628 44628 44628 44629 44629 44629 44629 44819 44819 44819 44819"
## [50] "PMC9763382 PMC_DL/PMC9763382/supplementaryfiles/mmc3.xlsx Hsapiens 13 44622 44625 44629 44626 44819 44631 44628 44630 44627 44623 44624 44621 44896"
## [51] "PMC9763382 PMC_DL/PMC9763382/supplementaryfiles/mmc3.xlsx Hsapiens 50 44896 44896 44896 44896 44630 44630 44630 44630 44621 44631 44631 44631 44631 44621 44621 44621 44622 44622 44622 44622 44623 44623 44623 44623 44624 44624 44624 44624 44625 44625 44625 44625 44626 44626 44626 44627 44627 44627 44628 44628 44628 44628 44629 44629 44629 44629 44819 44819 44819 44819"
## [52] "PMC9763382 PMC_DL/PMC9763382/supplementaryfiles/mmc3.xlsx Hsapiens 13 44629 44625 44622 44630 44626 44621 44819 44627 44623 44624 44631 44628 44896"
## [53] "PMC9762592 PMC_DL/PMC9762592/supplementaryfiles/pgen.1010080.s011.xls Dmelanogaster 5 44441 44531 44440 44443 44444"
## [54] "PMC9762592 PMC_DL/PMC9762592/supplementaryfiles/pgen.1010080.s013.xls Dmelanogaster 1 44166"
## [55] "PMC9762028 PMC_DL/PMC9762028/supplementaryfiles/40779_2022_432_MOESM1_ESM.xlsx Hsapiens 77 44622 44812 44805 44626 44625 44622 44628 44628 44805 44814 44621 44809 44807 44628 44814 44810 44806 44628 44815 44621 44626 44808 44815 44621 44812 44806 44622 44809 44811 44630 44623 44813 44628 44813 44814 44814 44813 44814 44627 44811 44811 44813 44811 44815 44625 44624 44813 44815 44622 44806 44627 44811 44626 44808 44806 44813 44813 44806 44806 44815 44813 44812 44806 44810 44806 44811 44806 44812 44813 44813 44810 44813 44629 44815 44806 44629 44623"
## [56] "PMC9762028 PMC_DL/PMC9762028/supplementaryfiles/40779_2022_432_MOESM1_ESM.xlsx Hsapiens 70 44626 44628 44808 44626 44623 44807 44813 44813 44808 44809 44815 44622 44622 44629 44812 44806 44805 44622 44815 44813 44813 44626 44808 44806 44813 44806 44813 44625 44813 44629 44622 44622 44628 44809 44813 44813 44814 44813 44810 44806 44806 44809 44806 44815 44623 44626 44814 44621 44806 44627 44813 44627 44807 44621 44809 44815 44806 44621 44811 44806 44806 44815 44806 44811 44815 44627 44811 44806 44621 44811"
## [57] "PMC9762028 PMC_DL/PMC9762028/supplementaryfiles/40779_2022_432_MOESM1_ESM.xlsx Hsapiens 43 44622 44811 44621 44810 44622 44626 44815 44813 44813 44623 44814 44623 44623 44806 44621 44805 44623 44814 44626 44812 44624 44806 44811 44627 44811 44815 44809 44813 44806 44813 44810 44808 44809 44815 44814 44627 44814 44806 44808 44813 44808 44812 44808"
## [58] "PMC9759858 PMC_DL/PMC9759858/supplementaryfiles/13148_2022_1401_MOESM2_ESM.xlsx Hsapiens 2 40603 39142"