Source: https://github.com/markziemann/GeneNameErrors2020
View the reports: http://ziemann-lab.net/public/gene_name_errors/
Gene name errors result when data are imported improperly into MS Excel and other spreadsheet programs (Zeeberg et al, 2004). Certain gene names like MARCH3, SEPT2 and DEC1 are converted into date format. These errors are surprisingly common in supplementary data files in the field of genomics (Ziemann et al, 2016). This could be considered a small error because it only affects a small number of genes, however it is symptomtic of poor data processing methods. The purpose of this script is to identify gene name errors present in supplementary files of PubMed Central articles in the previous month.
library("XML")
library("jsonlite")
library("xml2")
library("reutils")
library("readxl")
library("RCurl")
Here I will be getting PubMed Central IDs for the previous month.
Start with figuring out the date to search PubMed Central.
CURRENT_MONTH=format(Sys.time(), "%m")
CURRENT_YEAR=format(Sys.time(), "%Y")
if (CURRENT_MONTH == "01") {
PREV_YEAR=as.character(as.numeric(format(Sys.time(), "%Y"))-1)
PREV_MONTH="12"
} else {
PREV_YEAR=CURRENT_YEAR
PREV_MONTH=as.character(as.numeric(format(Sys.time(), "%m"))-1)
}
DATE=paste(PREV_YEAR,"/",PREV_MONTH,sep="")
DATE
## [1] "2023/4"
Let’s see how many PMC IDs we have in the past month.
QUERY ='((genom*[Abstract]))'
ESEARCH_RES <- esearch(term=QUERY, db = "pmc", rettype = "uilist", retmode = "xml", retstart = 0,
retmax = 5000000, usehistory = TRUE, webenv = NULL, querykey = NULL, sort = NULL, field = NULL,
datetype = NULL, reldate = NULL,
mindate = paste(DATE,"/1",sep="") , maxdate = paste(DATE,"/31",sep=""))
pmc <- efetch(ESEARCH_RES,retmode="text",rettype="uilist",outfile="pmcids.txt")
## Retrieving UIDs 1 to 500
## Retrieving UIDs 501 to 1000
## Retrieving UIDs 1001 to 1500
## Retrieving UIDs 1501 to 2000
## Retrieving UIDs 2001 to 2500
## Retrieving UIDs 2501 to 3000
## Retrieving UIDs 3001 to 3500
pmc <- read.table(pmc)
pmc <- paste("PMC",pmc$V1,sep="")
NUM_ARTICLES=length(pmc)
NUM_ARTICLES
## [1] 3359
writeLines(pmc,con="pmc.txt")
Now run the bash script. Note that false positives can occur (~1.5%) and these results have not been verified by a human.
Here are some definitions:
NUM_XLS = Number of supplementary Excel files in this set of PMC articles.
NUM_XLS_ARTICLES = Number of articles matching the PubMed Central search which have supplementary Excel files.
GENELISTS = The gene lists found in the Excel files. Each Excel file is counted once even it has multiple gene lists.
NUM_GENELISTS = The number of Excel files with gene lists.
NUM_GENELIST_ARTICLES = The number of PMC articles with supplementary Excel gene lists.
ERROR_GENELISTS = Files suspected to contain gene name errors. The dates and five-digit numbers indicate transmogrified gene names.
NUM_ERROR_GENELISTS = Number of Excel gene lists with errors.
NUM_ERROR_GENELIST_ARTICLES = Number of articles with supplementary Excel gene name errors.
ERROR_PROPORTION = This is the proportion of articles with Excel gene lists that have errors.
system("./gene_names.sh pmc.txt")
results <- readLines("results.txt")
XLS <- results[grep("XLS",results,ignore.case=TRUE)]
NUM_XLS = length(XLS)
NUM_XLS
## [1] 4944
NUM_XLS_ARTICLES = length(unique(sapply(strsplit(XLS," "),"[[",1)))
NUM_XLS_ARTICLES
## [1] 1157
GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>2]
#GENELISTS
NUM_GENELISTS <- length(unique(sapply(strsplit(GENELISTS," "),"[[",2)))
NUM_GENELISTS
## [1] 567
NUM_GENELIST_ARTICLES <- length(unique(sapply(strsplit(GENELISTS," "),"[[",1)))
NUM_GENELIST_ARTICLES
## [1] 296
ERROR_GENELISTS <- XLS[lapply(strsplit(XLS," "),length)>3]
#ERROR_GENELISTS
NUM_ERROR_GENELISTS = length(ERROR_GENELISTS)
NUM_ERROR_GENELISTS
## [1] 204
GENELIST_ERROR_ARTICLES <- unique(sapply(strsplit(ERROR_GENELISTS," "),"[[",1))
GENELIST_ERROR_ARTICLES
## [1] "PMC10148045" "PMC10147683" "PMC10146888" "PMC10140479" "PMC10140356"
## [6] "PMC10140160" "PMC10138339" "PMC10134541" "PMC10134367" "PMC10126422"
## [11] "PMC10124151" "PMC10121420" "PMC10120908" "PMC10119186" "PMC10119024"
## [16] "PMC10104777" "PMC10103417" "PMC10101686" "PMC10101332" "PMC10099834"
## [21] "PMC10098193" "PMC10095438" "PMC10095316" "PMC10092771" "PMC10087516"
## [26] "PMC10086525" "PMC10084920" "PMC10082087" "PMC10080023" "PMC10079878"
## [31] "PMC10078807" "PMC10076939" "PMC10076316" "PMC10073929" "PMC10072823"
## [36] "PMC10070455" "PMC10069891" "PMC10069868" "PMC10067406" "PMC10066595"
## [41] "PMC10066286" "PMC10063665" "PMC10148492" "PMC10146992" "PMC10142227"
## [46] "PMC10141963" "PMC10140399" "PMC10139981" "PMC10132553" "PMC10123121"
## [51] "PMC10120025" "PMC10119396" "PMC10118386" "PMC10115815" "PMC10115646"
## [56] "PMC10108528" "PMC10106861" "PMC10104827" "PMC10103646" "PMC10099689"
## [61] "PMC10097911" "PMC10092010" "PMC10091377" "PMC10087204" "PMC10085592"
## [66] "PMC10082194" "PMC10080956" "PMC10077681" "PMC10071012" "PMC10070495"
## [71] "PMC10067935" "PMC10064647" "PMC10064413"
NUM_ERROR_GENELIST_ARTICLES <- length(GENELIST_ERROR_ARTICLES)
NUM_ERROR_GENELIST_ARTICLES
## [1] 73
ERROR_PROPORTION = NUM_ERROR_GENELIST_ARTICLES / NUM_GENELIST_ARTICLES
ERROR_PROPORTION
## [1] 0.2466216
Here you can have a look at all the gene lists detected in the past month, as well as those with errors. The dates are obvious errors, these are commonly dates in September, March, December and October. The five-digit numbers represent dates as they are encoded in the Excel internal format. The five digit number is the number of days since 1900. If you were to take these numbers and put them into Excel and format the cells as dates, then these will also mostly map to dates in September, March, December and October.
#GENELISTS
ERROR_GENELISTS
## [1] "PMC10148045 PMC_DL/PMC10148045/supplementaryfiles/mmc6.xlsx Hsapiens 14 44626 44819 44810 44806 44811 44621 44812 44622 44627 44815 44625 44814 44807 44813"
## [2] "PMC10148045 PMC_DL/PMC10148045/supplementaryfiles/mmc4.xlsx Hsapiens 200 45173 45184 45175 45171 45176 44986 45170 45177 44987 45180 44990 45179 45178 45173 45175 45176 44986 45170 45177 44987 45174 45180 44990 45179 45178 45173 44991 45175 45171 45176 44986 45170 45177 44987 45174 45180 45179 45172 45178 44991 45175 45171 45176 44986 45170 45177 44987 45174 45180 44990 45179 45172 45178 45173 45175 45171 45176 44986 45170 45177 44987 45180 44990 45179 45178 45184 45175 45171 45176 44986 45177 44987 45180 44990 45178 45173 45175 45171 45176 44986 45177 45174 45180 45179 45178 45173 45175 45171 45176 45177 45180 45178 45184 45175 45171 45176 44986 45170 45177 44987 45180 44990 45179 45178 45173 45184 45175 45171 45176 44986 45170 45177 45174 45180 45179 45178 45175 45171 45176 44986 45170 45177 44987 45180 45179 45172 45178 45173 45175 45171 45176 45170 45177 44987 45174 45180 45179 45178 45175 45171 45176 44986 45177 44987 45180 44990 45179 45178 45175 45171 45176 44986 45170 45177 44987 45180 45178 45184 45175 45171 45176 44986 45170 45177 44987 45180 44990 45179 45178 45175 45171 45176 45170 45177 45174 45180 44990 45178 45184 45175 45171 45176 44986 45177 44987 45174 45180 45179 45178 44991 45175 45171 45176 45170 45177 45174 45180 44990 45179 45178"
## [3] "PMC10148045 PMC_DL/PMC10148045/supplementaryfiles/mmc4.xlsx Hsapiens 1184 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45170 45170 45170 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45170 45170 45170 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45170 45170 45170 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45170 45170 45170 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45170 45170 45170 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45170 45170 45170 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45170 45170 45170 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45170 45170 45170 45178 45175 45178 45176 45175 45170 45178 45175 45176 45175 45170 45178 45175 45176 45175 45170 45178 45175 45176 45175 45170 45178 45175 45176 45175 45170 45178 45175 45176 45175 45170 45178 45175 45176 45175 45170 45178 45175 45176 45175 45170 45178 45175 45171 45178 45178 45175 45175 45171 45178 45178 45175 45175 45171 45178 45178 45175 45175 45171 45178 45178 45175 45175 45171 45178 45178 45175 45175 45171 45178 45178 45175 45175 45171 45178 45178 45175 45175 45171 45178 45178 45175 45175 45176 45170 45176 45175 45175 45176 45170 45176 45175 45175 45176 45170 45176 45175 45175 45176 45170 45176 45175 45175 45176 45170 45176 45175 45175 45176 45170 45176 45175 45175 45176 45170 45176 45175 45175 45176 45170 45176 45175 45175 45170 45176 45178 45175 45175 45170 45176 45178 45175 45175 45170 45176 45178 45175 45175 45170 45176 45178 45175 45175 45170 45176 45178 45175 45175 45170 45176 45178 45175 45175 45170 45176 45178 45175 45175 45170 45176 45178 45175 45175 45176 45171 45170 45178 45176 45178 45175 45175 45178 45178 45176 45171 45170 45178 45176 45178 45175 45175 45178 45178 45176 45171 45170 45178 45176 45178 45175 45175 45178 45178 45176 45171 45170 45178 45176 45178 45175 45175 45178 45178 45176 45171 45170 45178 45176 45178 45175 45175 45178 45178 45176 45171 45170 45178 45176 45178 45175 45175 45178 45178 45176 45171 45170 45178 45176 45178 45175 45175 45178 45178 45176 45171 45170 45178 45176 45178 45175 45175 45178 45178 45171 45171 45175 45171 45176 45175 45175 45175 45171 45171 45175 45171 45176 45175 45175 45175 45171 45171 45175 45171 45176 45175 45175 45175 45171 45171 45175 45171 45176 45175 45175 45175 45171 45171 45175 45171 45176 45175 45175 45175 45171 45171 45175 45171 45176 45175 45175 45175 45171 45171 45175 45171 45176 45175 45175 45175 45171 45171 45175 45171 45176 45175 45175 45175 45171 45176 45175 45175 45178 45171 45176 45175 45175 45178 45171 45176 45175 45175 45178 45171 45176 45175 45175 45178 45171 45176 45175 45175 45178 45171 45176 45175 45175 45178 45171 45176 45175 45175 45178 45171 45176 45175 45175 45178 45171 45171 45175 45170 45170 45176 45178 45175 45178 45171 45171 45175 45170 45170 45176 45178 45175 45178 45171 45171 45175 45170 45170 45176 45178 45175 45178 45171 45171 45175 45170 45170 45176 45178 45175 45178 45171 45171 45175 45170 45170 45176 45178 45175 45178 45171 45171 45175 45170 45170 45176 45178 45175 45178 45171 45171 45175 45170 45170 45176 45178 45175 45178 45171 45171 45175 45170 45170 45176 45178 45175 45178 45171 45175 45176 45175 45178 45171 45175 45176 45175 45178 45171 45175 45176 45175 45178 45171 45175 45176 45175 45178 45171 45175 45176 45175 45178 45171 45175 45176 45175 45178 45171 45175 45176 45175 45178 45171 45175 45176 45175 45178 45175 45176 45178 45175 45175 45175 45176 45178 45175 45175 45175 45176 45178 45175 45175 45175 45176 45178 45175 45175 45175 45176 45178 45175 45175 45175 45176 45178 45175 45175 45175 45176 45178 45175 45175 45175 45176 45178 45175 45175 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45171 45178 45170 45170 45170 45178 45178 45178 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45171 45178 45170 45170 45170 45178 45178 45178 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45171 45178 45170 45170 45170 45178 45178 45178 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45171 45178 45170 45170 45170 45178 45178 45178 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45171 45178 45170 45170 45170 45178 45178 45178 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45171 45178 45170 45170 45170 45178 45178 45178 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45171 45178 45170 45170 45170 45178 45178 45178 45178 45175 45178 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45171 45175 45171 45178 45170 45170 45170 45178 45178 45178 45178 45175 45178 45175 45170 45170 45175 45170 45170 45175 45170 45170 45175 45170 45170 45175 45170 45170 45175 45170 45170 45175 45170 45170 45175 45170 45170 45171 45175 45170 45170 45170 45175 45171 45175 45170 45170 45170 45175 45171 45175 45170 45170 45170 45175 45171 45175 45170 45170 45170 45175 45171 45175 45170 45170 45170 45175 45171 45175 45170 45170 45170 45175 45171 45175 45170 45170 45170 45175 45171 45175 45170 45170 45170 45175 45175 45170 45170 45170 45176 45175 45175 45175 45170 45170 45170 45176 45175 45175 45175 45170 45170 45170 45176 45175 45175 45175 45170 45170 45170 45176 45175 45175 45175 45170 45170 45170 45176 45175 45175 45175 45170 45170 45170 45176 45175 45175 45175 45170 45170 45170 45176 45175 45175 45175 45170 45170 45170 45176 45175 45175 45171 45170 45178 45178 45178 45176 45175 45175 45178 45171 45170 45178 45178 45178 45176 45175 45175 45178 45171 45170 45178 45178 45178 45176 45175 45175 45178 45171 45170 45178 45178 45178 45176 45175 45175 45178 45171 45170 45178 45178 45178 45176 45175 45175 45178 45171 45170 45178 45178 45178 45176 45175 45175 45178 45171 45170 45178 45178 45178 45176 45175 45175 45178 45171 45170 45178 45178 45178 45176 45175 45175 45178 45171 45171 45171 45171 45175 45177 45170 45178 45175 45175 45178 45171 45171 45171 45171 45175 45177 45170 45178 45175 45175 45178 45171 45171 45171 45171 45175 45177 45170 45178 45175 45175 45178 45171 45171 45171 45171 45175 45177 45170 45178 45175 45175 45178 45171 45171 45171 45171 45175 45177 45170 45178 45175 45175 45178 45171 45171 45171 45171 45175 45177 45170 45178 45175 45175 45178 45171 45171 45171 45171 45175 45177 45170 45178 45175 45175 45178 45171 45171 45171 45171 45175 45177 45170 45178 45175 45175 45178 44995 45170 45170 45170 45178 45178 45176 45178 45175 45175 44995 45170 45170 45170 45178 45178 45176 45178 45175 45175 44995 45170 45170 45170 45178 45178 45176 45178 45175 45175 44995 45170 45170 45170 45178 45178 45176 45178 45175 45175 44995 45170 45170 45170 45178 45178 45176 45178 45175 45175 44995 45170 45170 45170 45178 45178 45176 45178 45175 45175 44995 45170 45170 45170 45178 45178 45176 45178 45175 45175 44995 45170 45170 45170 45178 45178 45176 45178 45175 45175"
## [4] "PMC10148045 PMC_DL/PMC10148045/supplementaryfiles/mmc4.xlsx Hsapiens 15 44808 44626 44819 44810 44806 44811 44621 44812 44622 44627 44815 44625 44814 44807 44813"
## [5] "PMC10148045 PMC_DL/PMC10148045/supplementaryfiles/mmc8.xlsx Hsapiens 14 44621 44622 44625 44805 44814 44815 44806 44807 44808 44809 44810 44811 44812 44813"
## [6] "PMC10147683 PMC_DL/PMC10147683/supplementaryfiles/41467_2023_38132_MOESM5_ESM.xlsx Hsapiens 2 39692 40057"
## [7] "PMC10147683 PMC_DL/PMC10147683/supplementaryfiles/41467_2023_38132_MOESM4_ESM.xlsx Hsapiens 26 36951 37316 36951 40238 37316 37681 38047 38412 38777 39142 39508 39873 42248 37135 40422 40787 41153 41883 37500 37865 38231 38596 38961 39326 39692 40057"