Data Mining Algorithms In R/Sequence Mining/DEGSeq
Working with DEGseq
Introduction
[edit | edit source]First we have to choose the way to organize your data. There are two ways to do that: using the MA-plot-based method with random sampling model or MA-plot-based method with technical replicates.
MA-plot-based method with random sampling model
The MA-plot is a statistical analysis tool that having been used to detect and visualize intensity-dependent ratio of microarray data. Let C1 and C2 denote the counts of reads mapped to a specific gene obtained from two samples, with Ci ~ binomial(ni, pi), I = 1,2, where ni denotes the total number of mapped reads and pi denotes the probability of a read coming from that gene. We define M = log2C1 - log2C2, and A = (log2C1 + log2C2)/2. It can be proven that under the random sampling assumption the conditional distribution of M given that A = a (a is an observation of A), follows an approximate normal distribution. For each gene on the MA-plot, we do the hypothesis test of H0: p1 = p2 versus H1: p1 ≠ p2. Then a P-value could be assigned based on the conditional normal distribution.
MA-plot-based method with technical replicates
Though it has been reported that sequencing platform has low background noise, technical replicates would still be informative for quality control and to estimate the variation due to different machines or platforms. MA-plot-based is an another method which estimates the noise level by comparing technical replicates in the data (if available). In this method, a sliding-window is first applied on the MA-plot of the two technical replicates along the A-axis to estimate the random variation corresponding to different expression levels. A smoothed estimate of the intensity-dependent noise level is done by loess regression, and converted to local standard deviations of M conditioned on A, under the assumption of normal distribution. The local standard deviations are then used to identify the difference of the gene expression between the two samples.
Multiple testing correction
For the above methods, the P-values calculated for each gene are adjusted to Q-values for multiple testing corrections. Users can set either a P-value or a false discovery rate (FDR) threshold to identify differentially expressed genes.
Once you choose your data, you can apply the R package DEGseq. There are five different methods to run the program. They are DEGexp, DEGseq, readGeneExp, samWrapper and getGeneExp.
Functions of Package
[edit | edit source]DEGexp
[edit | edit source]Description
This function is used to identify differentially expressed genes when users already have the gene expression values (such as the number of reads mapped to each gene).
Usage
DEGexp( geneExpFile1, geneCol1=1, expCol1=2, depth1=rep(0, length(expCol1) ), groupLabel1=" geneExpFile2, geneCol2=1, expCol2=2, depth2=rep(0, length(expCol2) ), groupLabel2=" header=TRUE, sep="", method=c("LRT", "CTR", "FET", "MARS", "MATR", "FC"), pValue=1e-3, zScore=4, qValue=1e-3, foldChange=4, thresholdKind=1, outputDir="none", normalMethod=c("none", "loess", "median"), replicate1="none", geneColR1=1, expColR1=2, depthR1=rep(0, length(expColR1) ), replicate2="none", geneColR2=1, expColR2=2, depthR2=rep(0, length(expColR2) ), rawcount=TRUE )
Arguments
Argument | Description |
---|---|
geneExpFile1 | file containing gene expression values for replicates of sample1 (or replicate1 when method="CTR"). |
geneCol1 | gene id column in geneExpFile1. |
expCol1 | expression value columns in geneExpFile1 for replicates of sample1 (numeric vector). Note: Each column corresponds to a replicate of sample1. |
depth1 | the total number of reads uniquely mapped to genome for each replicate of sample1 (numeric vector), default: take the total number of reads mapped to all annotated genes as the depth for each replicate. |
groupLabel1 | label of group1 on the plots. |
GeneExpFile2 | file containing gene expression values for replicates of sample2 (or replicate2 when method="CTR"). |
geneCol2 | gene id column in geneExpFile2. |
expCol2 | expression value columns in geneExpFile2 for replicates of sample2 (numeric vector). Note: Each column corresponds to a replicate of sample2. |
depth2 | the total number of reads uniquely mapped to genome for each replicate of sample2 (numeric vector), default: take the total number of reads mapped to all annotated genes as the depth for each replicate. |
groupLabel2 | label of group2 on the plots. |
header | a logical value indicating whether geneExpFile1 and geneExpFile2 contain the names of the variables as its first line. |
sep | the field separator character. If sep = "" (the default for read.table) the separator is white space, that is one or more spaces, tabs, newlines or carriage returns. |
method | method to identify differentially expressed genes. Possible methods are: • "LRT": Likelihood Ratio Test (Marioni et al. 2008),• "CTR": Check whether the variation between Technical Replicates can be explained by the random sampling model (Wang et al. 2009), • "FET": Fisher’s Exact Test (Joshua et al. 2009), • "MARS": MA-plot-based method with Random Sampling model (Wang et al. 2009), • "MATR": MA-plot-based method with Technical Replicates (Wang et al.2009), • "FC" : Fold-Change threshold on MA-plot. |
pValue | pValue threshold (for the methods: LRT, FET, MARS, MATR). only used when thresholdKind=1. |
zScore | zScore threshold (for the methods: MARS, MATR). only used when thresholdKind=2. |
qValue | qValue threshold (for the methods: LRT, FET, MARS, MATR). only used when thresholdKind=3 or thresholdKind=4. |
thresholdKind | the kind of threshold. Possible kinds are: • 1: pValue threshold, • 2: zScore threshold, • 3: qValue threshold (Benjamini et al. 1995), • 4: qValue threshold (Storey et al. 2003). |
foldChange | fold change threshold on MA-plot (for the method: FC). |
outputDir | the output directory. |
normalMethod | the normalization method: "none", "loess", "median". recommend: "none". |
replicate1 | file containing gene expression values for replicate batch1 (only used when method="MATR"). Note: replicate1 and replicate2 are two (groups of) technical replicates of a sample. |
GeneColR1 | gene id column in the expression file for replicate batch1 (only used when method="MATR"). |
expColR1 | expression value columns in the expression file for replicate batch1 (numeric vector) (only used when method="MATR"). |
depthR1 | the total number of reads uniquely mapped to genome for each replicate in replicate batch1 (numeric vector), default: take the total number of reads mapped to all annotated genes as the depth for each replicate (only used when method="MATR"). |
ReplicateLabel1 | label of replicate batch1 on the plots (only used when method="MATR"). |
replicate2 | file containing gene expression values for replicate batch2 (only used when method="MATR"). Note: replicate1 and replicate2 are two (groups of) technical replicates of a sample. |
geneColR2 | gene id column in the expression file for replicate batch2 (only used when method="MATR"). |
expColR2 | expression value columns in the expression file for replicate batch2 (numeric vector) (only used when method="MATR"). |
depthR2 | the total number of reads uniquely mapped to genome for each replicate in replicate batch2 (numeric vector), default: take the total number of reads mapped to all annotated genes as the depth for each replicate (only used when method="MATR"). |
ReplicateLabel2 | label of replicate batch2 on the plots (only used when method="MATR"). |
rawCount | a logical value indicating the gene expression values are based on raw read counts or normalized values. |
Example:
> library(DEGseq) > geneExpFile <- system.file("data", "GeneExpExample5000.txt", + package = "DEGseq") > if (geneExpFile == "") { + zipFile <- system.file("data", "Rdata.zip", package = "DEGseq") + if (zipFile != "") { + unzip(zipFile, "GeneExpExample5000.txt", exdir = tempdir()) + geneExpFile <- file.path(tempdir(), "GeneExpExample5000.txt") + } + } > layout(matrix(c(1, 2, 3, 4, 5, 6), 3, 2, byrow = TRUE)) > par(mar = c(2, 2, 2, 2)) > DEGexp(geneExpFile1 = geneExpFile, expCol1 = c(7, 9, 12, 15, 18), groupLabel1 = "kidney", geneExpFile2 = geneExpFile, expCol2 = c(8, 10, 11, 13, 16), groupLabel2 = "liver", method = "MARS") Please wait... geneExpFile1: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/GeneExpExample5000.txt gene id column in geneExpFile1: 1 expression value column(s) in geneExpFile1: 7 9 12 15 18 total number of reads uniquely mapped to genome obtained from sample1: 345504 354981 334557 geneExpFile2: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/GeneExpExample5000.txt gene id column in geneExpFile2: 1 expression value column(s) in geneExpFile2: 8 10 11 13 16 total number of reads uniquely mapped to genome obtained from sample2: 274430 274486 264999 method to identify differentially expressed genes: MARS pValue threshold: 0.001 output directory: none The outputDir is not specified! Please wait … Identifying differentially expressed genes ... Please wait patiently ... output … The results can be observed in directory: none
DEGseq
[edit | edit source]Description
This function is used to identify differentially expressed genes from RNA-seq data. It takes uniquely mapped reads from RNA-seq data for the two samples with a gene annotation as input. So users should map the reads (obtained from sequencing libraries of the samples) to the corresponding genome in advance.
Usage
DEGseq ( mapResultBatch1, mapResultBatch2, fileFormat="bed", readLength=32, strandInfo=FALSE, refFlat, groupLabel1="group1", groupLabel2="group2", method=c("LRT", "CTR", "FET", "MARS", "MATR", "FC"), pValue=1e-3, zScore=4, qValue=1e-3, foldChange=4, thresholdKind=1, outputDir="none", normalMethod=c("none", "loess", "median"), depthKind=1, replicate1="none", replicate2="none", replicateLabel1="replicate1", replicateLabel2="replicate2" )
Arguments
Argument | Description |
---|---|
mapResultBatch1 | uniquely mapping result files for technical replicates of sample1 (or replicate1 when method="CTR"). |
MapResultBatch2 | uniquely mapping result files for technical replicates of sample2 (or replicate2 when method="CTR"). |
fileFormat | file format: "bed" or "eland". example of "bed" format: chr12 7 38 readID 2 example of "eland" format: readID chr12.fa 7 U2 F Note: The field separator character is TAB. And the files must follow the format as one of the examples. |
ReadLength | the length of the reads (only used if fileFormat="eland"). strandInfo whether the strand information was retained during the cloning of the cDNAs. "TRUE" : retained, "FALSE": not retained. |
RefFlat | gene annotation file in UCSC refFlat format. |
GroupLabel1 | label of group1 on the plots. |
GroupLabel2 | label of group2 on the plots. |
Method | method to identify differentially expressed genes. Possible methods are: "LRT": Likelihood Ratio Test, "CTR": Check whether the variation between two Technical Replicates can be explained by the random sampling model, "FET": Fisher’s Exact Test, "MARS": MA-plot-based method with Random Sampling model, "MATR": MA-plot-based method with Technical Replicates, "FC" : Fold-Change threshold on MA-plot. |
pValue | pValue threshold (for the methods: LRT, FET, MARS, MATR). only used when thresholdKind=1. |
zScore | zScore threshold (for the methods: MARS, MATR). only used when thresholdKind=2. |
qValue | qValue threshold (for the methods: LRT, FET, MARS, MATR). only used when thresholdKind=3 or thresholdKind=4. |
ThresholdKind | the kind of threshold. Possible kinds are: • 1: pValue threshold, • 2: zScore threshold, • 3: qValue threshold, • 4: qValue threshold . |
foldChange | fold change threshold on MA-plot (for the method: FC). |
outputDir | the output directory. |
normalMethod | the normalization method: "none", "loess", "median". recommend: "none". |
DepthKind | 1- take the total number of reads uniquely mapped to genome as the depth for each replicate, 0: take the total number of reads uniquely mapped to all annotated genes as the depth for each replicate. We recommend taking depthKind=1, especially when the genes in annotation file are part of all genes. |
replicate1 | files containing uniquely mapped reads obtained from replicate batch1 (only used when method="MATR"). |
replicate2 | files containing uniquely mapped reads obtained from replicate batch2 (only used when method="MATR"). |
ReplicateLabel1 | label of replicate batch1 on the plots (only used when method="MATR"). |
ReplicateLabel2 | label of replicate batch2 on the plots (only used when method="MATR"). |
Example:
> kidneyR1L1 <- system.file("data", "kidneyChr21.bed.txt", package = "DEGseq") > liverR1L2 <- system.file("data", "liverChr21.bed.txt", package = "DEGseq") > refFlat <- system.file("data", "refFlatChr21.txt", package = "DEGseq") > mapResultBatch1 <- c(kidneyR1L1) > mapResultBatch2 <- c(liverR1L2) > outputDir <- file.path(tempdir(), "DEGseqExample") > DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "bed",refFlat = refFlat, outputDir = outputDir, method = "LRT")
Please wait...
mapResultBatch1: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt mapResultBatch2: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/liverChr21.bed.txt file format: bed refFlat: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/refFlatChr21.txt Ignore the strand information when count the reads mapped to genes! Count the number of reads mapped to each gene ... This will take several minutes, please wait patiently!
Please wait...
SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes.
Please wait …
total 259 unique genes processed 0 reads (kidneyChr21.bed.txt) processed 10000 reads (kidneyChr21.bed.txt) processed 20000 reads (kidneyChr21.bed.txt) processed 30000 reads (kidneyChr21.bed.txt) processed 34304 reads (kidneyChr21.bed.txt) total used 0.328000 seconds!
Please wait...
SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/liverChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes.
Please wait …
total 259 unique genes processed 0 reads (liverChr21.bed.txt) processed 10000 reads (liverChr21.bed.txt) processed 20000 reads (liverChr21.bed.txt) processed 30000 reads (liverChr21.bed.txt) processed 30729 reads (liverChr21.bed.txt) total used 0.297000 seconds!
Please wait...
geneExpFile1: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample/group1.exp gene id column in geneExpFile1: 1 expression value column(s) in geneExpFile1: 2 total number of reads uniquely mapped to genome obtained from sample1: 34304 geneExpFile2: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample/group2.exp gene id column in geneExpFile2: 1 expression value column(s) in geneExpFile2: 2 total number of reads uniquely mapped to genome obtained from sample2: 30729 method to identify differentially expressed genes: LRT pValue threshold: 0.001 output directory: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample
Please wait …
Identifying differentially expressed genes ... Please wait patiently ... output ...
Done ... The results can be observed in directory: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGs
getGeneExp
[edit | edit source]Description
This function is used to count the number of reads and calculate the RPKM for each gene. It takes uniquely mapped reads from RNA-seq data for a sample with a gene annotation file as input. So users should map the reads (obtained from sequencing library of the sample) to the corresponding genome in advance.
Usage
getGeneExp( mapResultBatch, fileFormat="bed", readLength=32, strandInfo=FALSE, refFlat, output=paste(mapResultBatch[1],".exp",sep=""), min.overlapPercent= 1 )
Arguments
Argument | Description |
---|---|
mapResultBatch | a vector containing uniquely mapping result files for a sample. Note: The sample can have multiple technical replicates. |
fileFormat | file format: "bed" or "eland". example of "bed" format: chr12 7 38 readID 2 example of "eland" format: readID chr12.fa 7 U2 F Note: The field separator character is TAB. And the files must follow the format as one of the examples. |
readLength | the length of the reads (only used if fileFormat="eland"). strandInfo whether the strand information was retained during the cloning of the cDNAs. • "TRUE" : retained, • "FALSE": not retained. |
refFlat | gene annotation file in UCSC refFlat format. |
output | the output file. |
min.overlapPercent | the minimum percentage of the overlapping length for a read and an exon over the length of the read itself, for counting this read from the exon. should be <=1. 0: at least 1 bp overlap between a read and an exon. |
Example:
> kidneyR1L1 <- system.file("data", "kidneyChr21.bed.txt", package = "DEGseq") > refFlat <- system.file("data", "refFlatChr21.txt", package = "DEGseq") > mapResultBatch <- c(kidneyR1L1) > output <- file.path(tempdir(), "kidneyChr21.bed.exp") > getGeneExp(mapResultBatch, refFlat = refFlat, output = output) Please wait... SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes. Please wait ... total 259 unique genes processed 0 reads (kidneyChr21.bed.txt) processed 10000 reads (kidneyChr21.bed.txt) processed 20000 reads (kidneyChr21.bed.txt) processed 30000 reads (kidneyChr21.bed.txt) processed 34304 reads (kidneyChr21.bed.txt) total used 0.328000 seconds! > exp <- readGeneExp(file = output, geneCol = 1, valCol = c(2, + 3), label = c("raw count", "RPKM")) > exp[30:32, ] raw count RPKM C21orf131 0 0.000 C21orf15 0 0.000 C21orf2 51 665.789
readGeneExp
[edit | edit source]Description
This method is used to read gene expression values from a file to a matrix in R workspace. So that the matrix can be used as input of other packages, such as edgeR. The input of the method is a file that contains gene expression values.
Usage
readGeneExp(file, geneCol=1, valCol=2, label = NULL, header=TRUE, sep="")
Arguments
Argument Description file file containing gene expression values. GeneCol gene id column in file. valCol expression value columns to be read in the file. label label for the columns. Header a logical value indicating whether the file contains the names of the variables as its first line. sep the field separator character. If sep = "" (the default for read.table) the separator is white space, that is one or more spaces, tabs, newlines or carriage returns.
Example:
> geneExpFile <- system.file("data", "GeneExpExample1000.txt", + package = "DEGseq") > exp <- readGeneExp(file = geneExpFile, geneCol = 1, valCol = c(7, + 9, 12, 15, 18, 8, 10, 11, 13, 16)) > exp[30:32, ]
R1L1Kidney
R1L3Kidney
R1L7Kidney
R2L2Kidney
R2L6Kidney ENSG00000188976 73 77 68 70 82 ENSG00000187961 15 15 13 12 15 ENSG00000187583 1 1 3 0 3
R1L2Liver
R1L4Liver
R1L6Liver
R1L8Liver
R2L3Liver ENSG00000188976 34 56 45 55 42 ENSG00000187961 8 13 11 12 20 ENSG00000187583 0 1 0 0 2
samWrapper
[edit | edit source]Description
This function is a wrapper of the functions in samr. It is used to identify differentially expressed genes for two sets of samples with multiple replicates or two groups of samples from different individuals (e.g. disease samples vs. control samples).
Usage
samWrapper( geneExpFile1, geneCol1=1, expCol1=2, measure1=rep(1, length(expCol1) ), geneExpFile2, geneCol2=1, expCol2=2, measure2=rep(2, length(expCol2) ), header=TRUE, sep="", paired=FALSE, s0=NULL, s0.perc=NULL, nperms=100, testStatistic= c("standard","wilcoxon"), max.qValue=1e-3, in.foldchange=logged2=FALSE, output )
Arguments
Argument Description geneExpFile1 file containing gene expression values for group1. geneCol1 gene id column in geneExpFile1. expCol1 expression value columns in geneExpFile1. See the example. measure1
numeric vector of outcome measurements for group1. like c(1,1,1...) when paired=FALSE, or like c(-1,-2,-3,...) when paired=TRUE.
geneExpFile2 file containing gene expression values for group2. geneCol2 gene id column in geneExpFile2. ExpCol2 expression value columns in geneExpFile2. See the example. Measure2 numeric vector of outcome measurements for group2. like c(2,2,2...) when paired=FALSE, or like c(1,2,3,...) when paired=TRUE. header a logical value indicating whether geneExpFile1 and geneExpFile2 contain the names of the variables as its first line. sep the field separator character. If sep = "" (the default for read.table) the separator is white space, that is one or more spaces, tabs, newlines or carriage returns. paired a logical value indicating whether the samples are paired. s0 exchangeability factor for denominator of test statistic; Default is automatic choice. s0.perc percentile of standard deviation values to use for s0; default is automatic choice; -1 means s0=0 (different from s0.perc=0, meaning s0=zeroeth percentile of standard deviation values= min of sd values. Nperms number of permutations used to estimate false discovery rates. TestStatistic test statistic to use in two class unpaired case. Either "standard" (t-statistic) or "wilcoxon" (Two-sample wilcoxon or Mann-Whitney test). recommend "standard". max.qValue the max qValue desired; shoube be <1. min.foldchange the minimum fold change desired; should be >1. default is zero, meaning no fold change criterion is applied. logged2 a logical value indicating whether the expression values are logged2. output the output file.
Example
> geneExpFile <- system.file("data", "GeneExpExample1000.txt", + package = "DEGseq") > set.seed(100) > geneExpFile1 <- geneExpFile > geneExpFile2 <- geneExpFile > output <- file.path(tempdir(), "samWrapperOut.txt") > expCol1 = c(7, 9, 12, 15, 18) > expCol2 = c(8, 10, 11, 13, 16) > measure1 = c(-1, -2, -3, -4, -5) > measure2 = c(1, 2, 3, 4, 5) > samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, nperms = 100, min.foldchange = 2, max.qValue = 1e-04, output = output, paired = TRUE)
References
[edit | edit source]Jiang,H. and Wong,W.H. (2009) Statistical inferences for isoform expression in RNASeq. Bioinformatics, 25, 1026–1032.
Likun Wang, Zhixing Feng, Xi Wang, Xiaowo Wang, and Xuegong Zhang. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Bioinformatics Advance Access published on October 24, 2009, DOI 10.1093/bioinformatics/btp612.
Wang, Likun; Feng, Zhixing; Wang, Xi; Wang, Xiaowo; Zhang, Xuegong. DEGseq Manual. Tsinghua University. Beijing, China. 2009(B). Available at <http://www.bioconductor.org/packages/devel/bioc/manuals/DEGseq/man/DEGseq.pdf> Access on: 18 November 2009
Wang, Likun; Wang, Xi. How to use the Degseq Package. Laboratory of Bioinformatics and Bioinformatics Division, TNLIST / Department of Automation, Tsinghua University. Beijing, China. 2009(A). Available at <http://bioinfo.au.tsinghua.edu.cn/software/degseq/DEGseq.pdf> Access on: 18 November 2009