The DYNATE
package accompanies the paper
“Localizing Rare-Variant Association Regions via Multiple Testing
Embedded in an Aggregation Tree”. The package is designed to pinpoint
the disease-associated rare variant sets with a controlled false
discovery rate in case-control study.
Here, we show how to apply DYNATE
.
We require the input data is a data frame with a long format: each
row is a rare variant (SNP) per sample. Specifically, the input data
should contains 6 variables with name Sample.Name
,
Sample.Type
, snpID
, domainID
,
X1
, X2
. Where variables
Sample.Name
, snpID
and domainID
indicate the Sample ID, SNP ID, and domain ID, respectively; Variable
Sample.Type
indicates the case/control status of each
sample; Variables X1
and X2
are covariates
that could be considered in the analysis. The snp_dat
below
is a toy simulated data with 6 variables and 210,454 rows. The data
contains 2,000 samples (1,000 cases and 1,000 controls). In total 16,281
SNPs reside in 2,000 domains are considered in snp_dat
.
str(snp_dat)
#> 'data.frame': 54282 obs. of 6 variables:
#> $ snpID : num 1 1 1 1 1 1 2 2 2 2 ...
#> $ Sample.Type: chr "case" "case" "ctrl" "ctrl" ...
#> $ Sample.Name: int 58 695 1373 1504 1898 1938 110 199 292 326 ...
#> $ domainID : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ X1 : num 0.38242 -0.46531 -0.23738 -0.0889 0.00982 ...
#> $ X2 : int 1 1 1 0 1 1 1 0 1 0 ...
First, we set the tuning parameters as follows. Please refer to the paper for detailed tuning parameters selection procedure.
Second, we use Test_Leaf
function to construct leaves
and generate leaf P-values for the case-control study.
# Model consider covariates effect:
t1 <- Sys.time()
p_leaf <- Test_Leaf(snp_dat=snp_dat,thresh_val=M)
#> 'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
#> Use 'as(., "CsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
t2 <- Sys.time()
t2-t1
#> Time difference of 1.515299 secs
In the output data frames p_leaf
, each row links to a
rare variant (SNPs), and the number of rows equals the number of rare
variants (SNPs) we considered (SNPs that link to a leaf with p-value=1
are excluded for maintaining the algorithm stability). The data frame
includes 5 variables. In the data frame, variable L1
is
leaf ID; variable pvals
is the leaf level p values;
variable Test
indicates the name of the statistical test to
generate the leaf level p values (FET or score).
str(p_leaf)
#> Classes 'data.table' and 'data.frame': 4214 obs. of 5 variables:
#> $ L1 : int 1 2 3 3 4 5 5 6 7 8 ...
#> $ pvals : num 0.453 0.887 0.548 0.548 0.895 ...
#> $ snpID : num 1 2 3 4 5 6 7 8 9 10 ...
#> $ domainID: int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Test : chr "FET" "FET" "FET" "FET" ...
#> - attr(*, ".internal.selfref")=<externalptr>
Finally, we use the function DYNATE
to conduct dynamic
and hierarchical testing based on the leaf level p values.
In the output data frames out
, each row links to a
unique SNP that is detected by DYNATE. The variables snpID
,
L1
, and domainID
link to the detected SNP ID,
leaf ID, and domain ID, respectively; Variable Test
links
to the name of the statistical test we applied (FET or score); Variable
pvals1
links to the leaf level p-values; Variable
Layer
indicates in which layer the SNP is detected.
str(out)
#> tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
#> $ L1 : int [1:9] 205 1357 1357 1462 2085 2479 3178 546 547
#> $ snpID : num [1:9] 277 1811 1812 1957 2785 ...
#> $ domainID: int [1:9] 20 168 168 181 312 376 493 66 66
#> $ Test : chr [1:9] "FET" "FET" "FET" "FET" ...
#> $ pvals1 : num [1:9] 1.25e-08 4.76e-34 4.76e-34 6.65e-05 2.09e-13 ...
#> $ Layer : int [1:9] 1 1 1 1 1 1 1 2 2