%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<BEAM3>
Yu Zhang

**this package implement methods described in two papers: 
[1] Zhang Y. A novel bayesian graphical model for genome-wide multi-SNP association mapping.
Genet Epi. (2011) Nov 29. doi: 10.1002/gepi.20661. [Epub ahead of print]
(This paper tests common variants)

[2] Zhang Y, Ghosh S, Hakondarson H. (2014) Dynamic Bayesian testing of sets of variants in complex diseases. Genetics, 10.1534/genetics.114.167403.
(This paper further tests rare variants)


The program uses graphs to account for SNP linkage disequlibrium (LD), such that the identified SNPs 
and SNP-SNP interactions are directly associatied with the disease, not due to LD effects.

The program also constructs a disease graph. Each node in the disease graph contains one or multiple
SNPs. The SNPs in a node are affecting the disease together, but independently of SNPs not included in the node. 
Each edge in the disease graph indicates an "interaction" between two nodes. That is, the two sets of SNPs
in the two connected nodes are jointly affecting the disease. Here, "interaction" means the two sets of SNPs
are jointly associated with the disease, i.e., their joint contribution to disease risk is stronger than their 
marginal contribution individually. This should not be confused with interation versus main effects in a regression
model, where interaction means the additional effects on top of main effects. In another word, even if an 
edge is present between two nodes, the SNPs within each node could still have main effects.

This version of the software only supports case control studies. A software supporting QTL studies will be
released later. In the meanwhile, if you have quantitative traits, you could partition the traits into two bins, 
and then treat the data as cases and controls.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This package contains the following files:
	BEAM3
	README.txt
	toydata.txt

[1. Installation]-----------------------------------------

Unzip all files into one folder. BEAM3 uses the GNU Scientific Library (GSL),
which has been built in the executables.

Run the program from command line by typing 

./BEAM3 [inputfile] -o [outname] [options]


[2. Input Data Format]-----------------------------------------

The inpute file contains the case-control genotype data.a
The genotypes should be coded as allele dosage, e.g., 0, 1, 2, denoting number of alternative alleles, and 3 denotes missing data. 
The missing alleles are simply imputed by rule of thumb at each SNP independently. You may use other imputation software to impute missing genotypes before running BEAM3.

The first line of the input file should contain the disease status of each individual.
You may use 1 to denote patients and 0 to denote controls. 
Start the first line by "ID Chr Position ". 

An example of the 1st line: to denote 3 cases and 3 controls, use the following as the 1st line:
ID Chr Position 1 1 1 0 0 0

Starting From the 2nd line of the input file, each line contains the genotype data per SNP, separated by space. But again, start each line with information about SNP_ID, chromosome, and position. For example: 

rs1021 chr5 110123456 1 0 3 0 2 2

This line specifies a SNP called "rs1021", at chr5:110123456, which has genotypes 1, 0, a missing value, and 0, 2, 2, in 3 cases and 3 controls, respectively.

Please use numerical values for chromosome numbers, such as chr23 for chrX.

Each column in the input file denote one individual, the disease 
status for each individual specified in the first line must match with the correponding 
column of genotypes in the remaining lines. 

SNPs should be sorted by their physical locations.

Please see the included "toydata.txt" for an example of the input file.

[3. Options]

There are 3 modes of the program, corresponding to different probability functions used for evaluating associations.
"-model 0" 	(default) This is the original BEAM3 (2011) method for common SNPs.
"-model 1"	This is the BEAM3 (2014) for testing rare variants (and common variants). This mode uses a Gaussian density function
		that models multivariate distributino of multiple variants.
"-model 2"	This is similar to -model 0, but allows more than 3 categorical values of genotypes. 

There are a few common options applicable to all 3 modes: 

"-filter k": Let the program to filter SNPs with too many missing genotypes (3%),
	     unbalanced missing between cases and controls, and SNPs violating HWE.
	     If this option is used, the user must specify the value k. k=0 if heterozygote is
	     coded as 2, and k=1 if heterozygote is coded as 1.

"-sample burnin mcmc": This option specifies the numbers of burnin and sampling iteractions.
	     By default, burnin=mcmc=100. In each iteration, the program updates all variables once,
	     and thus these numbers do not necessarily relate with the number of SNPs. A few
	     hundreds iterations will be enough for most cases.

"-prior p": This option specifies how likely each SNP is associated with the disease.
	    By default, p=5/L, i.e., 5 associated SNPs are expected (out of L SNPs).

"-T t":   This option tells the program to start running MCMC at a high temperature t, and the
	  temperature drops to 1 gradually over iterations. This option helps the program to jump
	  out of local modes in the first few iterations.

When running the program with "-model 1" for testing rare variants, the following options apply:
"-group a b"	Let the program to group SNPs with MAF <=b together as "one variable" for joint testing. 
	   	Also, let the program to test SNPs with MAF>=a individually without grouping.
		Typically a<b, such that for SNPs with a<=MAF<=b, they are presented in two forms: grouped with others for joint testing, and simultaneously testing their individual effects with the disease. This is a soft partition of SNPs to be treated as "rare" and "common".

"-szlimit x"	The maximum number of SNPs to be grouped as "one variable". Suggested number is 30. Going beyond 100 is likely to have no power and increase computing time.

"-split"	If used, this option tells the program to split a set of variants into smaller sets, but both the original set and the splitted sets are kept for testing. For example, if using "-group a b" option, and "-szlimit 30" option, one gets a set of variants of size 30, then using "-split" option, the program will generate 3 subsets of variants, each of size 10, so there will be 4 sets of variants tested: one of size 30, and three more of size 10 each. Nearby variants have priority to be grouped together.

"-randomonly"	Only test random effects. If not specified, the program will test both fixed and random effects. For rare variants, we suggest using this option.


[4. Output]-----------------------------------------

BEAM3 outputs 3 files, posterior.[outname] g.[outname].dot chi.txt

posterior.[outname] contains the posterior probabilities of marginal and interaction associations per SNP.
The content of the file almost is self explainary, where 0.01 + 0.10 = 0.11 means that the marginal
association probability at this SNP is 0.1, and the interaction association at this SNP is 0.10, and
the total disease association at this SNP is 0.11. You could also sum the probabilities of multiple SNPs
in a region to estimate the number of disease associated SNPs in the region. For example, if there 
are 100 SNPs spanning a 1Mb region, by summing the posterior probabilities (marginal, interaction,
or total) of the 100 SNPs, you could obtain an estimated number of disease SNPs in this region.

The last column is the same as the 2nd last column if not using "-model 1". If using "-model 1" option, the last column is
the posterior probability of association allocated to each SNP. For example, if SNPs 1~10 are tested as "one
variable" and receives a posterior probability 0.9, then SNP i receives 0.9/10=0.09 probability of association.
That is, the last column is a weighted association probability, the sum of which can be used as a statistic to evaluate 
the overall possibility of association over a region.

g.[outname].dot contains the disease graph. You can use GraphViz, a free software for graph visulization,
to visualize the disease graph. The first few lines in the file define graph nodes, and the numbers in
the parenthesis denotes the posterior probability of association of the node (summed over 100kb region).
The last few lines in the file define edges.

chi.txt is a chisq single-SNP test file, with test statistics and allele counts listed. This is for 
comparison purpose, not a necessary result from BEAM3.

[4. References]-----------------------------------------

Zhang Y and Liu JS (2007). Bayesian Inference of Epistatic Interations in
Case-Control Studies. Nature Genetics, 39:1167-1173

Zhang Y, Zhang J, Liu JS (2010) Block-based Bayesian Epistasis Association
Mapping with Application to WTCCC Type 1 Diabetes Data. Ann Appl Stat, 5:2052-2077.

Zhang Y. (2011) A novel bayesian graphical model for genome-wide multi-SNP association mapping.
Genet Epi. Nov 29. doi: 10.1002/gepi.20661. [Epub ahead of print]

Zhang Y, Ghosh S, Hakondarson H. (2014) Dynamic Bayesian testing of sets of variants in complex diseases. Genetics, 10.1534/genetics.114.167403.
 
[5. Support]-----------------------------------------

Should you have questions or comments, please contact 
Yu Zhang
Department of Statistics, 
The Pennsylvania State University
325 Thomas Building
University Park, PA 16802.
Email: yuzhang at stat dot psu dot edu
