Instructions for running Hclust.R Program


Hclust.R has been written as a method of finding tag SNPs based on correlation.

source("Hclust.R")

SNPclust(datafile[,option1=value1, option2=value2,...])



datafile is the matrix of SNPs to be analysed. Each column represents a SNP, coded by MINOR allele counts 0, 1, 2. Missing values are coded as NA or 3. The first row of the matrix is the names of the SNPs. An error occurs if the number of names does not equal the number of SNPs. (See the sample file or commented input example.) This variable must be a matrix or data frame, or else the name of a file containing a matrix. If datafile is a file name, it must be enclosed in quotes. This is the only non-optional variable.

SNPclust() also has many options which can be set on the command line. If a value is not specified, then the default will be used. For sample command using various options, click here. A complete list of options and defaults follows:

Usage:

SNPclust(datafile, hcbound=.5, stbound=1, ntags=NULL clustmethod="complete", bestN=NULL, minfreq=.1, subsetsize=500, qualityfile=NULL, weight=.1, infofile=NULL, pspfile=NULL, matarixfile=NULL,outfile=NULL, dplot=TRUE, main="Cluster Dendrogram", cex.label=.8, plotfile=NULL, logfile="log.txt", stat="mean")

Options:

datafile

The data to be analysed. See above.

hcbound

The cut-off value for finding clusters in the Hclust method. SNPs that are linked below 1-hcbound are in the same cluster. Default: .5.

stbound

1-stbound is the boundary used for the backward-step algorithm. If stbound = 1, then the standard H-clust method is used without stepwise regression.. (See stat option.) Default: 1

ntags

The maximum number of tags to select. If this value is NULL, there is no maximum. The program will start by dropping tag SNPs until stbound is satisfied. Then, if necessary, it will be incrementally decreased until there are at most ntags tag SNPs. Thus, it is possible, simultaneously, to put an upper bound on both stbound and the number of tags. Default: NULL

clustmethod

The linkage method usued in the hierachical clustering phase. (Use the R command help(hclust) for more information.) Default: "complete"

bestN

How many tagSNPs to find using the bestN algorithm. This algorithm starts after the normal tag selection concludes. The tagSNPs are ranked by the size of their clusters. Ties are broken by score and then CorMean. Any remaining ties are broken by random selection (see below.) If bestN is NULL, then this algorithm is not run. Default: NULL

minfreq

The minimum frequency of minor alleles allowed. If the percentage of minor alleles for a SNP is below minfreq, then the appropriate SNP is removed from the data. We do not recommend setting minfreq < 0.05. Default: 0.1

subsetsize

The maximum number of SNPs to be clustered at one time. Preliminary tag SNPs are chosen from each subset, then the method is repeated to choose tag SNPs from the set of preliminaries. We do not recommend having subsetsize > 1000. Default: 500

qualityfile

An optional parameter providing numerical SNP quality. This can be a file or a vector. If a file, each quality rating should be separated by white space. Default: NULL

weight

The relative importance of SNP quality in ranking tag SNPs. The score for each SNP is

score = weight(quality) + (1- weight)CorMean

where CorMean is the mean correlation with other SNPs in the cluster. The SNP with the highest score is chosen to be the tagSNP. This parameter is ignored when qualityfile is NULL. Default: 0.1

infofile

An optional file containing additional information about each SNP. This information is displayed in the output table. Blank lines can be used if no information is needed for a particular SNP. If the file is too short, then it is assumed that the last nSNPs have no additional information. If infofile is too long, it causes an error. Please note: if some entries contain spaces, it may cause problems with certain spreadsheet programs (e.g. Excel) wherein the entry is split into multiple cells. Default: NULL

psnpfile

An optional file designating pre-specified tag SNPs (pSNPs). Each pSNP is automatically chosen as a tag SNP. This file can contain integers, which refer to the pSNP's position in the data matrix, or names, matching the entries in namefile. Default: NULL

matrixfile

An optioinal file for writing the SNP data matrix from the analysis. For each SNP, the matrix includes information such as cluster number, CorMean, score, and a column to indicate which SNPs are tag SNPs. The SNPs are sorted by cluster then ranked according to score. Since pre-specified SNPs are always chosen as tag SNPs, they will be ranked at the top of their cluster. If any SNPs were dropped (e.g. monomorphisms) they are appended to the bottom of the table along with the reason for being dropped.Default: NULL

outfile

An optional file for listing the tag SNPs. A table is written to outfile containing the index number of each tag SNP, as well as their names (if namefile is specified.) Default: NULL

dplot

A logical variable specifying if the dendrogram should be plotted. Dendrograms are never plotted for large datasets (i.e. when the number of SNPs is greater than subsetsize. Default: TRUE

main

A string specifying the main title for the dendrogram plot. Default: "Cluster Dendrogram"

cex.label

A graphical parameter for printing the dendrogram that is passed to the plot() function. (Use the R command help(par) for more information.) Default: 0.8

logfile

A file that will contain information from the run. A table is written to logfile containing each SNP sorted by cluster and tag suitability. Default: "log.txt"

stat

In the stepwise regression method, the coefficient of determination is calculated for all censored SNPs. This is the statistic that should be compared to stbound. By default, subsets are compared using the mean coefficient. However, stat can be set to any percentile in (0,1]. In this case subsets are compared by the appropriate percentile. Default: "mean"



The function will return several R objects. The results of the analysis can be assigned to a variable. For example, to save the results in the variable out, use a command like:

out = SNPclust(data)

To see the results, use the "$" operator (VariableName$ObjectName). For example, after using the above command, the tag SNPs can be viewed by typing out$tagSNP. SNPclust() returns the following objects:

ObjectName

modhc

The hierarchical clustering model. Use the R command help(hclust) for more information.

clusters

A vector specifying which cluster contains each SNP.

CorMean

A vector specifying the mean squared correlation with all SNPs in the same cluster, including itself. For singleton SNPs, the value is 1.

score

A vector specifying the score for each SNP. Where score = weight(quality) + (1-weight)CorMean.

tagSNP

A vector specifying the index numbers for the selected tagSNPs. The indices refer to the cleaned data matrix and may not apply to the original matrix in datafile. (However, all tag SNP numbers in the printed tables refer to the original data matrix.)

data

The cleaned data matrix. Any cases (rows) or SNPs (column pairs) that consist entirely of missing values are removed. Any SNPs for which the frequency of minor alleles is less than minfreq are also removed. The outputted data matrix is a genotype matrix in which each entry is the number of minor alleles for a particular SNP.

stb

The maximum boundary for the stepwise regression that satifies the ntags constraint. SNPclust() only tests values of stb that are below stbound. Thus it is possible, simultaneously, to put an upper bound on both the stepwise boundary (stbound) and the number of tag SNPs (ntags). If ntags is not specified (NULL), then stb=stbound.

·         The hcbound value is the user selected cutoff for finding clusters in the H-clust method. The line in the dendrogram is drawn at 1-hcbound. A single tag SNP is selected from each cluster below this line. (Unless a cluster has multiple pre-specified SNPs.)

·         To use only the H-clust method, set stbound to 1. This is the default setting. To refine the tag SNPs using the stepwise method, select a value for stbound smaller than 1. For the stepwise method, the line on the dendrogram is at 1-stbound.

·         If the maximum number of tags desired is n, then set ntags=n. This bound can be set in both the H-clust method and the stepwise method

·         Sample Commands

Commented Input Example

In this example we have 3 people and 4 SNPs per person

Person

SNP1

SNP2

SNP3

SNP4

1

A/A

C/T

T/T

C/C

2

A/G

C/C

A/A

T/T

3

A/A

_/_

A/T

_/_

Person 3 has two missing SNPs and are coded as 3. The input file for this example would look like this:

SNP1 SNP2 SNP3 SNP4
0 1 2 0
1 0 0 2
0 3 1 3