Introduction to "Hclust.R": Software for Choosing Tag SNPS

This method is described in:

"Characterization of Multilocus Linkage Disequilibrium" by Rinald, Bacanu, Devlin, Sonpar, Wasserman and Roeder.

See also,

"Analysis of Single-Locus Tests to Detect Gene/Disease Associations" by Roeder, Bacanu, Sonpar, Zhang, and Devlin


H-clust is a simple clustering method that can be used to rapidly identify a set of tag SNP's based upon genotype data. This method does not require haplotype estimation. H-clust consists of two stages. The first stage uses hierarchical clustering to determine the clusters. In the second stage, the tag SNP is chosen by finding the SNP most correlated with all the other SNPs in the cluster. Optionally, the quality of each SNP can be included in the analysis. In this case, both quality and correlation affect the determination of tag SNPs. The input for H-clust is a genotype matrix using 0,1,2 to denote the number of copies of a particular allele. It then computes the similarity matrix based on Pearson's correlation between allele counts. The distance between two SNPs is one minus the squared correlation. By default, H-clust uses the "complete linkage" method. Hierarchical clustering can be represented as a dendrogram in which any two SNPs diverge at a height equal to their distance. The clusters are obtained by declaring SNPs to be in the same cluster when they converge before a certain cut-off value. In the H-clust program, this cutoff is 1- hcbound, where hcbound is determined by the user. (This is slightly different in the stepwise version, see below.) The second stage of H-clust finds a tag SNP to represent the cluster. This is done by scoring each SNP based on squared correlation and quality. If multiple SNPs are scored equally, then the one in the middle is chosen as the tag SNP.

H-clust includes functionality for an optional third stage analysis. A stepwise regression can be used to find a more parsimonious set of tag SNPs (e.g. Chapman et al. 2003, Human Heredity 56:18-31). After selecting a set of tag SNPs using H-clust, subsets of potential tag SNPs are assessed based upon their ability to predict jointly the allels counts of censored SNPs. For each censored SNP, the coefficient of determination, or Rsq, for predicting the allele count in the region is measured. With this sequential procedure, SNPs are dropped from the tag SNP set until the mean Rsq drops below the user-specified level (stbound.) By default, the mean Rsq is measured, but there is an option to specify a percentile instead. For example, the stepwise regression can proceed until the tenth percentile of Rsq falls below stbound.

The main program is implemented in the R programming language, which is a freeware version of S-plus for Windows and Linux (ver. 1.7.0). Thus, the first step in running the program will be to download and install a copy of R. To download R, go to

and follow the necessary links to download R for Windows/Linux. Once R is installed, simply type R at the command prompt (Linux), or select R from the Start Menu (Windows). To quit R, type q() at the R command prompt. To cancel an R command, type control-c (Linux) or ESC (Windows).

Download Program


Example File

Sample input file

Sample SNP info file

Commands for selecting tag SNPs

Dendrogram (Using H-clust method)

Dendrogram (Using H-clust & Stepwise method)

Sample log file

Sample Output - analysed SNP data matrix

Sample Output - tag SNP selection

Correlation plot (Uses Entropy Blocker program)