ZAPLO

(Zero-recombinant Haplotyping)

Overview:

The haplotyping algorithms and programs presented in the section on general haplotyping assume that the loci are in linkage equilibrium. Linkage equilibrium means that the frequency of a haplotype is the product of the frequency of its alleles. When loci span a very small genetic region, however, the assumption of linkage equilibrium may not be valid. For example, alleles of intragenic single nucleotide polymorphisms (SNPs), which may be only a few kilobases apart, are often correlated. This correlation is called linkage (gametic) disequilibrium and ignoring it can lead to incorrect results while haplotyping.

We now illustrate the effect of disequilbrium on haplotyping with a simple example. Suppose we have a nuclear family trio (two parents and a single child) each typed 1/2 at two SNPs that are 2 Kb apart, so that probability of recombination is almost 0. Under the assumption of no recombination the phased multilocus genotype of the child completely determines each parental genotype phase. Since the haplotype of the child is transmitted without recombination, the parent must also have that haplotype, and specifying one haplotype in the parent determines the other. Since there are four possible phases for the child, there are four different haplotype configurations, as displayed in Figure 1. 

Figure 1: The 4 possible haplotype configurations with non-zero likelihood under the assumption of linkage equilibrium.

Under the assumption of linkage equilibrium each of the four configurations has the same likelihood. Now suppose that the loci are in complete disequilibrium, which means that each allele at the first locus is paired with a unique allele at the second locus, implying there are only two haplotypes instead of four. Suppose that these two haplotypes are 1 1 and 2 2. Then there is only one phase for each parent, and thus only two haplotype configurations with a non-zero likelihood as displayed in Figure 2. Thus haplotyping under the assumption of linkage equilbibrium means that we have a 50% chance of finding an incorrect configuration.

Figure 2: The 2 possible haplotype configurations with non-zero likelihood under the assumption of complete linkage disequilibrium with haplotypes 1 1 and 2 2.

Running ZAPLO

ZAPLO uses LINKAGE-format pedigree and data files as input. The following command line options control the execution of the program.

-d  <data file>               Specifies the name of the marker data file. Note that ZAPLO ignores the recombination fractions as values of 0.0 are assumed. Also the input allele frequencies are not used. Default is datafile.dat.
-p  <pedigree file>    Specifies the name of the pedigree file. The pedigree must be in post-Makeped format. Default value is pedfile.dat.
-S  <pruning threshold> The pruning threshold is a real number from 0.0 to 1.0 that is used to eliminate multilocus genotypes in each step of the calculation to reduce the computational complexity. In particulare, ZAPLO considers the rank ordered probability distribution of each individual, then retains the set of multilocus genotypes whose cumulative probability is less than or equal to the pruning threshold. Note that if 1.0 is given, then NO multilocus genotypes are eliminated. Default value is 0.99
-w  <# markers to start> Specifies the number of markers used in the initial group to estimate haplotype frequencies. Default value is 1.
-s <# markers to add> Specifies the number of markers added at each step to extend the number of markers in the haplotype. Default value is 1.
-Z pg, -Z pgo Generates a linkage-format pedigree and data file treating the haplotypes as super alleles. In particular, the datafile, called zaplo_em_final.dat will contain a single marker where the allele number corresponds to the recoded haplotype in the zaplo_results file and the pedigree file zaplo_unordered.ped. Since an individual may now have more than one possible genotype due to phase ambiguity and/or missing information, the pedigree file uses {} to group distinct genotypes within an individual. Adding the o to the option creates the pedigree file zaplo_ordered.ped. This files contains ordered genotypes represented as A | B with A the paternal allele and B the maternal allele. Default is disabled.

Output files

zaplo_results         Estimates of the haplotype frequencies. Estimates of pairwise linkage disequilibrium |D'| and r2 derived from the full haplotype frequencies.

 

zaplo_profile The rank-ordered multilocus genotype probability distribution of each individual using the estimated haplotype frequencies.
zaplo_em_final.dat Linkage-format data file with a single marker that represents the haplotypes recoded as super alleles. Created using the -Z pg command line option.
zaplo_unordered.ped Linkage-format pedigree file with a single marker that represents unordered genotypes as pairs of haplotypes recoded as super alleles. Sets of genotypes are represented using the delimiter {}. Created when using the -Z pg command line option.
   
zaplo_ordered.ped Linkage-format pedigree file with a single marker that represents ordered genotypes as pairs of haplotypes recoded as super alleles. Sets of genotypes are represented using the delimiter {}. Created when using the -Z pgo command line option.

Example:

 The following example has 5 SNPs and 50 simulated multiplex pedigrees with 15 individuals. The datafile 5.txt contains 5 SNPs with equifrequent allele frequencies and recombination fractions 0.0. The pedigree file pedin_typed.txt has no missing data and pedin_untyped.txt is derived from pedin_typed.txt by blanking four founders in each pedigree. The pedigree pedin_untyped.txt is provided to measure the effect of missing genotype information on the computational complexity and haplotype frequency estimates. Run zaplo with varying values of the three options -w, -s and -S and compare the results. For example:

zaplo -d 5.txt -p pedin_typed.txt -S 1.0 -w 1 -s 1                                                                                            zaplo -d 5.txt -p pedin_typed.txt -S 1.0 -w 5                                                                                                    zaplo -d 5.txt -p pedin_untyped.txt  -w 1 -s 1                                                                                               zaplo -d 5.txt -p pedin_untyped.txt -w 5

Download: ZAPLO

Contact: zaplo@molecular-haplotype.org

References:

O'Connell JR, Weeks DE (1995) The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance, Nature Genetics 11:402-408

"O'Connell JR (2000) Zero-recombinant haplotyping: applications to fine mapping using SNPs," Genet Epidemiology 19:S64-70

O'Connell JR (2001) Rapid Multipoint Linkage Analysis via Inheritance Vectors in the Elston-Stewart Algorithm, Hum Hered 51(4):226-40