PROFILER  

PROFILER combines the likelihood speed of VITESSE and efficient genotype elimination algorithms to create a flexible tool to generate the probability distribution of joint multilocus genotypes defined by sets of individuals within the pedigree and sets of markers within the framework map. PROFILER uses the following terminology:

Pedigree: a set with N individuals comprising the family structure, denoted by P.

Group: a subset of n individuals from P, denoted by G. Thus, G is a subset of P.

Framework map: a set of markers (on a chromosome) used for analysis on P, denoted by F.

Window: a subset of markers of F, denoted by W.

Slice: a subset of markers of W, denoted by S. Thus, S is a subset of  W is a subset of F.  

First, PROFILER uses an efficient recursive genotype-elimination algorithm to generate the set of all possible joint multilocus genotype vectors consisting of a multilocus genotype over the loci in the slice S for each individual in the group G. For ease of presentation joint multilocus genotype vector may be referred to as genotype vector. Then PROFILER computes the likelihood of each genotype vector defined using G and S given the data in P and W, normalizes the likelihood by dividing by the likelihood of the data in P and W, and then rank orders these probabilities from highest to lowest. The result is the probability distribution of the set of genotype vectors conditional on the pedigree structure and window. The motivation for the slice-window paradigm is that the window creates an overlap that allows disjoint slices to be “glued” together. Thus, although the slices may be disjoint, their probabilities are correlated because their windows are not disjoint.

At its essence PROFILER is generating the full joint haplotype distribution of the group using the slice given the pedigree data. Thus, if the group is the pedigree and the slice and window are the framework map, then the PROFILER will generate the full haplotype distribution of the pedigree.

Running PROFILER

At its essence PROFILER uses LINKAGE-format pedigree and data files as input. The following command line options control the execution of the program.

-1 f               REQUIRED. Tells PROFILER which likelihood algorithm to use internally.
-d  <data file>               Specifies the name of the marker data file.  Default is datafile.dat.
-p  <pedigree file>    Specifies the name of the pedigree file. The pedigree must be in post-makeped format. Default value is pedfile.dat.
-w  <window size> Specifies the number of markers in the window. Default value is 1.
-s  <slice size> Specifies the number of markers in the slice. Default value is 1.
-J H Specifies to use haplotype format for the output: paternal haplotype | maternal haplotype. Default is genotype output  paternal allele | maternal allele for each marker.
-D i Generate the probability distribution for each individual in the pedigree file.
-D U Generate the probability distribution for the user-defined group given by a -1 in the proband column.

Output files

profiler_results         The file containing the probability distributions for each slice determined by the command line. 

An Interesting Example

We now demonstrate the power of PROFILER to elucidate the composition of the likelihood space for the real data pedigree below consisting of 41 individuals, where individuals with a slash have not been genotyped. The genotype data consists of 7 polymorphic markers (number of alleles at the markers are 12, 8, 8, 12, 12, 12, 9 and 6, respectively) spanning 37 cM. These markers were gentoyped as part of a fine mapping effort. 

Over a genetic distance of 37 cM the probability that a child receives a parental chromosome with no recombination is 0.7 under the stand Poisson model of crossover,  thus we expect significant phase constraint between loci in genotyped individuals. The 8-point likelihood calculations to generate the LOD score curve by moving the trait locus across the framework map required 44 hours to complete when analyzed in 2001 . The complexity of this pedigree for VITESSE is concentrated in individual 1, who is missing genotype data since if we had genotype information on individual 1 the LOD score calculations would be reduced to 25 seconds (see below). Given the genotype data in the pedigree, individual 1 has 559,872 possible ordered 7-locus genotypes, where each 7-locus ordered genotype consists of a paternal and maternal haplotype.  This number equals 9*8*9*9*6*4*4, which is the product of number of ordered genotypes individual 1 has at each locus. Note that without set recoding the number of possible ordered 7-locus genotypes is 28,560,000. We used the term ordered (multilocus) genotype to indicate that the parental source of each allele (haplotype) is specified. Thus, the unordered genotype 1 / 2 would a priori correspond to the two ordered genotypes 1 | 2  and 2 | 1.

Given that the five offspring and brother of individual 1 have genotype data and that the genetic distance spanned by the 7 markers is small, we might expect that the number of biologically plausible 7-locus genotypes within this distribution of 559,872 combinatorially feasible genotypes to be much smaller. To determine the exact probability distribution of these 559,872 7-locus genotypes we ran PROFILER with both slice and window size equal to 7 markers using a computer running Linux on an Intel chip. The total run required 204 days, averaging 30 seconds per 7-locus likelihood. 

An abbreviated version of the final PROFILER result is available here and the full output is available here. The output reveals that the best two 7-multilocus genotypes (4332543 | 2332151) and (2332151 | 4332543) account for 99.986% of the likelihood space, rendering the remaining 559,870 combinations irrelevant to the calculation. Moreover, the two most likely multilocus genotypes have identical haplotypes in opposite phase and thus represent probabilistically inferred molecular haplotypes (the probability distribution is symmetrical with regard to maternal and paternal haplotypes given that the parents are untyped). What happens if we assume these probabilistically inferred 7-locus multilocus genotypes were actually observed data?

Text Box: Table 1: PROFILER results for each window size with worse of the best and time.
WindowSize	Worse Best %	Time
1	75.49	< 1 s
2	96.11	< 1 s
3	96.76	2 s
4	98.23	1 m
5	99.51	52 m
6	99.52	42 h
7	99.99	204 d

If we assign individual 1 the 7 individual unordered genotypes from the 7-locus multilocus genotype (by editing the pedigree file), then the computational time for the LOD score analysis is reduced from 44 hours to 25 seconds—a speedup by a factor of 6336, which correlates with the reduction of the number of unordered 7-locus genotype combinations from 7500 to 1 (7500 = 5*5*5*5*3*2*2 which can be seen by running PROFILER with window and slice equal to 1). Comparing the exact and approximate LOD scores at 30 positions gave a mean absolute error of 0.000010 with a standard deviation of 0.000015, and a maximum absolute error of 0.00004.  The very small standard deviation implies that the approximation was uniformly good across the framework map. Moreover, the errors in the LOD score which includes both trait and genotype data correlated well with the error incurred excluding 0.02% of the likelihood space. Note that because we entered genotypes in the pedigree file and not haplotypes  we actually summed over 16 7-locus genotypes (four heterozygous, three homozygous genotypes have 16 possible a priori  phases) not just the two most likely, so that we were just a tad bit wasteful. Conclusion: assigning individual 1 these genotypes eliminated 559,856 7-locus genotype combinations that bogged down the calculation for 44 hours while adding essentially zero information. In fact, considering that the LOD scores are generally reported with a precision of one or two decimal places, this approximation can be considered “exact”.

Of course, the most pertinent question is whether we can deduce that individual 1 has these inferred molecular haplotypes without computing the full probability distribution since spending 204 days to reduce a 44 hour computation to 25 seconds has no value. We now work backwards and start with the amount of information in the single-locus distributions, which ignores intermarker correlation. The PROFILER probability distribution of the ordered genotypes for each of the 7 loci, M1 through M7, for individual 1 using the marker as both the window and the slice is here. Notice that the best unordered genotype at 5 of the 7 loci has probability greater than 0.9 and the smallest probability is 0.75 at marker M5. Thus, even omitting the information from flanking loci these distributions are concentrated on the same single unordered genotypes that comprise the two 7-locus haplotypes. Table 1 represents the results of increasing the window size from 1 to 6 with regard to computational time and the smallest of the best unordered n-locus genotype probabilities from each sliding window. The most likely haplotypes from different windows agreed on their overlap with the results from the full distribution. In fact, we can extract even more information by considering individual 1 and his brother individual 25 jointly to discover that they actually share (to 0.9998 probability) the same two haplotypes identical by descent.

Exercises

Run PROFILER using the data file datain.txt and pedigree file pedin.txt from the above example. The pedigree file has a -1 in the proband column for individual 1.

profiler -1 f -d datain.txt -p pedin.txt-w 1 -s 1 -D i        Generates the probability profile for each individual in the pedigree. Increase the slice and window sizes and examine how the output changes. Add the option -J H to view the output in haplotype form. 

Edit the pedigree file and add a -1 only in the proband column of individual 5 who has no offspring, but a genotyped father and four genotyped siblings; the pedigree file is pedin5.txt.  Now run profiler -1 f -d datain.txt -p pedin5.txt -w 4 -s 4 -D U -J H. What can you conclude from the probability distribution? What inference would you make about the 7-point results. If you have a fast machine try increasing the slice and window one marker at a time. 5 markers takes about 15 minutes.

Additional Exercises

1. Does individual 4, the spouse of individual 1, also have a probabilistically inferred molecular haplotype? Try window and slice sizes of 3.

2. Generate the joint probability distribution for individuals 1 and 25 by setting their proband values to -1 as mentioned above.

Executables

Currently only executables for Solaris and Linux are available. These executables are compressed.

PROFILER Linux

PROFILER Solaris

 

Further Analysis

This pedigree will also be used as an example in our computational repository to compare the performance of current likelihood algorithms.