PROFILER
PROFILER combines the likelihood speed of VITESSE and efficient
genotype elimination algorithms to create a flexible tool to generate the
probability distribution of joint multilocus genotypes defined by sets of
individuals within the pedigree and sets of markers within the framework map.
PROFILER uses the following terminology:
Pedigree: a
set with N individuals comprising the
family structure, denoted by P.
Group: a
subset of n individuals from P, denoted by G. Thus, G
is a subset of P.
Framework map:
a set of
markers (on a chromosome) used for analysis on P, denoted by F.
Window: a
subset of markers of F, denoted by W.
Slice: a
subset of markers of W, denoted by S. Thus, S
is a subset of W
is a subset of F.
First, PROFILER uses an efficient recursive
genotype-elimination algorithm to generate the set of all possible joint
multilocus genotype vectors consisting of a multilocus genotype over the loci in
the slice S for each individual in the group G. For ease of presentation joint
multilocus genotype vector may be referred to as genotype
vector. Then PROFILER computes the likelihood of each genotype vector
defined using G and S given the data in P and
W, normalizes the likelihood by dividing by the likelihood of the data in P
and W, and then rank orders these probabilities from highest to lowest. The
result is the probability distribution of the set of genotype vectors
conditional on the pedigree structure and window. The motivation for the
slice-window paradigm is that the window creates an overlap that allows disjoint
slices to be “glued” together. Thus, although the slices may be disjoint,
their probabilities are correlated because their windows are not disjoint.
At its essence PROFILER is generating the full joint haplotype distribution of the group using the slice given the pedigree data. Thus, if the group is the pedigree and the slice and window are the framework map, then the PROFILER will generate the full haplotype distribution of the pedigree.
At its essence PROFILER uses LINKAGE-format pedigree and data files as input. The following command line options control the execution of the program.
| -1 f | REQUIRED. Tells PROFILER which likelihood algorithm to use internally. |
| -d <data file> | Specifies the name of the marker data file. Default is datafile.dat. |
| -p <pedigree file> | Specifies the name of the pedigree file. The pedigree must be in post-makeped format. Default value is pedfile.dat. |
| -w <window size> | Specifies the number of markers in the window. Default value is 1. |
| -s <slice size> | Specifies the number of markers in the slice. Default value is 1. |
| -J H | Specifies to use haplotype format for the output: paternal haplotype | maternal haplotype. Default is genotype output paternal allele | maternal allele for each marker. |
| -D i | Generate the probability distribution for each individual in the pedigree file. |
| -D U | Generate the probability distribution for the user-defined group given by a -1 in the proband column. |
| profiler_results | The file containing the probability distributions for each slice determined by the command line. |
An
Interesting Example
We now demonstrate the power of PROFILER to elucidate the composition of the likelihood space for the real data pedigree below consisting of 41 individuals, where individuals with a slash have not been genotyped. The genotype data consists of 7 polymorphic markers (number of alleles at the markers are 12, 8, 8, 12, 12, 12, 9 and 6, respectively) spanning 37 cM. These markers were gentoyped as part of a fine mapping effort.

Over a genetic distance of 37 cM the probability that a child receives a parental chromosome with no recombination is 0.7 under the stand Poisson model of crossover, thus we expect significant phase constraint between loci in genotyped individuals. The 8-point likelihood calculations to generate the LOD score curve by moving the trait locus across the framework map required 44 hours to complete when analyzed in 2001 . The complexity of this pedigree for VITESSE is concentrated in individual 1, who is missing genotype data since if we had genotype information on individual 1 the LOD score calculations would be reduced to 25 seconds (see below). Given the genotype data in the pedigree, individual 1 has 559,872 possible ordered 7-locus genotypes, where each 7-locus ordered genotype consists of a paternal and maternal haplotype. This number equals 9*8*9*9*6*4*4, which is the product of number of ordered genotypes individual 1 has at each locus. Note that without set recoding the number of possible ordered 7-locus genotypes is 28,560,000. We used the term ordered (multilocus) genotype to indicate that the parental source of each allele (haplotype) is specified. Thus, the unordered genotype 1 / 2 would a priori correspond to the two ordered genotypes 1 | 2 and 2 | 1.
Given that the five offspring and brother of individual 1 have genotype data and that the genetic distance spanned by the 7 markers is small, we might expect that the number of biologically plausible 7-locus genotypes within this distribution of 559,872 combinatorially feasible genotypes to be much smaller. To determine the exact probability distribution of these 559,872 7-locus genotypes we ran PROFILER with both slice and window size equal to 7 markers using a computer running Linux on an Intel chip. The total run required 204 days, averaging 30 seconds per 7-locus likelihood.
An abbreviated version of the final PROFILER result is available
here
and the full output
is available here. The output reveals that the best two 7-multilocus genotypes
(4332543 | 2332151) and (2332151 | 4332543) account for 99.986% of the
likelihood space, rendering the remaining 559,870 combinations irrelevant to the
calculation. Moreover, the two most likely multilocus genotypes have identical haplotypes in opposite phase and
thus represent probabilistically inferred molecular haplotypes (the probability
distribution is symmetrical with regard to maternal and paternal haplotypes
given that the parents are untyped). What happens if we assume these
probabilistically inferred 7-locus multilocus genotypes were actually observed
data?
If we assign individual 1 the 7 individual unordered
genotypes from the 7-locus multilocus genotype (by editing the pedigree file),
then the computational time for the LOD score analysis is reduced from 44 hours
to 25 seconds—a speedup by a factor of 6336,
which correlates with the reduction of the number of unordered 7-locus
genotype combinations from 7500 to 1 (7500 = 5*5*5*5*3*2*2 which can be seen by
running PROFILER with window and slice equal to 1).
Comparing the exact and approximate LOD scores at 30 positions gave a mean
absolute error of 0.000010 with a standard deviation of 0.000015, and a maximum
absolute error of 0.00004. The very
small standard deviation implies that the approximation was uniformly good
across the framework map. Moreover, the errors in the LOD score which includes
both trait and genotype data correlated well with the error incurred excluding
0.02% of the likelihood space. Note that because we entered genotypes in the
pedigree file and not haplotypes we actually summed over 16 7-locus genotypes (four heterozygous,
three homozygous genotypes have 16 possible a priori phases) not just
the two most likely, so that we were just a tad bit wasteful.
Conclusion: assigning individual 1 these genotypes eliminated 559,856 7-locus
genotype combinations that bogged down the calculation for 44 hours while adding
essentially zero information. In fact, considering that the LOD scores are
generally reported with a precision of one or two decimal places, this
approximation can be considered “exact”.
Of course, the most pertinent question is whether we can deduce that
individual 1 has these inferred molecular haplotypes without computing the full
probability distribution since spending 204 days to reduce a 44 hour computation
to 25 seconds has no value. We now work backwards and start with the amount of
information in the single-locus distributions, which ignores intermarker
correlation. The PROFILER probability distribution of the ordered genotypes for
each of the 7 loci, M1 through M7, for individual 1 using the marker as both the
window and the slice is here. Notice that the best unordered genotype at 5 of
the 7 loci has probability greater than 0.9 and the smallest probability is 0.75
at marker M5. Thus, even omitting the information from flanking loci these
distributions are concentrated on the same single unordered genotypes that
comprise the two 7-locus haplotypes. Table 1 represents the results of
increasing the window size from 1 to 6 with regard to computational time and the
smallest of the best unordered n-locus genotype probabilities from
each sliding window. The most likely haplotypes from different windows agreed on
their overlap with the results from the full distribution. In fact, we can
extract even more information by considering individual 1 and his brother
individual 25 jointly to discover that they actually share (to 0.9998
probability) the same two haplotypes identical by descent.
Exercises
Run PROFILER using the data file datain.txt and pedigree file pedin.txt from the above example. The pedigree file has a -1 in the proband column for individual 1.
profiler -1 f -d datain.txt -p pedin.txt-w 1 -s 1 -D i Generates the probability profile for each individual in the pedigree. Increase the slice and window sizes and examine how the output changes. Add the option -J H to view the output in haplotype form.
Edit the pedigree file and add a -1 only in the proband column of individual 5 who has no offspring, but a genotyped father and four genotyped siblings; the pedigree file is pedin5.txt. Now run profiler -1 f -d datain.txt -p pedin5.txt -w 4 -s 4 -D U -J H. What can you conclude from the probability distribution? What inference would you make about the 7-point results. If you have a fast machine try increasing the slice and window one marker at a time. 5 markers takes about 15 minutes.
Additional Exercises
1. Does individual 4, the spouse of individual 1, also have a probabilistically inferred molecular haplotype? Try window and slice sizes of 3.
2. Generate the joint probability distribution for individuals 1 and 25 by setting their proband values to -1 as mentioned above.
Executables
Currently only executables for Solaris and Linux are available. These executables are compressed.
Further Analysis
This pedigree will also be used as an example in our computational repository to compare the performance of current likelihood algorithms.