Coding inapplicables

 Derek Sikes

14 June 1999


Click here to read a response to Maddison, W. 1993. Missing data versus missing characters in phylogenetic analysis. Sys. Bio. 42: 576-581.

This PAUP log demonstrates the consequences of three ways of coding taxa that are inapplicable for a suite of character states in a complex (e.g. a complex structure that has, in this example 9 characters associated with it)

circumstances: 22 taxa, 3 genera (indicated by the numbers 1, 2, 3 on the trees below), a character complex which has 10 characters- (#26-36) the first character is simply presence or absence of the character complex itself, the other nine are details of the complex. This complex can either be apomorphic (as in 1. below) or plesiomorphic (as in 2. below).

results in brief:

1. If the complex is apomorphic (states red & black below) for the entire set of taxa studied (and the absence is thus plesiomorphic) then there is no effect on topology between coding the inapplicables as missing (?) versus an extra state. This is because algorithmically there is no difference between plesiomorphic states and missing states- neither can alter the topology of the tree.

2. However, if the complex is plesiomorphic for the set of taxa (blue,black and red states) and some of the taxa lack the complex-green on the tree below (imagine, as in this case, 3 independent loses of the complex and thus the nine characters associated with the complex are inapplicable for those taxa) then there is a significant difference between the two ways of handling inapplicables- as can be seen in the examples below using the missing (?) coding produces no topological differences from the first two cases (apomorphic complex missings & extra states) and correctly reconstructs the three independent loss events whereas coding the inapplicables with an extra state greatly alters the topology, in this case reducing two monophyletic taxa to paraphyletic taxa and resulting in 4 shortest trees instead of only 1, although the trees produced did correctly depict 3 independent events, the consequences on the rest of the tree were quite destablizing & profound.

Recommendation: use missing codings (?) which are less likely to introduce homoplasy (the choice, when the complex is plesiomorphic is this: if there are independent losses of the complex then missings will be more likely to reconstruct this than extra states, if there was only a single loss then missings might weaken the support for the clade relative to extra state codings but missings cannot alter the topology whereas extra states CAN alter the topology. In other words, regardless of the circumstances (complex plesio vs apomorphic, one vs many indepdendent losses) it is safer to use missings (?) than extra states.

Some might argue that if there was a single loss, having many characters with the extra state of (complex absent) will greatly strengthen the support for that clade, however, this is simply a case of character bloating- turning a single character (complex absent) into nine identical characters, and thus artifically strengthening the support of the clade.

A third method of coding inapplicables, the assignment of unique states (autapomorphies) for each taxon lacking the character complex, has two apparent advantages- search algorithms will not consider the 9 states due to the lack of the character complex to be 9 synapomorphies, and character mappings will not reconstruct impossible ancestral states (i.e. the computer will not 'fill in' states, for say tail color, for taxa that lack tails). The last test presented below shows that this method works for this dataset. A possible disadvantage is the abberrant branch-lengths, which will reflect the numerous autapomorphic changes for all the inapplicable taxa (making these into long-branches, [which cannot possibly attract each other because no two long branches share any autapomorphies]).

 

 

 

The following is the edited PAUP log of the four file types:

1. complex apomorphic with inapplicables coded as missings

2. complex apomorphic with inapplicables coded as extra states

3. complex plesiomorphic with inapplicables coded as missings

4. complex plesiomorphic with inapplicables coded as extra states

5. complex plesiomorphic with taxa given separate, autapomorphic, states for each inapplicable character.

 

P A U P *

Version 4.0b2 for Macintosh
Sunday, 13 June 1999 2:15 PM

 

This copy registered to: Chris Simon

University of Connecticut 

Processing of file "complex apo ?" begins...

This file codes the character complex as apomorphic and all species that are inapplicable for the complex are coded as missing (?) for the states of the complex-

MATRIX download complete nexus file

a1 11221111111111111111111111??????????
b1 14225111111115511232211111??????????
c1 14325111114115511232211111??????????
d1 14325111114115511232211111??????????
e1 11332114111115511322111111??????????
f1 11332334111115511322111111??????????
g1 11332334111115511322111111??????????
h2 225462633362422224414222322222222222
i2 225462633362422224414222322222222222
j2 225466233352422425545333223223223222
k2 225466233352422425545333223223223222
ja2 225766233352422425545333222322322322
jb2 225766233352422425545333222322322322
l2 224134424332222211111333122232232233
m2 224134424332222231111333122232232233
n2 224134424332222231111333122232232233
o3 33164551211333331111311111??????????
p3 33164551211333331111311111??????????
q3 33154115321333431113311111??????????
r3 33154115321333431113311111??????????
s3 33154115312334331113311111??????????
t3 33154115312334331113311111??????????

;

END;

Data matrix has 22 taxa, 36 characters
Valid character-state symbols: 01234567
Missing data identified by '?'
Gaps identified by '-'
Processing of file "complex apo ?" completed.

Branch-and-bound search settings:
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Initial upper bound: unknown (compute via stepwise)
Addition sequence: furthest
Initial 'MaxTrees' setting = 100 (will be auto-increased by 100)
Branches collapsed (creating polytomies) if maximum branch length is zero
'MulTrees' option in effect
Topological constraints not enforced
Trees are unrooted

Branch-and-bound search completed:
Score of best tree found = 97
Number of trees retained = 1
Time used = 0.05 sec

Tree description:

Unrooted tree(s) rooted using outgroup method
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Character-state optimization: Accelerated transformation (ACCTRAN)

Tree number 1 (rooted using default outgroup)

Tree length = 97
Consistency index (CI) = 0.9897
Homoplasy index (HI) = 0.0103
Retention index (RI) = 0.9961
Rescaled consistency index (RC) = 0.9858

 

---------------------------------------------------------------------------

Processing of file "complex apo +" begins...

 

This file codes the character complex as apomorphic and all species that are inapplicable for the complex are coded with an additional state for the states of the complex-

MATRIX download complete nexus file

a1 112211111111111111111111111111111111
b1 142251111111155112322111111111111111
c1 143251111141155112322111111111111111
d1 143251111141155112322111111111111111
e1 113321141111155113221111111111111111
f1 113323341111155113221111111111111111
g1 113323341111155113221111111111111111
h2 225462633362422224414222322222222222
i2 225462633362422224414222322222222222
j2 225466233352422425545333223223223222
k2 225466233352422425545333223223223222
ja2 225766233352422425545333222322322322
jb2 225766233352422425545333222322322322
l2 224134424332222211111333122232232233
m2 224134424332222231111333122232232233
n2 224134424332222231111333122232232233
o3 331645512113333311113111111111111111
p3 331645512113333311113111111111111111
q3 331541153213334311133111111111111111
r3 331541153213334311133111111111111111
s3 331541153123343311133111111111111111
t3 331541153123343311133111111111111111

;

END;

 

Data matrix has 22 taxa, 36 characters
Valid character-state symbols: 01234567
Missing data identified by '?'
Gaps identified by '-'

Processing of file "complex apo +" completed.

Branch-and-bound search settings:
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Initial upper bound: unknown (compute via stepwise)
Addition sequence: furthest
Initial 'MaxTrees' setting = 100 (will be auto-increased by 100)
Branches collapsed (creating polytomies) if maximum branch length is zero
'MulTrees' option in effect
Topological constraints not enforced
Trees are unrooted

Branch-and-bound search completed:
Score of best tree found = 107
Number of trees retained = 1
Time used = 0.05 sec

Tree description:

Unrooted tree(s) rooted using outgroup method
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Character-state optimization: Accelerated transformation (ACCTRAN)

Tree number 1 (rooted using default outgroup)
Tree length = 107
Consistency index (CI) = 0.9907
Homoplasy index (HI) = 0.0093
Retention index (RI) = 0.9968
Rescaled consistency index (RC) = 0.9875

note that the only difference between the results of this and using (?)-(prior file) is that the CI is slightly higher but the topology is identical- this is because the lack of the complex is plesiomorphic and plesiomorphies do not influence tree topology so they do not differ from missing codings which also don't influence tree topology

 

 

---------------------------------------------------------------------------

 

Processing of file "complex plesio ?" begins...

 

[This file codes the character complex as plesiomorphic and all species that are inapplicable for the complex are coded as missing (?) for the states of the complex-

MATRIX download complete nexus file

a1 112211111111111111111111111111111111
b1 142251111111155112322111111111111111
c1 14325111114115511232211112??????????
d1 14325111114115511232211112??????????
e1 113321141111155113221111112211222122
f1 113323341111155113221111112211222122
g1 113323341111155113221111112211222122
h2 225462633362422224414222313322333233
i2 225462633362422224414222313322333233
j2 22546623335242242554533322??????????
k2 22546623335242242554533322??????????
ja2 225766233352422425545333213333333333
jb2 225766233352422425545333213333333333
l2 224134424332222211111333114444111444
m2 224134424332222231111333114444111444
n2 224134424332222231111333114444111444
o3 331645512113333311113111111166666611
p3 331645512113333311113111111166666611
q3 33154115321333431113311112??????????
r3 33154115321333431113311112??????????
s3 331541153123343311133111111166666611
t3 331541153123343311133111111166666611

;

END;

 

Data matrix has 22 taxa, 36 characters
Valid character-state symbols: 01234567
Missing data identified by '?'
Gaps identified by '-'

Processing of file "complex plesio ?" completed.

 

Branch-and-bound search settings:
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Initial upper bound: unknown (compute via stepwise)
Addition sequence: furthest
Initial 'MaxTrees' setting = 100 (will be auto-increased by 100)
Branches collapsed (creating polytomies) if maximum branch length is zero
'MulTrees' option in effect
Topological constraints not enforced
Trees are unrooted

 

Branch-and-bound search completed:
Score of best tree found = 122
Number of trees retained = 1
Time used = 0.03 sec

Tree description:

Unrooted tree(s) rooted using outgroup method
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Character-state optimization: Accelerated transformation (ACCTRAN)

Tree number 1 (rooted using default outgroup)
Tree length = 122
Consistency index (CI) = 0.9754
Homoplasy index (HI) = 0.0246
Retention index (RI) = 0.9903
Rescaled consistency index (RC) = 0.9660

note that the topology is identical to that of the previous two datafiles- and as with the previous two files only 1 MPT is found and the groups 1,2,3 are each monophyletic

 

---------------------------------------------------------------------------

Processing of file "complex plesio +" begins...

This file codes the character complex as plesiomorphic and all species that are inapplicable for the complex are coded with an additional state for the states (nine characters) of the complex-

MATRIX download complete nexus file

a1 112211111111111111111111111111111111
b1 142251111111155112322111111111111111
c1 143251111141155112322111125555555555
d1 143251111141155112322111125555555555
e1 113321141111155113221111112211222122
f1 113323341111155113221111112211222122
g1 113323341111155113221111112211222122
h2 225462633362422224414222313322333233
i2 225462633362422224414222313322333233
j2 225466233352422425545333225555555555
k2 225466233352422425545333225555555555
ja2 225766233352422425545333213333333333
jb2 225766233352422425545333213333333333
l2 224134424332222211111333114444111444
m2 224134424332222231111333114444111444
n2 224134424332222231111333114444111444
o3 331645512113333311113111111166666611
p3 331645512113333311113111111166666611
q3 331541153213334311133111125555555555
r3 331541153213334311133111125555555555
s3 331541153123343311133111111166666611
t3 331541153123343311133111111166666611

;

END;

Data matrix has 22 taxa, 36 characters
Valid character-state symbols: 01234567
Missing data identified by '?'
Gaps identified by '-'

Processing of file "complex plesio +" completed.

 

Branch-and-bound search settings:
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Initial upper bound: unknown (compute via stepwise)
Addition sequence: furthest
Initial 'MaxTrees' setting = 100 (will be auto-increased by 100)
Branches collapsed (creating polytomies) if maximum branch length is zero
'MulTrees' option in effect
Topological constraints not enforced
Trees are unrooted

 

Branch-and-bound search completed:
Score of best tree found = 147
Number of trees retained = 4
Time used = 0.45 sec

Tree description:

Unrooted tree(s) rooted using outgroup method
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Character-state optimization: Accelerated transformation (ACCTRAN)

 

Tree length = 147
Consistency index (CI) = 0.8776
Homoplasy index (HI) = 0.1224
Retention index (RI) = 0.9492
Rescaled consistency index (RC) = 0.8329

50% majority rule consensus. Note that instead of 1 MPT there are now 4 and two once monphyletic groups have broken into paraphyletic groups. This is due to the algorithm considering the absence of the complex an apomorphy and although in this case there were 3 independent events (loss of the complex) the computer wants to put these 3 lineages together because they have apomorphies for all nine characters-)

---------------------------------------------------------------------------

Processing of file "complex plesio +(autapos)" begins...

This file codes the character complex as plesiomorphic and all species that are inapplicable for the complex are coded with an additional state for the states (nine characters) of the complex-however, each taxon is given an autapomorphic state to prevent the unwanted consideration of the absences as synapomorphies.

MATRIX download complete nexus file

BEGIN CHARACTERS;

DIMENSIONS NCHAR=36;
FORMAT SYMBOLS= " 0 1 2 3 4 5 6 7 8 9 A B" MISSING=? GAP=- ;
CHARSTATELABELS

1 c1, 2 c2, 3 c3, 4 c4, 5 c5, 6 c6, 7 c7, 8 c8, 9 c9, 10 c10, 11 c11,;

MATRIX

[ 10 20 30 ]
[ . . . ]

a1 112211111111111111111111111111111111
b1 142251111111155112322111111111111111
c1 143251111141155112322111125555555555
d1 143251111141155112322111126677777766
e1 113321141111155113221111112211222122
f1 113323341111155113221111112211222122
g1 113323341111155113221111112211222122
h2 225462633362422224414222313322333233
i2 225462633362422224414222313322333233
j2 225466233352422425545333227788888877
k2 225466233352422425545333228899999988
ja2 225766233352422425545333213333333333
jb2 225766233352422425545333213333333333
l2 224134424332222211111333114444111444
m2 224134424332222231111333114444111444
n2 224134424332222231111333114444111444
o3 331645512113333311113111111166666611
p3 331645512113333311113111111166666611
q3 3315411532133343111331111299AAAAAA99
r3 33154115321333431113311112AABBBBBBAA
s3 331541153123343311133111111166666611
t3 331541153123343311133111111166666611

;

END;

 

 

Processing of file "complexplesio+(autapos)" completed.
Branch-and-bound search settings:
Optimality criterion = maximum parsimony
Character-status summary:
Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Initial upper bound: unknown (compute via stepwise)
Addition sequence: furthest
Initial 'MaxTrees' setting = 200 (will be auto-increased by 100)
Branches collapsed (creating polytomies) if maximum branch length is zero
'MulTrees' option not in effect; only 1 tree will be saved
Topological constraints not enforced
Trees are unrooted

Branch-and-bound search completed:
Score of best tree found = 182
Number of trees retained = 1
Time used = 0.08 sec

Tree description:
Unrooted tree(s) rooted using outgroup method

Optimality criterion = maximum parsimony
Character-status summary:

Of 36 total characters:
All characters are of type 'unord'
All characters have equal weight
All characters are parsimony-informative
Gaps are treated as "missing"
Character-state optimization: Accelerated transformation (ACCTRAN)

Tree number 1 (rooted using default outgroup)

Tree length = 182
Consistency index (CI) = 0.9835
Homoplasy index (HI) = 0.0165
Retention index (RI) = 0.9903
Rescaled consistency index (RC) = 0.9740

note that the topology is identical to that of the first three datafiles- and as with the first three files only 1 MPT is found and the groups 1,2,3 are each monophyletic- however, note the branch lengths are now quite different, to reflect the numerous costs due to the autapomorphies.