This document and was originally written as an aid for an MSc student project. It describes data structures and algorithmic structure for ID3. A companion document Machine learning with variables describes how ID3 can be modified to deal with comparisons between attributes as well as single attribute conditions.

- These documents are being released in connection with:
- An introduction to Artificial Intelligence
- Janet Finlay and Alan Dix, UCL Press, 1996.

The ID3 algorithm is dealt with in chapter 4, Machine Learning, pages 90-95 This chapter also describes Query-by-Browsing an experimental intelligent database interface, which uses ID3 as its learning mechanism.

There are two principal data structures, one for the examples and one for the decision tree. The examples are easy, being a list or array of values. These values are of types given in a separate 'schema' of attribute types.

example_type == ARRAY of attribute_value example_set == ARRAY of example_type schema == RECORD nos_of_attributes : integer attribute_types : ARRAY of attribute_type attribute_type == [ discrete | continuous | binary | string ... ]

Where the examples are of known classes, the class may be stored as an extra attribute value. The attribute values may either be a variant type (string or number ...) or held in some single form. For example, the values could be strings where numbers are converted when required.

The decision tree is substantially more complex. Each node may either be a branch (with and associated decision criterion) or a leaf. The leaf has stored with it the class it decides for and a certainty factor. This certainty factor will be one if only examples of the specified class are found at that node, but less than one if the examples were mixed.

node == EITHER leaf : RECORD class : class_type certainty : real OR branch: RECORD criterion : criteria_type yes_branch : node_ptr no_branch : node_ptr

These variants may be stored as two different data types with some form of initial identifier (or two derived classes from a base class in C++). Alternatively, both variants may be stored using a single record type (useful for PASCAL file storage) and a special 'LEAF' value of the `criterion` used to indicate LEAF nodes.

In standard ID3, the criteria are of the form 'attribute is equal to (or is less than or greater than) some value'. As we wish to extend this, critera can be of two forms:

criteria_type == EITHER inter : RECORD attr : attribute_no relation : { =, >, <, etc. } val : attribute_value OR intra : RECORD attr1 : attribute_no relation : { = or > } attr2 : attribute_no

Again, this can be represented either with variant records or
by having a tag (inter or intra) and having both a '`val`' and '`attr2`'
field of which only one is used.

To be a purist, I would say that the use of true variant recoreds is better, but programs often use the accumulated form of storage to make dynamic allocation and input/output to files easier.

Assuming we have a decision tree and an example, we can classify it by simply running down the tree:

classify( example: example_type, root: node_ptr ) returns class_type current_pos = root WHILE current_pos is not a leaf IF example satisfies current_pos^.criteria THEN current_pos := current_pos^.yes_branch ELSE current_pos := current_pos^.no_branch { ends with current_pos on a leaf } return current_pos^.class

It would be reasonable to get the classification procedure to return the certainty factor as well. This is a simple modification to the above.

Actually learning the decision tree is more complicated. Basically at each stage, one takes a set of examples, chooses a decision criterion, splits the examples into those that do or do not satisfy the criterion and then builds new trees for those new subsets of examples. The process stops when no further progress can be made.

buildtree ( examples: example_set ) returns node_ptr IF examples all of the same class THEN return a pointer to a LEAF where the class is the one with the most examples and the certainty factor is the number in the class divided by the total number of examples. ELSE choose criteria C IF choosing process fails THEN return pointer to LEAF as above ELSE split_examples( C, examples, yes_examples {VAR} , no_examples {VAR} ) LET yes_b = buildtree( yes_examples ) AND no_b = buildtree( no_examples ) return a pointer to a BRANCH with criteria = C yes_branch = yes_b no_branch = no_b END {if choosing fails} END {if examples of same class}

The only two sub-processes are the splitting of the example set and the choosing of the best decision criterion. The former is simply a matter of looking at each example in turn, seeing whether it satisfies the criterion and then adding it to the appropriate subset.

This leaves us with choosing an appropriate criterion. This may either be a single attribute criterion or a paired attribute one:

choose_criteria ( examples: example_set, crit: VAR criteria_type, found: VAR bool) found = true choose_single ( examples, s_crit, , s_ent, s_found ) choose_pair ( examples, p_crit, , p_ent, p_found ) IF s_found AND p_found THEN IF p_ent < s_ent THEN crit = p_crit ELSE crit = s_crit END { if p_ent < s_ent } ELSE IF s_found { only single criteria found } crit = s_crit ELSE IF p_found { only pair criteria found } crit = p_crit ELSE { no criteria found } found = false END { if s_found p_found etc. }

The procedures '`choose_single`' and '`choose_pair`' search for suitable
criteria and if found return the entropy so that they can be compared.

choose_single ( examples: example_set, crit: VAR criteria_type, ent: VAR real, found: VAR bool) ent = BIG_NUMBER best_crit = none found = false FOR EACH attribute A depending of attribute type: CASE discrete or bool FOR EACH attribute value AV LET try_crit = INTER with attr = A relation = equals val = AV LET try_ent = entropy(try_crit,examples) IF try_ent < best_ent AND useful split THEN found = true best_ent = try_ent best_crit = try_crit END { if try_ent } CASE string { no sensible criteria } CASE continuous { should look for < or > some value, but don't bother for now }

The '`useful split`' condition should check that the split on the attribute
actually makes progress, that is that all the examples
don't go the same way.
This can be coded as a separate test or the entropy calculating function
can return a special value.

The paired search is similar just looks at each attribute pair.
These pairs should be different and as the < and > test yield the same
information it is silly to consider the attributes in both orders.
This can be achieved for numerically coded attribute numbers,
by having a nested loop where the inner loop starts from `A1+1`.

choose_single ( examples: example_set, crit: VAR criteria_type, ent: VAR real, found: VAR bool) ent = BIG_NUMBER best_crit = none found = false FOR EACH attribute pair A1 A2 with the same type depending of attribute type: CASE discrete, bool or string LET try_crit = INTER with attr1 = A1 relation = equals attr2 = A2 LET try_ent = entropy(try_crit,examples) IF try_ent < best_ent AND useful split THEN found = true best_ent = try_ent best_crit = try_crit END { if try_ent } CASE continuous FOR relation R in '=' and '<' LET try_crit = INTER with attr1 = A1 relation = R attr2 = A2 LET try_ent = entropy(try_crit,examples) IF try_ent < best_ent AND useful split THEN found = true best_ent = try_ent best_crit = try_crit END { if try_ent }

Finally, we need to calculate the entropy. We assume we have the procedure which splits an example set into those which do or do not satisfy a given criterion. We also assume a procedure which given an example set returns an array with the count of the number of examples in each class (array length two for a simple binary choice).

entropy ( crit : criteria_type, examples: example_set yes_cts, no_cts: ARRAY [1..nos_classes] of real yes_examples, no_examples : example_set split_examples( crit, examples, yes_examples , no_examples ) count_classes ( yes_examples, yes_cts ) count_classes ( no_examples, no_cts ) total = number of examples ent = 0 non_zero = 0 FOR c = 1 .. nos_classes py = yes_cts[c] / total pn = no_cts[c] / total ent := ent - py * log(py) - pn * log (pn) IF ( py > 0 ) OR ( px > 0 ) THEN non_zero = non_zero + 1 return entropy = ent { at end of loop if 'non_zero' is < 2, this is a useless split } { the useless test could be identical to this, or entropy could } { have var parameters, one of which was a useless flag, or one } { have a line like: IF non_zero < 2 THEN ent = BIG_NUMBER }

The above is PASCAL-like pseudo-code, but the algorithms could also be programmed in prolog. This has the advantage that the 'dynamic typing' of the attribute values is far easier. Also variants can be easily dealt with:

decide(leaf(Class,T),Ex,Class). decide(branch(Crit,Y,N),Ex,Class) :- satisfies(Crit,Ex), decide(Y,Ex,Class). decide(branch(Crit,Y,N),Ex,Class) :- not satisfies(Crit,Ex), decide(N,Ex,Class). satisfies(eq(L,R),Ex) :- getval(L,Ex,Val), getval(R,Ex,Val). satisfies(lt(L,R),Ex) :- getval(L,Ex,cts(Lv)), getval(R,Ex,cts(Rv)), Lv < Rv. getval(attr(A),Ex,V) :- !, getattr(A,Ex,V). getval(V,Ex,V). getattr(A,[],error). getattr(A,Ex,error) :- A < 0. getattr(1,[V|Rest],V). getattr(A,[V|Rest],V) :- A1 is A - 1, getattr(A1,Rest,V).

This is of course the easy bit: deciding from a given tree. However, the tree building should not be too bad, the worst bit being finding the best out of a set of possibilities. Prolog is good at finding one successful thing out of a set, but not the best. Below are two ways of doing it.

In each of the following, 'findbest' takes five arguments:

findbest(Clause,Ret,Val,BestRet,BestVal)

It repeatedly tries to satisfy '`Clause ', and looks for the
solution which maximises 'Val'.
The best value of 'Val' is returned in 'BestVal' and
'BestRet' is the value of 'Ret' with the same bindings
in place as when the optimal 'Clause' was found.
For example, the following finds the pair of numbers
adding up to ten with the biggest product:
`

smallnumber(0). smallnumber(1). smallnumber(2). smallnumber(3). smallnumber(4). smallnumber(5). smallnumber(6). smallnumber(7). smallnumber(8). smallnumber(9). smallnumber(10). sum10(X,Y) :- smallnumber(X), smallnumber(Y), X+Y =:= 10. ?- findbest(sum10(A,B),(A,B),A*B,R,V).returns- findbest(sum10(_1,_2),(_1,_2),_1*_2,(5,5),25).

The first bit of code uses cuts, assert and retract liberally.

retract1(C) :- retract(C),!. best1(R,V) :- bestsofaris(R,V),!. findbest(_,_,_,_,_) :- asserta(bestsofaris(badvalue,-10000)), fail. findbest(C,R,V,_,_) :- call(C), best1(CR,CV), CV < V, retract1(bestsofaris(CR,CV)), asserta(bestsofaris(R,V)), fail. findbest(_,_,_,_,_) :- best1(badvalue,-10000), !, retract1(bestsofaris(badvalue,-10000)), fail. findbest(_,_,_,BR,BV) :- best1(BR,BV), !, retract1(bestsofaris(BR,BV)).

The second version uses the meta-logical predicate '`setof`'.
This is not available on all Prologs.
Also if the search space is very large setof may run out
of memory.

findbest(Clause,Ret,Val,BestRet,BestVal) :- setof(Clause,(Ret,Val),Set), bestofset(Set,badvalue,-10000,BestRet,BestVal), BestVal =\\= -10000. bestofset([],Ret,Val,Ret,Val). bestofset([(CR,CV)|RestOfSet],Ret,Val,Ret,Val) :- CV <= Val. bestofset([(CR,CV)|RestOfSet],Ret,Val,CR,CV) :- CV > Val.

N.B. none of the above prolog is tested!!!