Saturday, December 3, 2011

Customised evaluators in svm_toolkit

A new version of svm_toolkit today, and I've bumped this to version 1.0.0, as it includes the main functionality I need for now. The biggest change is support for customised methods for evaluating a model against a dataset.

Why are different evaluations methods needed? For many real-world datasets, simply measuring overall accuracy will not produce the most useful models. This is usually the case when the dataset has an imbalanced distribution of classes. For example, consider a test dataset with 900 positive instances and 100 negative instances. If the model puts every instance into the positive class, it gets 90% correct. However, the model is useless, as it gives no information about the negative class.

The geometric mean is one way to give each class equal prominence, irrespective of the frequency of instances of that class. The geometric mean first computes the accuracy of each class separately, multiplies them together and takes the nth root (where n is the number of classes). In the example above, the geometric mean would come out as zero, the worst score possible, because none of the negative instances are correctly classified by the model. Precision and recall are evaluation measures which highlight the performance of one class, usually the minority class.

The cross-validation process must select a model using an evaluation technique appropriate to the domain if the model is going to be useful in later tests. For example, in information retrieval the best choice may be to find a model with the best recall of the minority class, as the model is more likely to pick out all members of the class of interest, even if these are mixed with many instances from the other class.

Four evaluation types are currently provided with the svm_toolkit:
  1. Evaluator::OverallAccuracy, to compute the proportion of correctly classified instances.
  2. Evaluator::GeometricMean, to compute the geometric mean of the accuracy of each separate class.
  3. Evaluator::ClassPrecision(class), to compute the precision of the model for the given class.
  4. Evaluator::ClassRecall(class), to compute the recall of the model for the given class.
The first two are class names, and the second are methods which generate class definitions for the given class label. They are used by passing them to the new :evaluator parameter in evaluate_dataset and cross_validation_search, as follows:
# evaluate based on Overall Accuracy
puts model.evaluate_dataset(Dataset, :evaluator => Evaluator::OverallAccuracy)
# evaluate based on geometric mean of accuracies for separate classes
puts model.evaluate_dataset(Dataset, :evaluator => Evaluator::GeometricMean)
# evaluate based on precision of class labelled 1
puts model.evaluate_dataset(Dataset, :evaluator => Evaluator::ClassPrecision(1))
# evaluate based on recall of class labelled 1
puts model.evaluate_dataset(Dataset, :evaluator => Evaluator::ClassRecall(1))

User-Defined Evaluators

It is easy to define a new evaluator class. The class must support the following methods:
  1. add_result(actual, prediction), which is called with the actual and predicted classification for every instance in the dataset.
  2. value, to retrieve a number giving the performance of the model.
  3. better_than?(eval), to compare this evaluation with a given one (which may be nil), returning true if the model yielding this evaluation should be kept over the given one.
For example, the class for OverallAccuracy is implemented as:
# Measures accuracy as the percentage of instances 
# correctly classified out of all the available instances.
class OverallAccuracy
attr_reader :num_correct

def initialize
@num_correct = 0
@total = 0

# As each instance result is added, store the total number
# of instances and the number of correct predictions.
def add_result(actual, prediction)
@total += 1
@num_correct += 1 if prediction == actual

# This object is better than given object, if the
# given object is an instance of nil, or the accuracy
# is better
def better_than? other
other.nil? or self.value > other.value

# Return the accuracy as a percentage.
def value
100.0 * @num_correct / @total

def to_s
"Overall accuracy: #{value}%"
Finding the Best Model

The process of finding a good model with a radial-basis function (RBF) kernel is fairly straightforward. Once the training and cross-validation sets have been created, and lists of cost and gamma values created to search over, the best model can be found using:
best_model = Svm.cross_validation_search(
TrainingData, CrossSet,
Costs, Gammas,
:show_plot => true,
:evaluator => Evaluator::GeometricMean

The :show_plot parameter is given if you want a contour plot of the grid search results. The :evaluator parameter is used to give the class name of the evaluation technique to use in evaluating the models on the cross-validation set.

Tuesday, November 22, 2011

Demo program for svm_toolkit

Support vector machines are a complex learning algorithm, and simple visualisations can help in understanding the effects of the different parameters. The libsvm website contains an applet, which lets you see how changing parameters can change the models created for a simple two-dimensional classification task.

The svm_toolkit now includes a program svm-demo, which recreates the libsvm applet as a simple application, coded in jruby. The main display, as with libsvm's applet, allows the user to click on a window to enter instances of different classes. Some controls on the right allow the user to set up a kernel and parameters: there are separate fields for entering the cost and gamma values, and a spinner box for selecting a value for degree. Once set up, a click on train updates the display, colouring the background to show which regions are considered as falling in which class.

Below are two screen shots. Both contain the same set of instances: a cluster of blue dots surrounded by some green ones. And in both we use the RBF kernel.

The first has a value for cost of 1, and a value for gamma of 1:

Notice how the decision boundary is curved, but is still not very different from a straight line. The boundary makes three errors: one blue dot is in the green region, and two green dots are in the blue region. We can force the boundary to take more account of errors, by increasing cost, and also to take on a more complicated shape, by increasing gamma. The second example has a value for cost of 10, and a value for gamma of 10:

Notice how the area marked for blue is now almost circular: as the value for gamma is increased, the RBF kernel can generate increasingly complex decision boundaries. Now, the model does not make any errors on the training data.

The gamma parameter is only used by the RBF and sigmoid kernels, and the degree parameter is only used by the polynomial kernel. Below is an example of using the polynomial kernel:

When you install svm_toolkit, svm-demo is made available on the same path as jruby, and so it should be easy to run from the command line.