There are three stages to using libsvm:
- create or load a problem definition (an instance of class Problem)
- train a model with given parameters
- evaluate the model
Loading a problem definition from file
A fairly standard data format for training SVM models is the svmlight format. This format represents one instance per line. The line is separated into tokens by spaces. The first token is the class for the instance. The remaining tokens are index:value pairs, separated by colons. The advantage of this data format is where many features have the value of 0.
-1 1:1 3:-0.535714 5:-0.692308
The above line defines an instance with:
- class value of -1
- index 1 has value 1
- index 2 has value 0 (default value)
- index 3 has value -0.535714
- index 4 has value 0 (default value)
- index 5 has value -0.692308
Problem.from_file("australian_scale.txt")
Creating a problem definition in code
Sometimes, your data will need constructing, or some preprocessing from a different file format. The toolkit supports generating problem definitions from two arrays: an array of instance definitions, and an array of the instance labels.
# Sample dataset: the 'Play Tennis' dataset
# from T. Mitchell, Machine Learning (1997)
# --------------------------------------------
# Labels for each instance in the training set
# 1 = Play, 0 = Not
Labels = [0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0]
# Recoding the attribute values into range [0, 1]
Instances = [
[0.0,1.0,1.0,0.0],
[0.0,1.0,1.0,1.0],
[0.5,1.0,1.0,0.0],
[1.0,0.5,1.0,0.0],
[1.0,0.0,0.0,0.0],
[1.0,0.0,0.0,1.0],
[0.5,0.0,0.0,1.0],
[0.0,0.5,1.0,0.0],
[0.0,0.0,0.0,0.0],
[1.0,0.5,0.0,0.0],
[0.0,0.5,0.0,1.0],
[0.5,0.5,1.0,1.0],
[0.5,1.0,0.0,0.0],
[1.0,0.5,1.0,1.0]
]
The above definition creates 14 instances. Each row in the array Instances represents one instance, and simply lists the value for each feature. Each entry in the array Labels represents the label for the corresponding row in Instances.
For example, the third instance has feature values [0.5,1.0,1.0,0.0] and label 1.
Create a problem definition from these two arrays using the command:
Problem.from_array(Instances, Labels)
Train a Model
The most complex part of training a model is setting the parameters to use. The parameters depend on the problem to be solved, and the chosen kernel; more complete descriptions are available on the libsvm website. The most important two parameters are svm_type and kernel_type.
svm_type determines the type of problem to solve. A typical classification problem has type Parameter::C_SVC; libsvm also supports NU_SVC, ONE_CLASS, EPSILON_SVR, and NU_SVR.
kernel_type determines the kernel to use for building the model. There is a choice of Parameter::RBF, Parameter::LINEAR, Parameter::SIGMOID, and Parameter::POLY.
Depending on the kernel type, you will also want to set one or more of:
- cost (for all kernel types)
- degree (for the polynomial type)
- gamma (for the radial-basis function and sigmoid types)
A Parameter instance is created, passing in default values for the above in map format, for example, to create a simple classification problem with an RBF kernel:
params = Parameter.new(
:svm_type => Parameter::C_SVC,
:kernel_type => Parameter::RBF,
:cost => 10,
:gamma => 4
)
After creating the parameters, training the model on a given problem set is as a simple as:
model = Svm.svm_train(TrainingSet, params)
Evaluate a Model
There is a convenience method to evaluate a model on a given dataset, returning a simple count of the number of errors made:
model.evaluate_dataset(TrainingSet, true)
The second boolean value is optional. Passing 'true' makes the method print out the expected and actual output of the model for each instance.
0 comments:
Post a Comment