SVMs are supervised learning models. They are applied

SVMs are supervised learning models. They are applied for analyzing data
by using their associated learning algorithms and are used in different tasks of
classification and regression (mostly in classification). In this group of machine
learning techniques, data items are plotted as points in n-dimensional space
with n as the number of features, where the value of each feature is the value
of a particular coordinate.
SVMs are implemented for linear and non-linear classifications. To do the
non-linear classification, SVMs apply the non-linear kernels which map the inputs
into higher dimensionality feature spaces. The reason why non-linear classification
is used is to do the task of classifying data items in an easier way. In
fact, SVMs construct a hyperplane (or set of hyperplanes depending on the
number of classes) in a high-dimensional space to classify data points in two
(or several) classes. These hyperplanes would be used for tasks such as classification,
regression, depending on the type of input data 30.
The current task is a multi-class classification, which assumes that there
are more than two classes and each item must be assigned to one and only
one class. In this task, the number of classes is twenty. Each class represents
5
an emoji which is assigned to a tweet. SVMs are binary classifiers; therefore,
to do multi-class classification, there is a need to apply one of the strategies
such as, crammer-singer, one-vs-the-rest or one-vs-one ; these techniques are
described in the following paragraph.
In one-vs-one approach, there are several runs of the model, and in each of
them, a pair of classifiers are constructed to act as a binary classifier; to predict
the label of one data item, selected class is the one which has the highest number
in terms of the number of times that data item was labeled with that class
in multiple runs of classification process. In the case of two classes having the
same number of assignment to an input, the class with higher aggregate classification
confidence will be selected (aggregate classification confidence for 1
class is the sum of each binary classification confidence for that class). On the
other hand, One-vs-the-rest strategy which is the approach of our choice in this
task, includes running several classifier (equal to the number of our classes),
and in each run, one classifier is against all the rest. The advantage of the latter
technique over the former is the fact that it has complexity of degree n compared
to n
2 complexity of one-vs-one method. Finally, in crammer-singer approach,
the joint objective over all classes will be optimized. Among the three methods
described, one-vs-the-rest is the least computationally expensive technique9.
How one-vs-the-rest assigns a class to an input:
One-versus-rest classifiers divide an m class problem into m binary problems.
In this setting, in each one of the binary classifiers, classifier which provides
a positive output to a sample indicates the output class. In the event of
numerous classes with positive results, the most applied approach to select the
positive assigning classifier with maximum confidence. 2.
SVMs hyper-parameters:
There are two sets of parameters, model parameters and hyper-parameters.
Hyper-parameters are the parameters which should be manually set, before
running the estimator. In scikit-learn library 24, which is applied for this project,
hyper-parameters are passed as arguments to the constructor of the estimator
classes. Typical examples of hyper-parameters for SVMs include C, kernel and
gamma. In the case of linearSVM, the only hyper parameter that needs to be
tuned is the SVM soft-margin constant, C > 0 3.
Tuning the hyper-parameters:
Hyper-parameter tunning is done by “running multiple trials in a single training
job” 6. During the hyper-parameter tunning process, there is a need to
analyze the performance of the model; however, applying test data to reach
this goal, is risky and not an advisable approach, because model can be biased
towards the test data. Therefore, The test set used for tuning process,
“trial set or development set”, must be separated from test set. In each trial,
firstly, a set of optional hyper-parameters will be set for training , and the (aggregate)
accuracy is being observed during the process, to select the best set
of hyper parameters at the end 6.
To avoid the manual hyper-parameter tuning process, automated techniques
are introduced. In this task, Grid search was applied for tuning the parameters
of the LinearSVM.
Grid search:
To implement the grid search, a set of parameters and arguments must be
manually set, estimator (LinearSVM is an instance), a set of hyper-parameters
6
which is dependent on the estimator selection (in LinearSVM, a set of values for
C must be specified) and a performance metric. If it is not specified, estimator’s
default performance metric would be applied.
Grid-search, does both the training and testing on each possible combination
of the hyper-parameters and in this task, because there is just one hyperparameter
to be tuned, it does the multiple trials on different specified values of
C. This method evaluates the performance of each training trail on the validation
set which is held out out of the k folds that the data was partitioned to (in
k-fold cross validation scenario). Finally, the grid search algorithm outputs hyper
parameters of the setting that achieves the highest score as the best hyper
parameters 8.
In this task, “GridSearchCV” from scikit-learn is implemented for hyper parameter
tuning, since we would like to apply cross-validation technique.