SVMs are supervised learning models. They are applied for analyzing databy using their associated learning algorithms and are used in different tasks ofclassification and regression (mostly in classification).
In this group of machinelearning techniques, data items are plotted as points in n-dimensional spacewith n as the number of features, where the value of each feature is the valueof a particular coordinate.SVMs are implemented for linear and non-linear classifications. To do thenon-linear classification, SVMs apply the non-linear kernels which map the inputsinto higher dimensionality feature spaces. The reason why non-linear classificationis used is to do the task of classifying data items in an easier way. Infact, SVMs construct a hyperplane (or set of hyperplanes depending on thenumber of classes) in a high-dimensional space to classify data points in two(or several) classes.
These hyperplanes would be used for tasks such as classification,regression, depending on the type of input data 30.The current task is a multi-class classification, which assumes that thereare more than two classes and each item must be assigned to one and onlyone class. In this task, the number of classes is twenty. Each class represents5an emoji which is assigned to a tweet. SVMs are binary classifiers; therefore,to do multi-class classification, there is a need to apply one of the strategiessuch as, crammer-singer, one-vs-the-rest or one-vs-one ; these techniques aredescribed in the following paragraph.In one-vs-one approach, there are several runs of the model, and in each ofthem, a pair of classifiers are constructed to act as a binary classifier; to predictthe label of one data item, selected class is the one which has the highest numberin terms of the number of times that data item was labeled with that classin multiple runs of classification process.
In the case of two classes having thesame number of assignment to an input, the class with higher aggregate classificationconfidence will be selected (aggregate classification confidence for 1class is the sum of each binary classification confidence for that class). On theother hand, One-vs-the-rest strategy which is the approach of our choice in thistask, includes running several classifier (equal to the number of our classes),and in each run, one classifier is against all the rest. The advantage of the lattertechnique over the former is the fact that it has complexity of degree n comparedto n2 complexity of one-vs-one method. Finally, in crammer-singer approach,the joint objective over all classes will be optimized. Among the three methodsdescribed, one-vs-the-rest is the least computationally expensive technique9.
How one-vs-the-rest assigns a class to an input:One-versus-rest classifiers divide an m class problem into m binary problems.In this setting, in each one of the binary classifiers, classifier which providesa positive output to a sample indicates the output class. In the event ofnumerous classes with positive results, the most applied approach to select thepositive assigning classifier with maximum confidence. 2.SVMs hyper-parameters:There are two sets of parameters, model parameters and hyper-parameters.Hyper-parameters are the parameters which should be manually set, beforerunning the estimator. In scikit-learn library 24, which is applied for this project,hyper-parameters are passed as arguments to the constructor of the estimatorclasses.
Typical examples of hyper-parameters for SVMs include C, kernel andgamma. In the case of linearSVM, the only hyper parameter that needs to betuned is the SVM soft-margin constant, C > 0 3.Tuning the hyper-parameters:Hyper-parameter tunning is done by “running multiple trials in a single trainingjob” 6. During the hyper-parameter tunning process, there is a need toanalyze the performance of the model; however, applying test data to reachthis goal, is risky and not an advisable approach, because model can be biasedtowards the test data. Therefore, The test set used for tuning process,”trial set or development set”, must be separated from test set. In each trial,firstly, a set of optional hyper-parameters will be set for training , and the (aggregate)accuracy is being observed during the process, to select the best setof hyper parameters at the end 6.To avoid the manual hyper-parameter tuning process, automated techniquesare introduced. In this task, Grid search was applied for tuning the parametersof the LinearSVM.
Grid search:To implement the grid search, a set of parameters and arguments must bemanually set, estimator (LinearSVM is an instance), a set of hyper-parameters6which is dependent on the estimator selection (in LinearSVM, a set of values forC must be specified) and a performance metric. If it is not specified, estimator’sdefault performance metric would be applied.Grid-search, does both the training and testing on each possible combinationof the hyper-parameters and in this task, because there is just one hyperparameterto be tuned, it does the multiple trials on different specified values ofC. This method evaluates the performance of each training trail on the validationset which is held out out of the k folds that the data was partitioned to (ink-fold cross validation scenario).
Finally, the grid search algorithm outputs hyperparameters of the setting that achieves the highest score as the best hyperparameters 8.In this task, “GridSearchCV” from scikit-learn is implemented for hyper parametertuning, since we would like to apply cross-validation technique.