Working with models in LightSide

Most of the exercises for this week were concerned with building models in LightSide and comparing their performance.

The first exercise dealt with using different feature spaces within the model and seeing how this affected their performance. The initial model, using unigrams, resulted in an accuracy of 75.9% and a kappa value of 0.518. This is OK, but would including bigrams and trigrams as features improve these results? They might, by providing further context for each word, thus reducing the number of incorrect predictions. By including these extra features, there was a slight improvement in the model – an accuracy of 76.5% and a kappa value of 0.530. However, by increasing the number of features there is a risk of creating a model which overfits the data, and can’t be applied to other data sets. To overcome this there is a Feature Selection tool, which only uses the 3,500 (in this case) most predictive features in the model. The result of using this select group of features was a statistically significant improvement in the quality of the model.

 

Tagged , , , . Bookmark the permalink.

About Andrew

I'm a health librarian in Sydney, Australia, who also happens to be a geocacher.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.