Getting my head around text mining

In response to a tweet from one of the instructors of the Data, Analytics and Learning MOOC, I wanted to try and unpack how I think I can use text mining for my Twitter research project.

As a complete novice at this whole text mining caper, I’m still coming to terms with all the concepts behind it. To answer Carolyn’s question, I guess I see classification models working like this:

1. Take a subset of the data, and classify each item in the subset e.g. individual tweets, by hand. The classification scheme I’m thinking of would have categories such as “administrative”, “presentation summary”, “marketing”.

2. Build a model which will take this subset and learn the characteristics of the tweets which are in each category.

3. Apply the model to the remaining unclassified data so that each tweet is assigned the correct classification.

My take on predictive models is similar, I suppose, but I see them as more theoretical rather than practical. I guess by their nature models are theoretical, but the application of the models is still something I’m not sure about. The basic premise of training the model and then applying it to the data is the same, but I haven’t yet seen what happens at the end of the process i.e. the predictive side of things. This may be covered in next week’s content, so hopefully I’ll have a better understanding of the process then.

From the perspective of the Twitter analysis project that I’m working on, I don’t think the text mining tools will do what I need them to do. My aim is to categorise all the tweets that I’ve collected, based on their content. This is something that needs to be 100% accurate so that I can get an accurate picture of what was tweeted about. Perhaps I might do a bit of playing around with LightSide as part of the data analysis, but I won’t be relying on it to categorise all the tweets.

This whole course has been a great introduction to data analysis and mining, and the tools which are available. I think I’ll be trying to think of future projects to utilise them.