Text mining is the next type of data analysis that we’re looking at in the Data, Analytics and Learning MOOC. I’m looking forward to the next couple of weeks, as I think that some of these tools and techniques might be useful for my research project, which is based on analysing tweets. Text mining is all about trying to find patterns in large collections of text, and using these patterns as a basis for identifying data that is worth investigating further. It’s this finding patterns in textual data which interests me, as that’s the vision that I’ve got for my Twitter research project.
One of the subareas of text mining is analysing the collaborative learning process that occurs in online courses via the discussion forums. This analysis involves modelling conversational interactions between students , and using those models to find out what it is about conversations that make them valuable for online learning. Based on this understanding it’s then possible to design interventions to support learning in online settings. Analysing conversations in online courses draws on knowledge from a number of fields, such as education, psychology, and sociolinguistics. This knowledge is used to determine the cognitive processes associated with collaborative learning, investigate what conversational interactions look like, and build models of how psychological signals are revealed through language. All this ultimately allows the development of models showing where processes are happening during interactions.
An example of how these models can be used in learning analytics research is assessing some reasons for attrition along the way in MOOCs. The models are based on the analysis of the posts in discussion forums, both from the point of view of individual students and from the overall tone of individual threads. The negativity and positivity of the posts and threads is calculated, and then survival modeling is carried out to determine the probability that a student will have dropped out of the course by the following week.
This sort of detailed modeling is out of scope for my research project, but some of the aspects of conversation analysis could be useful, as many of the interactions between Twitter users could be characterised as conversations. At this stage I think I’ll be learning some useful stuff over the next couple of weeks.
Most of the exercises for this week were concerned with building models in LightSide and comparing their performance.
The first exercise dealt with using different feature spaces within the model and seeing how this affected their performance. The initial model, using unigrams, resulted in an accuracy of 75.9% and a kappa value of 0.518. This is OK, but would including bigrams and trigrams as features improve these results? They might, by providing further context for each word, thus reducing the number of incorrect predictions. By including these extra features, there was a slight improvement in the model – an accuracy of 76.5% and a kappa value of 0.530. However, by increasing the number of features there is a risk of creating a model which overfits the data, and can’t be applied to other data sets. To overcome this there is a Feature Selection tool, which only uses the 3,500 (in this case) most predictive features in the model. The result of using this select group of features was a statistically significant improvement in the quality of the model.
As I was watching one of the text mining lecture videos this morning, I experienced a “lightbulb moment” with regards to using LightSide. Up until now I didn’t think that LightSide would be useful for my Twitter research project, as I wasn’t interested in building models, I just wanted to analyse the content of the tweets. However, I know realise that I don’t need to use the model-building features of LightSide for my Twitter data, I can just use it to extract features to get a count of the number of the times each word (or group of words) appears in all the tweets. This is the type of analysis that I’m interested in. I was really pleased that I’ve managed to find a tool to help me with this part of the data analysis.
I couldn’t wait to get home and try using LightSide on some of the tweets that I’ve already collected. I had to do a bit of a clean-up of the Excel file to make it ready to import into LightSide, but once that was done everything worked fine. The image below shows the LightSide workspace once I’d extracted the features.
Once I had the Feature Table prepared, I exported it as a .csv file, and was able to use the Sum feature in Excel to quickly tally the occurrence of each term. I’m going to play around with LightSide a bit more to explore the other features that can be extracted, but I’m pretty sure that it can do exactly what I need it to do. Time to crunch some data!
The next topic in the Data, Analytics and Learning MOOC is text mining, which I’ll explain further in my next post. We were introduced to the last software tool that we’ll be using in the course – LightSide (the Star Wars fan in me is wondering if there’s a competing program called DarkSide which does the opposite of LightSide). It seems fairly simple to use, and I managed to get the correct answer for the exercise we were given:
I’m still not 100% sure if text mining is going to be useful for the Twitter analysis project I’m working on. I think I need a tool which will categorise the data, rather than try and build predictive models of it. Anyway, it’s always good to learn about a new tool – you never know when it might come in handy.