A comparable reflective writing corpus of multiple languages

Wouldn’t it be great to have a large data set of annotated reflective writings written in various languages? We might get there but let’s start with a bit of brainstorming.

At first sight, the task to create a multiple languages reflective writing corpus looks like a relatively straightforward task. For example, we could use a set of English reflective writings and let them translate into several languages. The problem here is that we would not learn anything about the essence of reflection, for example, whether there is anything language/culture-specific to reflection or not. Another idea which is quick to execute is that everyone interested can just contact friends (or friends of friends) from Uni and ask them whether they want to provide a reflective writing that they have written during their study time in exchange for a small payment for research purposes. The problem here is that we would not know whether the outcomes of the analysis of these reflective writings is telling us more about our sample choice or about reflection.

These two examples already show that the task is tricky. There is much to consider and many decisions to be made.

We all can go out and just collect data, but with some planning and collaboration we might get better data. I believe that it is therefore important to bring together many potential stakeholders and to brainstorm what we need to consider to create a corpus of comparative reflective writings in multiple languages. What I mean with multiple languages is that the aim is to create a corpus of two or more languages consisting of reflective writings. Each language set of reflective writings is built according to the same standards. Overall the aim is to have data set of different languages that are similar or comparable.

This brings me to the brainstorming task. Imagine that we all received a good amount of time and money to study reflective thinking across the world in order to better understand the variety of reflective thinking expressed in writings.

“What do we need to consider to create and annotate a publicly available data set of comparable HE students’ reflective writings of multiple languages?”

Think about this question. Then jot down a list of your considerations (keep them short and add an explanation to each). Once you have created your list, sort the items according to your priority. But don’t stop there. Send them to me. I am more than happy to discuss your points with you so please send them to me. And remember, brainstorming is about creativity, so there are not right or wrong answers.

Such data set would be of great value for various purposes. One of them is that we may discover that we all reflect in the same way, but also if we discover that there are differences then this should help us to learn from each other. Personally, I would be very interested to know whether feelings play more of a role in some languages than in others and also the importance of expressing a personal perspective.

I set out this idea of a data set of reflective writings in many languages in my thesis a while ago. It was one of my limitations as in this work I only focused on English student writings. My hope was and is ‘that this research can inspire further research to evaluate the potential of automated detection of reflection across languages’ . In my 2019 paper, I proposed that the specific method that I used in this paper can be of guidance to set standards regarding the annotation of the data set (method and theory). The paper presented several ideas from of using a ‘standardized evaluation method, its proposal of reflective writing categories that are common to many models, its focus on model validity, and its reliability’ .

I summarised some of the problems that I had when creating my large data set of reflective writings in this Google group post a while ago here: https://groups.google.com/d/msg/wred-general/9Fpbp2sQ0K8/m8y-GbioBgAJ. This was in response to the blog post of Ming Liu and Simon Buckingham Shum of the UTS team calling for contributions to a reflective writing data set for machine learning. See here: http://wa.utscic.edu.au/2018/09/14/building-a-reflective-writing-corpus-for-analytics-research/

I updated my list of problems or considerations when creating a reflective data set taking especially my work in the 2019 paper into account . They are:

  • What is the use case of the data set? My research was about the analysis of texts. Manual content analysis is the number one method to analyse reflective writing and from the analysis of the manual content analysis of reflective writings, I developed the coding scheme for my data set. Although analysis and assessment are close they do differ and thus a data set for the assessment of reflective writings may be different from an analysis data set.
  • This also raises questions regarding what the central constituents of a reflective writing are. This relates very much to the theory of reflection.
  • What unit of analysis is useful?
  • What is the right size?
  • What standards should be followed?
  • What languages would be included?
  • Are there important subject differences that need to be considered?
  • Are there any demographic variables to consider?
  • What research would be needed to carry out first in order to create a sound data set?
  • Once there is a data set, how can others use it? 

This list gives a good summary of points to consider when creating a large data set suitable for machine learning. Now, in addition to this, what do we need to consider when creating a comparable data set of reflective writings in multiple languages?


Ullmann, T. D. (2019). Automated Analysis of Reflection in Writing: Validating Machine Learning Approaches. International Journal of Artificial Intelligence in Education, 29(2), 217–257. https://doi.org/10.1007/s40593-019-00174-2
Ullmann, T. D. (2015). Automated detection of reflection in texts. A machine learning based approach (PhD Thesis). The Open University. Retrieved from http://oro.open.ac.uk/45402/