Week 19

Thursday 11th March 2021


Emotion Annotations: Understanding Annotators' Disagreements

Enrica Troiano

Institute for NLP, University of Stuttgart

  • Abstract

Analysing emotions in text consists in automatically understanding its emotional content. This includes a number of phenomena, from basic, discrete emotions to more fine-grained affective information, like intensity. Similar to most ML-based tasks, emotion analysis relies on manually annotated data, thus facing the problem of annotation subjectivity: it is particularly challenging to achieve substantial agreement on emotions.

In this presentation, I will address two annotation tasks, which face separate issues that lead to disagreements. In one setting, human judges infer emotion intensities, and in the other, they annotate specific emotion components (cognitive appraisal). I will show that annotations of intensity correlate both with the confidence of annotators and with their agreement. For cognitive appraisal annotations, I will discuss that reconstructing emotion components from descriptions of event is particularly challenging if annotators are not provided additional emotional information.

This is joint work with Jan Hofmann, Roman Klinger, and Sebastian Padó.

Week 18

Thursday 4th March 2021


Corpus-based Contrastive Analysis and Reader engagement in academic writing: methodological and analytical perspectives

Niall Curry

Coventry University

  • Abstract

In this talk, I discuss an in-depth analysis of questions as reader engagement devices in economics research articles in English, French, and Spanish. Merging contrastive and corpus linguistic approaches, the study interrogates issues of comparability and establishes a base from which to draw meaningful comparisons between discourses within the global, multilingual academy. The corpus-based contrastive analysis approach is applied to the study of questions in the English and French economics subcorpora of KIAP (Fløttum et al. 2006), as well as a comparable Spanish subcorpus created for this study. Direct questions are identified through the use of a "?" and illocutionary force indicating devices are identified to extract indirect questions. In the analysis, each direct and indirect question that serves to allow the writer to interact with the reader is analysed in terms of the following equivalences: frequency, function, type and form, location, passivity, tense/aspect and verbal modality, and question sentence type. A second analysis is presented in terms of these same equivalences; however, the second analysis focuses on shared question function across languages. The findings of this study indicate key similarities and differences across languages and allow for engagement with wider conversations on academic language in the multilingual academy. In concluding this talk, the findings are considered in terms of their applicability to the teaching and learning of academic language as well as future directions in corpus-based contrastive linguistics.


Fløttum, K., Dahl, T., & Kinn, T. (2006). Academic voices: Across languages and disciplines (Vol. 148). John Benjamins Publishing.

Week 17

Thursday 25th February 2021


Dismantling Online Dating Fraud

Matthew Edwards

University of Bristol

  • Abstract

Online romance scams are a prevalent form of mass-marketing fraud in the West. In this type of scam, fraudsters craft fake profiles and manually interact with their victims. Due to the characteristics of this type of fraud, and the peculiarities of how dating sites operate, traditional detection methods (e.g., those used in spam filtering) are ineffective.

This talk will report on our investigation into the archetype of online dating profiles used in this form of fraud, including their use of demographics, profile descriptions, and images, shedding light on both the strategies deployed by scammers to appeal to victims and the implicit traits of victims themselves. Our work is presented in the context of building and evaluating a machine-learning classifier for detecting spam profiles, and elaborates on our findings from investigating areas of under-performance.

Week 16

Thursday 18th February 2021


"WE ARE SO ANGRY! #ibscandal": Covid-19 and the International Baccalaureate

Saira Fitzgerald

Visiting researcher at CASS

  • Abstract

In this talk, I will present work in progress and discuss some preliminary results of a study examining discourses in global press reports and Twitter relating to the International Baccalaureate (IB) final examination results. These reports and tweets appeared over a two-month period, July-September, 2020.

As a result of Covid-19 and worldwide school closures in 2020, the IB organization cancelled its high stakes diploma examination for the first time in its 52-year history and, in its place, devised an alternate form of assessment based on an algorithm. The results for 174,355 students in 146 countries were published on July 6, and showed large discrepancies between students' predicted and final grades, which placed the postsecondary aspirations for many in jeopardy. Students, parents, teachers, academics and journalists demanded to know how grades were calculated and what statistical model was used. An online petition calling for "Justice for May 2020 IB Graduates" with the hashtag #ibscandal collected 15,000 signatures within the first four days.

This study is part of a larger research project on discourses surrounding the IB, which up to now have shown an overwhelmingly positive prosody constructed through repetition and incremental effects. The present study aims to uncover values and attitudes associated with the IB in this new context that previously may have been hidden or taken for granted. Preliminary findings point to shifts in discourses that can be linked to events taking place in the wider world, providing a rare and important window into the impact of "the global education industry" on students.

Week 14

Thursday 4th February 2021


The Application of Natural Language Processing in a Study of LGBTQ+ Cancer Experiences

Daisy Harvey

Spectrum Centre for Mental Health Research Lancaster University

  • Abstract

Current research suggests that there are knowledge gaps in institutional practices towards lesbian, gay, bisexual, transgender, and queer/questioning (LGBTQ) cancer patients, and that the LGBTQ+ community represent a 'growing and medically underserved population' (Quinn et al. 2015). In the context of cancer care, quantitative evidence shows that there are disparities in cancer outcomes between LGBTQ+ cancer patients and their 'heterosexual and cisgender counterparts' (Kamen et al. 2019), but there is a lack of qualitative research to address where services are lacking from the perspective of an LGBTQ+ service user.

This research demonstrates how the age of the internet can be utilised to provide an insight into underserved populations, and to gain empirical evidence and honest accounts from service users who might experience a fear of stigma or mistreatment in offline settings. The research explores the practical application of NLP and presents a methodology that encompasses web scraping, corpus creation, data annotation and anonymisation, a hybrid system for emotion detection utilising the NRC emotion intensity lexicon (Mohammad 2017) with machine learning methods, and topic modelling using Latent Dirichlet Allocation (LDA). The results of the research demonstrate an emotion detection classifier with a micro F1 score of 65%, and 8 clusters of topics that emerge from the topic modelling task. These topics provide insights that provoke further discussion, particularly within the theme of Diagnosis, Treatment and Sexuality, where excerpts describe that 'LGBT people with cancer can face discrimination and disqualification', and 'healthcare resources are all based on heteronormative assumptions'.


Kamen, C.S., Alpert, A., Margolies, L., Griggs, J.J., Darbes, L., Smith-Stoner, M., Lytle, M., Poteat, T., Scout, N.F.N. and Norton, S.A. (2019). "Treat us with dignity": a qualitative study of the experiences and recommendations of lesbian, gay, bisexual, transgender, and queer (LGBTQ) patients with cancer. Supportive Care in Cancer, 27(7), 2525-2532.

Mohammad, S. M. (2017). Word affect intensities. arXiv preprint arXiv:1704.08798.

Quinn, G.P., Sanchez, J.A., Sutton, S.K., Vadaparampil, S.T., Nguyen, G.T., Green, B.L., Kanetsky, P.A. and Schabath, M.B. (2015). Cancer and lesbian, gay, bisexual, transgender/transsexual, and queer/questioning (LGBTQ) populations. CA: a cancer journal for clinicians, 65(5), 384-400.

Week 9

Thursday 3rd December 2020


Triangulating corpus linguistics and clinical psychology in a study of narratives of voice-hearers

Luke Collins1 & Elena Semino2

1CASS, Lancaster University  2LAEL, Lancaster University

  • Abstract

We present a collaborative work between the 'Hearing the Voice' project (Durham University) and the ESRC Centre for Corpus Approaches to Social Science (CASS) with colleagues Dr Zsófia Demjén and Dr Vaclav Brezina, investigating the reports of individuals who hear voices that others cannot hear. Focusing on the description of such voices as 'person-like', we demonstrate how methods from corpus linguistics can be triangulated with approaches in clinical psychology. We find that an approach to investigating personhood based on the selection of specific linguistic aspects of the reports is convergent with the characterisation of participant experiences as 'minimal' or 'complex', based on a manual coding scheme developed by our colleagues in psychology. Furthermore, our corpus-based approach provides further insights into degrees of complexity, provisionally outlining a 'complexity scale' and contributing to increased understanding of experiences of voice-hearing in terms of personification of voices. The implementation of corpus methods in this work also highlighted important methodological considerations for the wider application of corpus linguistics.

Week 8

Thursday 26th November 2020


Usage-based perspective on the meaning-preserving hypothesis in voice alternation: Corpus linguistic and experimental studies in Indonesian

Gede Primahadi Wijaya RAJEG1, I Made RAJEG1 & I Wayan Arka2

1Universitas Udayana  2Australian National University & Universitas Udayana

  • Abstract

Voice alternation between active (AV) and passive (PASS) clauses is viewed as a "meaning-preserving alternation" (Kroeger, 2005, p. 271). It means that AV and PASS clauses based on the same verb should convey the same kind of event/meaning (cf. (1) & (2)).

1.Indonesian (ind_mixed_2012_1M-sentences.txt:755227)

murid Go bie-pay yang meng-(k)ena-kan baju warna hitam.

pupil NAME REL AV-hit-CAUS shirt colour black

'Go bie-pay's student who wears/puts on a black shirt'

2.Indonesian (ind_mixed_2012_1M-sentences.txt:802596)

Gaun yang di-kena-kan ber-warna hitam

dress REL PASS-hit-CAUS have.colour black

'The dress that is worn/put on is black'

Examples (1) and (2) convey the same event of wearing a clothing. The difference lies in the alignments of semantic roles and grammatical relations, especially that affecting the identity of the syntactic SUBJ(ect): in (2), the Theme (i.e. clothing) is PASS SUBJ, which is the direct OBJ(ect) in (1). Argument for the meaning-preserving status of AV-PASS alternation is typically illustrated using a pair of (often constructed) examples as in (1) and (2). Following up on our earlier work with the root kena 'hit' (Rajeg et al., 2020), we offer a usage-based, quantitative perspective in testing the meaning-preserving hypothesis in voice alternation, by bringing together evidence from (i) corpus analysis and (ii) sentence-production experiment (cf. Dąbrowska, 2009; Newman & Sorenson Duncan, 2019, for similar approach). We analysed the distribution of (non-)metaphoric senses of a set of Indonesian CAUSED FORWARD/BACKWARD motion verbs in AV-PASS alternation. Our study demonstrates that voice alternation can be sensitive to the senses of the verbs, given that a verb can be polysemous (cf. Bernolet & Colleman, 2016, for Dative alternation in Dutch). Quantitative findings indicate that voice alternation exhibits frequency effects (Diessel, 2016), such that certain senses strongly (dis)prefer one voice type over the other. These findings offer initial evidence to McDonnell's (2016) hypothesis on the role of semantic properties (e.g. senses) of a verb in accounting for the strong preference of that verb to occur in a given voice (cf. Gries & Stefanowitsch, 2004). Converging results between corpus and experimental data also suggest that speakers may store detailed semantic preference of the verb in a given voice type, contributing to the idea of item-specific knowledge in usage-based, Construction Grammar (Goldberg, 2006, pp. 49, 56; cf. Dąbrowska, 2009; Diessel, 2016)


Bernolet, S., & Colleman, T. (2016). Sense-based and lexeme-based alternation biases in the Dutch dative alternation. In J. Yoon & S. Th. Gries (Eds.), Corpus-based approaches to Construction Grammar (pp. 165-198). John Benjamins Publishing Company.

Dąbrowska, E. (2009). Words as constructions. In V. Evans & S. Pourcel (Eds.), New directions in cognitive linguistics (pp. 214-237). John Benjamins Pub. Co.

Diessel, H. (2016). Frequency and lexical specificity in grammar: A critical review. In H. Behrens & S. Pfänder (Eds.), Experience Counts: Frequency Effects in Language. De Gruyter.

Goldberg, A. E. (2006). Constructions at work: The nature of generalization in language. Oxford University Press.

Gries, S. Th., & Stefanowitsch, A. (2004). Extending collostructional analysis: A corpus-based perspective on "alternations." International Journal of Corpus Linguistics, 9(1), 97-129.

Kroeger, P. R. (2005). Analyzing Grammar: An Introduction. Cambridge University Press.

McDonnell, B. (2016). Symmetrical voice constructions in Besemah: A usage-based approach [PhD dissertation, University of California, Santa Barbara].

Newman, J., & Sorenson Duncan, T. (2019). The subject of ROAR in the mind and in the corpus: What divergent results can teach us. Linguistica Atlantica, 37(1), 1-27.

Rajeg, G. P. W., Rajeg, I. M., & Arka, I. W. (2020). Corpus-based approach meets LFG: Puzzling voice alternation in Indonesian. Paper Presented at the 25th International Lexical-Functional Grammar. Figshare.

Week 7

Thursday 19th November 2020


Affix substitution in Indonesian and its impact for discriminative learning

Karlina Denistia & R. Harald Baayen

Universität Tübingen

  • Abstract

This study explores computational modelling on two Indonesian nominal prefixes that realize similar function to the English suffix -er, PE- and PEN- (e.g., perenang `swimmer' and penari `dancer'). These prefixes are described as having very similar in form and meaning (Sneddon et al., 2010). Interestingly, PE- and PEN- often stand in a paradigmatic relation to verbal base words with the prefixes BER- and MEN- respectively (Dardjowidjojo, 1983). Thus, one could form a set of verb-noun words, such as berenang `to swim' - perenang `swimmer' and menari `to dance' - penari `dancer'. The central question addressed in the present study is whether the form similarities between PEN- (and its allomorphs) and MEN- (and its allomorphs) make this prefix easier to learn compared to PE-. To address this question, we made use of a computational model of lexical processing in the mental lexicon, the `discriminative lexicon' (DL) model introduced by (Baayen et al., 2019). Compiling the data from Leipzig Corpora Collection, a written Indonesian corpora (Goldhahn et al., 2012), we trained the model on 2517 word forms that were inflected or derived variants of 99 different base words. Of these 2517 word forms, 109 were nouns with PE- and 221 words were nouns with PEN-. Our results show that PE- is learnt somewhat better than PEN- for several reasons. As PE- is found to have a longer mean character length, it allows the model to discriminate better than PEN-. In the same vein, PEN- and MEN- has semantic cues competition, causing a less precision for the model to predict PEN-. Thus, the systematic paradigmatic similarities between PEN- and MEN- render these words more difficult for implicit lexical learning.


Baayen, R. H., Chuang, Y.-Y., Shafaei-Bajestan, E., and Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, pages 1-39.

Dardjowidjojo, S. (1983). Some Aspects of Indonesian Linguistics. Djambatan, Jakarta.

Goldhahn, D., Eckart, T., and Quasthoff, U. (2012). Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, pages 1799-1802.

Sneddon, J. N., Adelaar, A., Djenar, D. N., and Ewing, M. C. (2010). Indonesian: A Comprehensive Grammar. Routledge, New York, second edition.

Week 6

Thursday 12th November 2020


Measuring lexical complexity in L2 spoken production: Evidence from the Trinity Lancaster Corpus

Raffaella Bottini

CASS, Lancaster University

  • Abstract

The study validates lexical complexity measures for L2 spoken language using the 4.2-million-word Trinity Lancaster Corpus of L2 spoken English. Studies on learner language have shown that vocabulary knowledge is one of the best predictors of language use and overall proficiency (e.g. Milton, 2013). Different measures of vocabulary knowledge have been proposed in the field and lexical complexity plays a key role among them (e.g. Kim et al., 2018; Kyle & Crossley, 2015; Lu, 2012). However, little is known about different aspects of lexical complexity in L2 speech; also, there is no general agreement about which of the many existing complexity indices to use. This corpus-based study examines the reliability and validity of existing lexical measures - including indices which have not been validated before - and their relationship with learner characteristics (L1 and proficiency) and task-related features (topic familiarity). It introduces Lex Complexity Tool, a new tool which computes all the measures analysed and which includes a spoken wordlist from the Spoken BNC2014. The findings inform the choice of lexical indices tailored to research in second language acquisition and language testing, especially when L2 English speech is considered.

Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757-786.

Lu, X. (2012). The relationship of lexical richness to the quality of ESL learners' oral narratives. The Modern Language Journal, 96(2), 190-208.

Milton, J. (2013). Measuring the contribution of vocabulary knowledge to proficiency in the four skills. In C. Bardel, C. Lindqvist, & B. Laufer (Eds.), L2 vocabulary acquisition, knowledge and use. New perspectives on assessment and corpus analysis (pp. 57-78). Eurosla.

Week 5

Thursday 5th November 2020


CLEC: Colombian Learner English Corpus

Maria Victoria Pardo, Antonio Tamayo, Manuel Alejandro Gómez & Nicolás Alberto Henao

Universidad del Norte

  • Abstract

The objective of this presentation is to introduce to the research community the CLEC (Colombian Learner English Corpus). This corpus was created following the guidelines of the Computational Corpus Linguistics (McEnery & Hardie, 2011) and according to the compilation parameters of corpus of learners defined as "electronic collections of natural or almost natural data produced by foreign or second language learners (L2) and gathered according to explicit design criteria "Granger (2002, p. 7), Gilquin (2015, p.1). The TNT (Translation and New Technologies) research group of the University of Antioquia created the CLEC. It is an application that compiles 515 written compositions of students of English as a foreign language at university level. The application allows the search for information in the tagged data, it filters error labels systematically by category or type and allows you to find the trend of learner errors. The resulting product is a web responsive application that completely performs searches and does analysis on the tagged corpus of errors.

Week 4

Thursday 29th October 2020


Trainee EFL teachers' DDL lesson planning: Improving corpus-focused TPACK in Indonesia

Peter Crosthwaite

University of Queensland, Australia

  • Abstract

The use of corpora for language teaching/learning, via teacher-prepared corpus-assisted materials development or learners' direct use of corpus query software (commonly known as "data-driven learning", DDL) is gaining in popularity in pre-tertiary EFL contexts. However, improving trainee English teachers' technological and pedagogical content knowledge (TPACK) regarding integration of corpus tools/DDL pedagogy into classroom practice has received little attention from a language teacher education perspective.

This qualitative study therefore reports on a DDL lesson planning intervention for pre-service secondary school EFL teachers in Indonesia. I explore how trainee language teachers integrate DDL into their lesson planning following DDL training, and whether the trainees' LPs demonstrates appropriate TPACK required for successful future implementation. Nine pre-service EFL teacher trainees were enrolled in a teacher education program in Jakarta, Indonesia. The DDL training regimen included partial completion of a Short Private Online Course on DDL (Improving Writing Through Corpora, Crosthwaite, 2020) covering basic corpus techniques required for DDL (e.g. generating corpus queries, reading/manipulating concordances, understanding frequency information) using SKELL (Baisa & Suchomel, 2014) and SketchEngine (Kilgariff et. al, 2014). Trainees then submitted a sample lesson plan which was scrutinized for components where corpus data could enhance the proposed lesson's materials or where learners could engage in direct corpus consultation/DDL. Three, three-hour workshops on DDL were then conducted online via Zoom. Following these, trainees discussed their DDL training and lesson planning via Google Classroom chat, before working alone to create a new lesson plan including at least one direct DDL activity. Data includes the researcher's initial feedback regarding integration of DDL into trainees' original (non-DDL) lesson plans, trainees' Google Classroom chat logs, and trainees' completed lesson plans involving DDL resources/activities. Harris et. al's (2010) Technology Integration Observation Instrument was used to evaluate trainees' completed lesson plans for TPACK regarding curriculum goals and technologies; instructional strategies and technologies; Technology selection(s); and 'Fit'.

The data suggest trainees each integrated corpora/DDL into their lesson planning despite none reporting using a corpus prior to training. Submitted lesson plans featured DDL for language-related concerns (e.g. 'grammar focus'), and to support task-based genre-focused pedagogies as required by the Indonesian national curriculum. While submitted plans demonstrated high levels of 'fit' regarding curriculum goals and technology selection, some plans lacked DDL-relevant instructional strategies. However, TPACK scores for submitted lesson plans were generally high following only a short (but intensive) period of DDL training, underscoring the significant potential for integrating DDL into pre-tertiary classroom practice.


Baisa, V. & Suchomel, V. (2014) SkELL: Web interface for English language learning. In Horák, A. & Rychlý, P. (ed.), Proceedings of Recent Advances in Slavonic Natural Language Processing. Karlova Studánka, Czech Republic, 5-7 December, 63-70.

Crosthwaite, P. (2020). Taking DDL online: Designing, implementing and evaluating a SPOC on data-driven learning for tertiary L2 writing. Australian Review of Applied Linguistics, 43(2), 169-195.

Harris, J., Grandgenett, N., & Hofer, M. (2010). Testing a TPACK-based technology integration assessment rubric. In C. D. Maddux, D. Gibson, & B. Dodge (Eds.), Research highlights in technology and teacher education (pp. 323-331). Chesapeake, VA: Society for Information Technology & Teacher Education (SITE).

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1(1), 7-36.