What does representativeness mean in corpus linguistics? According to Leech (1991: 27), a corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety. Biber (1993: 243) defines representativeness from the viewpoint of how this quality is achieved: ‘Representativeness refers to the extent to which a sample includes the full range of variability in a population.’ A corpus is essentially a sample of a language or language variety (i.e. population). Sampling is entailed in the compilation of virtually any corpus of a living language. In this respect, the representativeness of most corpora is to a great extent determined by two factors: the range of genres included in a corpus (i.e. balance) and how the text chunks for each genre are selected (i.e. sampling).
We noted in Unit A1.2 that the criteria used to select texts for a corpus are principally external. The external vs. internal criteria correspond to Biber's (1993: 243) situational vs. linguistic perspectives. External criteria are defined situationally irrespective of the distribution of linguistic features whereas internal criteria are defined linguistically, taking into account the distribution of such features. Biber refers to situationally defined text categories as genres or registers, and linguistically defined text categories as text types, though these terms are typically used interchangeably in the literature (e.g. Aston and Burnard 1998), and in this book.
Internal criteria have sometimes been proposed as a measure of corpus representativeness. Otlogetswe (2004), for example, argues that:
The study of corpus word distributions would reveal whether words in a corpus arc skewed towards certain varieties and whether in such instances it is accurate to say they are representative of the entire corpus. It would also reflect the stability of the design - whether overall representativeness is very sensitive to particular genres.
In addition to text selection criteria, Hunston (2002: 30) suggests that another aspect of representativeness is change over time. She claims that ‘[a]ny corpus that is not regularly updated rapidly becomes unrepresentative’ (ibid.). The relevance of permanence in corpus design actually depends on how we view a corpus, i.e.
Similar views can be found elsewhere. For example, in a discussion of representativeness on the Corpora Mailing List, most discussants appeared to assume that a corpus should sufficiently represent particular words: ‘A representative corpus should include the majority of the types in the language as recorded in a comprehensive dictionary (Berber-Sardinha 1998). Such a decision would in turn entail a discussion of what should be counted as a word, e.g. whether one should count different forms of the same word as instances of the same type.
In our view, it is problematic, indeed it is circular, to use internal criteria like the distribution of words or grammatical features as the primary parameters tor the selection of corpus data. A corpus is typically designed to study linguistic distributions. If the distribution of linguistic features is predetermined when the corpus is designed, there is no point in analysing such a corpus to discover naturally occurring linguistic feature distributions. The corpus has been skewed by design. As such, we generally agree with Sinclair (1995) when he says that the texts or parts of texts to be included in a corpus should be selected according to external criteria whether a corpus should be viewed as a static or dynamic language model. The static view typically applies to a sample corpus whereas a dynamic view applies to a monitor corpus. While monitor corpora following the dynamic language model are useful in tracking rapid language change such as the development and lifecycle of neologisms, they normally cover a relatively short span of time. Very long-term change can, of course, be studied using diachronic corpora such as the Helsinki Diachronic Corpus, in which each component represents a specific time period. Static sample corpora, if resampled, may also allow the study of language change over time. Static sample corpora which apply the same sampling frame are particularly useful in this respect. Typical examples of this type of corpus are the Lancaster-Oslo-Bergen corpus (i.e. LOB) and the Freiburg-LOB corpus (i.e. FLOB), which represent British English in the early 1960s and the early 1990s respectively. Another corpus following the same sampling frame is under construction on a project which is funded by the Leverhulme Trust and undertaken by Lancaster University. The corpus is designed as a match for LOB in the early 1930s. These three corpora are specifically constructed with the study of language change in mind. Diachronic corpora like the Helsinki corpus and the corpora of the LOB family are sample corpora of the static language model sort, yet they are all well suited for the study of language change.
McEnery et al. Corpus Representativeness