Data Driven Vocabulary Learning

TESOL 2003 Presentation

by Bill Walker
AEI, University of Oregon

Links to concordancers and on-line dictionaries


I like what the Cambridge Dictionary of American English says in its preface: "The ancient Romans said that in matters of language, usage is more powerful than Caesar." The preface goes on to say that there is no central authority that decides what is and is not correct. It is the users of the language that determine what words can and cannot do; users decide which grammar certain words prefer, which other words they associate with and how they change meaning over time and in different contexts.

When we think of usage, we usually think of grammar. However, lately, there has been more and more attention paid to the way vocabulary is used. In fact, in the late 1990s, Michael McCarthy and Felicity O'Dell came out with a vocabulary text with the title Vocabulary In Use. They point out that it is one thing to know what a word means, but quite something else to know how to use it.

We all know that students sometimes make poor vocabulary choices in their writing. Years ago, I started to make a chart showing the kinds of mistakes my advanced level students were making in their writing assignments. I wanted to hone in on the grammar mistakes so that I could target some of the classroom activities to help them write more accurately. My focus was on grammar, because I knew that grammar follows rules. I assumed that students could learn the rules and correct the mistakes that I pointed out to them. Here is a brief example of how I marked students' papers using a standardized set of marking symbols.

OHP: do action fast

I listed my marking symbols in this chart in a column on the left. Then, after marking a set of papers, I went back and put a check mark against each symbol. Most of the symbols referred to grammar, but some indicated punctuation mistakes and some indicated vocabulary or word order.

OHP: correction symbols chart

Most of the check marks went right where I expected them: verb tense mistakes, articles, sentence boundary mistakes, etc. However, there soon appeared a rather large spike for the symbol ww: wrong word and other symbols relating to vocabulary usage. Over and over again, for each subsequent set of papers, I got similar results. Students were not quite choosing the appropriate words. On subsequent revisions, most students did fairly well in correcting the grammar mistakes, but still, some of them persisted in choosing the wrong words. Sometimes, the original mistaken word was better than the new attempt to replace it.

You've probably done what I did next. I asked the students how they went about correcting their vocabulary errors. The answer, not surprisingly, was that they used the dictionary. Obviously, there was something wrong with this approach. Either they were using inadequate dictionaries, or they didn't know how to use dictionaries correctly. In some cases, during conferences, I found myself saying, "that's not how we say that in English" or "this is how we say it . . ." Of course the students would respond with a look that said, "So, how am I supposed to know that?" My students knew that they could, if they tried, clean up their grammar by finding grammar rules and figuring out the correct forms. But vocabulary "rules" don't seem to exist. Dictionaries give students basic meanings of words, but there is a lot of information that cannot be had from dictionaries alone.

OHP: deliberate (Cambridge Dictionary)

Put yourself in the Student's Place

Two years ago, I did an experiment. I was curious as to what it is like to be a student in a foreign language class. Specifically, how do students handle new vocabulary? I enrolled in a French class at the University of Oregon. The professor was a native speaker of French and never spoke English in class. It was a reading-based course. We had to read, then discuss in class, a wide variety of authentic, unabridged articles from magazines, newspapers and the Internet. I went out and bought a great dictionary of the sort that I advise my students to buy: French-French, with clear definitions and lots of example sentences. I set out to improve my French vocabulary.

OHP: La France Pas Tranquille

Here's an example of a passage from one of my French texts. Below it, you see my translation. The words that are in italics are those that I guessed at, without looking in my dictionary. My guesses were sufficient for me to get the gist of the passage. However, if you read closely, you see that it doesn't quite flow in English. Certainly, some of my guesses were not precisely accurate. Later, I looked these words up in a dictionary. I was close enough on most of my guesses, but way off on others. The good guesses were based on some rudimentary, previous knowledge of the words, plus some context clues. The poor guesses were a result of little or no previous knowledge, or, in the case of "inventer un project," an idiomatic expression. Unfortunately, my dictionary did not list idioms like this.

I suppose that, over time, if I were to read lots and lots of books in French, I might come across "inventer un project" a few more times. Then, if I were lucky, I might be able to figure it out. Unfortunately, I really don't have the time to read many books in French. Even if I did find out what it means, it would still be quite a while longer before I would be able to use it correctly, in the appropriate contexts.

Let's take a word in English. Let's take the word discrimination. Native speakers know that sometimes we say discriminate against, and sometimes we say discriminate among or between. So, what's the difference? A dictionary might give this information:

OHP: discriminate

What Does it Mean to Know a Word?

That's a good start. The word discriminate has two senses: to treat unfairly; to notice a difference. When it means to treat unfairly, it is used with the word against. When it means to notice a difference, it is used with the word between. However, we know that there's a lot more to knowing a word than just knowing it's meaning or meanings. As you can see, for students to gain a full, encyclopedic knowledge of English vocabulary can be a long and tedious process. Researchers such as Norbert Schmitt and Paul Nation point out that there is much more to knowing a word than just learning its definition.

OHP: Knowing a Word

This is what we call "encyclopedic" knowledge of a word. But before we go any further, let's ask ourselves just what the term vocabulary means. As the basis of any discussion of what it means to teach vocabulary, a few terms need to be defined and clarified. The term vocabulary means "words" to lay people, but teaching professionals need a more precise definition. We could say that the "words" of one's vocabulary are actually "lexemes" or lexical units which may consist of a single word or a string of related words which encapsulate a single meaning unit. For example, the idea of "die" can be construed with the following lexemes: die, expire, pass away, bite the dust; give up the ghost, . . . is no longer with us, etc. Additionally, lexemes live in word families in the form of grammatical inflections (die, dies, died) or derivatives (prime, primary, primarily, primitive, primitively, primitiveness).

I doubt if students can get all of this information from dictionaries. Only through extensive exposure to massive amounts of comprehensible English could they start to build up their encyclopedic knowledge of vocabulary items. Certainly, by reading many books and listening to a great deal of speech, students can gradually pick up greater and greater vocabulary. However, there is an easier way, and that is the topic of this demonstration.

Using corpora and concordance programs is one way to accelerate and make more efficient the acquisition of vocabulary. Of course, this approach is not foolproof. There are some pitfalls. However, I'll be pointing these out to you as I go along. Some careful preparations have to be made before you launch your students into investigating corpora.

What is a corpus?

Basically, it is any body of text. However, to be of any real use in vocabulary learning, a corpus should be quite large. By large, I mean over a million words. Of course, smaller corpora could be used for special purpose, some of which I will demonstrate shortly, in general, corpora of over a million words work best. How many words is a million? Imagine a book that has 500 words on a page. If the book is 500 pages long, it contains 1/4 of a million words. It would take four such books to contain 1,000,000 words. When I develop activities for my students, I usually use corpora of 10 to 12 million words. If I use fewer than that, I sometimes don't get good results. The CoBuild Dictionary is based on the Bank of Enlgish corpus. According to their web site, as of January, 2002, this corpus contained over 450 million words.

If you examine a corpus of, say, one million words, how often would you expect a word, such as discriminate, to appear? More specifically, how many times would you expect the word discriminate to occur in four novels (a million words)? By the way, it is generally considered that the word discriminate is one of the words from the Academic Word List, which consists of words that occur at least 100 times in the Academic Corpus of 3,500,000 words. I searched Lord of the Rings, books 1 and 2. How many times did the word discriminate occur? Zero. Not once in two books. Not once in half a million words. Obviously, a million words is not very many. Another thing, though, is that I was looking for an academic word in a non-academic corpus. If I had tried to find the word elf in an academic corpus, I'm sure I would have struck out. The word discriminate would occur more than a dozen times in an academic corpus, whereas the word elf would occur hundreds of times in a corpus taken from Tolkein's works.

Researchers gather great amounts of interesting information about language by examining these massive, multi-million word corpora using concordance software. A concordance program is a kind of search engine that hunts for all occurrences of individual words or strings of words. Here is an example of the results I got by looking for the word discriminate in a corpus of newspaper articles:

OHP: could still discriminate against women. "if

Right away you can see that by listing these occurrences of the word, certain information appears. This is information that dictionaries don't give. You notice that most of the occurrences of discriminate co-occur with the word against. In this short example, there is only one co-occurrence with the word between. Instantly, a student would know that the first meaning sense of discriminate (to treat unfairly) is much more common than the second meaning (to notice a difference).

In addition, as you can see, the key word is presented in a bit of context. This is called KWIC in concordance lingo: Key Word In Context. There are concordance programs that let you chose the amount of context presented with each "hit." If, for example, I click on the second line of this concordance list, I get the following:

OHP: The Bill, introduced after a long campaign

Some programs let you choose the amount of context you see. for example, there's a program that let's you search the entire Internet for occurrences of your key word and lets you choose from 1 to 30 words of context to the right and to the left of the key word.

OHP: WebCorp output for search term "discriminate"

Most concordancers programs also allow you to click to see the full paragraph, or more, that the word occurs in. Suppose you needed 14 encounters with the word discriminate in order to really get to know it. Because this word occurs relatively infrequently in general texts, you would have to read over three and a half million words to see it 14 times. Here, using a concordance program, in just a few minutes I have found 21 sentences containing the word. Right away, you can see one enormous advantage of using a concordance program to examine a corpus: the sheer number of times a word can be seen in context.

Corpus advantages and disadvantages

While dictionaries do provide concentrated data about individual words, it really is not very easy for most students to see the patterns in which these words typically appear. Corpora, on the other hand, can make the patterns jump right out.

Let's do an experiment right now. I want you to look at a read-out from a concordance of the word impact. Let's suppose you are a student who has learned a dictionary definition about this word. What other data can you find? Look for patterns.

HO: Impact

Now, this was a "left sort". It means that the words immediately to the left of the key word are arranged in alphabetical order. There are some concordance programs that will do a double left sort. In that case, the verbs would be easier to see.

If I had done a right sort, what patterns might I have noticed?

Which sort, right or left, seems to yield the most results in the case of the word impact?

In his book "Learning with Corpora," Guy Aston points out that "learning a language is a matter both of learning about how the language is used, and of learning how to use it, thereby developing what Widdowson (1984) terms competence and capacity." He goes on to say that "pedagogy should aim to develop learner's learning skills, leading them to greater autonomy." I believe, and my own classroom experiences have shown, that students can become autonomous investigators of language, and that their experience in using concordance programs to examine corpora instill in them a heightened awareness of language forms, both grammatical and lexical, which carries over to their examination of texts that they read without the aid of concordancers.

Turning the controls over the the students

By putting corpora into the hands of students, the focus shifts to the learner and the learning process. The students make discoveries about how the language is actually used, not how it is described by linguists. This brings us to a notion that started to develop in the early 1990s: the idea of colligation. This is the idea that every word may have its own grammatical preferences. This suggests that lexical items don't occur in lonely isolation, but rather are integral parts of larger chunks of language, idiomatic chunks that are governed by collocational factors, colligational factors, semantic, pragmatic and generic factors. Simple dictionaries usually don't indicate these factors. Instead, they isolate words, provide very little context and rarely highlight factors that influence the usage of the words. Corpora, on the other hand, don't hold anything back.

The trick is to get students to notice. If you decide to incorporate the use of corpora in your courses, you need to prepare students. If you just throw them into the ocean of words and tell them to start swimming, you are going to see a lot of resistance. In fact, most students will probably drown. Instead, you need to lay a very careful groundwork. First of all, you must make sure that students are computer literate. Nowadays, that's hardly a problem in most places where computer technology has been around for several decades. If students can surf the web, you can be sure that they can use a concordance program.

Choose a simple to use concordancer. Many concordancers are available that are either inexpensive or free. The early versions of MicroConcord are quite suitable for classroom purposes; highly sophisticated concordancers are more unwieldy, and so probably less useful. A quick search of the internet should turn up a dozen or so programs that you can either download or order on line; prices range from free to about $150 or so.

Nowadays, concordance programs are intuitive, and therefore rather easy to use. Basically, you select a corpus to examine, then type in a search term (one word, a partial word, or a string of words), then chose the type of sort (sort right, sort left), and start the search. Let's take a look at the home page of an on-line concordance program: the Virtual Language Centre:

OHP: VLC Web Concordancer

First, type in a "search string" which can be a whole word, part of a word, a word that starts with or ends with a combination of letters. For example, you can type the word prime, or the beginning of the word (prim) in order to see many forms of it (prime, primary, primarily, etc.); you can type in suffixes to see which words in the corpus share it (-er or -or); or you can type in roots, such as mort to get such words as immortal, mortuary, etc.

Next, choose the format: normal or gapped. In normal, the key word is shown in context. In gapped, the key word is missing. This provides a quick way for you to make a quiz for your students.

OHP: gapped concordance

Next, decide whether or not you want to print the collocates, and whether you want the collocates to be arranged in alphabetical order or by frequency of occurance. Then select a corpus. Right now, this concordancer lists about 25 corpora, ranging from 56,000 words to over 3 million words.

Next, choose sort left or sort right. If you are searching for instances of a noun, a left sort will usually yield the associated adjectives and verbs: great impact, significant impact; to make an impact, to have an impact. Similarly, if you are searching for instances of a verb, you might want to sort right to see what prepositions, if any, are associated. Of course, you can search left, get the results, then search right to see what different results you might come up with.

In this program, as in most programs, you can search for collocates. Not all collocates are right next to the key word; there may be intervening words (adjectives, articles), so choose 2 or 3 words to the right or to the left.

Choose the widest line width; in this case, 60 columns is the widest. (The default is 40). As for the number of concordances to stop after, you should usually leave it at the default, which will usually give you all the hits that are available, unless you are concordancing words that are extremely frequent, in which case you will want to limit the number of hits.

Finally, hit search for concordances. In a few moments, you get your results.

OHP: global

In this particular program, after you examine your search results, you can use the same parameters that you have already set and immediately search for a new term. In addition, you can look up dictionary entries for the searched term.

OHP: entries for global

In this on-line dictionary, you can also click to hear the word pronounced. This is a synthesized sound, however, so it is not quite authentic.

Finally, you can click on sentence concordances for the key word.

OHP: sentences concordances for global

If you had chosen to see the collocations of your key word, you would have gotten a result like this:

OHP: collocates of impact

Reducing the difficulty of students' tasks

There are a few guidelines that you should follow, especially when beginning to use corpora in the classroom, to make students' tasks easier. First of all, you might want to pre-edit the material that students look at. You can do the searches yourself, take out lines that are irrelevant, and show the students the end product. You'll save a lot of paper if you can put the results on an OHP. Of course, this pre-filtering precludes the kind of discovery learning that can be very valuable, but as an introduction to using corpora, this method can be quite useful.

Next, choose tasks that don't demand all of the data to be classified. As an expert in the language, you will certainly notice many small things about the key words under discussion. However, if you focus your students' attention to just one or two aspects of the key words, they will be a lot more comfortable. Too much data can be overwhelming. Allow the students to experience success in little incremental steps. Start with simple collocational patterns at first. When students get used to finding these, introduce them to other types of recurrent semantic or pragmatic patterns.

Certainly, encourage, or even require, students to support each other. Set up pair and group work environments where they can compare their results, or look for different sets of information and report their results to each other.

Of course, be on the lookout for superstars. Once in a while, you'll get some students who have greater linguistic aptitude and ability. Challenge these students to search for more subtle information about the target words. Likewise, if you have students who are more adept at using computers, make sure they share their expertise with the other students.

Finally, allow students to be in charge of their learning. Try not to pre-determine all of the tasks. Once students develop a friendly attitude towards using corpora, try to get out of the way. Guide them, but let them suggest ways to do the exploring. The more the process is negotiated and student centered, the more motivated they will be.

Inductive and Deductive Approaches

Students can use concordancers to search corpora either inductively or deductively. In the inductive approach, students are discovering facts about the meanings and associations of words. They are noticing the facts, looking at possible patterns, and adding up the details to form generalizations. Where dictionaries give students prototypical examples of sentences using the key word, corpora give them samples of real sentences. The distinction between examples and samples needs to be made to the students. Examples are those that are pre-selected to somehow illustrate a principle in general. Examples are neat and tidy, but they reflect a certain filtering process on the part of the person who selected them. Samples, on the other hand, are just that. Chunks of language with no special filtering of any sort. When students are allowed to sample the language, they can actually come up with much more practical, and often more accurate, information about the words.

In the deductive approach, students start with a generalization, then examine the data to see how it conforms. The test the hypotheses that they already have. In actual practice, the two processes complement each other, so that students should shift between them. In this way, their knowledge of words goes through a series of progressive approximations.

This is where the selection of the corpora is vital. We know that some words are used primarily in speech, whereas other are used in writing. Or, some words are associated with a certain genre, but rarely occur in other genres. Students sometimes ask me what is the difference between British English and American English. There are corpora which are derived from oral sources (taped conversations, radio, television, songs), so you could compare spoken and written usage, just as there are corpora which derive from British publications and corpora derived from American publications.


Corpora can be examined to find out more about what words actually denote. For example, students know the common word chair denotes a physical object. But does it ever denote a non-physical object? Look at the following line taken from a concordance of the word chair:

returned to Copehagen and applied for the university's chair in physics but was rejected

In cases where students already realize that certain words have several denotations, you can use corpora to have them discover which of the denotations is more common. Take for example the word concrete. It can refer to the stone-like man-made substance, or it can mean clear, certain, or able to be touched or felt. Of course, the relative frequency of occurrence depends on the corpus chosen.

OHP: concrete


There are three types of linguistic connotation. First, there are connotations related to social context, such as social class, regional origin, gender, age, and the relative relationship between interlocutors. For example, the expression absolutely awful might be used in upper-middle class British English, whereas that's so gross would be used by a lower class of American speakers. Connotations may also be influenced by culture. Finally, speakers may use variations of a core meaning to convey emotion.

OHP: connotation


I think the greatest contributions that corpora can have to the reading process comes in two forms: getting students to recognize collocations, and giving them more practice in guessing from context. The typical academic reading course capitalizes on the interaction between top-down and bottom-up reading skills. In the past two or three decades, however, there has been a greater emphasis on developing top-down skills. Lately, it has become apparent to researchers that there needs to be more attention paid to bottom-up processing. ESL students quickly learn to tap into their pre-existing knowledge of topics to be read about. Moreover, their knowledge of the world typically exceeds their command of their English. They do need more practice in their identification skills, and that, to a large extent, means having a much deeper knowledge of lexical items. By learning more and more collocational patterns, students can come to their texts with greater decoding automaticity. That is, when they encounter a known collocational chunk, they can automatically process it much faster than if they had to work out each word in the collocation individually. This, in fact, is how native speakers process incoming text. One word triggers the others in the group.

Preparation for Writing

Finally, corpora can be used to prepare students for the writing process. In addition to providing students with lists of key words in context, concordancers can also provide them with frequency lists. That is, all of the words in a particular corpus can be listed according to how frequently they occur. At first, this might not seem to be very useful. However, if you limit your corpora to those that pertain to a topic of discussion, a topic about which students will have to write, it is possible to discover certain words that associate with the topic. For instance, if the topic is capital punishment, you might have students examine a collection of articles whose theme is crime and punishment. When you get a readout of the frequently used words, you will discover, after the first 50 or 75 words, a few that seem germane to the topic. Students who heretofore have had a rather narrow range of vocabulary from which to draw may be inspired to expand their repertoire to include these other terms. For instance, instead of constantly referring to capital punishment, they would sometimes write about the death penalty. Moreover, when the investigate the use of these two terms, they can see the grammatical preferences of each. For example, which of these words uses the article the? Which verbs are used with death penalty, and which are used in association with capital punishment?

OHP: death penalty


Vocabulary games and quizzes

Count and non-count nouns

Odd man out

match collocations