Corpora: An Invaluable Addition to Any Translator’s Toolbelt – September 2022 General Meeting

by William Giller

It’s a scene we know all too well: “But is that actually said?” plays over and over in your mind. You pore over your carefully made glossaries. Nothing there. You consult every dictionary at your disposal. Zilch. You ask your partner, your friends. Nada. You consult a number of search engines, but the daunting number of hits leaves you more confused than before. You may have found what you were looking for, but your native speaker intuition is still telling you something is not quite right. Where, oh where, can a stuck translator turn in their time of need? And then it hits you: a corpus, of course!

Ana Lis Salotti, MA in Translation and a Spanish translator with over 16 years of experience, offered a nifty solution to assuage this very problem. Her presentation, entitled “The Devil Is in the Details,” explored corpora (or corpuses, if you prefer) and how they can be a translator’s saving grace.

In her presentation, Ana explained that a corpus is a vast collection of authentic written and spoken texts amassed in a single database and illustrates how a word or phrase is actually used in good practice. The data that makes up a corpus can stem from any number of reliable sources, some obvious and others surprising: newspapers, magazines, novels, plays, recipes, TV shows, radio broadcasts, movies—even social media posts. A corpus can be mono- or multilingual and, depending on the language, can even be filtered by region for pluricentric languages like Spanish. Additionally, corpora are completely free to access by anyone—although in some cases you may need to create a free account.

Corpora are multifunctional tools. They can determine how frequently words are used and how they are used in real-world contexts. If you have a hunch that a word is old-fashioned, unnatural in a certain context, or doesn’t go well—or to use the technical word, doesn’t collocate—with another, a corpus can confirm your hunch or prove you wrong. With a corpus, you can see what hundreds of different speakers and writers have actually said or written before.

Ana explained that searches are best run using the “lemma”—the dictionary form—of a word, as opposed to a derivation. For example, you may encounter a greater number of results if you search for walk (a lemma) rather than walks, walked or walking (derivations or “forms” of the lemma). When you use a lemma as your search term, the corpus will produce a myriad of different contexts in which your lemma has been used.

If you need to know what words collocate in your target language with another one, you can also narrow your query to identify collocations. This is especially useful when you’re trying to find synonyms or when you don’t know how to use certain words well in certain contexts. To illustrate her point, Ana gave the collocates “court” and “hear” as an example. In English, we say that a “court hears a case.” In Spanish, for example, we don’t use the literal translation of “hear” in this context. The Spanish translation of “hear” doesn’t collocate with the Spanish translation of “court.” But what word or words do collocate with the Spanish counterpart of “court”? Ana demonstrated that a quick corpus search can come to the rescue in these cases.

Eager to start test-driving a corpus? Ana gave links to start exploring options, such as the Corpus of Contemporary American English (COCA) or the Open Parallel Corpus (OPUS), a website with links to a multitude of multilingual corpora. You can also check out Ana’s blog post to see further examples and explanations.

In summary, corpora provide examples of how language is actually being used. Dictionaries, glossaries, or search engines can sometimes prove unwieldy or even unhelpful, and our own intuition is sometimes not enough to resolve tricky translation issues. Corpora help inform our decisions by giving us examples of how the community at large is using a word or phrase in real-life situations, freeing us from having to rely on our gut instinct alone. Corpora are yet another tool in our translator toolbelt to help us achieve smooth, natural-sounding translations.

William Giller, Certified Healthcare Interpreter™ (Spanish), currently works at Stanford Medicine Children’s Health in Palo Alto, California, as a lead interpreter and translator for English, Spanish, and Portuguese. He holds a Master of Arts in Translation from the Middlebury Institute of International Studies at Monterey. Find him on LinkedIn!