A Frequency Dictionary of Spanish - PDF Free Download (2022)



2 A Frequency Dictionary of Spanish A Frequency Dictionary of Spanish is an invaluable tool for all learners of Spanish, providing a list of the 5,000 most frequently used words in the language. Based on a 20-million word corpus which is evenly divided between spoken, fiction and non-fiction texts from both Spain and Latin America, the dictionary provides the user with a detailed frequency-based list plus alphabetical and part of speech indexes. All entries in the rank frequency list feature the English equivalent, a sample sentence plus an indication of major register variation. The dictionary also contains 30 thematically organized lists of frequently used words on a variety of topics. A Frequency Dictionary of Spanish aims to enable students of all levels to maximize their study of Spanish vocabulary in an efficient and engaging way. Mark Davies is Associate Professor at the Department of Linguistics, Brigham Young University at Provo in Utah.

3 Routledge Frequency Dictionaries General Editors: Anthony McEnery Paul Rayson Consultant Editors: Michael Barlow Asmah Haji Omar Geoffrey Leech Barbara Lewandowska-Tomaszczyk Josef Schmied Andrew Wilson Other books in the series: A Frequency Dictionary of German: Core vocabulary for learners hbk pbk Coming soon: A Frequency Dictionary of Polish

4 A Frequency Dictionary of Spanish Core vocabulary for learners Mark Davies

5 First published 2006 by Routledge 270 Madison Ave, New York, NY Simultaneously published in the UK by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN Routledge is an imprint of the Taylor & Francis Group 2006 Mark Davies This edition published in the Taylor & Francis e-library, To purchase your own copy of this or any of Taylor & Francis or Routledge s collection of thousands of ebooks please go to All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data Davies, Mark, 1963 Apr. 22 A frequency dictionary of modern Spanish/Mark Davies. p. cm. (Routledge frequency dictionaries) Includes bibliographical references and index. 1. Spanish language Word frequency Dictionaries. I. Title. II. Series. PC4691.D dc22 ISBN10: (hbk) ISBN10: (pbk) ISBN13: (hbk) ISBN13: (pbk)

6 Contents Thematic vocabulary lists vi Series preface vii Acknowledgments ix List of abbreviations x Introduction 1 References 11 Frequency index 12 Alphabetical index 183 Part of speech index 235

7 Thematic vocabulary lists 1 Animals 15 2 Body 21 3 Food 26 4 Clothing 31 5 Transportation 36 6 Family 41 7 Materials 45 8 Time 51 9 Sports Natural features and plants Weather Professions Creating nouns Diminutives and superlatives Nouns differences across registers Colors Opposites: frequent pairs Nationalities and place adjectives Adjectives with ser/estar Adjectives of emotion Adjectives differences across registers Verbs of movement Verbs of communication Use of the reflexive marker se Preterit/imperfect Subjunctive triggers Verbs differences across registers Adverbs differences across registers New words since the 1800s Word length (Zipf s Law) 164

8 Series preface There is a growing consensus that frequency information has a role to play in language learning. Data derived from corpora allows the frequency of individual words and phrases in a language to be determined. That information may then be incorporated into language learning. In this series, the frequency of words in large corpora is presented to learners to allow them to use frequency as a guide in their learning. In providing such a resource, we are both bringing students closer to real language (as opposed to textbook language, which often distorts the frequencies of features in a language, see Ljung 1990) and providing the possibility for students to use frequency as a guide for vocabulary learning. In addition we are providing information on differences between frequencies in spoken and written language as well as, from time to time, frequencies specific to certain genres. Why should one do this? Nation (1990) has shown that the 4,000 5,000 most frequent words account for up to 95 percent of a written text and the 1,000 most frequent words account for 85 percent of speech. While Nation s results were for English, they do at least present the possibility that, by allowing frequency to be a general guide to vocabulary learning, one task facing learners to acquire a lexicon which will serve them well on most occasions most of the time could be achieved quite easily. While frequency alone may never act as the sole guide for a learner, it is nonetheless a very good guide, and one which may produce rapid results. In short, it seems rational to prioritize learning the words one is likely to hear and use most often. That is the philosophy behind this series of dictionaries. The information in these dictionaries is presented in a number of formats to allow users to access the data in different ways. So, for example, if you would prefer not to simply drill down through the word frequency list, but would rather focus on verbs, the part of speech index will allow you to focus on just the most frequent verbs. Given that verbs typically account for 20 percent of all words in a language, this may be a good strategy. Also, a focus on function words may be equally rewarding 60 percent of speech in English is composed of a mere 50 function words. We also hope that the series provides information of use to the language teacher. The idea that frequency information may have a role to play in syllabus design is not new (see, for example, Sinclair and Renouf 1988). However, to date it has been difficult for those teaching languages other than English to use frequency information in syllabus design because of a lack of data. While English has long been well provided with such data, there has been a relative paucity of such material for other languages. This series aims to provide such information so that the benefits of the use of frequency information in syllabus design can be explored for languages other than English. We are not claiming, of course, that frequency information should be used slavishly. It would be a pity if teachers and students failed to notice important generalizations across the lexis presented in these dictionaries. So, for example, where one pronoun is more frequent than another, it would be problematic if a student felt they had learned all pronouns when they had learned only the most frequent pronoun. Our response to such issues in this series

9 viii is to provide indexes to the data from a number of perspectives. So, for example, a student working down the frequency list who encounters a pronoun can switch to the part of speech list to see what other pronouns there are in the dictionary and what their frequencies are. In short, by using the lists in combination a student or teacher should be able to focus on specific words and groups of words. Such a use of the data presented here is to be encouraged. References Ljung, M. A. (1990) A Study of TEFL Vocabulary, Stockholm: Almqvist & Wiksell International. Nation, I. S. P. (1990) Teaching and learning vocabulary, Boston: Heinle and Heinle. Tony McEnery and Paul Rayson Lancaster, 2005 Sinclair, J. M. and Renouf, A. (1988) A lexical syllabus for language learning, in R. Carter and M. McCarthy (eds.) Vocabulary and Language Teaching, London: Longman, pp

10 Acknowledgments I am indebted to Douglas Biber, James Jones, and Nicole Tracy from Northern Arizona University, who helped with the part of speech tagging and lemmatization of the 20 million word corpus. I am also grateful to a number of graduate students from both Illinois State University and Brigham Young University who helped with this project. From ISU I would particularly like to thank Alysse Rasmussen, Bradley Alexander, Amanda Pflum, Erin Miller, and Ardythe Woerley. From BYU I would like to thank Rossana Quiroz, Hermán Jara, Cecilia Tocaimaza, Gabriela Poletti, Stephen Mouritsen, Ben Stull, Curtis Snyder, David Staley, and Rebecca Cottrell. Finally, I am very grateful to Kathy, Spencer, Joseph, and Adam, who were so supportive as this book was being written, and to whom this book is dedicated.

11 Abbreviations Meaning Example art article 1 el, la art the adj adjective 888 oscuro adj dark, obscure adv adverb 587 apenas adv hardly, barely conj conjunction 117 aunque conj although, even though f feminine 33 la pron [3rd person] (obj-f) + fam familiar 136 te pron you (obj/+fam) fam formal 269 usted pron you (subj/ fam) interj interjection 2337 ay interj oh no!, oh my! m masculine 21 lo pron [3rd person] (obj-m) n neuter 110 esto pron this (n) nc noun common 1019 estudiante nc student nf noun feminine 116 casa nf house nf (el) noun feminine (with el) 194 agua nf (el) water nm noun masculine 253 libro nm book nmf noun masc/fem (different meanings) 4857 cometa nmf comet (m), kite (f) nm/f noun masc/fem (masc form given) 538 autor nm/f author num number 823 doce num twelve obj object 136 te pron you (obj/+fam) dir obj direct object 33 la pron [3rd person] (dir obj-f) indir obj indirect object 19 le pron [3rd person] (indir obj) pl plural 4193 lente nmf lens [pl] glasses prep preposition 48 sobre prep on top of, over, about pron pronoun 191 nosotros pron we (subj) sg singular 814 tú pron you (subj-sg/+fam) subj subject 52 yo pron I (subj) v verb 710 sentar v to sit (down), seat // separates speakers in sample phrase 2172 fue algo premeditado? // No; fue espontáneo.

12 Introduction The value of a frequency dictionary of Spanish What is the value of a frequency dictionary for language teachers and learners? Why not simply rely on the vocabulary lists in a course textbook? The short answer is that although a typical textbook provides some thematically-related vocabulary in each chapter (foods, illnesses, transportation, clothing, etc.), there is almost never any indication of which of these words the student is most likely to encounter in actual conversation or texts. In fact, sometimes the words are so infrequent in actual texts that the student may never encounter them again in the real world, outside of the test for that particular chapter. While the situation for the classroom learner is sometimes bleak with regards to vocabulary acquisition, it can be equally as frustrating for independent learners. These individuals may pick up a work of fiction or a newspaper and begin to work through the text word for word, as they look up unfamiliar words in a dictionary. Yet there is often the uncomfortable suspicion on the part of such learners that their time could be maximized if they could simply begin with the most common words in Spanish, and work progressively through the list. Finally, frequency dictionaries can be a valuable tool for language teachers. It is often the case that students enter into an intermediate language course with deficiencies in terms of their vocabulary. In these cases, the teacher often feels frustrated, because there does not seem to be any systematic way to bring less advanced students up to speed. With a frequency dictionary, however, the teacher could assign remedial students to work through the list and fill in gaps in their vocabulary, and they would know that the students are using their time in the most effective way possible. What is in this dictionary? This frequency dictionary is designed to meet the needs of a wide range of language students and teachers, as well as those who are interested in the computational processing of Spanish. The main index contains the 5,000 most common words in Spanish, starting with such basic words as el and de, and quickly progressing through to more intermediate and advanced words. Because the dictionary is based on the actual frequency of words in a large 20 million word corpus (collection of texts) of many different types of Spanish texts (fiction, non-fiction, and actual conversations), the user can feel comfortable that these are words that one is very likely to subsequently encounter in the real world. In addition to providing a listing of the most frequent 5,000 words, the entries provide other information that should be of great use to the language learner. Each entry also shows the part of speech (noun, verb, etc.), a simple definition of the word in English, and an actual example of the word in context, taken from the 100 million word Corpus del Español ( Finally, the entries show whether the word is more common in spoken, fiction, or non-fiction texts, so that the learner acquires greater precision in knowing exactly when and where to use the word. Aside from the main frequency listing, there are also indexes that sort the entries by alphabetical order and part of speech. The alphabetical index can be of great value to students who for example want to look up a word from a short story or newspaper article, and see how common the word is in general. The part of speech indexes could be of benefit to students who want to focus selectively on verbs, nouns, or some other part of speech. Finally, there are a number of thematically-related lists and lists related to common grammatical problems for beginning and intermediate students, all of which should enhance the learning experience. The expectation, then, is that this frequency dictionary will significantly maximize the efforts of a wide range of students and teachers who are involved in the acquisition of Spanish vocabulary.

13 2 Previous frequency dictionaries of Spanish There have been a number of other frequency dictionaries and lists for Spanish (Buchanan 1927, Eaton 1940, Rodríguez Bou 1952, García Hoz 1953, Juilland and Chang-Rodríguez 1964, Alameda and Cuetos 1995, Sebastián, Carreiras, and Cuetos 2000), but all of these suffer from significant limitations. First, all of these frequency dictionaries are based exclusively on written Spanish, and contain no data from the spoken register. Second, five of the dictionaries (Buchanan 1927, Eaton 1940, Rodríguez Bou 1952, García Hoz 1953, Juilland and Chang- Rodríguez 1964) are based on texts from the 1950s or earlier, and are now quite outdated. Third, the two dictionaries that have been produced in the last ten years both suffer from other important limitations. Alameda and Cuetos (1995) only lists exact forms (e.g. digo, dices, dijeran) rather than lemma (e.g. decir), and very few of the written texts that it uses are from outside of Spain. The other recent dictionary Sebastián, Carreiras, and Cuetos (2000) exists only in electronic form and is extremely hard to acquire, especially outside of Spain. Among the dictionaries just mentioned, most researchers recognize Chang-Rodríguez (1964) as the most complete frequency dictionary of Spanish to date. Yet because of its methodological limitations, its list of words is somewhat problematic. As mentioned, all of the texts are from nearly fifty years ago (or before), they are nearly all from Spain, and they are all from written texts. In addition, due to limitations in data collection of more than forty years ago, the corpus is quite small (less than a million words), and is limited just to written texts spoken Spanish is not represented at all in the wordlist. Because of the limitations just mentioned, the vocabulary in Chang-Rodríguez is highly skewed. For example, the word poeta is word number 309 in the frequency list, with other cases like lector (453), gloria (566), héroe (601), marqués (653), dama (696), and príncipe (737). This skewing is not limited just to nouns, but also includes what would in a normal corpus be much lower frequency verbs, like acudir (498), figurar (503), podar (1932) and malograr (2842), and the adjectives bello (612), fecundo (2376), and galán (2557). On the other hand, there are a number of what we would expect to be highly frequent words that are not in their list. For example, its list of the top 5,000 words of Spanish does not include the following words (the numbers in parenthesis show their placement in our list): nouns: oportunidad 626, equipo 737, película 827, control 889, televisión 1079, rama 1161, acceso 1316, marca 1371, tratamiento 1419, experto 1453, paciente 1512, parque 1763 verbs: enfrentar 914, recuperar 967, identificar 988, controlar 1071, transmitir 1203, grabar 1449, distribuir 1504, fallar 1666, investigar 1752, quebrar 2376, apretar 2405, fumar 2472 adjectives: capaz 412, extraño 552, temprano 1201, listo 1457, ocupado 1612, probable 1842, latino 1864, sucio 1995, japonés 2171, básico 2296, moreno 2304, feo 2382, cruel 2453 Thus, while Chang-Rodríguez (1964) was quite an achievement for its time, it seems clear that forty years later it is time for a new frequency dictionary of Spanish, which is based on the more advanced data collection techniques that are now available. The corpus In order to have an accurate listing of the top 5,000 words in Spanish, the first step is to create a robust and representative corpus of Spanish. In terms of robustness, our 20,000,000 word corpus is more than twenty times larger than the corpus used in Chang-Rodríguez (1964). The texts were taken in large part from the 1900s portion of the Corpus del Español ( which contains 100 million words of text from the 1200s 1900s, and which I had previously created with a grant from the US National Endowment for the Humanities from In terms of being representative, the corpus contains a much wider collection of registers and text types than that of any previous frequency dictionary of Spanish. As we see in Table 1, two-thirds of the corpus comes from the written register, while a full one-third (6,750,000 words) comes from spoken Spanish. Approximately one half of the spoken corpus comes from transcriptions of natural speech, including 2,300,000 words in the Habla Culta corpus of conversations with speakers from eleven different countries, and 1,000,000 words from the Corpus Oral de Referencia, which contains transcripts of

14 3 Table 1 Composition of 20 million word Modern Spanish corpus No. of Spain No. of Latin America words words Spoken 1.00 España Oral Habla Culta (ten countries) 0.35 Habla Culta (Madrid, Sevilla) Transcripts/ 1.00 Transcripts/Interviews 1.00 Transcripts/Interviews Plays (congresses, press conferences, (congresses, press conferences, other) other) 0.27 Interviews in the newspaper ABC 0.40 Plays 0.73 Plays Literature 0.06 Novels (BV 2 ) 1.60 Novels (BV 2 ) 0.00 Short stories (BV 2 ) 0.87 Short stories (BV 2 ) 0.19 Three novels (BYU 3 ) 1.11 Twelve novels (BYU 3 ) 2.17 Mostly novels, from LEXESP Four novels from Argentina Three novels from Chile Texts 1.05 Newspaper ABC 3.00 Newspapers from six different countries 0.15 Essays in LEXESP Cartas ( letters ) from Argentina Encarta encyclopedia 0.30 Humanistic texts (e.g. philosophy, history from Argentina 5 ) 0.30 Humanistic texts (e.g. philosophy, history from Chile 6 ) Total Sources: 1 Corpus oral de referencia de la lengua española contemporánea ( 2 The Biblioteca Virtual ( 3 Fifteen recent novels, acquired in electronic form from the Humanities Research Center, Brigham Young University 4 Léxico informatizado del español ( 5 From the Corpus lingüístico de referencia de la lengua española en Argentina ( 6 From the Corpus lingüístico de referencia de la lengua española en Chile ( conversations, lectures, sermons, sports broadcasts, and many other types of spoken Spanish. The written corpus is divided in half between literature and non-literary texts, including newspaper articles, essays, encyclopedias, letters, and humanistic texts. In addition to a having a good selection of different genres, this corpus is the first to have a good balance of texts from both Latin America and Spain approximately 43 percent of the texts come from Spain, while 57 percent come from Latin America. In terms of the time period represented, virtually all of the texts are from , with the clear majority being from the 1990s. Annotating the data from the corpus In order to create a useful and accurate listing of the top 5,000 words in Spanish, the entire 20 million words of text needs to first be tagged and lemmatized.

15 4 Tagging means that we assign a part of speech to each word in the corpus. In order to do this, we created a lexicon of Spanish, which contained more than 400,000 separate word forms, with their part of speech and lemma (where lemma refers to the base word or dictionary headword to which each individual form belongs). For example, the following are five word forms from the 400,000 word lexicon: newspaper, as opposed to the electronic newspaper ). In cases such as these, we looked at the total number of cases where the past participle was preceded in the corpus by ser (which suggests a passive / verbal reading) or by estar (which suggests a resultative / adjectival reading). If the cases with ser were more common with this particular past participle, then ambiguous cases like [N + Past Part] (periódico escrito) Word form / lemma / part of speech (pos) lápices / lápiz / noun_masc_pl tengo / tener / verb_present_1pers_sg francesa / francés / adjective_fem_sg pronto / pronto / adverb doscientas / doscientos / number_fem_pl In cases where there is just one lexicon entry for a given word form, then that form is easy to annotate (e.g. tengo = tener / verb_present). For many other word forms, however, a given word form has to have more than one entry in the lexicon. For example, trabajo (the) work, I work can either be [lemma = trabajo, pos = noun_masc_sg] or [lemma = trabajar, pos = verb_present_ 1pers_sg]. Another example would be limpia clean, 3sg cleans, which can be either [lemma = limpio, pos = adjective_fem_sg] or [lemma = limpiar, pos = verb_present_3pers_sg]. Such is the case for thousands of different word forms. In these cases, we used rules to tag the text. For example, in the case of trabajo, the tagger uses the preceding definite article [el] to tag [el trabajo] as [lemma = trabajo, pos = noun_masc_sg], whereas it would use the preceding subject pronoun [yo] to tag [yo trabajo] as [lemma = trabajar, pos=verb_ present_1pers_sg]. In many other cases, it is even more difficult than using simply rules to disambiguate the different lemma and parts of speech of a given word form, and in these cases we have used probabilistic information. For example, one of the most difficult classes of words to tag are past participles (e.g. dicho, controlado, apagado). The rule-based component of the tagger looks for a preceding form of haber to have and identifies the word as the form of a verb; for example he [escrito] I have written is [lemma = escribir, pos = verb_pp_masc_sg]. In a case like [periódico escrito], however, escrito can either be a past participle of the verb escribir (leí el periódico escrito ayer I read the newspaper (that was) written yesterday ) or it can have a more adjectival-like sense ( the written would be marked as passive/verb. The fact that all of the data was stored in a relational database made this type of probabilistic tagging and lemmatization much easier to carry out than may have been possible with linear, word-by-word annotation. In terms of the actual process used to annotate the corpus, the following are the steps that we followed. First, I created the 400,000 word lexicon, as discussed above. Second, the entire corpus was tagged using rule-based procedures. This was carried out at Northern Arizona University under the direction of Professor Douglas Biber and with the substantial involvement of James Jones and Nicole Tracy, and was part of a separate grant that we had received from the US National Science Foundation to analyze syntactic variation in Spanish. Finally, I input this preliminary tagged and lemmatized information into a MS SQL Server database, where I cleaned up the rule-based annotation and carried out many probabilistically-based re-annotations of the data, as described above. This entire process took more than two years, and was carried out from We have not carried out formal tests to determine the accuracy of the part of speech tagging and lemmatization, but we have examined the annotation in detail at many different stages of the project. After the preliminary tagging, we determined which word forms belonged to two or more lemma that were within the 20,000 most frequent lemma in the corpus (i.e. limpia or trabajo, as mentioned above). For each one of these forms, we examined the collocations (words to the left and right) to make sure that we had annotated these forms correctly, and made any necessary adjustments. Later we went through

16 5 each of the 6,000 most frequent lemma, and again looked for any form for any of these lemma that also appeared as a member of another lemma, and again checked the collocations and made the appropriate adjustments. Finally, we continually compared our list to that in Chang-Rodríguez, and carefully examined all of the forms of any word that was in our list but was not in Chang-Rodríguez, or any word that was in their list but was not in our top 5,000 words. While the tagging is not perfect, we feel confident that it is quite accurate. Organizing and categorizing the data Even after annotating the corpus for part of speech and lemma as described in the previous section there remained a number of difficult decisions regarding how the lemma should be grouped together. In most cases, we have followed the parts of speech from Chang-Rodríguez (1964). In some cases, however, we have conflated categories that Chang-Rodríguez kept distinct. The three primary areas of difference are the following: Noun/adjective In many cases there is only minor syntactic and semantic difference between nouns and adjectives in Spanish, as in the case of ella es católica she is (a) Catholic. This holds true not only for religions and nationalities (él es ruso / italiano he is (a) Russian / (an) Italian ), but also cases like los ricos no ayudan a los pobres the rich don t help the poor or los últimos recibieron más que los primeros the ones who came last got more than those who came early. In most cases, these were assigned a final part of speech of [adjective], and learners can easily apply this information to these cases where there is a more nominal sense. Past participle It is often very hard to disambiguate between the [passive / verbal] and [adjectival / resultative] senses of the past participle, as shown above with the example of periódico escrito. One solution would be to simply include all past participles as part of the verbal lemma, so that organizado is listed with organizar, descrito is listed with describir, etc. Yet there are other cases where the past participle has a clearly adjectival sense, as in los niños cansados the tired children, un libro pesado a heavy book, or unos casos complicados some complicated cases. Our approach has been to manually check each of the adjective entries in the dictionary, which have the form of a past participle. When the majority of the occurrences of this initially-tagged form have a strongly agentive reading, then that past participle would be re-assigned to the verbal lemma. Determiner/pronoun/adjective/adverb Many frequency lists and dictionaries create fine-grained distinctions between these categories, which may be of minimal use to language learners. For example, some frequency lists and dictionaries distinguish between determiner and adjective. Yet it is probably impossible to say where the category [determiner] ends and [adjective] starts, as in cases like varios, algunos, cuyos several, some, whose. As a result, we follow the lead of Chang-Rodríguez, and assign all determiners (except the articles el and la) to the category [adjective]. Yet we also depart from Chang-Rodríguez on a number of points, primarily with regards to the categorization of pronouns, adjectives, and adverbs. For example, they distinguish between the adjectival use of temprano = early (fue un verano temprano it was an early summer ) and the adverbial use (el verano llegó temprano summer arrived early ). While they list the word twice in the dictionary, we assume that a learner can easily apply the meaning to both cases, and simply list it once under [adjective]. Similarly, Chang-Rodríguez distinguishes between the adjectival use of todo = all/every (están todos los hombres all the men are here ) and the putative pronominal use (están todos everyone is here ), whereas we list todo just once again as an [adjective]. In fact, with an atomistic division of part of speech categories, the same word can theoretically span three different parts of speech noun, adjective, and adverb and the question is whether to list them all separately in the dictionary. For example, Chang-Rodríguez lists menos less/least three times in the dictionary as noun (había menos de lo que queríamos there was less than we wanted ), adjective (había menos dinero del que queríamos there was less money than we wanted ), and adverb (cobraron menos que nosotros they charged less than us ). In our dictionary, we assume that the learner can easily apply the one meaning to the three contexts, and we accordingly conflate the three uses to the [adjective] category. Finally, we group together the masculine and feminine forms of the definite

17 6 article (el/la), as both we and Chang-Rodríguez have done for all other determiners (ese/esa, otro/otra, etc.). Finally, we should note that there is one category of words with which we separate more lemma than is typically done in other frequency dictionaries. Other dictionaries will often include all of the forms of a pronoun under the masculine / singular / subjective case form of the pronoun. For example, Chang-Rodríguez group together under the one entry yo I the following pronouns: me me, nos us, nosotros we, le/les to 3sg/3pl, and even se (the reflexive marker in Spanish). Because they are morphologically distinct, forms would not be readily recognized as forms that are related to yo, we include them (and similar pronouns) as their own entries. Range, frequency, and weighting At this point each of the 20 million words of text had been assigned to a lemma and part of speech, and with some lemma these categories were conflated, as discussed in the previous section. The final step was to determine exactly which of these words would be included in the final list of 5,000 words. One approach would be to simply use frequency counts. For example, all lemma that occur 240 times or more in the corpus would be included in the dictionary. Imagine, however, a case where a particular scientific term was used repeatedly in eight encyclopedia entries and six newspaper articles (for a total of fourteen segments in the segment non-fiction corpus), but did not appear in any works of fiction or in any of the spoken texts. Alternatively, suppose that a given word is spread throughout an entire register (spoken, fiction, non-fiction), but that it is still limited almost exclusively to that register. Should the word still be included in the frequency dictionary? The argument could be made that we should look at more than just raw frequency counts in cases like this, and that we ought to include some measure of how well the word is spread across all of the registers in the entire corpus. As a clear example of the contrast between frequency and range, consider the following table. The words to the left have a range of at least 80, meaning that the word appears at least once in 80 or more of the 100 blocks in the corpus (each block has 200,000 words, which is 1/100th of the 20 million words in the corpus). The words to the right, on the other hand, have more limited range, and occur in less than 30 of the 100 blocks in the corpus. Most would easily agree that the words shown at the left would be more useful in a frequency dictionary, because they represent a wide range of texts and text types in the corpus. Wide range Narrow range freq Spanish POS English range range Spanish POS English freq 236 demostración nf demonstration radicalismo n radicalism desconfianza nf mistrust sodio n sodium mención nf mention autonómico adj self-governed innecesario adj unnecessary graso adj fatty aceptable adj acceptable serbio adj Serbian recepción nf reception electromagnético adj electromagnetic décimo adj tenth champiñón n mushroom molesto adj bothered aminoácido n amino acid complicación nf complication neutron n neutron cuidadoso adj careful dirigencia n leadership 204 A second issue deals with the relative weights assigned to the three main registers spoken, fiction, and non-fiction. Is one register more important in terms of how well it represents what we perceive to be the most useful variety of Spanish? Consider first the following table. The words to the left occur in at least 95 percent of all of the blocks of text from the spoken part of the corpus but in less than 60 percent of the blocks from the non-fiction portion, while those to the right have wide range in non-fiction

18 7 texts (at least 96 percent) but relatively poor range in spoken texts (less than 45 percent). It seemed fairly uncontroversial that the spoken list at the left represents more basic vocabulary, and so we would argue that a higher weight should be given to words that occur more in the spoken register than in the non-fiction register. + Range in spoken + Range in non-fiction Spanish POS English oral non-f Spanish POS English oral non-f trescientos num three hundred adopción n adoption ti pron (prep+) you incremento nm increment mañana adv tomorrow incorporación nf incorporation poquito adj a little bit asentar v to establish lunes nm Monday prolongado adj prolonged últimamente adv lately reemplazar v to replace montón nm (a) lot magnitud nf magnitude sábado nm Saturday expansión nf expansion contento adj happy incrementar v to increment señora nf Ms modalidad nf modality How does the spoken register compare to the fiction register? Again, the words to the left have good range in spoken texts but more limited range in fiction, whereas the opposite is true for the words to the right. It is interesting that there are more words referring to concrete concepts in the fiction, which probably relates to the fact that fiction includes more description than conversation, since everything has to be spelled out explicitly. Assuming that we favor these concrete/descriptive words more, we might then give a slightly higher weighting to the fiction sub-corpus. + Range in spoken + Range in fiction Spanish POS English oral fict Spanish POS English oral fict trescientos num three hundred arrugado adj wrinkled evolución nf evolution asombro nm amazed cifra nf figure maldito adj damn eliminar v eliminate chupar v to suck determinado adj certain inesperado adj unexpected horario nm schedule puño nm fist prácticamente adv practically pañuelo nm handkerchief plantear v to propound uña nf fingernail absolutamente adv absolutely rubio adj blonde setenta num seventy hervir v to boil The final calculation After looking at the issue of range, frequency, and the weights for different registers, we created the following formula: x = 2*(RaO2/10) + 13*(RaO1/25) + 20*(RaF/39) + 15*(RaNF/38) + 2*(FrO2/12600) + 13*(FrO1/32400) + 20*(FrF/56600) + 15*(FrNF/41300) where: RaO2, FrO2 = range, raw frequency in non-core spoken texts

19 8 RaO1, FrO1 = range, raw frequency in core spoken texts RaF, FrF = range, raw frequency in fiction texts RaNF, FrNF = range, raw frequency in fiction texts (Note: We have divided the spoken texts into core and non-core texts. The non-core spoken texts are those texts that may have been subsequently modified and may reflect written characteristics to some degree, such as those from press conferences, political speeches, and governmental transcripts. We view these as being less valuable (and thus they have a much lower weighting) than the core texts, which represent all other spoken texts. Note also that due to rounding up at a previous stage, the range values add up to 102: In the final calculation, we subtract 2 from the total.) As a concrete example, let s take the word alojar to host or accommodate. This word occurs in the following number of blocks of text: 0 in O2 ( non-core spoken), 9 in O1 ( core spoken), 30 in F (fiction) and 24 in NF (non-fiction). In each case, the actual range is divided by the total number of text blocks for that register 10 for O2, 25 for O1, 39 for F, and 38 for NF. Thus, if the word appears in every block of a given register, it will have a value of [1.00]; otherwise, it represents a percentage of all blocks. The values [2, 13, 20, 15] refer to the weighting given to each register. The values for the two spoken sub-corpora combined have a weighting of 15 percent, while it is 20 percent for fiction and 15 percent for non-fiction. We perform similar calculations for the raw frequency in each register. The weighting between the different registers is the same, but this time we divide by a given number [12,600, 32,400, 56,000, 413,000], which represents the raw frequency of the tenth most common word in the corpus in each of those registers. (In fact, to account for the large numbers, we actually use log values for all of these raw frequency numbers.) In the case of alojar, the raw frequency values in the four registers are [0, 20, 94, 65]. Therefore, after inserting the actual data for alojar into the formula, we obtain the following (remember that we will log values for each number that is followed by [ L ] in the bottom line): = 2*(0/10) + 13*(9 /25) + 20*(30/39) + 15*(24/38) + 2*(0 L /12,600 L ) + 13*(20 L /32,400 L ) + 20*(94 L /56,600 L ) + 15*(65 L /41,300 L ) It is this figure of [45.47] for alojar that represents its score, and determines whether the word is included in the dictionary. We simply take the top 5,000 scores, and these words are those that are included here. While the actual formula may seem complicated, hopefully the general criteria for the inclusion of a word in the dictionary are somewhat easier to understand. First, weighting is given to all three registers spoken, fiction, and non-fiction and it is unlikely that a word will be included if it is common in only one of these three registers. Second, equal weighting (50 percent / 50 percent) is given to both range and raw frequency. In other words, a word must not only occur many times in the corpus; it must also be spread out well throughout the entire corpus. Third, there is a slight weighting advantage given to the fiction register, although the final weighting is still relatively equal 30 percent spoken, 40 percent fiction, 30 percent non-fiction. The main frequency index Chapter 2 contains the main index in this dictionary a rank-ordered listing of the top five thousand words (lemma) in Spanish, starting with the most frequent word (the definite article el) and progressing through to cueva cave, which is number 5,000. The following information is given for each entry: rank frequency (1, 2, 3,...), headword, part of speech, English equivalent, sample sentence, range count, raw frequency total, indication of major register variation As a concrete example, let us look at the entry for bruja witch : 4305 bruja nf witch, hag había una leyenda de una bruja que se montaba en una escoba f -nf This entry shows that word number 4305 in our rank order list is [bruja], which is a feminine noun [nf] that can be translated as [witch, hag] in English. We then see an actual sentence or phrase that shows the word in context. The two following numbers show that the word occurs in sixty-one of the 100 equally-sized blocks from the corpus (i.e. the range count), and that this lemma occurs 251 times in the corpus. Finally, the [+f -nf] indicates that the word

20 9 is much more common in the fiction register than would otherwise be expected, while it is less common than would otherwise be expected in the non-fiction register. Let us briefly add some additional notes to the explanation just given. The part of speech Remember that some categories have been conflated, such as noun/adjective with religions and nationalities (católico, americano), or adjective/pronoun (todos). With nouns, there are several different markings for gender. Most nouns are either nm (masculine; año, libro) or nf (feminine; tierra, situación). Nouns that are feminine but are preceded by the articles el and un are marked nf (el): agua, alma, while nouns that have the same form for masculine or feminine are marked nc (joven, artista). In most cases, professions are marked nm/f (autor, director), which means that only the masculine form appears in the dictionary, but the frequency statistics have been grouped together with a possible feminine form (autora, directora). Finally, a few nouns have both masculine and feminine forms (nmf), but these have different meanings (cometa = comet (m), kite (f); radio = radio set (m), radio means of communication (f) ). English equivalent Only the most basic translations for the word are given. This is not a bilingual dictionary, which lists all possible meanings of a given word, and intermediate to advanced users will certainly want to consult such a dictionary for additional meanings. Also note that high frequency phrases in which a given word occurs are not given, except when the vast majority of all occurrences of that word occur within such a phrase. There are a handful of such words in the dictionary, and they are marked as such (e.g. 180 sin [embargo], 333 a [medida] que, 347 a [través], 1679 no [obstante], 1944 a [menudo], etc.). Finally, in most cases we have not given the special senses that the word acquires when used pronominally (i.e. with se), although this is noted in a handful of cases where a very high percentage of the occurrences are with se, as in rendirse to give in or colarse to slip in. Phrase in context All of these phrases and sentences come from the Corpus del Español ( The goal has been to choose phrases whose meaning reflects well the basic meaning of the word with the minimal number of words, and this has been more possible in some cases than in others. With invented sentences it would have certainly been possible to have concise sentences that express the core meaning very clearly, but this would have been at the expense of less authentic examples. Finally, in some cases the original sentence has been shortened by taking out some words whose absence does not affect the basic meaning of the phrase as a whole. Register variation The symbols [+o -o +f -f +nf -nf] show that the word in question has a high (+) or low (-) score (a combination of frequency and range) in the indicated register (oral, fiction, non-fiction). These symbols appear only when the word is in the top 8 9 percent or the bottom 8 9 percent of the words in that register, in terms of its relative frequency to the other two registers. Remember that there are some words that are marked [+o] that may not be as common in regular conversations as they are in our oral corpus. This is due to the fact that we have many press conferences, political speeches, and interviews with politicians in our oral corpus, although we have tried to compensate for this by giving these corpora a lower weighting (see the final calculation on pp. 7 8). Thematic vocabulary ( call-out boxes ) Placed throughout the main frequency-based index are approximately thirty call-out boxes, which serve to display in one list a number of thematically-related words. These include lists of words related to the body, food, family, weather, professions, nationalities, colors, emotions, verbs of movement and communication, and several other semantic domains. In addition, however, we have focused on several topics in Spanish grammar that are often difficult for beginning and intermediate students. For example, there are lists that show the most common diminutives, superlatives, and derivational suffixes to form nouns, the most common verbs and adjectives that take the subjunctive, which verbs most often take the reflexive marker se, which verbs most often occur almost exclusively in the imperfect and preterit, and which adjectives occur almost exclusively with the two copular verbs ser and estar. Finally, there are even more advanced lists that compare the use of

Mostrar más

Top Articles

Latest Posts

Article information

Author: Laurine Ryan

Last Updated: 11/13/2022

Views: 6094

Rating: 4.7 / 5 (57 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Laurine Ryan

Birthday: 1994-12-23

Address: Suite 751 871 Lissette Throughway, West Kittie, NH 41603

Phone: +2366831109631

Job: Sales Producer

Hobby: Creative writing, Motor sports, Do it yourself, Skateboarding, Coffee roasting, Calligraphy, Stand-up comedy

Introduction: My name is Laurine Ryan, I am a adorable, fair, graceful, spotless, gorgeous, homely, cooperative person who loves writing and wants to share my knowledge and understanding with you.