LIFCACH
Lista de Frecuencias de Palabras del Castellano de Chile
Word Frequency List of Chilean Spanish

Copyright (c) 2006 Scott Sadowsky & Ricardo Martnez Gamboa
Todos los derechos reservados. All Rights Reserved.
Inscripcin N 154.198 (Chile).

La LIFCACH puede utilizarse libre y gratuitamente para fines acadmicos que no tengan fines
de lucro, siempre que se cite la fuente. Se prohbe expresamente todo uso o aplicacin
comercial de la LIFCACH que no cuente con el consentimiento escrito previo de los autores.

The LIFCACH may be freely used for non-profit academic purposes if properly cited. All
commercial use or application of the LIFCACH is expressly prohibited without express written
consent from the authors.


============================
=====CONTACTO / CONTACT=====
============================
s s a d o w s k y @ u d e c . c l
s s a d o w s k y @ g m a i l . c o m
r i c a r d o m a r t i n e z g @ g m a i l . c o m


===============================================================
=====CONTENIDOS DEL ARCHIVO ZIP / CONTENTS OF THE ZIP FILE=====
===============================================================

1. INFORMACIN SOBRE LA LIFCACH
   INFORMATION ABOUT THE LIFCACH

   README.rtf
   README.txt

   El presente archivo.
   This file.

2. LISTA DE FRECUENCIAS, POR FUENTE, EN FORMATO CSV
   FREQUENCY LIST, BY SOURCE, IN CSV FORMAT

   Sadowsky_&_Martinez_-_LIFCACH--04_No_Hapax_Logomena.csv.txt

   Este archivo contiene la lista no ponderada de las frecuencias totales (la columna Total
   Occurrences), adems de las listas de frecuencias correspondientes a cada una de las 102
   fuentes individuales utilizadas.

   This file contains a non-weighted list of total frequencies (the Total Occurrences column)
   plus individual frequency lists for each of the 102 sources used.


===============================
=====ADVERTENCIA - WARNING=====
===============================

===La lista de frecuencias NO DEBE ABRIRSE en Microsoft Excel!===

La LIFCACH contiene 477.293 filas, pero la ltima versin de Excel que hemos probado (Excel
2002) slo puede procesar las primeras 65.000 filas (aproximadamente). Sugerimos utilizar
Microsoft Access, Quattro Pro, o un software estadstico adecuado.

===DO NOT open the frequency list in Microsoft Excel!===

The LIFCACH contains 477,293 rows, while the latest tested version of Excel (Excel 2002) can
only open the first 65,000 or so rows. We suggest using Microsoft Access, Quattro Pro, or a
suitable statistics package.


=======================
=====NOTAS / NOTES=====
=======================

===1. Descripcin===

La Lista de Frecuencias de Palabras del Castellano de Chile (LIFCACH) es un conjunto de 102
listas de frecuencias lxicas derivadas de los distintos subcorpora del Corpus Dinmico del
Castellano de Chile (CODICACH), un corpus del espaol escrito [1] contemporneo de Chile
desarrollado por Sadowsky entre 1997 y 2002; este corpus contena aproximadamente 450
millones de palabras a la hora de elaborar la LIFCACH (actualmente contiene alrededor de 830
millones de palabras). La LIFCACH tambin contempla una lista no ponderada de frecuencias
totales (la columna titulada Total Occurrences), la cual es simplemente la suma de las
frecuencias de las 102 listas individuales (en otras palabras, es la lista de las frecuencias del
CODICACH en su totalidad).

Aunque podra existir la tentacin de interpretar la lista Total Occurrences como una lista
representativa del castellano de Chile en general, recomendamos encarecidamente no hacerlo.
El CODICACH es un corpus oportunista que privilegia, entre otras cosas, los medios de prensa
escritos; tal como est estructurado, no pretende ser una muestra representativa de la variante
lingstica nacional, al estilo del BNC. Sin embargo, la naturaleza modular del CODICACH y de
las 102 listas individuales de la LIFCACH permite a los investigadores utilizar una o ms de
estas listas de manera independiente; combinarlas segn sus propias necesidades; o ponderar
las listas individuales de la LIFCACH para as crear una nueva lista de frecuencias que sea
representativa segn los criterios del investigador.

La LIFCACH contiene 477.293 lemas, derivados de aproximadamente 4,5 millones de types
extrados de los 450 millones de palabras de texto corrido que contemplaba el CODICACH al
momento de elaborar la LIFCACH.

[1] Although the CODICACH contains two sub-corpora of oral texts, ORAL_Entrevistas_Lgtcas and ORAL_TV,
these are so small as to be of negligible impact on the overall corpus.

===Description===

The Word Frequency List of Chilean Spanish (LIFCACH) is a set of 102 frequency lists derived
from the sub-corpora of the Corpus Dinmico del Castellano de Chile (Dynamic Corpus of
Chilean Spanish, CODICACH), a corpus of contemporary written[1] Chilean Spanish developed
by Sadowsky between 1997 and 2002; this corpus contained approximately 450 million words
when the LIFCACH was created (it currently contains some 830 million words). The LIFCACH
also contains a non-weighted list of total frequencies (the Total Occurrences column), which is
simply the sum of the frequencies of the 102 individual lists (in other words, the list of
frequencies of the entire CODICACH corpus.)

While it may be tempting to take the Total Occurrences list as being representative of Chilean
Spanish as a whole, we strongly advise against this. The CODICACH is an opportunistic corpus
with a bias toward press-based sources; it does not seek to be a BNC-style representative
sampling of the language in general. The modular nature of the CODICACH and of the 102
individual LIFCACH lists, however, allows researchers to use one or more of these lists alone,
to combine them as needed, or to create their own frequency lists for Chilean Spanish by
weighting each of the LIFCACH's individual lists as they see fit.

The LIFCACH contains 477,293 lemmas derived from the approximately 4.5 million types found
in the 450 million running words contained in the CODICACH at the time the lists were created.

[1] Although the CODICACH contains two sub-corpora of oral texts, ORAL_Entrevistas_Lgtcas and ORAL_TV,
these are so small as to be of negligible impact on the overall corpus.

===2. Elaboracin de la LIFCACH===

A continuacin se presentan los pasos de la creacin de la LIFCACH:

i.   Se generaron listas de frecuencias de types en base a las palabras de texto corrido de
     cada uno de los 102 subcorpora del CODICACH.
ii.  Se lematiz y etiquet con categoras gramaticales (POS) cada una de las listas de
     frecuencias de types con el programa MS-Tools v2.0 de la Universitat Politecnica de
     Catalunya (para ms informacin sobre MS-Tools, comunquese con Llus Padr
     <padro@lsi.upc.es>).
iii. Se eliminaron los aproximadamente 300.000 lemas con una frecuencia de 1 (hpax
     legmenos). La eliminacin de estos lemas representa un intento de establecer un
     equilibrio entre la completitud de las listas y el tamao y procesabilidad de los archivos.
iv.  Las listas de frecuencias de lemas resultantes se incorporaron en un archivo CSV, y
     luego se calcularon las frecuencias totales.

Es preciso hacer una advertencia respecto de esta metodologa. La utilizacin de listas de
frecuencias de types en vez de palabras de texto corrido en el proceso de lematizacin y
etiquetado POS surgi de una necesidad prctica relacionada con la velocidad del software y
los recursos computacionales disponibles en el momento de la elaboracin de la LIFCACH. En
consecuencia, el software debi analizar palabras como canto sin disponer de la informacin
necesaria para determinar si una instancia dada de esta palabra corresponda al verbo cantar o
al sustantivo canto. La eliminacin del contexto redujo la precisin del etiquetado y
lematizacin, aunque mucho menos de lo que sucedera en el caso del ingls, gracias a la
compleja morfologa del castellano.

Tambin debe notarse que el software de etiquetado POS y lematizacin que se utiliz est
basado en el castellano de Espaa, un dialecto nacional que es un tanto alejado del castellano
de Chile.

Los autores estn preparando un nuevo conjunto de listas de frecuencia, LIFCACH II, para
subsanar estas deficiencias.


===Creation of the LIFCACH===

The steps in creating the LIFCACH were as follows:

i.   Type frequency lists based on the running words of each of the 102 sub-corpora of the
     CODICACH were generated.
ii.  Each type frequency list was lemmatized and POS-tagged using the Universitat
     Politecnica de Catalunya's MS-Tools v2.0 (For more information on MS-Tools, contact
     Llus Padr <padro@lsi.upc.es>).
iii. Lemmas with a frequency of 1 were removed (approximately 300,000). Eliminating these
     was considered an acceptable trade-off in exchange for a far more manageable file size.
iv.  The resulting lemma frequency lists were assembled in the attached CSV file and total
     occurrences were calculated.

An important caveat regarding this methodology must be mentioned. The use of type frequency
lists instead of running words in the POS tagging and lemmatizing process was a practical
necessity, due to the speed of the software used and the computing resources available at the
time the LIFCACH was created. As a result, the software had to analyze words such as canto
without the information required to decide if a given instance of this word was a form of the verb
cantar or the noun canto. This elimination of context reduced the accuracy of the lemmatization
process, though far less so than would happen with English, thanks to Spanish's rich
morphology.

It should also be noted that the lemmatizing and tagging software that was used is based on
European Spanish, a national dialect that is somewhat removed from Chilean Spanish.

The authors plan to create a new set of frequency lists, LIFCACH II, which will address these
issues.


===3. Lista de categoras gramaticales / Part of Speech List===

A continuacin se presentan los cdigos de categora gramatical que se utilizan en las listas de
frecuencias.

The following are the POS codes used in the frequency lists.

CDIGO/CODE	CATEGORA GRAMATICAL          PART OF SPEECH

AJ                            Adjetivo                      Adjective
AV                            Adverbio                      Adverb
C                             Conjuncin                    Conjunction
D                             Determinante                  Determiner
I                             Interjeccin                  Interjection
N                             Sustantivo                    Common noun
NG                            Nombre geogrfico             Toponym
NP                            Nombre propio                 Proper noun
PN                            Pronombre                     Pronoun
PP                            Preposicin                   Preposition
SG                            Sigla                         Abbreviation
V                             Verbo                         Verb


===4. Listado de fuentes / List of Sources===

Cada una de las listas de frecuencias de la LIFCACH se elabor en base a un subcorpus
distinto del CODICACH. A continuacin se presentan los cdigos que se utilizan para sealar
estas listas y subcorpora.

Each frequency list in the LIFCACH is derived from a different sub-corpus of the CODICACH.
The codes used to indicate these lists and sub-corpora are as follows.

CDIGO/CODE                          DESCRIPCIN/DESCRIPTION

ACAD_CCAA                            Academic Texts - Applied Sciences
ACAD_CCNN                            Academic Texts - Natural Sciences
ACAD_CCSS                            Academic Texts - Social Sciences
ACAD_Hum                             Academic Texts - Humanities
DIAR_CEN_Estrella_Valpo              Newspaper - Central Chile - Estrella de Valparaso
DIAR_CEN_Gran_Valpo                  Newspaper - Central Chile - Gran Valparaso
DIAR_CEN_Lider_San_Antonio           Newspaper - Central Chile - El Lder, San Antonio
DIAR_CEN_Mercurio_Valpo              Newspaper - Central Chile - El Mercurio, Valparaso
DIAR_NOR_Estrella_Arica              Newspaper - North Chile - La Estrella, Arica
DIAR_NOR_Estrella_Iquique            Newspaper - North Chile - La Estrella, Iquique
DIAR_NOR_Estrella_Loa                Newspaper - North Chile - La Estrella, Loa
DIAR_NOR_Estrella_Norte_Antofagasta  Newspaper - North Chile - La Estrella, Antofagasta
DIAR_NOR_Mercurio_Antofagasta        Newspaper - North Chile - El Mercurio, Antofagasta
DIAR_NOR_Mercurio_Calama             Newspaper - North Chile - El Mercurio, Calama
DIAR_NOR_Nortino_Iquique             Newspaper - North Chile - El Nortino, Iquique
DIAR_SAN_Cuarta                      Newspaper - Santiago - La Cuarta
DIAR_SAN_Estrategia                  Newspaper - Santiago - Estrategia
DIAR_SAN_Firme                       Newspaper - Santiago - La Firme
DIAR_SAN_Mercurio                    Newspaper - Santiago - El Mercurio
DIAR_SAN_Metropolitano               Newspaper - Santiago - El Metropolitano
DIAR_SAN_Mostrador                   Newspaper - Santiago - El Mostrador
DIAR_SAN_Primera_Linea               Newspaper - Santiago - Primera Lnea
DIAR_SAN_Primera_Pagina-El_Area      Newspaper - Santiago - Primera Pgina / El rea
DIAR_SAN_Segunda                     Newspaper - Santiago - La Segunda
DIAR_SAN_Tercera                     Newspaper - Santiago - La Tercera
DIAR_SAN_Ultimas_Noticias            Newspaper - Santiago - Las ltimas Noticias
DIAR_SUR_Austral_Osorno              Newspaper - South Chile - Austral, Osorno
DIAR_SUR_Austral_Temuco              Newspaper - South Chile - Austral, Temuco
DIAR_SUR_Austral_Valdivia            Newspaper - South Chile - Austral, Valdivia
DIAR_SUR_Cronica                     Newspaper - South Chile - Crnica
DIAR_SUR_El_Sur                      Newspaper - South Chile - El Sur
DIAR_SUR_Enc_BioBio                  Newspaper - South Chile - Enciclop. Bo-Bo
DIAR_SUR_Llanquihue_Pto_Montt        Newspaper - South Chile - El Llanquihue, Pto. Montt
ESPER_CartasDirector                 Personal Writings - Letters to Editor
ESPER_ForosInet                      Personal Writings - Internet Site Forums
ESPER_Clasificados                   Personal Writings - Classified Ads
ESPER_ForosMedios                    Personal Writings - Media Forums
ESPER_Usenet                         Personal Writings - Usenet
LEX_Jurisprudencia                   Legal - Jurisprudence
LEX_Leyes                            Legal - Laws
LEX_Libros                           Legal - Law Books
LEX_Misc                             Legal - Miscellaneous
LIBR_Ficcion                         Books - Fiction
LIBR_NoFiccion                       Books - Non-Fiction
OBRC_CandiaCares_DicoCoa             Reference Works - Dictionary of Coa
OBRC_GonzalezParra_ManualProvrb      Reference Works - Book of Chilean Proverbs
ORAL_Entrevistas_Lgtcas              Oral - Linguistic Interviews
ORAL_TV                              Oral - Television
PUB_Misc                             Advertising - General 1
PUB_Publicidad                       Advertising - General 2
REV_CMP_ChileTech                    Magazine - Computers - ChileTech
REV_CMP_CompuChile                   Magazine - Computers - CompuChile
REV_CMP_ComputerWorld                Magazine - Computers - ComputerWorld
REV_CMP_Informatica                  Magazine - Computers - Informtica
REV_CMP_Infoweek                     Magazine - Computers - Infoweek
REV_CMP_Internet21                   Magazine - Computers - Internet21
REV_CMP_Mouse                        Magazine - Computers - Mouse
REV_DEP_All                          Magazine - Sports
REV_ESP_Capital                      Magazine - Specialty - Capital
REV_ESP_CiudadArquitectura           Magazine - Specialty - CiudadArquitectura
REV_ESP_Conicyt                      Magazine - Specialty -        Conicyt Scientific
REV_ESP_CopropInmob                  Magazine - Specialty - Copropiedad Inmobiliaria
REV_ESP_DiarioSocCivil               Magazine - Specialty - Diario de la Sociedad Civil
REV_ESP_Educar                       Magazine - Specialty - Educar
REV_ESP_LemuChile                    Magazine - Specialty - LemuChile
REV_ESP_Lignum                       Magazine - Specialty - Lignum
REV_ESP_Mensaje                      Magazine - Specialty - Mensaje
REV_ESP_Notas_CESAF                  Magazine - Specialty - Notas CESAF
REV_ESP_Publimark                    Magazine - Specialty - Publimark
REV_ESP_Rev_Inf_Musical              Magazine - Specialty - Revista Musical
REV_ESP_Rev_Scielo                   Magazine - Specialty -        Scielo Scientific
REV_ESP_Rev_Social                   Magazine - Specialty - Revista Social
REV_ESP_Rev_Trabajo_Social           Magazine - Specialty -        Revista de Trabajo Social
REV_ESP_RevChil_Cirujia              Magazine - Specialty - Revista Chilena de Ciruja
REV_ESP_Revistas_Industriales        Magazine - Specialty - Industrial Magazines
REV_ESP_Sidhartha                    Magazine - Specialty - Siddhartha
REV_GEN_Asuntos_Publicos             Magazine - General - Asuntos Pblicos
REV_GEN_Cosas                        Magazine - General - Cosas
REV_GEN_Cultura_Urbana               Magazine - General - Cultura Urbana
REV_GEN_El_Siglo                     Magazine - General - El Siglo
REV_GEN_Ercilla                      Magazine - General - Ercilla
REV_GEN_Hacer_Familia                Magazine - General - Hacer Familia
REV_GEN_Man                          Magazine - General - Man
REV_GEN_Mujer_a_mujer                Magazine - General - Mujer a mujer
REV_GEN_Nos                          Magazine - General - Nos
REV_GEN_Puerto_Paralelo              Magazine - General - Puerto Paralelo
REV_GEN_Punto_Final                  Magazine - General - Punto Final
REV_GEN_Que_Pasa                     Magazine - General - Qu Pasa
REV_GEN_Revista_ED                   Magazine - General - Revista ED
REV_GEN_Rocinante                    Magazine - General - Rocinante
REV_INF_Dirigible                    Magazine - Children's - Dirigible
REV_INF_Icarito                      Magazine - Children's - Icarito
REV_INF_Papas_Fritas                 Magazine - Children's - Papas Fritas
REV_INF_Volare                       Magazine - Children's - Volare
REV_JUV_All                          Magazines - Youth
REV_LOC_All                          Magazines - Local
RVDI_ECN_Diario_PyME                 Financial Mags & Newspapers - Diario PyME
RVDI_ECN_El_Diario                   Financial Mags & Newspapers - El Diario
RVDI_ECN_Emprendedores               Financial Mags & Newspapers - Emprendedores
RVDI_ECN_Negocios_Ambientales        Financial Mags & Newspapers - Negoc. Ambientales
SIT_INS_All                          Government Sites 1
SIT_INS_Old                          Government Sites 2



=====SANTIAGO, 13 MAY 2008=====