Corpora & Frequency Dictionaries
I believe strongly that the best way to understand language is by studying how it is actually used. Introspection is invaluable when it comes to forming hypotheses, but there's nothing as reliable as high-quality data when it comes to confirming or refuting them.
When I started researching the variety of Spanish that has been my focus so far --Chilean Spanish-- I quickly found that there were no corpora or frequency dictionaries available. So instead of grovelling, I created the Dynamic Corpus of Chilean Spanish (Codicach), which is to this day the largest corpus of Spanish in the world, as far as I know.
That led some years later to the creation of the Word Frequency List of Chilean Spanish (Lifcach), which I also believe is the largest of its kind.
And over the last two years, I've been focusing on developing the Sociolinguistic Corpus of Spoken Chilean Spanish (Coscach), as part of the research for my Ph.D. dissertation.
Dynamic Corpus of Chilean Spanish (Codicach)
The Dynamic Corpus of Chilean Spanish (Codicach) is an electronic corpus of written Chilean Spanish. It contains about 800 million running words in 1.3 million files and 102 sub-corpora. It has been chunked, lemmatized, and tagged with POS and syntactic relationship information using Connexor's Machinese Syntax program. Search and retrieval are powered by the IMS Open Corpus Workbench.
Sociolinguistic Corpus of Spoken Chilean Spanish (Coscach)
The Sociolinguistic Corpus of Spoken Chilean Spanish, known by its Spanish acronym Coscach, is a massive electronic database of the speech of Chilean children and young adults, built with cutting-edge technology and sociolinguistic methods. At present, it contains 24-bit audio and high-quality video recordings of 131 speakers.
Word Frequecy List of Chilean Spanish (Lifcach)
The Word Frequency List of Chilean Spanish (Lifcach) is a set of 102 frequency lists derived from the sub-corpora of the Dynamic Corpus of Chilean Spanish (Codicach), a corpus of contemporary written Chilean Spanish developed by Sadowsky between 1997 and 2002; this corpus contained approximately 450 million words when the Lifcach was created (it currently contains some 800 million words).