Frequency List Wizard
Version 1.0.0
Copyright (c) 2010 Scott Sadowsky - Licensed under the GNU GPL v3.
ssadowsky at g mail period com
http://ssadowsky.hostei.com/
Frequency List Wizard is a command-line program that does various useful things with... frequency lists. It's free software, written in Perl and licensed under the GPL v3.
Quick usage
To process a frequency list using FLW's default options, unzip the downloaded file and do the following:
Windows executable
- Copy the program to the folder where your frequency list is (or copy it to a folder that's on your path, such as C:\Windows or C:\Windows\System32, to avoid this hassle).
- Open a command prompt by hitting WINDOWS+R and typing cmd.exe (you can also type this in the Start Menu search box on Vista or later).
- In the command prompt, navigate to the directory with the frequency list using the cd command.
- Type the following: frequency-list-wizard.exe -i your-list.txt
Perl script
- Make the .pl file executable.
- Copy it to the directory where your frequency list is (or to a directory that's on your path).
- Open a terminal window and navigate to the directory with your frequency list.
- Execute the following command: ./frequency-list-wizard.pl -i your-list.txt
Run the program with the -h switch to see help and usage information.
Description
Frequency List Wizard's default processing mode takes a 2-column frequency list in ISO-8859-1 (Latin-1) encoding, merges all entries that vary only by their capitalization (e.g. 'house', 'House' and 'HOUSE'), and sums the frequencies of each of these items to give you the total frequency per set of variant capitalizations (which is almost certainly what is desired when working with lexical items, lemmas, etc.). It performs a reverse natural numeric sort on the results and outputs them to a text file.
Three-column lists (e.g. frequency + lemma + POS) can be processed using the -3c switch. This options allows identical lemmas with different POSes to be processed (and counted) separately (e.g. 'jump' (NOUN) and 'jump' (VERB)).
If desired, FLW can also calculate the total number of types and tokens in the frequency list, as well as its type-token ratio (this is done by default, and printed at the end of the processed frequency list).
Optionally, FLW can eliminate entries containing numerals (-nn) and/or punctuation marks (-np) from frequency lists. It can also merge certain Spanish allomorphs (y + e, o + u) into a single item (-ma) (this is FLW's only language-specific feature). All three options are activated by default, and can be deactivated with the -nonn, -nonp and -noma switches. The difference between the number of items in the source frequency list and the number actually processed after eliminating numbers and/or punctuation marks is reflected in the type and token counts shown with the --print-stats option ('INPUT_TYPES' versus 'PROCESSED_TYPES', etc.).
When using the 3-column option, POS information in the third column can be pruned if it is in a Connexor-style format (e.g. '@NH N MSC SG'). The -kh (--killhead) switch will eliminate the head of the field ('@NH '), while -kt (--killtail) will eliminate the tail (' MSC SG').
The "meta-frequency" (AKA "legomena") processing mode, activated with the -mf or -hx switches, calculates the frequency of each frequency in the list. Its output is a frequency list of frequencies -- how many items occur 1 time, 2 times, and so on.
Options |
||
-i, --input | Name of input file. MANDATORY! Must be ISO-8859-1 (Latin-1). | |
-o, --output | Name of output file. If not provided, a name will be automatically generated using the input file base name. | |
-ps,--print-stats | Calculate and print type, token and TTR statistics (DEFAULT: ON). | |
-mf, --meta-freq | Calculate the frequencies of each frequency in the list. In other words, generates a meta-frequency list, or list of n-legomena. | |
-leg, --legomena | Same as -mf or --meta-freq. | |
-nn, --nonums | Eliminate list entries that contain numbers (e.g. "Bill7"). | |
-np, --nopunct | Eliminate list entries that contain punctuation (e.g. "a@b.com"). | |
-ma, --mergeallo | Merge Spanish allomorphs (e.g. "y" and "e", "o" and "u"). | |
-3c, --3-col | Process 3-column lists. Temporarily merges columns 2 (typically "word") and 3 ("POS", "lemma", etc.). This allows processing of identical items that have different POSes/lemmas assigned to them (e.g. "canto" (NOUN SG MSC) and "canto" (V 1SG PRES IND)). After processing, the merge is undone, giving the original number of columns. | |
-kh, -killhead | In lists that provide head info in the format "@NH ", eliminate this information, leaving only POS info in the column (e.g. Connexor). Assumes that this info is in the THIRD column. | |
-kt, -killtail | In lists with POS info, eliminate all of this info EXCEPT the general grammatical category (e.g. "DET MSC SG" becomes "DET"). Forces -killtail. | |
-so, --spliton | Define the character that input file lines will be split on. The default value is \t (tab). | |
-d, --delimiter | Allows an alternative delimiter character to be used. This is the character that is inserted between columns in the output file. Entering "t" will produce \t. The default value is \t (tab). | |
-st,--spaces-split | Treat 2 or more spaces as the split character. Typically for messy lists. Care must be taken with this option, as any extraneous space can (and will) have undesirable consequences. | |
-db, --debug | Print debug info. | |
-h, --help | Show this help information. | |
Meta-configurations |
||
-w, --words | Process frequency list as words (2 columns: FREQ, WORD). | |
-l, --lemmas | Process frequency list as lemmas (2 columns: FREQ, LEMMA). | |
-pm, --posmin | Process frequency list as minimal POS (2 columns: FREQ, POS. Kills POS head and tail). | |
-p, --pos | Process frequency list as partial POS (2 columns: FREQ, POS. Kills POS head, leaves tail intact). | |
-pf, --posfull | Process frequency list as full POS (2 columns: FREQ, POS. Leaves entire POS intact). | |
-sr, --synrel | Process frequency list as syntactic relationships (2 columns, deactivates potentially destructive options). | |
-wp, --wordpos | Process frequency list as words + POS (3 columns: FREQ, WORD, POS. Kills POS head and tail, and eliminates numbers and punctuation). | |
-lp, --lemmapos | Process frequency list as lemmas + POS (3 columns: FREQ, LEMMA, POS. Kills POS head and tail, and eliminates numbers and punctuation). |