Frequency List Wizard

Version 1.0.0

Copyright (c) 2010 Scott Sadowsky - Licensed under the GNU GPL v3.

ssadowsky at g mail period com
http://ssadowsky.hostei.com/

 

Frequency List Wizard is a command-line program that does various useful things with... frequency lists. It's free software, written in Perl and licensed under the GPL v3.

 

Quick usage

To process a frequency list using FLW's default options, unzip the downloaded file and do the following:

Windows executable

Perl script

Run the program with the -h switch to see help and usage information.

Description

Frequency List Wizard's default processing mode takes a 2-column frequency list in ISO-8859-1 (Latin-1) encoding, merges all entries that vary only by their capitalization (e.g. 'house', 'House' and 'HOUSE'), and sums the frequencies of each of these items to give you the total frequency per set of variant capitalizations (which is almost certainly what is desired when working with lexical items, lemmas, etc.). It performs a reverse natural numeric sort on the results and outputs them to a text file.

Three-column lists (e.g. frequency + lemma + POS) can be processed using the -3c switch. This options allows identical lemmas with different POSes to be processed (and counted) separately (e.g. 'jump' (NOUN) and 'jump' (VERB)).

If desired, FLW can also calculate the total number of types and tokens in the frequency list, as well as its type-token ratio (this is done by default, and printed at the end of the processed frequency list).
Optionally, FLW can eliminate entries containing numerals (-nn) and/or punctuation marks (-np) from frequency lists. It can also merge certain Spanish allomorphs (y + e, o + u) into a single item (-ma) (this is FLW's only language-specific feature). All three options are activated by default, and can be deactivated with the -nonn, -nonp and -noma switches. The difference between the number of items in the source frequency list and the number actually processed after eliminating numbers and/or punctuation marks is reflected in the type and token counts shown with the --print-stats option ('INPUT_TYPES' versus 'PROCESSED_TYPES', etc.).

When using the 3-column option, POS information in the third column can be pruned if it is in a Connexor-style format (e.g. '@NH N MSC SG'). The -kh (--killhead) switch will eliminate the head of the field ('@NH '), while -kt (--killtail) will eliminate the tail (' MSC SG').

The "meta-frequency" (AKA "legomena") processing mode, activated with the -mf or -hx switches, calculates the frequency of each frequency in the list. Its output is a frequency list of frequencies -- how many items occur 1 time, 2 times, and so on.

Options

   
   
-i, --input   Name of input file. MANDATORY! Must be ISO-8859-1 (Latin-1).
-o, --output   Name of output file. If not provided, a name will be automatically generated using the input file base name.
-ps,--print-stats   Calculate and print type, token and TTR statistics (DEFAULT: ON).
-mf, --meta-freq   Calculate the frequencies of each frequency in the list. In other words, generates a meta-frequency list, or list of n-legomena.
-leg, --legomena   Same as -mf or --meta-freq.
-nn, --nonums   Eliminate list entries that contain numbers (e.g. "Bill7").
-np, --nopunct   Eliminate list entries that contain punctuation (e.g. "a@b.com").
-ma, --mergeallo   Merge Spanish allomorphs (e.g. "y" and "e", "o" and "u").
-3c, --3-col   Process 3-column lists. Temporarily merges columns 2 (typically "word") and 3 ("POS", "lemma", etc.). This allows processing of identical items that have different POSes/lemmas assigned to them (e.g. "canto" (NOUN SG MSC) and "canto" (V 1SG PRES IND)). After processing, the merge is undone, giving the original number of columns.
-kh, -killhead   In lists that provide head info in the format "@NH ", eliminate this information, leaving only POS info in the column (e.g. Connexor). Assumes that this info is in the THIRD column.
-kt, -killtail   In lists with POS info, eliminate all of this info EXCEPT the general grammatical category (e.g. "DET MSC SG" becomes "DET"). Forces -killtail.
-so, --spliton   Define the character that input file lines will be split on. The default value is \t (tab).
-d, --delimiter   Allows an alternative delimiter character to be used. This is the character that is inserted between columns in the output file. Entering "t" will produce \t. The default value is \t (tab).
-st,--spaces-split   Treat 2 or more spaces as the split character. Typically for messy lists. Care must be taken with this option, as any extraneous space can (and will) have undesirable consequences.
-db, --debug   Print debug info.
-h, --help   Show this help information.
   

Meta-configurations

   
   
-w, --words   Process frequency list as words (2 columns: FREQ, WORD).
-l, --lemmas   Process frequency list as lemmas (2 columns: FREQ, LEMMA).
-pm, --posmin   Process frequency list as minimal POS (2 columns: FREQ, POS. Kills POS head and tail).
-p, --pos   Process frequency list as partial POS (2 columns: FREQ, POS. Kills POS head, leaves tail intact).
-pf, --posfull   Process frequency list as full POS (2 columns: FREQ, POS. Leaves entire POS intact).
-sr, --synrel   Process frequency list as syntactic relationships (2 columns, deactivates potentially destructive options).
-wp, --wordpos   Process frequency list as words + POS (3 columns: FREQ, WORD, POS. Kills POS head and tail, and eliminates numbers and punctuation).
-lp, --lemmapos   Process frequency list as lemmas + POS (3 columns: FREQ, LEMMA, POS. Kills POS head and tail, and eliminates numbers and punctuation).