Ver en castellano

Frequency List Wizard

Version 1.2.0

Frequency List Wizard is a command-line program that does various useful things with... frequency lists. It's free software, written in Perl and licensed under the GPL v3.

Download from GitHub

Older versiones

Quick usage info

To process a frequency list using FLW's default options, unzip the downloaded file and do the following:

Windows executable

Copy the program to the folder where your frequency list is (or copy it to a folder that's on your path, such as C:\Windows or C:\Windows\System32, to avoid this hassle).
Open a command prompt by hitting WINDOWS+R and typing cmd.exe (you can also type this in the Start Menu search box on Vista or later).
In the command prompt, navigate to the directory with the frequency list using the cd command.
Type the following: frequency-list-wizard.exe -i your-list.txt

Perl script

Make the .pl file executable.
Copy it to the directory where your frequency list is (or to a directory that's on your path).
Open a terminal window and navigate to the directory with your frequency list.
Execute the following command:
- GNU/Linux: ./frequency-list-wizard.pl -i your-list.txt
- Windows: perl frequency-list-wizard.pl -i your-list.txt

Run the program with the -h switch to see help and usage information.

Description

Frequency List Wizard's default processing mode takes a 2-column frequency list in ISO-8859-1 (Latin-1) encoding, merges all entries that vary only by their capitalization (e.g. house, House and HOUSE), and sums the frequencies of each of these items to give you the total frequency per set of variant capitalizations (which is almost certainly what is desired when working with lexical items, lemmas, etc.). It performs a reverse natural numeric sort on the results (1000, 200, 30, 1 instead of 30, 200, 1000, 1) and outputs them to a text file.

Three-column lists (e.g. frequency + lemma + POS) can be processed using the -3c switch. This options allows identical lemmas with different POSes to be processed (and counted) separately (e.g. jump (NOUN) and jump (VERB)).

If desired, FLW can also calculate the total number of types and tokens in the frequency list, as well as its type-token ratio (this is done by default, and printed at the end of the processed frequency list).

Optionally, FLW can eliminate entries containing numerals (using -nn) and/or punctuation (using -np) from frequency lists. It can also merge certain Spanish allomorphs (y + e, o + u) into a single item (-ma) (this is FLW's only language-specific feature). All three options are activated by default, and can be deactivated with the -nonn, -nonp and -noma switches. The difference between the number of items in the source frequency list and the number actually processed after eliminating numbers and/or punctuation marks is reflected in the type and token counts shown with the --print-stats option (INPUT_TYPES versus PROCESSED_TYPES, etc.).

When using the 3-column option, POS information in the third column can be pruned if it is in a Connexor-style format (e.g. @NH N MSC SG). The -kh (--killhead) switch will eliminate the head of the field (the first block of characters plus the first space; in the example tag, this eliminates @NH ), while -kt (--killtail) will eliminate the tail (everything between the second space and the end of the POS field; in the example tag, MSC SG). To process differently-formatted POS information, use only the --posfull option .

The "meta-frequency" (AKA "legomena") processing mode, activated with the -mf or -hx switches, calculates the frequency of each frequency in the list. Its output is a frequency list of frequencies -- how many items occur 1 time, 2 times, and so on.

Options

-i, --input		Name of the input frequency list file. MANDATORY! Must be ISO-8859-1 (Latin-1).
-o, --output		Name of the output file. If not provided, a name will be automatically generated using the input file base name.
-ps,--print-stats		Calculate and print type, token and TTR statistics to output file (DEFAULT: ON).
-mf, --meta-freq		Calculate the frequencies of each frequency in the list. In other words, generates a meta-frequency list, or list of n-legomena.
-leg, --legomena		Same as -mf or --meta-freq.
-nn, --nonums		Eliminate frequency list entries that contain numbers (e.g. Bill7).
-np, --nopunct		Eliminate frequency list entries that contain punctuation (e.g. a@b.com).
-ma, --mergeallo		Merge Spanish allomorphs (e.g. "y" and "e", "o" and "u").
-3c, --3-col		Process 3-column frequency lists. This allows intelligent handling of identical items that have different POSes/lemmas assigned to them (e.g. canto (NOUN SG MSC) and canto (V 1SG PRES IND)).
-kh, -killhead		In frequency lists that provide syntactic info in the format @NH, eliminate this information, leaving only POS info (e.g. Connexor). Assumes that this info is in the THIRD column.
-kt, -killtail		In frequency lists with POS info, eliminate all of this info EXCEPT the general grammatical category (e.g. DET MSC SG becomes DET). Forces -killhead.
-so, --spliton		Define the character that input frequency list columns will be split on. The default value is \t (tab).
-d, --delimiter		Allows an alternative column delimiter character to be specified in the output file. This is the character that is inserted between columns in the output file. Entering t will produce \t. The default value is \t (tab).
-st,--spaces-split		Treat 2 or more spaces in the input frequency list as the split character. Typically for messy lists. Care must be taken with this option, as any extraneous space can (and will) have undesirable consequences.
-db, --debug		Show debug info.
-h, --help		Show FLW's help information.

Meta-configurations

-w, --words		Process frequency list as words (2 columns: FREQ, WORD).
-l, --lemmas		Process frequency list as lemmas (2 columns: FREQ, LEMMA).
-pm, --posmin		Process frequency list as minimal POS (2 columns: FREQ, POS. Kills POS head and tail).
-p, --pos		Process frequency list as partial POS (2 columns: FREQ, POS. Kills POS head, leaves tail intact).
-pf, --posfull		Process frequency list as full POS (2 columns: FREQ, POS. Leaves entire POS intact).
-sr, --synrel		Process frequency list as syntactic relationships (2 columns, deactivates potentially destructive options).
-wp, --wordpos		Process frequency list as words + POS (3 columns: FREQ, WORD, POS. Kills POS head and tail, and eliminates numbers and punctuation).
-lp, --lemmapos		Process frequency list as lemmas + POS (3 columns: FREQ, LEMMA, POS. Kills POS head and tail, and eliminates numbers and punctuation).

Older versions

Windows executable: Local - Mirror
Perl script / source code: Local - Mirror

ENGLISH

CASTELLANO