This section describes an awk program for presenting a text file
in the form of a KWIC index. When applied to exatext1
the result would contain the following:
abstract. In the top right han considers appropriate for this abstract. When the indexer and-alone product by technical abstracters on their home agencies who publish technical abstracts from a wide collecti s to allow researchers to find abstracts of publications that abstracts to allow researchers short, to try and achieve full text interpretati lters to collect messages from across the world, scan actually appear in the documen Indexers can also add new keywords which they th against descriptor assignments thout human intervention. News agencies agencies who publish technical The `Terms' menu contains all allowable terms for a part ks on that particular keyword, all occurrences of by a robust noun-group parser. All the documents in a large In all these applications, the as unit occurs at all. This gives an estimate of abstracts to allow researchers to find abst written by clinicians, both to allow the compilation of perfo
The programming task of making a KWIC index is described in great detail in chapter 5 of Aho, Kernighan and Weinberger (1988). It turns out that the UNIX pipeline approach is amazingly well-suited to this task once the correct decomposition of the problem has been found. The key insight is that the following three-step decomposition of the problem is good:
awk
in two small programs which fit on half a page.
Admittedly the programs
pack a lot of awk lore into
a small space, but because they are so short they can be explained
in detail.
The rotate program looks as follows (also available as
rotate.awk
):
1 { print $0 2 for(i = length($0); i > 0; i--) 3 if(substr($0,i,1) == " ") 4 print substr($0,i+1) "\t" substr($0,1,i-1) 5 }Again, the line numbers are only there to allow discussion. The first line prints out each input line in its original order (because $0 refers to the whole input line). Line 2 is a for loop which counts the variable i down from the length (in characters) of the input to 1, inclusive. Lines 3 happens for every value of i. It tests whether the substring of length 1 at position i is a single blank. If it is, line 4 is executed, which prints out the string formed by concatenating the following things
exatext1
,
then the output of the first stage is as follows:
The HCRC Language Technology Group (LTG) is a technology transfer transfer The HCRC Language Technology Group (LTG) is a technology technology transfer The HCRC Language Technology Group (LTG) is a a technology transfer The HCRC Language Technology Group (LTG) is is a technology transfer The HCRC Language Technology Group (LTG) (LTG) is a technology transfer The HCRC Language Technology Group Group (LTG) is a technology transfer The HCRC Language Technology Technology Group (LTG) is a technology transfer The HCRC Language ...The second stage of the pipeline is just sort -f. The -f option sorts together upper and lower case words (so ``This'' and ``this'' are not separated). The output becomes:
abstract the human indexer is currently working on. The bottom left abstract. In the top right hand corner, the system displays the abstract. When the indexer keywords it considers appropriate for this abstracters on their home used as a stand-alone product by technical abstracts from a wide collection of agencies who publish technical abstracts of publications that abstracts to allow researchers to find abstracts to allow researchers to find abstracts of publications that achieve full text interpretation. short, to try and across the world, scan use message filters to collect messages from ...
The unrotate program is even shorter than the rotate program
(available as unrotate.awk
):
1 BEGIN {FS= "\t"; WID = 30} 2 { printf ("%" WID "s %s\n", 3 substr($2,length($2)-WID+1),substr($1,1,WID)) 4 }
Note the use of the
built in variable FS (for Field Separator). FS is given a
single tab character as its value. This means that awk will
automatically split input lines at tabs, rather than (as is the
default) any white-space character. This in turn means that when the
main body (lines 2-4) processes lines, they are split in two at the
tab characters which were inserted by the rotate
program.
The program uses printf
as explained on page .
Note how it constructs the string that is to be printed (WID)
on the fly from the expression
"%" WID "s %s\n"
.
Putting the three components together, we have:
rotate.awk example | sort -f | unrotate.awk | moreTry this. The result will contain lines like the following:
and-alone product by technical abstracters on their home PCs agencies who publish technical abstracts from a wide collecti s to allow researchers to find abstracts of publications that abstracts to allow researchers short, to try and achieve full text interpretati lters to collect messages from across the world, scan actually appear in the documen Indexers can also add new keywords which they th against descriptor assignments ...
and-alone product by technical abstracters on their home PCs. agencies who publish technical abstracts from a wide collecti s to allow researchers to find abstracts of publications that st assigning keywords to these abstracts to allow researchers ge to is too short, to try and achieve full text interpretati lters to collect messages from across the world, scan them, a tching --- does a fixed string actually appear in the documen Indexers can also add new keywords which they th ocument representational units against descriptor assignmentsCan you see what may have caused the difference?
exatext1
line by line. Because the word ``actually'' appears at
the start of a line in the exatext1
file,
it shows up in the KWIC index with
no context to its left. Strictly speaking, that's wrong.
You can avoid that problem by removing linebreaks in the input
file. For example, you could take out all the linebreaks, still
leaving in titles and paragraph breaks. If you then run the
KWIC software, only the start of a paragraph will show
up in the KWIC index with no context to its left; and the
last word of a paragraph will show up with no context to its right.$1 !~ /^(a|A|an|An|and|And|for|For|if|If|in|In|of|Of|the|The|to|To)$/The
!~
means ``does not match'' so the effect of the
program is to reject all lines which begin with the common words.
This can be inserted into the pipeline after rotate, as
follows:
rotate.awk example |stop.awk | sort -f | unrotate.awk
categories is not an arbitrary incoming mail and sort it into categories like classify them into a number of categories so that relevant solving categorisation and routing pro In our experience of categorisation and routing pro text categorisation and routing sol omatic and semi-automatic text categorisation is also an impo categorisation system, SISTA. categorisation tools are used categorization and routing are Our work on text categorization and routing has Text categorization and text routin lications, the assignment of a category or set of the likelihood that a certain category should beThis shows that
Modify the code to produce nicer looking
output using ASCII art.
We format the words as a table, using features of printf which
are not worth discussing here. One thing which is worth discussing is
the use of index to try to find a trailing space after the first
word. If there is a space, we
use the knowledge of the position to split the string into three
parts. If we didn't find a space, we must be at the end of the line,
so we set the variable pos as if there really were a space after
the end of the line. This lets us use the same code for both cases:
BEGIN {FS= "\t"; WID = 30; WORDLEN=12; line()}
{ pos = index($1," ")
if(!pos) {
pos = length($1)+1;
}
printf ("|%" WID "s | %-" WORDLEN "s| %-" WID "s|\n",
substr($2,length($2)-WID+1),
substr($1,1,pos-1),
substr($1,pos+1,WID))
}
END { line() }
function line () {print "---...---"}
We also take the opportunity to define a tiny function for drawing
lines of the right length at the beginning and end of the
table. Some of the dashes are omitted in the listing.
Alternative solution:
Another way to write the same program uses a second function to
do the output to the screen, as follows:
BEGIN {FS= "\t"; WID = 30; WORDLEN=12; line()}
{ pos = index($1," ")
if(!pos) { pos = length($1)+1}
cells($2,$1,pos ) }
END { line() }
function cells(prefix,suffix,pos) {
printf ("|%" WID "s | %-" WORDLEN "s| %-" WID "s|\n",
substr(prefix,length(prefix)-WID+1),
substr(suffix,1,pos-1),
substr(suffix,pos+1,WID))
}
function line () {print "---...---"}
This has the merit that it clarifies the purpose of the main program,
and isolates the clunky manoeuvres with printf in a single
location. The second version would be slightly easier to modify if you
needed a KWIC index program which output direct to LATEX.
This program is available as unrotate3.awk
.
Output from unrotate3 is shown in table 4.1
Write the program which generates the LATEX KWIC index.
(Obviously you need to know what input LATEX expects, so this
question is for LATEXies only.)
Only small changes are needed:
BEGIN {FS= "\t"; WID = 30; start_table()}
{ pos = index($1," ")
if(!pos) { pos = length($1)+1}
cells($2,$1,pos ) }
END { end_table() }
function cells(prefix,suffix,pos) {
printf ("%s & %s & %s \\\\ \\hline \n",
substr(prefix,length(prefix)-WID+1),
substr(suffix,1,pos-1),
substr(suffix,pos+1,WID))
}
function start_table () {
print "{\\small"
print "\\begin{tabular}{|r|l|l|} \\hline"
print "\\multicolumn{1}{|l|}{Prefix} &"
print "Word &"
print "\\multicolumn{1}{l|}{Suffix} \\\\ \\hline \\hline"
}
function end_table () {
print "\\end{tabular}"
print "}"
}
The only point of general interest is that you need to write \\
inside strings in order to get \
in the output. The result can
be seen in table 4.2.
1|l|Prefix
Word
1l|Suffix
3|c|...
considers appropriate for this
abstract.
When the indexer clicks on a k
and-alone product by technical
abstracters
on their home PCs. Above, we g
agencies who publish technical
abstracts
from a wide collection of jour
s to allow researchers to find
abstracts
of publications that may be of
st assigning keywords to these
abstracts
to allow researchers to find a
ge to is too short, to try and
achieve
full text interpretation.
lters to collect messages from
across
the world, scan them, and clas
tching -- does a fixed string
actually
appear in the document? At the
3|c|...