Frequency Lists 
===============

Users of WiTTFind can find frequency lists grouped by semantic
categories in the website. Most of these lists were done with the first
5.000 pages of Wittgenstein's *Nachlass* that were open to the public.
Now that all the 20.000 pages of the manuscripts and type scripts are
accessible, the semantic frequency lists need to be updated.

Ludwig Wittgenstein wrote about a varying number of topics. One of them
is music. Colors also played a big role in the work of this philosopher
and in his type script Ts-213 he wrote under the chapter
\"Phänomenologue\" (phenomenology) a subchapter on color and color
mixing. It is not surprising that the second most common adjective in
this document is \"rot\" (red). Many semantic categories could be
analyzed in Wittgenstein's *Nachlass*, because of the time constraint of
the thesis, it was decided that only the categories of color and music
would be explored.

In a previous thesis, frequency lists for these and other semantic
categories were created. These frequency lists were static and covered
only the first 5.000 pages of the *Nachlass* open at that time. In order
for the users to further analyze Wittgenstein's work, the frequency
lists need to cover all of the *Nachlass*. This chapter explains the
process of creating new semantic frequency lists for the FinderApp
WiTTFind. The process is implemented in such way, that it can be added
to the \[CW\]AST Toolchain. The semantic frequency lists are therefore
dynamic and can be called with a makefile target as all the other tools
in the chain.

Lexicon 
-------

The lexicon used by the FinderApp WiTTFind is called witt\_WAB\_DELA. It
intends to include all the words that appear in Ludwig Wittgenstein's
*Nachlass* and it's sorted alphabetically. This lexicon comprises
grammatical and semantic characteristics for the tokens that can be use
to extract the semantic frequency lists and other important informations
for the FinderApp. It is an electronic lexicon held in the German DELAF
format. This format is specially suitable for working with local
grammars and for processing text corpus with the help of Unitex. For a
better explanation of the Lexicon, see the thesis done by Angela Krey
[@krey_thesis].

Since the lexicon has been improved over time, there are different
versions of it. As in fall 2018, the lexicon being used is the
witt\_WAB\_dela\_XIX.txt.

Each line in the lexicon represents a word and it's composed following
the schema:

`fullform,lemma.grammatic_categories+semantic_categories...`

As mentioned above, the frequency lists created in the scope of this
thesis comprise the semantic categories for music and color. The
semantic category `MUSIK` (music) marks in the lexicon all words that
fall into this category.

``` 
Akkord,.N+MUSIK
Antoni,.EN+persName+MUSIK
Bachanten,Bacchant.N+MUSIK
Bach Johann Sebastian,Bach.EN+persName+MUSIK+KOMPONIST
Bach,.N+persName+MUSIK+KOMPONIST:aeM:deM:neM
Blasinstrumente,Blasinstrument.N+MUSIK:amN:deN:gmN:nmN
...
```

As mentioned a the beginning of this chapter, Wittgenstein worked
intensively with color theory. There are many semantic categories for
color such as Zwischenfarbe (intermidiate color), Transparenz
(transparency), Glanz (shine) and so on, but they all are a subset of
the category `COL`, standing for color. In listing
    there are some example for words that
fall into the color semantic category.

``` 
dunkel,.ADJ+COL+Zwischenfarbe:up
dunkelblau,.ADJ+COL+Zwischenfarbe
Dunkelrot,.N+COL
durchsichtige,durchsichtig.ADJ+COL+Transparenz
einfarbige,einfarbig.ADJ+NUM+COL+Farbigkeit
farbloses,farblos.ADJ+REL+COL+Farbigkeit
...
```

Understanding the structure of the entries in the lexicon is important
because it will make the extraction of words for the semantic frequency
lists easier.

Make deploy chain for the semantic frequency lists
--------------------------------------------------

The tools needed to create the semantic frequency lists are controlled
by the makefile `semantic_freqlist.make` and the logic follows the
structure of the \[CW\]AST deploy chain. The makefile can be found under
the `witt-data/deployment/makefile` folder, where all other makefiles
needed to deploy the FinderApp are stored. All the makefile targets in
`semantic_freqlist.make` have to be called from inside the
`witt-data/deployment` folder.

In `semantic_freqlist.make` the path to the different programs needed to
produce the frequency lists as well as the path destinations for the
output are saved into variables. These variables are later used in the
command part of the make rules.

To generate the semantic frequency lists for music and color, a
frequency list over all words of the *Nachlass* is needed first. This
frequency list is created with the script `all_frequencies.py` that can
be found in the `witt-data/tools/frequency` folder and can be called
with the makefile target `make all-freqlist`, see listing
    line 5. This rule has as dependency the
OA\_NORM-tagged.xml files, which means that if they change, the rule can
be called again to redo the frequency list.

``` 
TAGGED_UNEXPANDED_NORM_FILES   = $(shell find -L $(NACHLASS_DIR)/*/norm -type f -name \*.xml | grep '\-tagged' | grep -v '\expanded')

semantic_freqlist: all-freqlist music-freqlist color-freqlist

all-freqlist: $(TAGGED_UNEXPANDED_NORM_FILES)
	$(SILENT) $(PYTHON3_RUNNER) $(FREQ_ALL_DIR)/$(FREQ_ALL_CMD) $(ALL_FREQLIST) $(ALL_FREQ_PICKLE) $^
	$(SILENT) echo "written $(ALL_FREQLIST)"
	$(SILENT) echo "written $(ALL_FREQ_PICKLE)"

music-freqlist:
	$(SILENT) $(PYTHON3_RUNNER) $(FREQ_MUSIC_DIR)/$(FREQ_MUSIC_CMD) $(DICT_WITT) $(ALL_FREQ_PICKLE) $(MUSIC_FREQLIST)
	$(SILENT) echo "written $(MUSIC_FREQLIST)"

color-freqlist:
	$(SILENT) $(PYTHON3_RUNNER) $(FREQ_COLOR_DIR)/$(FREQ_COLOR_CMD) $(DICT_WITT) $(ALL_FREQ_PICKLE) $(COLOR_FREQLIST)
	$(SILENT) echo "written $(COLOR_FREQLIST)"
```

The program to create the lemmatized frequency list for music,
`music_freqlist.py`, can be found in `witt-data/tools/frequency/music`
folder and this program can be called with the makefile target
`make music_freqlist`.

For the colors, the program is called `color_freqlist.py` and it's
located in the directory `witt-data/tools/frequency/color`. The makefile
target `make color-freqlist` calls this script.

The three makefile targest can be called at once with the target
`make semantic_freqlist`, see line 5 in listing.

Frequency of words over all files 
---------------------------------

The program `all_frequencies.py` can be called with the following
command:

``` 
(*\textcolor*) python all_freqlist.py arg1 arg2 arg3...
```

The first argument expected is the output file for the frequency list in
txt format. The secon one is the output file for the frequency list in
pickle format and arg3 until argn represent all the OA\_NORM-tagged.xml
files. The program saves the tagged files into an array to later iterate
through them.

This script works in a similar way to the `language_finder.py`. It initializes a dictionary
`all_word_freqs` which keys will be the tokens and their values the
amount of times that world appears throughout the *Nachlass*. It then
reads one by one the tagged files and parses each document with the help
of `iterparse` from `lxml.etree` into a tree, creating a tuple of the
form (event,element). Only the elements with the tag \"w\" (words) are
important to create the frequency lists. Again, the program ignores the
the mathematical formulas found in the *Nachlass*, since they are not
considered tokens.

The word is then cleaned from possible XML elements due to different
types of input and preprocessing errors. This is done exactly as in
 (#unknown_words_section)  with the help of the regular
expressions (regex). After these steps, there are
still strings representing a token that start with a punctuation symbol.
These should not be inserted into the frequency list and therefore an
additional step is required.

To do so, a string `all_punctuation` is declared with the help of the
Pyhton method `string.punctuation` (gives the ASCII characters which are
considered punctuation back). Other punctuation characters found in the
*Nachlass* need to be added to this string. The processed word is only
added to the dictionary if it doesn't start with a punctuation symbol.//

``` 
all_punctuation = string.punctuation + (*"”“–’‘„…"*)

if word[0] not in all_punctuation:
  all_words_freqs[word] += 1
```

After finishing iterating through all the files, the program sorts the
frequency list of the words by descending order of their value and saves
the sorted dictionary in a pickle file. It also saves a txt version in
which each line represents a word followed by a space followed by its
frequency, this file can be found in the attached SD Card.

Semantic frequency lists 
------------------------

The same program is used to create the semantic frequency list for music
and for color. They search for a different semantic category in the
witt\_WAB\_DELA lexicon. While `color_freqlist.py` searches for the
entries in the dictionary that have the `COL` semantic category,
`music_freqlist.py` searches for the entries with the semantic category
`MUSIK`.

What follows is a short explanation of how the program creates a
lemmatized frequency list for the semantic category color. It is a
lemmatized frequency list because at the end the frequencies should be
sorted by their lemma and an entry in the list should look like this:

`lemma,sum_of_all_freqs; first_fullform, freq; second_fullform, freq; ...`


``` 
dunkelblau,6; dunkelblau,3; dunkelblauen,3;
Dunkelrot,4; Dunkelrot,4;
einfarbig,6; einfarbig,2; einfarbige,2; einfarbigen,2;
farblos,33; farblos,14; farblose,7; farbloser,5; farbloses,5; farblosen,2;
...
```

The script `color_frqlist.py` can be called as:

``` 
(*\textcolor*) python color_freqlist.py arg1 arg2 arg3
```

The first argument expected is the lexicon `witt_WAB_dela_XIX.txt`, the
second argument should be the frequency over all words in the lexicon in
pickle format and the third argument is the file where the output should
be saved.

The dictionary of dictionaries `color_freqs` is initialized. Its keys
are the lemma of different full forms and its values are dictionaries
with all full forms mapped to their frequencies.

The program reads one by one the lines in the lexicon, see listing
   , and checks with help of `re.match`
at the beginning of the string for anything (fullform) followed by a
comma, then anything (lemma) follwed by a period follwed by anything and
then +COL. As we mentioned above, COL symbolizes the semantic category
for colors. The pattern match has to follow the DELAF format explained
at the beginning of this chapter.

``` 
color_freqs = defaultdict(lambda: defaultdict(int))

with open(lexicon, 'r') as witt_lex:
    for entry in witt_lex:
        col = re.match("(.*),(.*)\..*\+COL", entry)
        if col:
            # lstrip is used because some words in the dictionary have leading spaces
            full_form = col.group(1).lstrip()
            lemma = col.group(2).lstrip()
            if not lemma:
                lemma = full_form
            if full_form in all_frequencies:
                color_freqs[lemma][full_form] = all_frequencies[full_form]
```

If the entry in the lexicon matches the pattern, the full form for the
word is set to the first group of the match. Sometimes a full form of a
word is also its lemma. In this case, the lemma is left empty in the
lexicon entry. See entry for \"Dunkelrot\" (dark red). 
The program deals with these kinds of
entries, see lines 10 to 11, by checking if the second group captured
something. If it didn't, it sets the lemma of the word to be the same as
its full form.

With the variables `full_form` and `lemma` filled, the program checks if
`full_form` is a key in the dictionary created by `all_freqlist.py`. If the full form of the entry
in the lexicon is in the frequency list over all words in the
*Nachlass*, then it is added to the lemmatized dictionary together with
its frequency, see line 13.

When `color_freqlist.py` finishes iterating through the lexicon, it
still needs to write the obtained frequency list into a txt file. To do
so, it iterates through the items in the dictionary `color_freqs` and
sums all the values for the different full forms of a lemma entry with
the help of the function `sum`. The lemma followed by this sum, and
then the full forms with their frequencies are written in the output
file that looks, as mentioned before.

``` 
    for lemma, full_forms in color_freqs.items():
        sum_of_freqs = sum(full_forms.values())
```

The program to extract the semantic frequency list for music uses a
different regex for the matching, see bellow. This is the only line that
is different between the two scripts `color_freqlist.py` and
`music_freqlist.py`.

``` 
music = re.match("(.*),(.*)\..*\+MUSIK", entry)
```

By replacing the regex on line 5 of listing
    one can find different semantic
categories and do a frequency list for them if needed.

### Old frequencies

As mentioned at the beginning of the chapter, frequency lists for
different semantic categories were created as part of a previous
bachelor thesis. The resulting frequency lists for the first 5.000 open
source pages can be found in the FinderApp WiTTFind.

To compare the old frequencies with the ones created in this work, the
10 most common words of the old frequencies for color adjectives, see
table  (#tab:old_color) , were extracted.

The composers frequency list shown in table
 (#tab:old_composers)  and later on in
 (#tab:new_composers) , shows the 10 most frequent mentioned
composers by lemma (not by full form as in the color adjectives).

| Wort | Frequenz |
|------|----------|
| rot  | 904  |
| klar |267 |
| blau |209 |
| gelb |  152 |
| roten |  141 |
| rote |  104 |
| gelbe |  92 |
| schwarz |  75 |
| blue |  73 |
| rotes|  68 | 

  Old frequencies for color adj retrieved from
  http://wittfind.cis.uni-muenchen.de/?semantics\#

 
The pages of the type script Ts-213 form part of the first 5.000 pages
of Wittgenstein's Nachass that were open to the public. As previously
mentioned, the second most common adjective in this type script is
\"rot\" (red). This type script is one of the largest documents in the
*Nachlass* and it is not surprising to find this color adjective in the
first place of the color list with a frequency of 904. Different
representation forms for this adjective, depending on whether the noun
it is modifying is singular or plural and its gender, make it also to
the list of the most common color adjectives used by Wittgenstein.

| Wort | Frequenz |
|------|----------|
|  Beethoven |  41 |
|  Schubert |  31 |
|  Brahms |  26 |
|  Mozart |  23 |
|  Mendelssohn |  16 |
|  Bruckner |  15 |
|  Labor |  10 |
|  Wagner |  9 |
|  Schumann |  8 |
|  Haydn |    7 |

  Old frequencies for composers retrieved from
  http://wittfind.cis.uni-muenchen.de/?semantics\#

 
Music was another topic the philosopher wrote about and therefore it is
important to research this semantic category in his work. To make the
comparison of the old frequencies with the new ones, it was decided that
the semantic subcategory KOMPONIST (composer) should be explored. The
composer that appears the most throughout the first 5.000 open pages of
the *Nachlass* is Beethoven. He is followed by Schubert.

### Frequencies of additional 15.000 pages

The frequencies for the then secure pages can be found in table
 (#tab:additional_colors)  for color adjectives and in table
 (#tab:additional_composers)  for the composers. A few
interesting things can be observed.

| Wort | Frequenz |
|------|----------|
|  rot |  1256 |
|  klar |  1415 |
|  blau |  499 |
|  gelb |  290 |
|  roten |  282 |
|  rote |  190 |
|  gelbe |  85 |
|  schwarz |  166 |
|  blue |  27 |
|  rotes |  53 |

  Additional frequencies for color adj


The color adjective \"rot\" was found 1256 times in the 15.000 left
pages. The adjective \"klar\" was found even more often than \"rot\",
occurring 1415 times. The word \"klar\" can mean different things
depending on the context, for example clear or transparent.

The use of color adjectives continues to be strong for the rest 3/4 of
Wittgenstein's *Nachlass*.

The words retrieved regarding music composers in the additional 15.000
pages are very scarce to say the least. In these pages, Wittgenstein
mentions Bruckner 3 times. No other composer is mentioned. This shows,
that the philosopher talks mostly about music in one or more of the
documents belonging to the first 5.000 open pages of the *Nachlass*.


   **Wort**   **Frequency**
   Bruckner         3

  Additional frequencies for composers


Difference between old frequencies and new frequencies
------------------------------------------------------

The frequencies shown below are the new total frequencies for the 10
most frequent words appearing in the old frequencies table.


| Wort | Frequenz |
|------|----------|
|  rot |  2160 |
|  klar |  1682 |
|  blau |  499 |
|  gelb |  247 |
|  roten |  423 |
|  rote |  294  |
|  gelbe |  177 |
|  schwarz |  241 |
|  blue |  100 |
|  rotes |  121 |

  New frequencies for color adj

 
The frequencies for composers decreased from the old frequencies, see
table  (#tab:old_composers) , to the frequencies for composers over
all 20.000 pages. This would be really odd if it weren't
for the fact that the XML documents received from Bergen have changes in
their format from time to time. The changes implemented are to fix some
errors but also new transcription or edition problems can be found in
them. To understand why the frequency for Beethoven, Schubert and Brahms
decreased by 1, for Haydn by 2 and for Mendelssohn by 3, a deep research
would need to be make but this exceeds the scope of this work.


| Wort | Frequenz |
|------|----------|
|  Beethoven |  40 |
|  Schubert |  30 |
|  Brahms |  25 |
|  Mozart |  23 |
|  Mendelssohn |  13 |
|  Bruckner |  18 |
|  Labor |  Not marked as MUSIC |
|  Wagner |  9 |
|  Schumann |  8 |
|  Haydn  |  5 |

  New frequencies for composers

 
No entry for Labor was found either in the search bar of WiTTFind or in
the newly created frequency list. Joseph Labor was a composer and an
entry for his name can be found in the witt\_WAB\_delaXIX.txt lexicon:

`Labor Josef,Labor.EN+MUSIK+KOMPONIST`

The reason why this composer doesn't appear either in the new semantic
frequencies or in the search bar in WiTTFind is because both programs
check that the full form of a an entry in the lexicon is a key in the
frequency over all words dictionary. The full form given by the Lexicon
is \"Josef Labor\".

The entries found for Labor in the dictionary `all_words_freqs` created
by the script `all_freqlist.py` are:

    all_words_freqs ={ 
      ...
      Labor: 8,
      Labors: 2,
      ...
    }

\"Josef Labor\" is not found as key of any item in the dictionary, since
Wittgenstein never writes his complete name. The error lies on the
incomplete Lexicon witt\_WAB\_delaXIX.txt but can easily be fixed by
adding the two following entries:

`Labor,.EN+MUSIK+KOMPONIST`

`Labors,Labor.EN+MUSIK+KOMPONIST`

This type of error is better explained in chapter
 (#evaluation) .

New frequencies
---------------

Until now we compared the old frequencies with the new ones. In this
last part of the chapter, the new frequencies over all documents are
shown. The 20 most common words for the semantic category color and for
the semantic category music are shown in table
 (#tab:all_new_colors)  and table
 (#tab:all_new_music)  respectively.


| Wort | Frequenz |
|------|----------|
|  rot |  2160 |
|  klar |  1682 |
|  Rot |  865 |
|  blau |  499 |
|  roten |  423 |
|  grün |  411 |
|  Weiß |  375 |
|  Grün |  298 |
|  rote |  294 |
|  gelb |  247 |
|  Blau |  243 |
|  schwarz |  241 |
|  Gelb |  236 |
|  rein |  221 |
|  roter |  187 |
|  gelbe |  177 |
|  heller |  177 |
|  schwarzen |  170 |
|  Schwarz   |      169 |


  New frequencies over all color semantic category


The word that ranks first in the new semantical frequency list for color
is \"weiß\", which means white. This word is also is the present form of
the verb wissen (to know) for the first and third person singular.

The figures show that the full form word \"weiß\" has the
same frequency for its different possible lemmas. This problem is
created when the semantic categories of a word are created with only the
help of a lexicon. To disambiguate the meaning of a word, its POS-tag
should be taken into account when creating the list of frequency over
all words.


The following examples aim to show two different uses of the the word
\"weiß\". In the first example, \"D.h. also, er weiß immer mehr, als er
zeigen kann.\" (That means, he always knows more than he can show.)
found in Ts-213,12r\[3\]\_3 the word is used as a verb. The tagging for
it can be found in listing
   .

``` 
<w ana="pagenr:29 linenr:11 tokennr:6" l="wissen" t="VVFIN">weiß</w>
```

In the sentence \"im Schachspiel wird die weiße Farbe von Fi- guren zur
Unterscheidung von der schwarzen Farbe andrer Figuren gebraucht.\" (In
chess, the white color of figures is used to distinguish them from the
black color of other figures.) from Ts-213,441r\[6\]\_2 the word
\"weiße\" is an adjective.

``` 
<w ana="pagenr:613 linenr:1 tokennr:28" l="weiß" t="ADJA">weiße</w>
```

The disambiguation of the meaning of words has to be done with help of
their tag. A lexical frequency list does not suffice to disambiguate the
meaning of some full forms. This approach for creating frequency list
could be a research possibility for a future thesis since this kind of
problem is not specific to the word shown in the example.

Red is one of the most common color adjectives by Wittgenstein. It ranks
first in the old frequency list and second in the new one. Other
declination's of this adjective make it again to the list.

The 20 most common words in the semantic category of music are topped by
the word \"Form\". In music a form refers to the structure of
performance or composition. It is clear though, that this word can also
be used in many other non musical context and therefore it is not
strange that the frequency of this word is by far higher, than all other
words frequencies that fall into this semantic category. All other words
found in the list are less ambiguous.

The complete lemmatized frequency list for music and color can be found
in the attached SD Card.

| Wort | Frequenz |
|------|----------|
|  spielen |  839 |
|  Ton |  446 |
|  hören |  326 |
|  Musik |  200 |
|  Melodie |  184 |
|  Thema |  155 |
|  Klang |  142 |
|  Töne |  132 |
|  play |  92 |
|  singen |  75 |
|  Noten |  73 |
|  Rhythmus |  67 |
|  Klavier |  52 |
|  playing |  46 |
|  klingen |  45 |
|  tone |  44 |
|  Phrase |  43 |
|  hear |  39 |
|  Note |  36 |
|  Musikstück |  35 |

  New frequencies over all music semantic category