Frequency Lists =============== Users of WiTTFind can find frequency lists grouped by semantic categories in the website. Most of these lists were done with the first 5.000 pages of Wittgenstein's *Nachlass* that were open to the public. Now that all the 20.000 pages of the manuscripts and type scripts are accessible, the semantic frequency lists need to be updated. Ludwig Wittgenstein wrote about a varying number of topics. One of them is music. Colors also played a big role in the work of this philosopher and in his type script Ts-213 he wrote under the chapter \"Phänomenologue\" (phenomenology) a subchapter on color and color mixing. It is not surprising that the second most common adjective in this document is \"rot\" (red). Many semantic categories could be analyzed in Wittgenstein's *Nachlass*, because of the time constraint of the thesis, it was decided that only the categories of color and music would be explored. In a previous thesis, frequency lists for these and other semantic categories were created. These frequency lists were static and covered only the first 5.000 pages of the *Nachlass* open at that time. In order for the users to further analyze Wittgenstein's work, the frequency lists need to cover all of the *Nachlass*. This chapter explains the process of creating new semantic frequency lists for the FinderApp WiTTFind. The process is implemented in such way, that it can be added to the \[CW\]AST Toolchain. The semantic frequency lists are therefore dynamic and can be called with a makefile target as all the other tools in the chain. Lexicon ------- The lexicon used by the FinderApp WiTTFind is called witt\_WAB\_DELA. It intends to include all the words that appear in Ludwig Wittgenstein's *Nachlass* and it's sorted alphabetically. This lexicon comprises grammatical and semantic characteristics for the tokens that can be use to extract the semantic frequency lists and other important informations for the FinderApp. It is an electronic lexicon held in the German DELAF format. This format is specially suitable for working with local grammars and for processing text corpus with the help of Unitex. For a better explanation of the Lexicon, see the thesis done by Angela Krey [@krey_thesis]. Since the lexicon has been improved over time, there are different versions of it. As in fall 2018, the lexicon being used is the witt\_WAB\_dela\_XIX.txt. Each line in the lexicon represents a word and it's composed following the schema: `fullform,lemma.grammatic_categories+semantic_categories...` As mentioned above, the frequency lists created in the scope of this thesis comprise the semantic categories for music and color. The semantic category `MUSIK` (music) marks in the lexicon all words that fall into this category. ``` Akkord,.N+MUSIK Antoni,.EN+persName+MUSIK Bachanten,Bacchant.N+MUSIK Bach Johann Sebastian,Bach.EN+persName+MUSIK+KOMPONIST Bach,.N+persName+MUSIK+KOMPONIST:aeM:deM:neM Blasinstrumente,Blasinstrument.N+MUSIK:amN:deN:gmN:nmN ... ``` As mentioned a the beginning of this chapter, Wittgenstein worked intensively with color theory. There are many semantic categories for color such as Zwischenfarbe (intermidiate color), Transparenz (transparency), Glanz (shine) and so on, but they all are a subset of the category `COL`, standing for color. In listing there are some example for words that fall into the color semantic category. ``` dunkel,.ADJ+COL+Zwischenfarbe:up dunkelblau,.ADJ+COL+Zwischenfarbe Dunkelrot,.N+COL durchsichtige,durchsichtig.ADJ+COL+Transparenz einfarbige,einfarbig.ADJ+NUM+COL+Farbigkeit farbloses,farblos.ADJ+REL+COL+Farbigkeit ... ``` Understanding the structure of the entries in the lexicon is important because it will make the extraction of words for the semantic frequency lists easier. Make deploy chain for the semantic frequency lists -------------------------------------------------- The tools needed to create the semantic frequency lists are controlled by the makefile `semantic_freqlist.make` and the logic follows the structure of the \[CW\]AST deploy chain. The makefile can be found under the `witt-data/deployment/makefile` folder, where all other makefiles needed to deploy the FinderApp are stored. All the makefile targets in `semantic_freqlist.make` have to be called from inside the `witt-data/deployment` folder. In `semantic_freqlist.make` the path to the different programs needed to produce the frequency lists as well as the path destinations for the output are saved into variables. These variables are later used in the command part of the make rules. To generate the semantic frequency lists for music and color, a frequency list over all words of the *Nachlass* is needed first. This frequency list is created with the script `all_frequencies.py` that can be found in the `witt-data/tools/frequency` folder and can be called with the makefile target `make all-freqlist`, see listing line 5. This rule has as dependency the OA\_NORM-tagged.xml files, which means that if they change, the rule can be called again to redo the frequency list. ``` TAGGED_UNEXPANDED_NORM_FILES = $(shell find -L $(NACHLASS_DIR)/*/norm -type f -name \*.xml | grep '\-tagged' | grep -v '\expanded') semantic_freqlist: all-freqlist music-freqlist color-freqlist all-freqlist: $(TAGGED_UNEXPANDED_NORM_FILES) $(SILENT) $(PYTHON3_RUNNER) $(FREQ_ALL_DIR)/$(FREQ_ALL_CMD) $(ALL_FREQLIST) $(ALL_FREQ_PICKLE) $^ $(SILENT) echo "written $(ALL_FREQLIST)" $(SILENT) echo "written $(ALL_FREQ_PICKLE)" music-freqlist: $(SILENT) $(PYTHON3_RUNNER) $(FREQ_MUSIC_DIR)/$(FREQ_MUSIC_CMD) $(DICT_WITT) $(ALL_FREQ_PICKLE) $(MUSIC_FREQLIST) $(SILENT) echo "written $(MUSIC_FREQLIST)" color-freqlist: $(SILENT) $(PYTHON3_RUNNER) $(FREQ_COLOR_DIR)/$(FREQ_COLOR_CMD) $(DICT_WITT) $(ALL_FREQ_PICKLE) $(COLOR_FREQLIST) $(SILENT) echo "written $(COLOR_FREQLIST)" ``` The program to create the lemmatized frequency list for music, `music_freqlist.py`, can be found in `witt-data/tools/frequency/music` folder and this program can be called with the makefile target `make music_freqlist`. For the colors, the program is called `color_freqlist.py` and it's located in the directory `witt-data/tools/frequency/color`. The makefile target `make color-freqlist` calls this script. The three makefile targest can be called at once with the target `make semantic_freqlist`, see line 5 in listing. Frequency of words over all files --------------------------------- The program `all_frequencies.py` can be called with the following command: ``` (*\textcolor*) python all_freqlist.py arg1 arg2 arg3... ``` The first argument expected is the output file for the frequency list in txt format. The secon one is the output file for the frequency list in pickle format and arg3 until argn represent all the OA\_NORM-tagged.xml files. The program saves the tagged files into an array to later iterate through them. This script works in a similar way to the `language_finder.py`. It initializes a dictionary `all_word_freqs` which keys will be the tokens and their values the amount of times that world appears throughout the *Nachlass*. It then reads one by one the tagged files and parses each document with the help of `iterparse` from `lxml.etree` into a tree, creating a tuple of the form (event,element). Only the elements with the tag \"w\" (words) are important to create the frequency lists. Again, the program ignores the the mathematical formulas found in the *Nachlass*, since they are not considered tokens. The word is then cleaned from possible XML elements due to different types of input and preprocessing errors. This is done exactly as in (#unknown_words_section) with the help of the regular expressions (regex). After these steps, there are still strings representing a token that start with a punctuation symbol. These should not be inserted into the frequency list and therefore an additional step is required. To do so, a string `all_punctuation` is declared with the help of the Pyhton method `string.punctuation` (gives the ASCII characters which are considered punctuation back). Other punctuation characters found in the *Nachlass* need to be added to this string. The processed word is only added to the dictionary if it doesn't start with a punctuation symbol.// ``` all_punctuation = string.punctuation + (*"”“–’‘„…"*) if word[0] not in all_punctuation: all_words_freqs[word] += 1 ``` After finishing iterating through all the files, the program sorts the frequency list of the words by descending order of their value and saves the sorted dictionary in a pickle file. It also saves a txt version in which each line represents a word followed by a space followed by its frequency, this file can be found in the attached SD Card. Semantic frequency lists ------------------------ The same program is used to create the semantic frequency list for music and for color. They search for a different semantic category in the witt\_WAB\_DELA lexicon. While `color_freqlist.py` searches for the entries in the dictionary that have the `COL` semantic category, `music_freqlist.py` searches for the entries with the semantic category `MUSIK`. What follows is a short explanation of how the program creates a lemmatized frequency list for the semantic category color. It is a lemmatized frequency list because at the end the frequencies should be sorted by their lemma and an entry in the list should look like this: `lemma,sum_of_all_freqs; first_fullform, freq; second_fullform, freq; ...` ``` dunkelblau,6; dunkelblau,3; dunkelblauen,3; Dunkelrot,4; Dunkelrot,4; einfarbig,6; einfarbig,2; einfarbige,2; einfarbigen,2; farblos,33; farblos,14; farblose,7; farbloser,5; farbloses,5; farblosen,2; ... ``` The script `color_frqlist.py` can be called as: ``` (*\textcolor*) python color_freqlist.py arg1 arg2 arg3 ``` The first argument expected is the lexicon `witt_WAB_dela_XIX.txt`, the second argument should be the frequency over all words in the lexicon in pickle format and the third argument is the file where the output should be saved. The dictionary of dictionaries `color_freqs` is initialized. Its keys are the lemma of different full forms and its values are dictionaries with all full forms mapped to their frequencies. The program reads one by one the lines in the lexicon, see listing , and checks with help of `re.match` at the beginning of the string for anything (fullform) followed by a comma, then anything (lemma) follwed by a period follwed by anything and then +COL. As we mentioned above, COL symbolizes the semantic category for colors. The pattern match has to follow the DELAF format explained at the beginning of this chapter. ``` color_freqs = defaultdict(lambda: defaultdict(int)) with open(lexicon, 'r') as witt_lex: for entry in witt_lex: col = re.match("(.*),(.*)\..*\+COL", entry) if col: # lstrip is used because some words in the dictionary have leading spaces full_form = col.group(1).lstrip() lemma = col.group(2).lstrip() if not lemma: lemma = full_form if full_form in all_frequencies: color_freqs[lemma][full_form] = all_frequencies[full_form] ``` If the entry in the lexicon matches the pattern, the full form for the word is set to the first group of the match. Sometimes a full form of a word is also its lemma. In this case, the lemma is left empty in the lexicon entry. See entry for \"Dunkelrot\" (dark red). The program deals with these kinds of entries, see lines 10 to 11, by checking if the second group captured something. If it didn't, it sets the lemma of the word to be the same as its full form. With the variables `full_form` and `lemma` filled, the program checks if `full_form` is a key in the dictionary created by `all_freqlist.py`. If the full form of the entry in the lexicon is in the frequency list over all words in the *Nachlass*, then it is added to the lemmatized dictionary together with its frequency, see line 13. When `color_freqlist.py` finishes iterating through the lexicon, it still needs to write the obtained frequency list into a txt file. To do so, it iterates through the items in the dictionary `color_freqs` and sums all the values for the different full forms of a lemma entry with the help of the function `sum`. The lemma followed by this sum, and then the full forms with their frequencies are written in the output file that looks, as mentioned before. ``` for lemma, full_forms in color_freqs.items(): sum_of_freqs = sum(full_forms.values()) ``` The program to extract the semantic frequency list for music uses a different regex for the matching, see bellow. This is the only line that is different between the two scripts `color_freqlist.py` and `music_freqlist.py`. ``` music = re.match("(.*),(.*)\..*\+MUSIK", entry) ``` By replacing the regex on line 5 of listing one can find different semantic categories and do a frequency list for them if needed. ### Old frequencies As mentioned at the beginning of the chapter, frequency lists for different semantic categories were created as part of a previous bachelor thesis. The resulting frequency lists for the first 5.000 open source pages can be found in the FinderApp WiTTFind. To compare the old frequencies with the ones created in this work, the 10 most common words of the old frequencies for color adjectives, see table (#tab:old_color) , were extracted. The composers frequency list shown in table (#tab:old_composers) and later on in (#tab:new_composers) , shows the 10 most frequent mentioned composers by lemma (not by full form as in the color adjectives). | Wort | Frequenz | |------|----------| | rot | 904 | | klar |267 | | blau |209 | | gelb | 152 | | roten | 141 | | rote | 104 | | gelbe | 92 | | schwarz | 75 | | blue | 73 | | rotes| 68 | Old frequencies for color adj retrieved from http://wittfind.cis.uni-muenchen.de/?semantics\# The pages of the type script Ts-213 form part of the first 5.000 pages of Wittgenstein's Nachass that were open to the public. As previously mentioned, the second most common adjective in this type script is \"rot\" (red). This type script is one of the largest documents in the *Nachlass* and it is not surprising to find this color adjective in the first place of the color list with a frequency of 904. Different representation forms for this adjective, depending on whether the noun it is modifying is singular or plural and its gender, make it also to the list of the most common color adjectives used by Wittgenstein. | Wort | Frequenz | |------|----------| | Beethoven | 41 | | Schubert | 31 | | Brahms | 26 | | Mozart | 23 | | Mendelssohn | 16 | | Bruckner | 15 | | Labor | 10 | | Wagner | 9 | | Schumann | 8 | | Haydn | 7 | Old frequencies for composers retrieved from http://wittfind.cis.uni-muenchen.de/?semantics\# Music was another topic the philosopher wrote about and therefore it is important to research this semantic category in his work. To make the comparison of the old frequencies with the new ones, it was decided that the semantic subcategory KOMPONIST (composer) should be explored. The composer that appears the most throughout the first 5.000 open pages of the *Nachlass* is Beethoven. He is followed by Schubert. ### Frequencies of additional 15.000 pages The frequencies for the then secure pages can be found in table (#tab:additional_colors) for color adjectives and in table (#tab:additional_composers) for the composers. A few interesting things can be observed. | Wort | Frequenz | |------|----------| | rot | 1256 | | klar | 1415 | | blau | 499 | | gelb | 290 | | roten | 282 | | rote | 190 | | gelbe | 85 | | schwarz | 166 | | blue | 27 | | rotes | 53 | Additional frequencies for color adj The color adjective \"rot\" was found 1256 times in the 15.000 left pages. The adjective \"klar\" was found even more often than \"rot\", occurring 1415 times. The word \"klar\" can mean different things depending on the context, for example clear or transparent. The use of color adjectives continues to be strong for the rest 3/4 of Wittgenstein's *Nachlass*. The words retrieved regarding music composers in the additional 15.000 pages are very scarce to say the least. In these pages, Wittgenstein mentions Bruckner 3 times. No other composer is mentioned. This shows, that the philosopher talks mostly about music in one or more of the documents belonging to the first 5.000 open pages of the *Nachlass*. **Wort** **Frequency** Bruckner 3 Additional frequencies for composers Difference between old frequencies and new frequencies ------------------------------------------------------ The frequencies shown below are the new total frequencies for the 10 most frequent words appearing in the old frequencies table. | Wort | Frequenz | |------|----------| | rot | 2160 | | klar | 1682 | | blau | 499 | | gelb | 247 | | roten | 423 | | rote | 294 | | gelbe | 177 | | schwarz | 241 | | blue | 100 | | rotes | 121 | New frequencies for color adj The frequencies for composers decreased from the old frequencies, see table (#tab:old_composers) , to the frequencies for composers over all 20.000 pages. This would be really odd if it weren't for the fact that the XML documents received from Bergen have changes in their format from time to time. The changes implemented are to fix some errors but also new transcription or edition problems can be found in them. To understand why the frequency for Beethoven, Schubert and Brahms decreased by 1, for Haydn by 2 and for Mendelssohn by 3, a deep research would need to be make but this exceeds the scope of this work. | Wort | Frequenz | |------|----------| | Beethoven | 40 | | Schubert | 30 | | Brahms | 25 | | Mozart | 23 | | Mendelssohn | 13 | | Bruckner | 18 | | Labor | Not marked as MUSIC | | Wagner | 9 | | Schumann | 8 | | Haydn | 5 | New frequencies for composers No entry for Labor was found either in the search bar of WiTTFind or in the newly created frequency list. Joseph Labor was a composer and an entry for his name can be found in the witt\_WAB\_delaXIX.txt lexicon: `Labor Josef,Labor.EN+MUSIK+KOMPONIST` The reason why this composer doesn't appear either in the new semantic frequencies or in the search bar in WiTTFind is because both programs check that the full form of a an entry in the lexicon is a key in the frequency over all words dictionary. The full form given by the Lexicon is \"Josef Labor\". The entries found for Labor in the dictionary `all_words_freqs` created by the script `all_freqlist.py` are: all_words_freqs ={ ... Labor: 8, Labors: 2, ... } \"Josef Labor\" is not found as key of any item in the dictionary, since Wittgenstein never writes his complete name. The error lies on the incomplete Lexicon witt\_WAB\_delaXIX.txt but can easily be fixed by adding the two following entries: `Labor,.EN+MUSIK+KOMPONIST` `Labors,Labor.EN+MUSIK+KOMPONIST` This type of error is better explained in chapter (#evaluation) . New frequencies --------------- Until now we compared the old frequencies with the new ones. In this last part of the chapter, the new frequencies over all documents are shown. The 20 most common words for the semantic category color and for the semantic category music are shown in table (#tab:all_new_colors) and table (#tab:all_new_music) respectively. | Wort | Frequenz | |------|----------| | rot | 2160 | | klar | 1682 | | Rot | 865 | | blau | 499 | | roten | 423 | | grün | 411 | | Weiß | 375 | | Grün | 298 | | rote | 294 | | gelb | 247 | | Blau | 243 | | schwarz | 241 | | Gelb | 236 | | rein | 221 | | roter | 187 | | gelbe | 177 | | heller | 177 | | schwarzen | 170 | | Schwarz | 169 | New frequencies over all color semantic category The word that ranks first in the new semantical frequency list for color is \"weiß\", which means white. This word is also is the present form of the verb wissen (to know) for the first and third person singular. The figures show that the full form word \"weiß\" has the same frequency for its different possible lemmas. This problem is created when the semantic categories of a word are created with only the help of a lexicon. To disambiguate the meaning of a word, its POS-tag should be taken into account when creating the list of frequency over all words. The following examples aim to show two different uses of the the word \"weiß\". In the first example, \"D.h. also, er weiß immer mehr, als er zeigen kann.\" (That means, he always knows more than he can show.) found in Ts-213,12r\[3\]\_3 the word is used as a verb. The tagging for it can be found in listing . ``` weiß ``` In the sentence \"im Schachspiel wird die weiße Farbe von Fi- guren zur Unterscheidung von der schwarzen Farbe andrer Figuren gebraucht.\" (In chess, the white color of figures is used to distinguish them from the black color of other figures.) from Ts-213,441r\[6\]\_2 the word \"weiße\" is an adjective. ``` weiße ``` The disambiguation of the meaning of words has to be done with help of their tag. A lexical frequency list does not suffice to disambiguate the meaning of some full forms. This approach for creating frequency list could be a research possibility for a future thesis since this kind of problem is not specific to the word shown in the example. Red is one of the most common color adjectives by Wittgenstein. It ranks first in the old frequency list and second in the new one. Other declination's of this adjective make it again to the list. The 20 most common words in the semantic category of music are topped by the word \"Form\". In music a form refers to the structure of performance or composition. It is clear though, that this word can also be used in many other non musical context and therefore it is not strange that the frequency of this word is by far higher, than all other words frequencies that fall into this semantic category. All other words found in the list are less ambiguous. The complete lemmatized frequency list for music and color can be found in the attached SD Card. | Wort | Frequenz | |------|----------| | spielen | 839 | | Ton | 446 | | hören | 326 | | Musik | 200 | | Melodie | 184 | | Thema | 155 | | Klang | 142 | | Töne | 132 | | play | 92 | | singen | 75 | | Noten | 73 | | Rhythmus | 67 | | Klavier | 52 | | playing | 46 | | klingen | 45 | | tone | 44 | | Phrase | 43 | | hear | 39 | | Note | 36 | | Musikstück | 35 | New frequencies over all music semantic category