This report documents text mining on L. Frank Baum’s The Wonderful Wizard of Oz. The textual data used in this report is gathered from the Gutenberg Project via the R-package, gutenbergr.

The Wizard of Oz-dataset comes in the form of a CSV-file(Comma Separated Values), which is a format for storing tabular data. In this dataset the structure is thus:

Each row has the following columns:

gutenberg_id The unique id given to the book by the Gutenberg Project

text The text of The Wonderful Wizard of Oz dispersed on lines

title The title of the book: "The Wonderful Wizard of Oz

Linenumber Linenumber of the text

chapter The chapter that the text derives from"

The dataset is processed in the software programme R, offering various methods for statistical analysis and graphic representation of the results. In R, one works with packages each adding numerous functionalities to the core functions of R. In this example, the relevant packages are:

library(tidyverse)
library(tidytext)
library(ggplot2)
library(ggwordcloud)

Documentation for each package:
https://www.tidyverse.org/packages/
https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
https://lubridate.tidyverse.org/
https://ggplot2.tidyverse.org/
*https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html

Additional information about R: https://www.r-project.org/

Data import

The dataset is loaded into R. This happens via the read_csv function, which is told the path to the CSV-files located in the folder “data”:

wizard <- read_csv("data/WizardOfOz.csv")

The data processing will be based on the Tidy Data Principle as it is implemented in the tidytext package. The notion is to take text and break it into individual words. In this way, there will be just one word per row in the dataset. However, this poses a problem in relation to proper nouns such as “tin man”. Using the tidytext principle, this proper noun will break up “tin” and “man” into separate rows. This results in a loss of meaning because “tin” and “man” on their own do not make sense. “tin man” is a semantic unit, which we are destroying when converting to the tidytext-format. We are therefore interested in avoiding these instances and this is done via the regular expression:

“(tin) (man)”, “\1_\2”

This expression prompts R to look for all the instances where “tin” is followed by a space, followed by “man”. Then the space is replaced by “_” so that:

“tin man” is changed to “tin_man”

By reading Wizard of Oz you realise that the tin man is mostly called by the name Tin Woodman. We do the exact same thing for this instance and other proper nouns spanning more than one word

wizard %>% 
  mutate(text = str_replace_all(text, pattern = "(tin) (man)", "\\1_\\2")) %>%
  mutate(text = str_replace_all(text, pattern = "(Tin) (Woodman)", "\\1_\\2")) %>% 
  mutate(text = str_replace_all(text, pattern = "(Wicked) (Witch)", "\\1_\\2"))-> wizard

This cleaning will prove to be inadequate, since there certainly will be more proper nouns spanning over more than one word. Besides that there might be other semantic units than just proper nouns. You can allways return to this step and put in more. The above code is in reality nothing more than a complex “search and replace”-function as we know it from word. The only difference is that we use regular expressions. These can be used to extract data from text or alter it, as we have done here. Regular expression are very powerful as they can be used to search after complex patterns of text. Regular expression are also a bit complex and a thorough survey is outside the scope of this workshop. If you wish to learn regular expression I can recommend this interactive tutorial: https://regexone.com.

Next, we transform the data into the before mentioned tidytextformat which will place each word into a row of its own. This is achieved by using the unnest_tokens function:

wizard %>% 
  unnest_tokens(word, text)

## # A tibble: 40,418 x 5
##    gutenberg_id title                      linenumber chapter word   
##           <dbl> <chr>                           <dbl>   <dbl> <chr>  
##  1           55 The Wonderful Wizard of Oz          1       1 1      
##  2           55 The Wonderful Wizard of Oz          1       1 the    
##  3           55 The Wonderful Wizard of Oz          1       1 cyclone
##  4           55 The Wonderful Wizard of Oz          2       1 <NA>   
##  5           55 The Wonderful Wizard of Oz          3       1 <NA>   
##  6           55 The Wonderful Wizard of Oz          4       1 dorothy
##  7           55 The Wonderful Wizard of Oz          4       1 lived  
##  8           55 The Wonderful Wizard of Oz          4       1 in     
##  9           55 The Wonderful Wizard of Oz          4       1 the    
## 10           55 The Wonderful Wizard of Oz          4       1 midst  
## # … with 40,408 more rows

# Analysis

Now we will find the words that appear most commonly per chapter in The Wonderful Wizard of Oz.

wizard %>% 
  unnest_tokens(word, text) %>% 
  count(chapter, word, sort = TRUE)

## # A tibble: 11,489 x 3
##    chapter word      n
##      <dbl> <chr> <int>
##  1      12 the     286
##  2      11 the     249
##  3       8 the     188
##  4      15 the     180
##  5      12 and     178
##  6      11 and     162
##  7       2 the     161
##  8      14 the     154
##  9       7 the     150
## 10      10 the     146
## # … with 11,479 more rows

Not surprisingly, particles are the most common words we find. This is not particularly interesting for us in this enquiry, as we want to see which words are specific to the individual chapters. The particles will appear in all chapters. The first step is finding a measurement that will allow us to compare the frequency of words across the chapters. We can do this by calculating the word’s, or the term’s, frequency:

frekvens=ntermNaarfrekvens=ntermNaar<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>f</mi><mi>r</mi><mi>e</mi><mi>k</mi><mi>v</mi><mi>e</mi><mi>n</mi><mi>s</mi><mo>=</mo><mfrac><msub><mi>n</mi><mrow class="MJX-TeXAtom-ORD"><mi>t</mi><mi>e</mi><mi>r</mi><mi>m</mi></mrow></msub><msub><mi>N</mi><mrow class="MJX-TeXAtom-ORD"><mi>a</mi><mi>a</mi><mi>r</mi></mrow></msub></mfrac></math>

$frekvens=\frac{n_{term}}{N_{aar}}$

Before we can take this step, we need R to count how many words there are in each year. This is done by using the function group_by followed by summarise:

wizard %>% 
  unnest_tokens(word, text) %>% 
  count(chapter, word, sort = TRUE) %>% 
  group_by(chapter) %>% 
  summarise(total = sum(n)) -> total_words

## `summarise()` ungrouping output (override with `.groups` argument)

total_words

## # A tibble: 24 x 2
##    chapter total
##      <dbl> <int>
##  1       1  1170
##  2       2  2068
##  3       3  2022
##  4       4  1482
##  5       5  2107
##  6       6  1547
##  7       7  1841
##  8       8  1983
##  9       9  1433
## 10      10  2017
## # … with 14 more rows

Then we add the total number of words to our dataframe, which we do with left_join:

wizard %>%
  unnest_tokens(word, text) %>% 
  count(chapter, word, sort = TRUE) %>% 
  
  left_join(total_words, by = "chapter") -> wizard_counts

wizard_counts

## # A tibble: 11,489 x 4
##    chapter word      n total
##      <dbl> <chr> <int> <int>
##  1      12 the     286  3744
##  2      11 the     249  3704
##  3       8 the     188  1983
##  4      15 the     180  2852
##  5      12 and     178  3744
##  6      11 and     162  3704
##  7       2 the     161  2068
##  8      14 the     154  1940
##  9       7 the     150  1841
## 10      10 the     146  2017
## # … with 11,479 more rows

Now we have the number we need to calculate the frequency of the words. Below we are calculating the word “the” in the chapter 8.

frekvensfor"the"in8=1881983=0.09480585frekvensfor"<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow class="MJX-TeXAtom-ORD"><mi>f</mi><mi>r</mi><mi>e</mi><mi>k</mi><mi>v</mi><mi>e</mi><mi>n</mi><mi>s</mi><mi>f</mi><mi>o</mi><mi>r</mi><mo>&quot;</mo><mi>t</mi><mi>h</mi><mi>e</mi><mo>&quot;</mo><mi>i</mi><mi>n</mi><mn>8</mn></mrow><mo>=</mo><mfrac><mn>188</mn><mn>1983</mn></mfrac><mo>=</mo><mn>0.09480585</mn></math>

${frekvens for "the" in 8}=\frac{188}{1983}=0.09480585$

By calculating the frequency of the terms, we can compare them accross each chapter. However, it is not terribly interesting comparing the word “the” between chapters. Therefore, we need a way to “punish” words that occur frequently in all chapters. To achieve this, we are using inversed document frequency(idf):

idf (t e r m) = \ln (\frac{n}{N})

$\textrm{idf}(term)=\ln(\frac{n}{N})$ n is the totalt number of documents (chapters, in our example) and N is the number of chapters in which the word occurs.

idf (t h e) = \ln (\frac{24}{24}) = 0

$\textrm{idf}(the)=\ln(\frac{24}{24})=0$ Thus we punish words that occur with great frequency in all chapters or many chapters. Words that occur in all the chapters cannot really tell us much about a particular chapter. Those words will have an idf of 0 resulting in a tf_idf value that is also 0, because this is defined by multiplying tf with idf.

Luckily, R can easily calculate tf and tf_idf for all the words by using the bind_tf_idf function:

wizard_tfidf <- wizard_counts %>% 
  bind_tf_idf(word, chapter, n)

  
wizard_tfidf

## # A tibble: 11,489 x 7
##    chapter word      n total     tf   idf tf_idf
##      <dbl> <chr> <int> <int>  <dbl> <dbl>  <dbl>
##  1      12 the     286  3744 0.0764     0      0
##  2      11 the     249  3704 0.0672     0      0
##  3       8 the     188  1983 0.0948     0      0
##  4      15 the     180  2852 0.0631     0      0
##  5      12 and     178  3744 0.0475     0      0
##  6      11 and     162  3704 0.0437     0      0
##  7       2 the     161  2068 0.0779     0      0
##  8      14 the     154  1940 0.0794     0      0
##  9       7 the     150  1841 0.0815     0      0
## 10      10 the     146  2017 0.0724     0      0
## # … with 11,479 more rows

Nonetheless we still do not see any interesting words. This is because R lists all the words in an ascending order – lowest to highest. Instead, we will ask it to list them in a descending order – highest to lowest tf_idf:

wizard_tfidf %>% 
  select(-total) %>% 
  arrange(desc(tf_idf))

## # A tibble: 11,489 x 6
##    chapter word         n     tf   idf tf_idf
##      <dbl> <chr>    <int>  <dbl> <dbl>  <dbl>
##  1      24 24           1 0.0123  3.18 0.0392
##  2      24 cabbages     1 0.0123  3.18 0.0392
##  3      24 darling      1 0.0123  3.18 0.0392
##  4      24 folding      1 0.0123  3.18 0.0392
##  5      24 kisses       1 0.0123  3.18 0.0392
##  6       9 mice        24 0.0167  2.08 0.0348
##  7      24 covering     1 0.0123  2.48 0.0307
##  8       9 queen       17 0.0119  2.48 0.0295
##  9      22 hill        11 0.0113  2.48 0.0280
## 10      17 balloon     18 0.0151  1.79 0.0270
## # … with 11,479 more rows

Turns out words from chapter 24 is represented a fair bit in the top of this list. This is due to chapter 24 being a lot shorter than the other chapters. Thus it only contains 81 words where the longest chapter has 3344 words! This gives the words in chapter 24 an unsually high tf_idf.

This is infact the entire chapter 24:

Aunt Em had just come out of the house to water the cabbages when she looked up and saw Dorothy running toward her. “My darling child!” she cried, folding the little girl in her arms and covering her face with kisses. “Where in the world did you come from?” “From the Land of Oz,” said Dorothy gravely. “And here is Toto, too. And oh, Aunt Em! I’m so glad to be at home again!”

So the results from chapter 24 is "polluting our results and since we are trying to compare 24 chapters to each other it is hard to say anything based on an list af words. Next step is to visualise our result!

Visualisation

Many people who have tip their toes in the text mining/data mining sea will have seen wordclouds showing the most used words in a text. I this visualisation we are going to create a wordcloud for each chapter showing the words with the highest tf_idf from that chapter. By doing so we will get a nice overview of what words are specific and important to each chapter. Remember that words which appear alot acros the chapters will not show here.

wizard_tfidf %>%
  arrange(desc(tf_idf)) %>%
  filter(chapter != 24) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(chapter) %>% 
  top_n(8) %>% 
  ungroup %>%
  ggplot(aes(label = word, size = tf_idf, color = tf_idf)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 15) +
  theme_minimal() +
  facet_wrap(~chapter, ncol = 6, scales = "free") +
  scale_color_gradient(low = "darkgoldenrod2", high = "darkgoldenrod4") +
  labs(
      title = "The Wonderful Wizard of Oz: most important words pr.chapter",
       subtitle = "Importance determined by term frequency (tf) - inversed document frequency(idf)",
      caption = "Data from Gutenberg Project - ID:55")

## Selecting by tf_idf

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## One word could not fit on page. It has been placed at its original position.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been placed at their original
## positions.

Because the space for visualisation is constricted in this .Rmd format we have to save the result as a pdf, where we define a larger canvas. Run the last code, chunk 12, and look in the files pane to the right. In the folder “doc” you should get a file called “wizard-wordcloud”. This is readable.

ggsave("doc/wizard-wordcloud.pdf", width = 65, height = 35, units = "cm")

Contratulations! You have completed your very first text mining task and created an output! You are now ready ti venture further into the world of tidy text mining. This short introduction was based on the Tidy Text Mining with R-book. Now that you know how to use an R-markdown you can use the book to explore their methods!

Text mining workshop - Online: The Wonderful Wizard of Oz and term frequency(tf) – inverse document frequency(idf)

Max Odsbjerg Pedersen

3/24/2020

Data import

Visualisation