The focal point of this little example is to proces an extract of the collection data from the National Gallery of Denmark. This collection is comprised of the museum’s currently (September 2020) digitally registered art works and collections which at this time contains around 80 thousand art works - a number that is constantly rising.

Our extract is delivered through SMK’s Application Programming Interface(API). An API is in brief a way to let software talk to eachother - in this case the API let’s R on my computer get acess to SMK’s digitally registered art works. The API also lets us specify very precisely, what data we are interested in.

Read more about the SMK-API here: https://www.smk.dk/article/smk-api/

Read the more technical aspects af the SMK-API here: https://api.smk.dk/api/v1/docs#/

The the following example will be processed in the software programme R, offering various methods for statistical analysis and graphic representation of the results. In R, one works with packages each adding numerous functionalities to the core of R-functions. The first step of the project is to load the relavant packages into R.

Loading the data from SMK’s API into R

First we need to load the relevant libraries:

library(jsonlite)
library(tidyverse)
library(leaflet)

This example shows how to extract all the digitally registered works within SMK with the painter C. W. Eckersberg noted as creator. The data is delivered in JSON-format, which is a way of storing and sending data as text. JSON-data can be parsed into R using the fromJSON function. The data is parsed as a large list-structure, where we right away extracts the exact list that contains the art works with ‘Eckersberg’ as the creator:

creator_eckersberg <- fromJSON("https://api.smk.dk/api/v1/art/search/?keys=*&filters=%5Bcreator%3AEckersberg%2C%20C.W.%5D&offset=0&rows=1000")

creator_eckersberg <- creator_eckersberg$items

For further explanation of the JSON format see - https://www.w3schools.com/js/js_json_intro.asp

Lets take a look at the first six data entries after we extracted the list from the initial response:

head(creator_eckersberg)

How many rows are there in total?

nrow(creator_eckersberg)

## [1] 744

The initial inspection shows nothing alarming: we have 744 rows - each corresponding to an art piece who have ‘Eckersberg’ as a creator - and we have 56 columns containing various information on the art pieces.

BUT

By shuffling through the columns (use the little black arrow at the top of the data frame) we see that some of the rows contains data frames and lists within themselves - this dataframe is a complex one.
This is not the way I like to work - I like to have my data tidy and use the functions put at disposal in the R-library tidyverse . The tidy format is defined as:

In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.

R for Data Science

Tidying the data

The solution to our problem with dataframes within rows is to pick and extract the relevant data from the nested dataframes. In other words we need to select the information we want from the dataframes. Since we need to pick out this information for each of the 732 works that our dataset consist of, we are going to do alot of iteration in this process. This is where the package purr from the tidyverse comes in handy as it manages iterations for us at the expense of readability of the code.
For the sake of limiting the extent of this example we wont be diving into this process of finding and extracting the relevant information from the nested dataframes. Instead we will just note that the key in this process is understanding vectors and lists and how R handles them. Jenny Bryant has created a great tutorial teaching this and how to employ the purr-package. It can be found here: https://jennybc.github.io/purrr-tutorial/index.html

For now we will just accept some magic from purr-functionality and use it to create 10 columns containing data from the 732 Eckersberg works, that we want to do the the following data processing on:

creator_eckersberg %>%
  mutate(title = map(titles, "title")) %>%
  mutate(title = map_chr(title, 1)) %>% 
  mutate(title_notes = map(titles, "notes")) %>%
  mutate(title_notes = map_chr(title_notes, 1, .null = NA)) %>%
  mutate(creator_1 = map(production, "creator")) %>%
  mutate(creator_1 = map_chr(creator_1, 1)) %>% 
  mutate(creator_2 = map(production, "creator")) %>%
  mutate(creator_2 = map_chr(creator_2, 2, .null = NA)) %>%
  mutate(production_start = map(production_date, "start")) %>%
  mutate(production_start = map_chr(production_start, 1, .null = NA)) %>%
  mutate(acquisition_date_precision = acquisition_date_precision) %>% 
  mutate(production_end = map(production_date, "end")) %>%
  mutate(production_end = map_chr(production_end, 1, .null = NA)) %>%  
  mutate(image_url = image_native) %>% 
  mutate(has_image = has_image) %>%
  mutate(image_thumbnail = image_thumbnail) %>% 
  select(title, creator_1, creator_2, title_notes, production_start, production_end,image_native, image_thumbnail, has_image, exhibitions, public_domain, rights) -> eckersberg_dfr

Lets examine our new dataframe(Use the black arrow in the top right corner of the dataframe to leaf through the columns):

eckersberg_dfr

By leafing through the columns we see that we now have one art piece pr. row. For each art piece we get the title, the creator, production_end and other information.

We however see that Englændernes sidste dåd appears two times, which is due to the fact that there are two versions. One which is black and white and the other which is coloured.
There is one problem though and this is the last column, exhibitions. This column is still dataframes within rows even though alot of the rows is <NULL> meaning that there is no data here.

Let us examine what is hidden inside one of these dataframes. Here we dive into the exhibitions-dataframe from the art piece with the title Et parti af Piazza del Campidoglio set mod balustraden med Dioskurerne og kirken S. Maria in Aracoeli:

eckersberg_dfr[[10]][6][[1]]

It shows that the above-mentioned art piece has been displayed on exhibitions three times and for each of the exhibitions we get the name of the exhibition and the name of venue. This information is going to be at the center of the further data processing. Because venue refers to a place in the real world we can plot them on a map and visualise where in the world Eckersberg has been on display.

But since an art piece is quite likely to have been on display more than once as was the case with Et parti af Piazza del Campidoglio set mod balustraden med Dioskurerne og kirken S. Maria in Aracoeli we are confronted with new issues since we prefer the one observation pr. row idea. Before we set about this - let’s appreciate that the data has a visual and aesthetic side, which the API gives us in form of a link to picture of the actual art pieces. We’ve mentioned Et parti af Piazza del Campidoglio set mod balustraden med Dioskurerne og kirken S. Maria in Aracoeli a couple of timex. Let’s see it:

Et parti af Piazza del Campidoglio set mod balustraden med Dioskurerne og kirken S. Maria in Aracoeli, C.W. Eckersberg, 1813

Since the exhibition data is going to be central from here on, we need to change our understanding of the data a little bit. Right now our data consist of one art piece pr. row. We are going to change that now based on the data in exhibitions. Instead we want each row to be an instance of an art piece on display. In this new way of understanding the data an art piece can occur several times if it has been on display more than once. Take our example from before. Et parti af Piazza[shortened] will appear three times in the data since it has been on display in both Sølvgade, Hamburg, and Fundation Custodia.

Transforming the data to this format is fairly easy in terms of R-code. We use the unnest function on the exhibitions-column, which creates a row for each exhibition while retaining all information from the other columns. Also, we specify that we only want the data where Eckersberg is noted as creator1. Since we are changing the fundamental structure of the data we save it to a new dataframe:

eckersberg_dfr %>% 
  filter(creator_1 == "Eckersberg, C.W.") %>%
  unnest(exhibitions) %>% 
  select(exhibition, everything()) -> eckersberg_exh

eckersberg_exh

We observe that this new dataframe only have 459 rows, which is due to the fact that not all of the works of Eckersberg have been on display. SMK’s digital registered Eckersberg works contains both painting and sketches, where the latter presumably not have been the focus of as much attention in terms of exhibitions as the paintings.

Since the data now has a format where each instance of an art work on display has its own row, we can now start asking questions about Eckersberg on display.

Analysis - Eckersberg on display

Which Eckersberg work has been displayed the most?

Because the art pieces appears in the data for each of the times it has been on display it is possible to count the titles to find out which of the artpieces has been displayed the most:

eckersberg_exh %>% 
  count(title, sort = TRUE)

Let’s visualise this result in order to gain more clarity:

eckersberg_exh %>% 
  count(title, sort = TRUE) %>% 
  top_n(3) %>%
  mutate(title = reorder(title, n)) %>% 
  ggplot(aes(x = title, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Title of art piece",
      y = "Count",
      title = "Count of times an Eckergsberg art\npieces has been on exhibition") + 
  theme(axis.text.y = element_text(angle = 14, hjust = 1))

With the chunk of code above it is suprisingly easy to ask questions to the Eckersberg-data. We would however be doing violence to the data’s nature if we didn’t take a look at these three most displayed art pieces by Eckersberg:

The three most displayed Eckersberg-works owned by SMK

Where have the most Eckersberg-works been displayed?

Based on the code above we will now change title to venue. This way we will see which venue have had the most Eckersberg-works on display. By changing the number in top_n(3) we can control how many venues that are shown:

eckersberg_exh %>% 
  count(venue, sort = TRUE) %>% 
  top_n(10) %>%
  mutate(venue = reorder(venue, n)) %>% 
  ggplot(aes(x = venue, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Venue",
      y = "Count",
      title = "Count of Eckerberg works dispersed on venues") + 
  theme(axis.text.y = element_text(angle = 14, hjust = 1))

Not surprisingly, “Sølvgade” is the venue where most of Eckersberg’s works are exhibited. This is of course due to the fact that the data that forms the background for these questions are extracted from SMK’s API and consists of art pieces in SMK’s possession with Eckersberg as creator1. Since SMK’s address is Sølvgade 48-50, 1307 København K, we are not surprised to see “Sølvgade” at the top of the venues. In other words, the works by Eckersberg that are in SMK’s possession are most often exhibited at SMK’s museum. However, one thing has crept in just above Sølvgade: “NA”. “NA” appears when data is missing. Let’s try to investigate it further:

NA-values in venue

By using filter(is.na(venue)) we will only get the instances of an displayed art piece where venue is NA

eckersberg_exh %>% 
  filter(is.na(venue)) %>%
  select(title, venue, exhibition, everything())

We see that there are eleven works where no venue is listed for their exhibitions. By flipping through columns (use the small black arrow in the upper right corner of the data frame) we can see that it is primarily the exhibition “Det skiønne ær - fra København til dansk vestindien” that lacks information about the venue. For now we wont do further, but we could find out where this exhibition was and then add it to our dataset. However, this would be beyond the scope for this little exaple. Instead, we remove the NAs from our last count of which venues have had the most Eckersberg works on display.

eckersberg_exh %>% 
  drop_na(venue) %>% 
  count(venue, sort = TRUE) %>% 
  top_n(10) %>%
  mutate(venue = reorder(venue, n)) %>% 
  ggplot(aes(x = venue, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Venue",
      y = "Count",
      title = "Count of Eckerberg works dispersed on venues") + 
  theme(axis.text.y = element_text(angle = 14, hjust = 1))

The above visualisation may make one wonder - how far do SMK’s Eckersberg works on display actually go?

SMK’s Eckersberg-works in the world

In the following, we will use the leaflet package to visualise where in the world SMK’s Eckersberg works have been on display. We will do this based on the venues we looked at just before. Specifically, we will try to place all the works on a map based on their venues. Since the venues appear to be the names of art museums mostly, we need to translate them into coordinates that the leaflet package needs to place them.

Open Street Map Nominatim - converting names/adresses to coordinates

For the proces of converting venues to coordinates we will use the tool from Open Street Map(OSM) called Nominatim. This tool takes a name or an adress and converts it into a coordinate based on the OSM-data. The Nominatim has an API where we can send our venues to and quickly get coordinates back. Based on this API we will now use a function created for this API:

The following function is borrowed from: OSM Nominatim with R: getting Location’s Geo-coordinates by its Address

## geocoding function using OSM Nominatim API
## details: http://wiki.openstreetmap.org/wiki/Nominatim
## made by: D.Kisler 
nominatim_osm <- function(address = NULL)
{
  if(suppressWarnings(is.null(address)))
    return(data.frame())
  tryCatch(
    d <- jsonlite::fromJSON( 
      gsub('\\@addr\\@', gsub('\\s+', '\\%20', address), 
           'http://nominatim.openstreetmap.org/search/@addr@?format=json&addressdetails=0&limit=1')
    ), error = function(c) return(data.frame())
  )
  if(length(d) == 0) return(data.frame())
  return(data.frame(lon = as.numeric(d$lon), lat = as.numeric(d$lat)))
}

Let’s us see if it works:

nominatim_osm("Hamburger Kunsthalle")

Cool! Initial test indicates that it is working. The coordinates is pointing to Hamburger Kunsthalle: https://www.openstreetmap.org/search?query=10.003%0953.55529%09#map=17/53.55529/10.00300

Let’s try one more:

nominatim_osm("Fundation Custodia")

OSM Nominatim is a marvelous tool, but i doesn’t just take anything and interprets it into an address. Let’s first find out which venues it actually can convert into coordinates and then deal with the rest afterwards.

Since Nominatim OSM only interprets one name/address to coordinate per second, we must have made a summary of venues in our data set (as seen in the latest visualisation there will be above 50 occurences of “Hamburger Kunsthalle”, so there is no reason to ask OSM several times where “Hamburger Kunsthalle” is).

eckersberg_exh %>%
  drop_na(venue) %>% 
  select(venue) %>% 
  group_by(venue) %>% 
  summarise_all(max) -> venues

venues

So this is the list over places where SMK-owned Eckersberg works have been on display - now it is time to see how many of these venues OSM Nominatim will return coordinates to. First we pull the venues out as as value:

venues %>% 
  pull(venue) ->venues

Next step is to see how many of the venues OSM Nominatim is returning as coordinates. (This step is also borrowed from OSM Nominatim with R: getting Location’s Geo-coordinates by its Address)

venues_geo_loc <- suppressWarnings(lapply(venues, function(venues) {
  #set the elapsed time counter to 0
  t <- Sys.time()
  #calling the nominatim OSM API
  api_output <- nominatim_osm(venues)
  
  if (nrow(api_output) == 0) {
    api_output <- data.frame(lat = NA, lon = NA)
  }
  #get the elapsed time
  t <- difftime(Sys.time(), t, 'secs')
  #return data.frame with the input address, output of the nominatim_osm function and elapsed time
  return(data.frame(venues = venues, api_output, elapsed_time = t))
  }) %>%
#stack the list output into data.frame
bind_rows() %>% data.frame())
#output the data.frame content into console
venues_geo_loc

Surprisingly, OSM has found a coordinator for many of the venues, but what do we do with the remaining ones?

Venues with missing coordinates

The first step is to see how many venues lack coordinates and isolate them in their own dataframe:

#Dataframe for venues with missing coordinates
venues_geo_loc %>% 
  filter(is.na(lon)) -> venues_geo_loc_missing

#Dataframe for venues wirth coordinates. We save them for later
venues_geo_loc %>% 
  filter(!is.na(lon))-> venues_geo_loc

#Lets examine the venues with missing coordinates
venues_geo_loc_missing

What to do with these? Since there are “only” thirteen we are going to search for each venue in google and manually put in their adress in a new adress-column:

#Initiate new empty column called adress
venues_geo_loc_missing %>% 
  mutate(adress = NA) %>% 
  select(venues, adress)-> venues_geo_loc_missing

#Manually putting in adresses in the rows

venues_geo_loc_missing$adress[1] <- "Aros Allé 2, 8000 Aarhus"
venues_geo_loc_missing$adress[2] <- "Ny Vestergade 10, 1471 København K"
venues_geo_loc_missing$adress[3] <- "Via di San Pietro in Carcere, 00186, Roma"
venues_geo_loc_missing$adress[4] <- "Søren Kierkegaards Plads 1, 1221 København K"
venues_geo_loc_missing$adress[5] <- "121 Rue de Lille, 75007 Paris, Frankrig"
venues_geo_loc_missing$adress[6] <- "Jernbanegade 13, 5000 Odense C"
venues_geo_loc_missing$adress[7] <- "Valge 1, 10127 Tallinn, Estland"
venues_geo_loc_missing$adress[8] <- "Düsternbrooker Weg 1, 24105 Kiel, Tyskland"
venues_geo_loc_missing$adress[9] <- "Sjøgata 1, 9008 Tromsø, Norge"
venues_geo_loc_missing$adress[10] <- "50-52 Rue Cardinal Fesch, 20000 Ajaccio, Frankrig"
venues_geo_loc_missing$adress[11] <- "Schaumainkai 63, 60596 Frankfurt am Main, Tyskland"
venues_geo_loc_missing$adress[12] <- "Ulitsa Volkhonka, 12, Moscow, Rusland"
venues_geo_loc_missing$adress[13] <- "Æblehaven 23, 6000 Kolding"

venues_geo_loc_missing %>% 
  pull(adress) -> venues_missing

Now we run the exact same process as before but this time it will be run on the venue’s adresses and not just their names:

venues_geo_loc_adress <- suppressWarnings(lapply(venues_missing, function(venues_missing) {
  #set the elapsed time counter to 0
  t <- Sys.time()
  #calling the nominatim OSM API
  api_output <- nominatim_osm(venues_missing)
  
  if (nrow(api_output) == 0) {
    api_output <- data.frame(lat = NA, lon = NA)
  }
  #get the elapsed time
  t <- difftime(Sys.time(), t, 'secs')
  #return data.frame with the input address, output of the nominatim_osm function and elapsed time
  return(data.frame(venues_missing = venues_missing, api_output, elapsed_time = t))
  }) %>%
#stack the list output into data.frame
bind_rows() %>% data.frame())
#output the data.frame content into console
venues_geo_loc_adress

We put them back together with the venues names:

venues_geo_loc_adress %>% 
  mutate(venues = venues_geo_loc_missing$venues) %>% 
  select(venues, lon, lat, elapsed_time) -> venues_geo_loc_adress

venues_geo_loc_adress

And then we put it back together with the venues who got coordinates in our first run at the OSM API:

bind_rows(venues_geo_loc, venues_geo_loc_adress) %>% 
  rename(venue = venues) -> venues_coord

venues_coord

The next step is to put these information into our dataframe eckerberg_exh, where each row is an instance of a displayed work

eckersberg_exh %>% 
  left_join(venues_coord, by = "venue") -> eckersberg_exh

eckersberg_exh

Now each instance of a displayed art piece is enriched with the coordinates to the place, where it was displayed. Now we are ready to put the instances of displayed art works on a map with leaflet.

Creating a world map showing Eckersberg’s displayed works

First we load the library

library(leaflet)

For a thorough demonstration of the possibilities of leaflet see:

https://rstudio.github.io/leaflet/

In the following each step will be very concise.

Creating html-content for the popup-markers. This is what creates the data and pictures shown in the popup-markers on the map

eckersberg_exh %>% 
  mutate(content = paste0('<b>Title:</b> ', title, '<br>', '<b>Exhibition: </b>', exhibition, '<br>', '<b>Venue:</b> ', venue, '<br>', '<a href="', image_native, '">', '<img src="', image_thumbnail,'" width="150">', '</a> <br>', '<a href="', image_native, '">Download image in full size</a>')) -> eckersberg_exh

title_html <- '<b>Map over where SMK-owned Eckersberg-works have been on display</b><p>For more info on the data processing <a href="https://library.au.dk/fileadmin/www.nobel.bibliotek.au.dk/LibLab/files/r-markdown/eckersberg.html">click here</a></p>'

The last step is to create the map. The map is interactive and each marker contain information on the displayed art piece:

eckersberg_exh %>% 
  leaflet() %>%
  addTiles() %>%
  addMarkers(~lon, ~lat, label = ~title, popup = ~content, clusterOptions = markerClusterOptions()) %>%
  addControl(title_html, position = "bottomright")

Click here to enlarge the map
For questions please contact maop@kb.dk.

Importing data from SMK(National Gallery of Denmark)-API

Max Odsbjerg Pedersen, Aarhus University Library, Royal Danish Library

2020-10-13