Exploring Data Visualizations

A day after the Liberal Party’s majority government win, the advanced cartography class went to work on infographics along the theme of Liberal Party promises in the style of a Globe and Mail visualization. 

In light of the ambitious pledge to expand government supported Syrian refugee numbers to 25,000 in 6 months, I produced an infographic that examines the capacity of countries to process asylum applications in the face of a mass exodus out of SyriaAs a class, we (the students + our two incredibly patient and supportive instructors Sally & Joey) meet at the Geography Building once a week for three hours to learn how to build data complex, user-oriented maps. It amuses me how excited I get when I enter our working space. Evidence on the whiteboard reveals that the same room we discuss and code away in was only moments ago a seminar space for faculty and students in the department.

When we are there, the space becomes a different kind of meeting space – it becomes a place of intense focus and problem-solving, where collaboration and communication is occurring among virtually every desk, and students, like nodes, work in a network of constant information exchange. The “lab” assignments are no longer as familiar, straight-forward and “intuitive”, but require students to ask basic questions and think about what we are encoding when we write arguments.

Our latest activity proved to be more time consuming then the last [if you have no idea what I’m referring to, read the previous two posts about my short time with Processing]. In the past two weeks we have transitioned from Processing to **RStudio **to build interactive maps that visualize sets of data as a way to show the spatial dimensions of objects (e.g. 3-1-1 complaint calls, low- to high-income residents, bakeries or pizzerias) and assist thinking about relationships and differences through comparison.

Working with Vancouver’s crime data we each created a crime map using data obtained from the City’s open data catalogue. I worked with crime data from 2007 to produce a basic interactive map that has:

  1. toggle-able point layers for the different crime categories with pop-up on click

  2. a toggle-able polygon layer for the crime density for your year using the hex bin with a popup on click 

  3. suitable basemap tiles

  4. appropriate colors and point sizing


Visualization Observations 
Here’s what it looks like: Unfortunately the length of my 23,000+ geocoded points are overwhemling for tumblr, so I won’t be able to embed the interactive map right now. If anyone knows how I can get around this issue, let me know!

A snapshot of the map will have to sit pretty as a placeholder for now…

Some immediate observations when examining this map are the areas with the highest crime density in the city. In 2007 Downtown and the Downtown Eastside were areas with a higher total counts of crime events. Specifically at the West Pender Street and Richards Street intersection crime event count is highest. When we look at where these counts are coming from we see that in the crime categories theft from auto, commerical breaking and enetering, and mischief there is a higher concentration of points falling within the polygons in these neighborhoods. Theft of auto is slightly higher along and one block north and south of Broadway Street, along Howe Street and Davie street than in other areas of the city. Across the residential zones there is a uniform occurence of theft of auto.

A few issues I found with the data set: 
Typos, and the crime types not being normalized. The crime types the City used felt redundant and not very detailed. 

Synthesizing cartography & code

When building a map, a number of steps are taken to successfully communicate information to the map user. This is the primary objective of any informative data visualization (e.g. diagram, infographic, map, documentary). Often cartographers call up the Cartographic Communication System as a theoretical approach and guideline to map making or “geospatial visualization”. In this system, GIS analysts and cartographers draw from their perception of the real world and its multidimensional, complex reality to encode information about their conception of the real world for the map user to decode. So a process like the cartoony one below occurs every time a map is born:

Diagram of the “Cartographic Communication System” from Sally Hermansen. Acquired on September 19, 2015.

In the gap in between the cartographer’s conception of the real world and the creation of a map is a process of data visualization that we will unpack using Ben Fry’s data visualization workflow or “pipeline”.

Diagram of the 7 steps in the pipeline from Ben Fry’s Visualizing Data (2008), p15.

The pipeline is made up of a number of stages that are key to bringing together data and design for the purpose of communicating data and assist in reasoning about reality. These stages include the following:

  1. Acquire - Obtain data from a source, with consideration to the quality of the data source including sponsors, interests and agendas. This is important for the credibility of your visualization.

  2. Parse - Provide a structure for the data, and order it into categories.

  3. Filter - Isolate from the raw data the data of interest.

  4. Mine - Apply methods from statistics as a way to draw comparisons and discern patterns.

  5. Represent - Choose a visual model for evidence presentation.

  6. Refine - Improve the representation to make it clearer and more visually engaging.

  7. Interact - Add methods for manipulating the data or controlling visibility of features for data exploration.

With the assistance of the pipeline model I was able to break up my workflow into stages when building my first interactive map, which helped when keeping track of my progress.

A walk through the script

After starting a new project and script on RStudio, I installed and loaded up some packages I needed for the map: 

######################################################

Vancouver 3-1-1: Data Processing Script

Date: 2015-10-13

By: Audrey S

Desc: Interactive Web Map of Crime in 2007

######################################################

-------------------------------------------------

--------------- Install Libararies ---------------

-------------------------------------------------

#install.packages("GISTools")

#install.packages('rgdal')

#install.packages('curl')

#install.packages("devtools")

#devtools::install_github("corynissen/geocodeHERE")

-------------------------------------------------

---------------- Load Libararies -----------------

-------------------------------------------------

library(GISTools) library(rgdal) library(geocodeHERE) library(curl) First step is acquire the data. Once I downloaded the City’s 2007 crime data, I stored the csv data file’s filepath under the value “fname” for later use, such as reading in the file for data inspection.

-------------------------------------------------

------------------- Acquire ---------------------

-------------------------------------------------

access data from saved location using filepath"

fname=('C:/Users/Audrey/Documents/2015-2016/GEOG472/crime_2007/crime_2007.csv')

Read data as csv

data = read.csv(fname, header=T)

inspect your data

print(head(data)) Second step is parsing the data, which means standardizing and structuring the data. To protect the privacy of individuals and businesses, the city used Xs for the last two digits of addresses. Because geocoding each crime point address to latitude and longitude coordinates requires recognizable numbers and values, I substituted all Xs for 0s. I also included the city, province, and country for geocoding as well. Intersections are also represented by dashes, these need to be substituted with & or for there to be no issues with geocoding the events. Once these were fixed, I was able to start geocoding with Nokia’s Here API. Once geocoding was done, I added the lat and lon coordinates to the dataframe and wrote out the file as a csv.

-------------------------------------------------

-------------- Parse: Geocoder -------------------

-------------------------------------------------

change intersection to 00's

data$h_block = gsub("X", "0", data$HUNDRED_BLOCK) print(head(data$h_block))

Create new column for full address by adding "Vancouver, BC":

data$full_address = paste(data$h_block, paste("Vancouver, BC, Canada", sep=", "), sep=", ")

change dashes (representing intersections) to &

data$full_address = gsub("/", "&", data$full_address) print(head(data$full_address))

Geocode the events - we use Nokia's Here API

Create an empty vector for lat and lon coordinates

lat = c(); lon = c()

loop through the addresses

for(i in 1: length(data$full_address)){

store the address at index "i" as a character

address = as.character(data$full_address[i])

append the latitude of the geocoded address to the lat vector

lat = c(lat,geocodeHERE_simple(address)$Latitude)

append the longitude of the geocoded address to the lon vector

lon = c(lon,geocodeHERE_simple(address)$Longitude)

at each iteration through the loop, print the coordinates - takes about 40 min.

print(paste("#",i,", ", lat[i], lon[i], sep=",")) }

add the lat lon coordinates to the dataframe

data$lat = lat data$lon = lon

write file out for geocoded data

ofile ='/Users/Audrey/Documents/2015-2016/GEOG472/crime_2007/vancrime2007_audrey_schalhoub.csv' write.csv(data, ofile) Next step of the process is Mining. First I examined the different types of crime and assigned a classification ID number for creating a point layer that will be used when I set up the map. Here I determine each class. Although I could keep all the crime types listed by the City, I want to combine theft of auto above $5000 and theft of auto under $5000 as one category “theft of auto”. This will also be done for “theft from auto” and “mischief”. If the user wants to explore the specifics of the crime event i.e. whether the cost was above or under $5000, they can with the function of a pop-up by clicking on an individual marker. During this stage I noticed there was a typo in the City’s dataframe and corrected it. What I wanted to know from this data set was the frequencies for different crime types and found that the top three types of crime in 2007 by order of frequencies were 1) Theft from auto under $5000 (12093 occurences), 2) Mischief under $5000 (4890 occurences), and 3) Commercial breaking and entering (2501 occurences).

-------------------------------------------------

--------------------- Mine ----------------------

-------------------------------------------------

--- Examine the unique cases ---

examine how the cases are grouped

unique(data$TYPE)

Print each unique case on a new line for easier inspection

for (i in 1:length(unique(data$TYPE))){ print(unique(data$TYPE)[i], max.levels=0) }

fix spelling errors - change "thef" to "theft"

data$TYPE = gsub("Thef", "Theft", data$TYPE) print(head(data$TYPE)) data$TYPE = gsub("Theftt", "Theft", data$TYPE) print(head(data$TYPE))

Determine classes to group case types:

theft from auto

theft_from_auto= c('Theft From Auto Under $5000','Theft From Auto Over $5000')

#theft of auto theft_of_auto= c('Theft Of Auto Under $5000','Theft Of Auto Over $5000')

commercial breaking and entering

commercial_be = c('Commercial BE')

mischief

mischief = c('Mischief Over $5000','Mischief Under $5000')

give class id numbers:

data$cid = 9999 for(i in 1:length(data$TYPE)){ if(data$TYPE[i] %in% theft_from_auto){ data$cid[i] = 1
}else if(data$TYPE[i] %in% theft_of_auto){ data$cid[i] = 2 }else if(data$TYPE[i] %in% commercial_be){ data$cid[i] = 3 }else if(data$TYPE[i]%in% mischief){ data$cid[i] = 4 }else{ data$cid[i] = 0 } }

plot the data to view the distribution of crime data

plot(data$lon, data$lat)

--- handle overlapping points ---

Set offset for points in same location:

data$lat_offset = data$lat data$lon_offset = data$lon

Run loop - if value overlaps, offset it by a random number

for(i in 1:length(data$lat)){ if ( (data$lat_offset[i] %in% data$lat_offset) && (data$lon_offset[i] %in% data$lon_offset)){ data$lat_offset[i] = data$lat_offset[i] + runif(1, 0.0001, 0.0005) data$lon_offset[i] = data$lon_offset[i] + runif(1, 0.0001, 0.0005) } }

#plot the data to view distribution of offsetted points plot(x = data$lon_offset,y = data$lat_offset)

--- what are the top crimes in 2007? ---

get a frequency distribution of Metro Vancouver's crimes in 2007:

top_crimes = data.frame(table(data$TYPE)) top_crimes = top_crimes[order(top_crimes$Freq),] print(top_crimes)

The fourth step is filtering, which I will do to remove all points that reside outside the bounds of the City of Vancouver. During this step I will also write out the filtered data and convert the data I have to a shapefile with the WGS84 projection for later use on GIS software.

-------------------------------------------------

-------------------- Filter ---------------------

-------------------------------------------------

Subset only the data if the coordinates are within our bounds (City of Vancouver)

#or if it is not a NA data_filter = subset(data, (lat <= data-preserve-html-node="true" 49.313162) & (lat >= 49.199554) & (lon <= data-preserve-html-node="true" -123.019028) & (lon >= -123.271371) & is.na(lon) == FALSE )

plot the data

plot(data_filter$lon, data_filter$lat)

write out the filtered data

ofile_filtered = 'C:/Users/Audrey/Documents/2015-2016/GEOG472/crime_2007/filtered_crime2007.csv' write.csv(data_filter, ofile_filtered)

--- Convert Data to Shapefile ---

store coordinates in dataframe

coords_crime2007 = data.frame(data_filter$lon_offset, data_filter$lat_offset)

create spatialPointsDataFrame

data_shp = SpatialPointsDataFrame(coords = coords_crime2007, data = data_filter)

set the projection to wgs84

projection_wgs84 = CRS("+proj=longlat +datum=WGS84") proj4string(data_shp) = projection_wgs84

set an output folder for our geojson files:

gfolder= 'C://Users/Audrey/Documents/2015-2016/GEOG472/crime_2007/new_crime_2007_map/'

Join the folderpath to each file name:

opointsshp = paste(gfolder,"1018", "crime","2007",".shp", sep="") print(opoints_shp) # print this out to see where your file will go & what it will be called

opointsgeojson = paste(gfolder,"1018", "crime_","2007",".geojson", sep="") print(opoints_geojson) # print this out to see where your file will go & what it will be called

write the file to a shp

writeOGR(check_exists=FALSE, data_shp, opoints_shp, layer="data_shp", driver="ESRI Shapefile")

write file to geojson

writeOGR(data_shp, opoints_geojson, layer="odata_shp", driver="GeoJSON", check_exists = FALSE) To see the crime density across the city, I will create a hexagonal grid that I am borrowing from Joey. I will read in the hex grid, transform the projection to wgs84, and use the poly.counts function to count the number of crime events that fall into each grid cell. Once this function is done, I will create a data frame of the counts and set it to the hexgrid data. All empty grid cells will be removed. A shapefile of this dataframe will also be created.

--- aggregate to a grid ---

ref: http://www.inside-r.org/packages/cran/GISTools/docs/poly.counts

set the file name - combine the shpfolder with the name of the grid

gridfolder ='C://Users/Audrey/Documents/2015-2016/GEOG472/crime_2007/new_crime_map_2007/'

#grid_shp = paste(gridfolder,'hgrid_100m.shp',sep="")

#grid_fn ref: https://raw.githubusercontent.com/joeyklee/aloha-r/master/data/calls_2014/geo/hgrid_250m.geojson' grid_fn = 'C://Users/Audrey/Documents/2015-2016/GEOG472/crime_2007/new_crime_map_2007/hgrid_250m.geojson'

define the utm 10n projection

#projection_utm10n = CRS('+proj=utm +zone=10 +ellps+GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no+defs')

#read in the hex grid setting the projection to utm10n hexgrid = readOGR(grid_fn,'OGRGeoJSON')

transform the projection to wgs84 to match the point file and store it to a new variable (see variable: projection_wgs84)

hexgrid_wgs84 = spTransform(hexgrid, projection_wgs84)

Use the poly.counts() function to count the number of occurences of crime per grid cell

grid_cnt = poly.counts(data_shp, hexgrid_wgs84)

create a data frame of the counts

grid_total_counts = data.frame(grid_cnt)

set grid_total_counts dataframe to the hexgrid data

hexgrid_wgs84$data = grid_total_counts$grid_cnt

remove all the grids without any crime points

hexgrid_wgs84 = subset(hexgrid_wgs84, grid_cnt >0)

define the output names:

ohex_shp = paste(gridfolder,"hexgrid250m","2007","_cnts",".shp") print(ohex_shp) ohex_geojson = paste(gridfolder,"hexgrid250m","2007","_cnts",".geojson") print(ohex_geojson)

write the file to a shp

writeOGR(hexgrid_wgs84, ohex_shp, layer="hexgrid_wgs84", driver="ESRI Shapefile", check_exists = FALSE)

write file to geojson

writeOGR(hexgrid_wgs84, ohex_geojson, layer="hexgrid_wgs84", driver="GeoJSON", check_exists = FALSE) Now for building the map! This will be a basic map that will undergo cycles of refinement. First I need to install and load the leaflet library. Then I need to initiate the leaflet and add openstreetmap tiles. An important step here is to subset my data to refer to when I set a marker color for each class. Classes are defined by classification ID number, which I defined earlier in the Mining stage. Points will be added to the map during this stage, as well as a legend and additional maptiles to compare and select a suitable maptile style for the final map.

#---Building Map V1.---#

install the leaflet library

devtools::install_github("rstudio/leaflet")

add the leaflet library to your script

library(leaflet)

subset out your data_filter:

df_theft_of_auto = subset(data_filter, cid == 1) df_theft_from_auto = subset(data_filter, cid == 2) df_commercial_be = subset(data_filter, cid == 3) df_mischief = subset(data_filter , cid == 4)

initiate leaflet

m = leaflet()

add openstreetmap tiles (default)

m = addTiles(m)

set colors for each class in order, e.g. red if cid = 1

colorFactors = colorFactor(c('red', 'orange', 'purple', 'blue'), domain=data_filter$cid)

#add points to the map m = addCircleMarkers( m, lng=df_theft_of_auto$lon_offset, # we feed the longitude coordinates lat=df_theft_of_auto$lat_offset, popup=df_theft_of_auto$TYPE, radius=2, stroke = FALSE, fillOpacity = 0.75, color = colorFactors(df_theft_of_auto$cid), group = "1 - theft of auto" )

m = addCircleMarkers(m, lng=df_theft_from_auto$lon_offset, lat=df_theft_from_auto$lat_offset, popup=df_theft_from_auto$TYPE, radius=2, stroke = FALSE, fillOpacity = 0.75, color = colorFactors(df_theft_from_auto$cid), group = "2 - theft from auto") m = addCircleMarkers(m, lng=df_commercial_be$lon_offset, lat=df_commercial_be$lat_offset, popup=df_commercial_be$TYPE, radius=2, stroke = FALSE, fillOpacity = 0.75, color = colorFactors(df_commercial_be$cid), group = "3 - commercial breaking & entering") m = addCircleMarkers(m, lng=df_mischief$lon_offset, lat=df_mischief$lat_offset, popup=df_mischief$TYPE, radius=2, stroke = FALSE, fillOpacity = 0.75, color = colorFactors(df_mischief$cid), group = "4 - mischief")

add some additional map tiles, toner and toner lite

m = addTiles(m, group = "OSM (default)") m = addProviderTiles(m,"Stamen.Toner", group = "Toner") m = addProviderTiles(m, "Stamen.TonerLite", group = "Toner Lite")

add layers control

m = addLayersControl(m, baseGroups = c("Toner Lite","Toner"), overlayGroups = c('1 - theft of auto','2 - theft from auto','3 - commercial breaking and entering','4 - mischief') )

make the map

m Based on my review of the previous map, I have decided to go with the Toner Lite style for my final map. The Toner Lite maptiles is in monochrome, so bodies of wate are shown as gray and land is shown as white. This scheme is ideal for the basemap because I want to draw the user’s eyes to the points and polygons of interest, which will remain in bold color.

I also would like to tweak the legend so the categories of crimes are not layered upon each other and the user may flip between each category one at a time.

#---Optomize Map V1.---#

Now we're going to put those variables into a list so we can loop through them:

data_filterlist = list(df_theft_of_auto = subset(data_filter, cid == 1), df_theft_from_auto = subset(data_filter, cid == 2), df_commercial_be = subset(data_filter, cid == 3), df_mischief = subset(data_filter, cid == 4) )

Remember we also had these groups associated with each variable? Let's put them in a list too:

layerlist = c("1 - theft of auto", "2 - theft from auto", "3 - commercial breaking & entering","4 - mischief")

We can keep that same color variable:

colorFactors = colorFactor(c('red', 'orange', 'purple', 'blue'), domain=data_filter$cid)

Now we have our loop - each time through the loop, it is adding our markers to the map object:

for (i in 1:length(data_filterlist)){ m = addCircleMarkers(m, lng=data_filterlist[[i]]$lon_offset, lat=data_filterlist[[i]]$lat_offset, popup=data_filterlist[[i]]$TYPE, radius=2, stroke = FALSE, fillOpacity = 0.75, color = colorFactors(data_filterlist[[i]]$cid), group = layerlist[i] ) }

flip “overlayGroups” and “baseGroups” so only 1 category of crimes are shown

m = addTiles(m, "Stamen.TonerLite", group = "Toner Lite") m = addLayersControl(m, overlayGroups = c("Toner Lite"), baseGroups = layerlist ) m Now that I’ve seen what the markers look like on the map, it’s time to see what the polygons look like on the map. I need to read in the geojson file with the hexagonal grid set with crime counts. Because I am working with continuous data of various values across space (delineated as hexagonal cells), a legend indicating the range of values and their associated colors needs to be included on the map. A continuous palette is suitable for this type of data representation.

#---Polygon map--#

#Define the filepath hex_2007_fn='C://Users/Audrey/Documents/2015-2016/GEOG472/crime_2007/new_crime_map_2007/hexgrid250m 2007 _cnts .geojson'

read in the geojson

hex_2007 = readOGR('2015-2016/GEOG472/crime_2007/new_crime_map_2007/ hexgrid250m 2007 _cnts .geojson', "OGRGeoJSON")

start up a new map object with Toner Lite style

m = leaflet() m = addProviderTiles(m, "Stamen.TonerLite", group = "Toner Lite")

Create a continuous palette function

pal = colorNumeric( palette = "Greens", domain = hex_2007$data )

add the polygons to the map

m = addPolygons(m, data = hex_2007, stroke = FALSE, smoothFactor = 0.2, fillOpacity = 1, color = ~pal(hex_2007$data), popup= paste("Number of reported crimes: ",hex_2007$data, sep="") )

add a legend showing crime density

m = addLegend(m, "bottomright", pal = pal, values = hex_2007$data, title = "2007 crime density", labFormat = labelFormat(prefix = " "), opacity = 0.75 )

m Both the point and polygon maps can be combined to create a layered interactive geospatial data visualization. I merely copy and paste the code for the point and polygon maps in this section. However, a few things are changed to include the polygon map as a toggle-able layer for the user’s manipulation when viewing the map. A group is created called “crime density” when creating the polygon layer, which is later slotted in the layers control function along with the Toner Lite maptile layer.

#---Synthesis of Polygon and Point Maps---#

initiate leaflet map layer

m = leaflet() m = addProviderTiles(m, "Stamen.TonerLite", group = "Toner Lite")

Add the polygon layer:

read in & store the hexgrid data

hex_2007 = readOGR('2015-2016/GEOG472/crime_2007/new_crime_map_2007/ hexgrid250m 2007 _cnts .geojson', "OGRGeoJSON")

Create a continuous palette function

pal = colorNumeric( palette = "Reds", domain = hex_2007$data )

First add polygon data to the map so this layer is behind the points layer

m = addPolygons(m, data = hex_2007, stroke = FALSE, smoothFactor = 0.2, fillOpacity = 1, color = ~pal(hex_2007$data), popup= paste("Number of reported crimes: ",hex_2007$data, sep=""), group= "crime density" )

Remember to add a legend!

m = addLegend(m, "bottomright", pal = pal, values = hex_2007$data, title = "2007 crime density", labFormat = labelFormat(prefix = " "), opacity = 0.75 )

Now add the points of interest

Remember storing the data filter list for looping between each category:

data_filterlist = list(df_theft_of_auto = subset(data_filter, cid == 1), df_theft_from_auto = subset(data_filter, cid == 2), df_commercial_be = subset(data_filter, cid == 3), df_mischief = subset(data_filter, cid == 4) )

Make a list with crime categories for layers:

layerlist = c("1 - theft of auto", "2 - theft from auto", "3 - commercial breaking & entering","4 - mischief")

We can keep that same color variable:

colorFactors = colorFactor(c('Purple', 'orange', 'blue', 'green'), domain=data_filter$cid)

Add our loop function - each time through the loop, it is adding our markers to the map object

for (i in 1:length(data_filterlist)){ m = addCircleMarkers(m, lng=data_filterlist[[i]]$lon_offset, lat=data_filterlist[[i]]$lat_offset, popup=data_filterlist[[i]]$TYPE, radius=2, stroke = FALSE, fillOpacity = 0.75, color = colorFactors(data_filterlist[[i]]$cid), group = layerlist[i] ) }

flip “overlayGroups” and “baseGroups” so only 1 category of crimes are shown

m = addTiles(m, "Stamen.TonerLite", group = "Toner Lite") m = addLayersControl(m, overlayGroups = c("crime density","Toner Lite"), baseGroups = layerlist ) m There we go! If you have any questions or comments, send them my way. It’s been great to get feedback on my past projects, and it only fuels my excitement to continue on this journey in code.

Audrey Schalhoub