Many governments, businesses and research organisations are making data available for use by citizen data scientists. Just a few examples include Kaggle Open Datasets (www.kaggle.com/datasets), data.gov.uk in the US, and data.gov.uk in the UK.
This creates an opportunity for anyone with the will and curiosity to create useful insights from real world data, as well as developing data science skills.
The Northern Ireland Government Committed to making anonymised public sector data freely available through their policy of ‘open by default’. Datasets range from the health of trees, to public transport to healthcare – as well as geospatial data (www.opendatani.gov.uk)..
In this post we will step through the available data on disease prevalence in Northern Ireland, with the aim of investigating differences across health trusts.
For background, the Northern Ireland healthcare system (HSCNI) is made up of five trusts each of which is responsible for delivering healthcare within a particular geography. Unlike the NHS in England trusts are responsible for delivering hospital services (emergency care, inpatients and outpatients), community care (e.g. district nursing, allied health professions), as well as social care for both children and adults.
GP practices are typically the first point of contact for non-emergency care within Northern Ireland, treating patients with a wide range of conditions. GP’s collect data relating to the number of patients with specific conditions such as asthma, diabetes, and dementia – which is available for analysis through opendatani.gov.uk
In this post we will use R to combine the disease prevalence data with geospatial data to create a map of disease prevalence by trust. In the next post I’ll make this interactive using Shiny.
Install and Load Packages
packages<-c(“dplyr”, “psych”, “stringr”, “tmap”)
lapply(packages, require, character.only = TRUE)
dplyr is a helpful package that makes manipulating data much more straightforward and intuitive than using the standard R code. It’s also often faster than the R commands.
To display the geographic data we will use the tmap package, which allows us to work with a range of gographic data in R rather than using other geographic information packages such as the open source QGIS (www.qgis.org) or commercial ARC gis (www.arcgis.com).
Get the data
Two datasets are needed:
Download the datasets below and keep them in a folder of your choice.
- Disease Prevalence: https://www.opendatani.gov.uk/dataset/disease-prevalence
- Trust Boundary Shapefile: https://www.opendatani.gov.uk/dataset/department-of-health-trust-boundaries
The open data isn’t in a format that is easily digestible by R, so for simplicity we will copy the relevant data and save it as a csv file for use in the visualisation. Open the disease prevalence spreadsheet and navigate to the bottom of the sheet: Registers_&_Prevalence_2016, copy the data on prevalence per 1000 patients using full list into a new workbook and save it as a csv file (DiseasePrevelance2.csv in the example below).
Set your working directory and load the data:
soashape <- read_shape(“trustBoundary/DHSSPS_TrustBoundaries.shp”)
Rename the factor levels in the prevelance data and shapefile
levels(summary$Trust) <- c(“B.T.”, “Northern Trust”, “S.E.T”, “Southern Trust”, “Western Trust”)
levels(soashape@data$LGDNAME) <- c(“B.T.”, “Northern Trust”, “S.E.T”, “Southern Trust”, “Western Trust”)
Append the data to the shapefile, and change the factor levels so they can be better represented on the map:
completeSummary <- append_data(soashape, summary, key.shp = “LGDNAME”, key.data=”Trust”)
Plot a Map
qtm(completeSummary, fill=”Diabetes”, text=”Trust”, remove.overlap=TRUE, title=”Diabetes Prevalence”)
tmap allows for two plotting methods – qtm, which stands for quick thematic map, and an extended version where parameters can be piped through using %>%.