Skip to content

Preventable diseases report

RStudio data Project

NOTE: The document below includes both the written narrative and visualisations, as well as the code used to generate them. As a result, the formatting may appear disjointed at times. The primary goal is to demonstrate my ability to use RStudio to clean, summarise, and visualise data.

Abstract

This project provides a summary analysis of global vaccine-preventable disease data from the World Health Organization (WHO). The workflow encompassed data cleaning, transformation, summarisation, and visualisation techniques. The raw dataset from the WHO was generally well-organised, with the primary challenges being missing values and redundant columns. These issues were resolved within RStudio.

For data visualisation, the data was aggregated and summarised across various variables, enabling the creation of line graphs, bar charts, and world map diagrams. Key findings from the analysis revealed that measles had the highest overall case numbers but was also the most impacted by vaccination efforts. In contrast, mumps and rubella had fewer overall cases; however, no vaccination data for mumps was available from the WHO.

This project highlights proficiency in R for data cleaning, summarisation, and visualisation. Future work on this project will involve applying statistical analyses within a medical and biological context.

Initial Data assessment

Source: World Health Organisation

Outline:

The World Health Organization (WHO) provides multiple datasets categorised into disease-reported cases and vaccination coverage, all in CSV format. The measles, mumps, and rubella disease case datasets include key columns such as location, period, and value, enabling straightforward analysis of cases over time. Similarly, the vaccination coverage datasets for measles (MCV1/MCV2) and rubella (RSV) include essential columns such as year, country code, doses, and coverage, facilitating effective analysis of vaccination trends over time.

Notably, no vaccination data for mumps was publicly available from the WHO. Despite this limitation, these datasets are highly valuable for identifying trends in disease spread and vaccination coverage over time and across different locations.

Completeness:

Apart from the absence of mumps vaccination data, the datasets are generally complete. As anticipated, the collection of disease and vaccination data varies across countries, resulting in gaps in the datasets based on location and year. These data gaps can be addressed by examining adjacent years, calculating averages, or analysing data from neighbouring or similar countries to estimate the missing values. Despite these gaps, the datasets still provide a substantial volume of information for effective analysis of disease distribution and vaccination coverage across countries and over time.

Accuracy:

The WHO is generally regarded as a reliable source of global medical and health data. However, it is important to consider potential variations or inaccuracies in how each country measures vaccination coverage and disease cases.

Key variables:

  • Disease reported cases
  • Country
  • Vaccination coverage
  • 3-letter country codes

Potential relationships:

  • Country
  • 3-letter country codes
  • Year

Issues and Limitations: See Completeness and Accuracy

Data workflow

  1. Microsoft Excel: Used to perform an initial review of the data, identifying data types, column structures, and potential cleaning and analysis requirements for further processing in RStudio.
  2. RStudio: The workflow began with the installation and loading of the required packages, such as tidyverse, ggplot2, rnaturalearthdata, and tinytex, which were needed for analysis and presentation. The case and vaccination data were imported into RStudio, and the unused columns were selected into new cleaned variables. Data types were checked and converted to the appropriate formats. Rows with null values were removed, with care taken not to discard usable data. Any labels or strings in columns were replaced with simpler, easier-to-identify labels. Data summarisation followed, including calculations for total cases per year, the top and bottom 10 values, vaccinations per year, and averages. These summaries were visualised using line graphs over time, bar charts, and world maps with gradient shading to represent cases and vaccination coverage. Finally, all the code and visualisations were output into an R Markdown document, accompanied by a written narrative.

Analysis conclusions

Please see the PDF report document


Files / Code

Extra code:

#Installing required packages
#install.packages("tidyverse")
#install.packages("readr")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("ggthemes")
#install.packages("forecast")
#tinytex::install_tinytex()
#install.packages("rnaturalearth")
#install.packages("rnaturalearthdata")

#Loading required packages
library(conflicted)
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(lubridate)
library(forecast)
library(zoo)
library(rnaturalearth)
library(rnaturalearthdata)

#START Data loading

#Load MCV1 Vaccination data (CSV File)
MCV1_data <- read_csv("MCV1 Immunisation coverage amoung 1-year-olds _.csv")
#Quick view of the imported dataset
View(MCV1_data)
#Str gives a data overview of each column's data type
str(MCV1_data)
#Basic statistics for each variable (e.g. length, min, max, mean, etc)
summary(MCV1_data)
#Shows the first few rows of data
head(MCV1_data)
#Shows the last few rows of data
tail(MCV1_data)

#Load MCV2 Vaccination data (CSV File)
MCV2_data <- read_csv("MCV2 Immunisation coverage amoung 1-year-olds _.csv")
#Quick view of the imported dataset
View(MCV2_data)
#Str gives a data overview of each column's data type
str(MCV2_data)
#Basic statistics for each variable (e.g. length, min, max, mean, etc)
summary(MCV2_data)
#Shows the first few rows of data
head(MCV2_data)
#Shows the last few rows of data
tail(MCV2_data)

#Load Measles reported cases (CSV File)
Measles_cases <- read_csv("Measles number reported cases.csv")
#Quick view of the imported dataset
View(Measles_cases)
#Str gives a data overview of each column's data type
str(Measles_cases)
#Basic statistics for each variable (e.g. length, min, max, mean, etc)
summary(Measles_cases)
#Shows the first few rows of data
head(Measles_cases)
#Shows the last few rows of data
tail(Measles_cases)

#Load Mumps reported cases (CSV File)
Mumps_cases <- read_csv("Mumps number reported cases.csv")
#Quick view of the imported dataset
View(Mumps_cases)
#Str gives a data overview of each column's data type
str(Mumps_cases)
#Basic statistics for each variable (e.g. length, min, max, mean, etc)
summary(Mumps_cases)
#Shows the first few rows of data
head(Mumps_cases)
#Shows the last few rows of data
tail(Mumps_cases)

#Load Rubella reported cases (CSV File)
Rubella_cases <- read_csv("Rubella number reported cases.csv")
#Quick view of the imported dataset
View(Rubella_cases)
#Str gives a data overview of each column's data type
str(Rubella_cases)
#Basic statistics for each variable (e.g. length, min, max, mean, etc)
summary(Rubella_cases)
#Shows the first few rows of data
head(Rubella_cases)
#Shows the last few rows of data
tail(Rubella_cases)

#END Data loading

###################################################################

#START Data cleaning: Remove unused columns

#MCV1: Remove empty or useless columns
MCV1_data_trimmed <- MCV1_data %>% select(-IndicatorCode, -ValueType, 
                                          `Location type`, -`Period type`, -IsLatestYear, 
                                          -`Dim1 type`, -Dim1, -Dim1ValueCode, -`Dim2 type`, 
                                          -Dim2, -Dim2ValueCode, -`Dim3 type`, -Dim3, -Dim3ValueCode, 
                                          -DataSourceDimValueCode, -DataSource, -FactValueNumericPrefix,
                                          -FactValueUoM, -FactValueNumericLowPrefix, -FactValueNumericLowPrefix,
                                          -FactValueNumericLow, -FactValueNumericHighPrefix, -FactValueNumericHigh,
                                          -FactValueNumericHigh, -FactValueTranslationID, -FactComments, -Language, 
                                          -DateModified, -FactValueNumeric)
#MCV1: View the new trimmed data
#view(MCV1_data_trimmed)


#MCV2: Remove empty or useless columns
MCV2_data_trimmed <- MCV2_data %>% select(-IndicatorCode, -ValueType, 
                                          `Location type`, -`Period type`, -IsLatestYear, 
                                          -`Dim1 type`, -Dim1, -Dim1ValueCode, -`Dim2 type`, 
                                          -Dim2, -Dim2ValueCode, -`Dim3 type`, -Dim3, -Dim3ValueCode, 
                                          -DataSourceDimValueCode, -DataSource, -FactValueNumericPrefix,
                                          -FactValueUoM, -FactValueNumericLowPrefix, -FactValueNumericLowPrefix,
                                          -FactValueNumericLow, -FactValueNumericHighPrefix, -FactValueNumericHigh,
                                          -FactValueNumericHigh, -FactValueTranslationID, -FactComments, -Language, 
                                          -DateModified, -FactValueNumeric)
#MCV2: View the new trimmed data
#view(MCV2_data_trimmed)


#Measles: Remove empty or useless columns
Measles_cases_trimmed <- Measles_cases %>% select(-IndicatorCode, -ValueType, 
                                          `Location type`, -`Period type`, -IsLatestYear, 
                                          -`Dim1 type`, -Dim1, -Dim1ValueCode, -`Dim2 type`, 
                                          -Dim2, -Dim2ValueCode, -`Dim3 type`, -Dim3, -Dim3ValueCode, 
                                          -DataSourceDimValueCode, -DataSource, -FactValueNumericPrefix,
                                          -FactValueUoM, -FactValueNumericLowPrefix, -FactValueNumericLowPrefix,
                                          -FactValueNumericLow, -FactValueNumericHighPrefix, -FactValueNumericHigh,
                                          -FactValueNumericHigh, -FactValueTranslationID, -FactComments, -Language, 
                                          -DateModified, -FactValueNumeric)
#Measles: View the new trimmed data
Measles_cases_trimmed <- na.omit(Measles_cases_trimmed)
#view(Measles_cases_trimmed)


#Mumps: Remove empty or useless columns
Mumps_cases_trimmed <- Mumps_cases %>% select(-IndicatorCode, -ValueType, 
                                                  `Location type`, -`Period type`, -IsLatestYear, 
                                                  -`Dim1 type`, -Dim1, -Dim1ValueCode, -`Dim2 type`, 
                                                  -Dim2, -Dim2ValueCode, -`Dim3 type`, -Dim3, -Dim3ValueCode, 
                                                  -DataSourceDimValueCode, -DataSource, -FactValueNumericPrefix,
                                                  -FactValueUoM, -FactValueNumericLowPrefix, -FactValueNumericLowPrefix,
                                                  -FactValueNumericLow, -FactValueNumericHighPrefix, -FactValueNumericHigh,
                                                  -FactValueNumericHigh, -FactValueTranslationID, -FactComments, -Language, 
                                                  -DateModified, -FactValueNumeric)
#Mumps: View the new trimmed data
#view(Mumps_cases_trimmed)


#Rubella: Remove empty or useless columns
Rubella_cases_trimmed <- Rubella_cases %>% select(-IndicatorCode, -ValueType, 
                                              `Location type`, -`Period type`, -IsLatestYear, 
                                              -`Dim1 type`, -Dim1, -Dim1ValueCode, -`Dim2 type`, 
                                              -Dim2, -Dim2ValueCode, -`Dim3 type`, -Dim3, -Dim3ValueCode, 
                                              -DataSourceDimValueCode, -DataSource, -FactValueNumericPrefix,
                                              -FactValueUoM, -FactValueNumericLowPrefix, -FactValueNumericLowPrefix,
                                              -FactValueNumericLow, -FactValueNumericHighPrefix, -FactValueNumericHigh,
                                              -FactValueNumericHigh, -FactValueTranslationID, -FactComments, -Language, 
                                              -DateModified, -FactValueNumeric)
#Rubella: View the new trimmed data
#view(Rubella_cases_trimmed)



#END Data cleaning: Remove unused columns


#############################################################################




#START Data cleaning: Missing/duplicate/fix data types/reformatting

#MCV1 Data
view(MCV1_data_trimmed)
str(MCV1_data_trimmed)
summary(MCV1_data_trimmed)

#Check is there are any missing values in the dataset (per column)
colSums(is.na(MCV1_data_trimmed))
#Shows the unique values in a specific column
unique(MCV1_data_trimmed$Location)

#MCV2 Data
view(MCV2_data_trimmed)
str(MCV2_data_trimmed)
summary(MCV2_data_trimmed)

#Check is there are any missing values in the dataset (per column)
colSums(is.na(MCV2_data_trimmed))
#Shows the unique values in a specific column
unique(MCV2_data_trimmed$Location)

#Measles Data
view(Measles_cases_trimmed)
str(Measles_cases_trimmed)
summary(Measles_cases_trimmed)

Measles_cases_trimmed$Value <- as.numeric(Measles_cases_trimmed$Value)

#Check is there are any missing values in the dataset (per column)
colSums(is.na(Measles_cases_trimmed))
#Shows the unique values in a specific column
unique(Measles_cases_trimmed$Location)

#Mumps Data
view(Mumps_cases_trimmed)
str(Mumps_cases_trimmed)
summary(Mumps_cases_trimmed)

Mumps_cases_trimmed$Value <- as.numeric(Mumps_cases_trimmed$Value)

#Check is there are any missing values in the dataset (per column)
colSums(is.na(Mumps_cases_trimmed))
#Shows the unique values in a specific column
unique(Mumps_cases_trimmed$Location)

#Rubella Data 
view(Rubella_cases_trimmed)
str(Rubella_cases_trimmed)
summary(Rubella_cases_trimmed)

Rubella_cases_trimmed$Value <- as.numeric(Rubella_cases_trimmed$Value)

#Check is there are any missing values in the dataset (per column)
colSums(is.na(Rubella_cases_trimmed))
#Shows the unique values in a specific column
unique(Rubella_cases_trimmed$Location)


Measles_cases_trimmed$Indicator <- gsub("Measles - number of reported cases", "Measles", Measles_cases_trimmed$Indicator)
Mumps_cases_trimmed$Indicator <- gsub("Mumps - number of reported cases", "Mumps", Mumps_cases_trimmed$Indicator)
Rubella_cases_trimmed$Indicator <- gsub("Rubella - number of reported cases", "Rubella", Rubella_cases_trimmed$Indicator)


#END Data cleaning: Missing/duplicate/fix data types/reformatting

###########################################################################################


#START Data visualisation 

##### Disease Cases

#Summarise number of Measles cases per year
Measles_cases_summary <- Measles_cases_cleaned %>%
  group_by(Period) %>%
  summarise(total_cases = sum(Value, na.rm = TRUE))
Measles_cases_summary <- as.data.frame(Measles_cases_summary)

#Summarise number of Mumps cases per year
Mumps_cases_summary <- Mumps_cases_cleaned %>%
  group_by(Period) %>%
  summarise(total_cases = sum(Value, na.rm = TRUE))

#Summarise number of Rubella cases per year
Rubella_cases_summary <- Rubella_cases_cleaned %>%
  group_by(Period) %>%
  summarise(total_cases = sum(Value, na.rm = TRUE))
#Measles/Mumps/Rubella cases per year
ggplot() +
  geom_line(data = Measles_cases_summary, aes(x=Period, y=total_cases, color = "Measles"), size = 1.5) +
  geom_point(data = Measles_cases_summary, aes(x=Period, y=total_cases, color = "Measles"), size = 3) +
  geom_line(data = Mumps_cases_summary, aes(x=Period, y=total_cases, color = "Mumps"), size = 1.5) +
  geom_point(data = Mumps_cases_summary, aes(x=Period, y=total_cases, color = "Mumps"), size = 3) +
  geom_line(data = Rubella_cases_summary, aes(x=Period, y=total_cases, color = "Rubella"), size = 1.5) +
  geom_point(data = Rubella_cases_summary, aes(x=Period, y=total_cases, color = "Rubella"), size = 3) +
  labs(title = "Total Cases",
       x= "Year",
       y= "Cases",
       color = "Disease") +
  theme_stata()



#Cases in 2023
Measles_cases_2023 <- Measles_cases_trimmed %>%
  dplyr::filter(Period == 2023)

Mumps_cases_2023 <- Mumps_cases_trimmed %>%
  dplyr::filter(Period == 2023)

Rubella_cases_2023 <- Rubella_cases_trimmed %>%
  dplyr::filter(Period == 2023)

top_10_measles_cases_2023 <- Measles_cases_2023 %>%
  arrange(desc(Value)) %>%
  slice_head(n = 10)

top_10_mumps_cases_2023 <- Mumps_cases_2023 %>%
  arrange(desc(Value)) %>%
  slice_head(n = 10)

top_10_rubella_cases_2023 <- Rubella_cases_2023 %>%
  arrange(desc(Value)) %>%
  slice_head(n = 10)

# Reorder the locations within each dataset based on the Value
top_10_measles_cases_2023$Location <- with(top_10_measles_cases_2023, reorder(Location, Value))
top_10_mumps_cases_2023$Location <- with(top_10_mumps_cases_2023, reorder(Location, Value))
top_10_rubella_cases_2023$Location <- with(top_10_rubella_cases_2023, reorder(Location, Value))

# MMR highest number of cases in 2023
ggplot() + 
  # Measles
  geom_bar(
    data = top_10_measles_cases_2023,
    aes(x = Value, y = Location, fill = Indicator),
    stat = "identity",
    position = position_dodge(width = 0.8),
    width = 0.6
  ) + 
  # Mumps
  geom_bar(
    data = top_10_mumps_cases_2023,
    aes(x = Value, y = Location, fill = Indicator),
    stat = "identity",
    position = position_dodge(width = 0.8),
    width = 0.6
  ) + 
  # Rubella
  geom_bar(
    data = top_10_rubella_cases_2023,
    aes(x = Value, y = Location, fill = Indicator),
    stat = "identity",
    position = position_dodge(width = 0.8),
    width = 0.6
  ) + 
  facet_wrap(~ Indicator, scales = "free_y") +  # Separate rows for each disease
  labs(
    title = "Countries with highest cases of MMR diseases (2023)",
    x = "Cases",
    y = "Country",
    fill = "Indicator"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#World data map 2023
world <- ne_countries(scale = "medium", returnclass = "sf")

Measles_cases_2023_classify <- Measles_cases_2023 %>%
  mutate(Has_Cases = ifelse(Value > 0, "Measles cases", "No cases"))

world_map <- world %>%
  left_join(Measles_cases_2023_classify, by = c("adm0_a3" = "SpatialDimValueCode"))

ggplot(data = world_map) +
  geom_sf(aes(fill = Has_Cases), color = "White") +
  scale_fill_manual(values = c("Measles cases" = "red", "No cases" = "green"), na.value = "gray80") +
  labs(
    title = "Global Measles cases (2023)",
    fill = "Measles Cases"
  ) +
  theme_minimal()


#Measles cases in 2000

Measles_cases_2000 <- Measles_cases_trimmed %>%
  dplyr::filter(Period == 2000)

Mumps_cases_2000 <- Mumps_cases_trimmed %>%
  dplyr::filter(Period == 2000)

Rubella_cases_2000 <- Rubella_cases_trimmed %>%
  dplyr::filter(Period == 2000)

top_10_measles_cases_2000 <- Measles_cases_2000 %>%
  arrange(desc(Value)) %>%
  slice_head(n = 10)

top_10_mumps_cases_2000 <- Mumps_cases_2000 %>%
  arrange(desc(Value)) %>%
  slice_head(n = 10)

top_10_rubella_cases_2000 <- Rubella_cases_2000 %>%
  arrange(desc(Value)) %>%
  slice_head(n = 10)

# Reorder the locations within each dataset based on the Value
top_10_measles_cases_2000$Location <- with(top_10_measles_cases_2000, reorder(Location, Value))
top_10_mumps_cases_2000$Location <- with(top_10_mumps_cases_2000, reorder(Location, Value))
top_10_rubella_cases_2000$Location <- with(top_10_rubella_cases_2000, reorder(Location, Value))

# MMR highest number of cases in 2000
ggplot() + 
  # Measles
  geom_bar(
    data = top_10_measles_cases_2000,
    aes(x = Value, y = Location, fill = Indicator),
    stat = "identity",
    position = position_dodge(width = 0.8),
    width = 0.6
  ) + 
  # Mumps
  geom_bar(
    data = top_10_mumps_cases_2000,
    aes(x = Value, y = Location, fill = Indicator),
    stat = "identity",
    position = position_dodge(width = 0.8),
    width = 0.6
  ) + 
  # Rubella
  geom_bar(
    data = top_10_rubella_cases_2000,
    aes(x = Value, y = Location, fill = Indicator),
    stat = "identity",
    position = position_dodge(width = 0.8),
    width = 0.6
  ) + 
  facet_wrap(~ Indicator, scales = "free_y") +  # Separate rows for each disease
  labs(
    title = "Countries with highest cases of MMR diseases (2000)",
    x = "Cases",
    y = "Country",
    fill = "Indicator"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#World data map 2000
world <- ne_countries(scale = "medium", returnclass = "sf")

Measles_cases_2000_classify <- Measles_cases_2000 %>%
  mutate(Has_Cases = ifelse(Value > 0, "Measles cases", "No cases"))

world_map <- world %>%
  left_join(Measles_cases_2000_classify, by = c("adm0_a3" = "SpatialDimValueCode"))

ggplot(data = world_map) +
  geom_sf(aes(fill = Has_Cases), color = "White") +
  scale_fill_manual(values = c("Measles cases" = "red", "No cases" = "green"), na.value = "gray80") +
  labs(
    title = "Global Measles cases (2000)",
    fill = "Measles Cases"
  ) +
  theme_minimal()





##### Vaccines

#Summarise number of MCV1 vaccinations per year
MCV1_summary <- MCV1_data_trimmed %>%
  group_by(Period) %>%
  summarise(total_vaccinations = mean(Value))
view(MCV1_summary)

#Summarise number of MCV2 vaccinations per year
MCV2_summary <- MCV2_data_trimmed %>%
  group_by(Period) %>%
  summarise(total_vaccinations = mean(Value))
view(MCV2_summary)

#MCV1 + MCV2 vaccination over time
ggplot() +
  geom_line(data = MCV1_summary, aes(x=Period, y=total_vaccinations, color = "MCV1"), size = 1.5) +
  geom_point(data = MCV1_summary, aes(x=Period, y=total_vaccinations, color = "MCV1"), size = 3) +
  geom_line(data = MCV2_summary, aes(x=Period, y=total_vaccinations, color = "MCV2"), size = 1.5) +
  geom_point(data = MCV2_summary, aes(x=Period, y=total_vaccinations, color = "MCV2"), size = 3) +
  labs(title = "MCV1 + MCV2 Vaccination over time",
       x= "Year",
       y= "Percentage of vaccination",
       color = "Vaccine") +
  scale_y_continuous(limits = c(75, 90)) +
  scale_x_continuous(limits = c(2000, 2023)) + 
  scale_color_manual(values = c("MCV1" = "red", "MCV2" = "orange")) +
  theme_stata()

#END Data visualisation 

##############################################################################################


#Time series analysis (not suited for annual data, need monthly/quarterly)
measles_ts <- ts(Mumps_cases_summary$total_cases, start = c(min(Mumps_cases_summary$Period)), frequency = 1)
measles_decomp <- stl(measles_ts, s.window = "periodic")
plot(measles_decomp)

#Statistical analysis (Linear Regression)
measles_model <- lm(total_cases ~ Period, data = Measles_cases_summary)
summary(measles_model)

#Statistical analysis (ANOVA)
anova_model <- aov(total_cases ~ factor(Period), data = Measles_cases_summary)
summary(anova_model)

# Loess Smoothing
loess_model <- loess(total_cases ~ Period, data = Mumps_cases_summary, span = 0.5)
Mumps_cases_summary$trend <- predict(loess_model)

plot(Mumps_cases_summary$Period, Mumps_cases_summary$total_cases, type = "b", main = "Measles Cases with Loess Trend", ylab = "Cases", xlab = "Year")
lines(Mumps_cases_summary$Period, Mumps_cases_summary$trend, col = "green", lwd = 2)

#Predictive analysis (ARIMA Model)
Measles_fit <- auto.arima(measles_ts)
forecasted_values <- forecast(Measles_fit, h = 2)  # Forecast next 5 periods (years)
plot(forecasted_values)

# Moving Average for Trend (if data is yearly)
moving_avg <- rollmean(measles_ts, k = 3, align = "center", na.pad = TRUE)
plot(measles_ts, type = "l", main = "Measles Cases with Moving Average Trend", ylab = "Cases")
lines(moving_avg, col = "blue", lwd = 2)