G5Project_ISSS602: Exploratory Data Analysis

Literature review was conducted to analyze how Exploratory Data Analysis (EDA) has been performed before, and whether those techniques are applicable to the project on the explanatory and predictive modeling of whether an individual is likely to undergo H1N1 vaccination. In view of the scope of the project that covers explanatory and predictive modeling, the objective of the EDA is to visualize the relationships between input variables and the target variable so as to identify any trends which may suggest the usefulness of an input variable in predicting the target variable.

A key consideration in deciding the type of visualization is the data type of the input variables. For this dataset, most of the input variables are categorical. Therefore, visualizations that are suitable for categorical variables will be explored, as opposed to visualization that are suitable for continuous variables such as correlation plots and histograms.

1.2 Reviewing Past Visualizations on H1N1 vaccination or related topics

Many descriptive visualizations have been made to visualize survey responses on attitudes towards H1N1 vaccination before, or related public health topics on other pandemics. Some examples have been highlighted below:

An example of a visualization is responses of survey respondents on a scale of Strongly Disagree to Strongly Agree with respect to various statements, and how these proportions have changed over time. The visualization is available at this link and a screenshot is provided below:

Another example of a visualization is a global map visualizing how the proportions of survey respondents who Strongly Agree or Agree with the statement that ‘Vaccines are imporant for children to have’. The visualization is available at this link and a screenshot is provided below:

However, these visualizations are not the most relevant for this project. While this dataset also involves survey responses, the objective of the visualization is not to gain insights on the survey responses alone but to gain insights on how the survey responses influence whether an individual takes the H1N1 vaccination. Therefore, I will not be following what these visualization have done, but look at other methods that can be used to explore the relationship between an input and target variable.

1.3 Mosaic Plot

The mosaic plot data visualization technique is suitable for visualizing multivariate categorical data. A mosaic plot is a graphical display of cell frequencies in a contingency table, which is useful in this project where we are looking at how each input variable affects the target variable.

While there are many packages that can be used for mosaic plots including ggmosaic and dlookr which contains plot.relate(), I will be using the vcd package. vcd stands for visualizing categorical data. ggmosaic, dlookr and vcd are all able to plot the basic requirements of a mosaic chart - which is to plot 2 categorical variables against each other and modify the aesthetics of the graph (e.g. color scheme) as required. However, the vcd package stands out for its shade function, where it is able to shade the mosaic plot based on chi-square test results. This color scale would be a useful indicator for this project in helping the user observe which input variables have a relationship with the target variable, and provides early insights into which input variables might be included or excluded in the explanatory and predictive models.

However, a disadvantage of the mosaic plot is that rectangles can vary considerably in aspect ratio. As it is difficult to compare rectangles of different length and width, a user may not be able to compare the frequencies of the rectangles. To overcome this difficulty, a bar plot can be used in conjunction with the mosaic plot.

1.4 Stacked Bar Chart

Bar plots are also a suitable data visualization technique for visualizing multivariate categorical data by helping us to identify the relationship between the input and target variables. For this project, we will be using a stacked bar chart which will clearly illustrate the different bars for those who have taken the vaccine and those who have not. With the bar charts, labels will be added to show the count and proportions of those who have and have not taken vaccine within each group (see below for more details). In addition, it can also help us to observe the distribution of the input variables with respect to the target variable, and whether there are any plots with complete or quasi-complete separation. Variables with complete or quasi-complete separation need to be removed from the logistic regression.

While there are also many packages that can be used for bar charts, the package that I will be using is ggplot. The ggplot package is easy to manipulate and there are many different elements that we can add to the graph according to what is required. Aside from ggplot, other ways to plot barplots in R include R’s barplot() and barplot3d package. Barplot3d package is intended for 3d barplot, which I would not be exploring in this visualization. R’s barplot() offers similar functions to ggplot and can be used to do up simple barplots, but as a whole ggplot is a package with many more functions. Thus, ggplot is chosen.

1.5 Other techniques for multivariate categorical data

There are other exploratory visualizations that we can explore for multivariate categorical data namely balloon plots, which can be plotted using ‘ggballoonplot()’ under ggpubr and correspondance analysis, which can be plotted using ’ca()’under the package FactoMineR. However, both of these plots will not be explored in this visualization. Most of the categorical variables in this visualization are binary and do not require balloon plots nor correspondance analysis which may be more suitable for categorical variables that have many categories.

Apart from visualizations, the contingency table is another commonly used technique for multivariate categorical data. There are many packages that are available for displaying tables including KableExtra, flextable, huxtable and reactable among many others. However, I will not need a contingency table as I will be including information on count in the bar chart.

2 Storyboard

Combining both elements of a mosaic plot and bar chart together, a storyboard has been sketched out as follows:

As the objective of this visualization is to visualize the relationship between input variables and the target variable, the top section of the storyboard aims to achieve this objective through the mosaic plot and bar chart. By selecting a variable in the left panel, the variable will be plotted against the target variable in the main panel. In addition, it may be interested to investigate if there are relationships between input variables, which the bottom section aims to achieve. Upon selecting both variables in the left panel, both variables would appear in the main panel.

3 Data Extraction and Preparation

3.1 Installing packages

The first step would be to install the required packages. The packages that I will be using are as follows:

Package	Purpose
tidyverse	Data manipulation and plotting bar chart
vcd and vcdExtra	Plotting mosaic plot
scales	Displaying the proportions in the bar chart
stringr	Recoding the values in the column ‘State’ into letters where the first letter is capitalized

packages = c('tidyverse', 'vcd', 'vcdExtra', 'scales', 'stringr')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.2 Importing dataset

H1N1 <- read_csv("data/H1N1_Final_dan_edit.csv")

3.3 Data preparation

The first step is to rename the column for ‘Concern Level’. This is a problem that is specific to our dataset. As Concern Level has a whitespace character, there may be difficulties in calling this variable in subsequent codes. Thus, we rename it as ‘Concern_Level’.

names(H1N1)[names(H1N1) == 'CONCERN LEVEL'] <- "CONCERN_LEVEL"

The second step is to handle NA values. We handle NA values differently for the target variable and input variables. For the target variable, we would remove NA values as these rows are not useful to our analysis. The code is as follows:

#exclude the NA column in the target variable
H1N1 <- H1N1 %>%
  filter(!is.na(VACC_H1N1_F))

However, for input variables, it is acceptable to have NA values as long as these values are not above a certain threshold, which we have set as 50%. To select the input variables that have less than 50% of NA values, there are a few steps involved. The first step is to recode the NA values into NA as recognized by R. This is because the NA values in the dataset are coded as #N/A which is not recognized as an NA value to R. The following code can be run for this recoding process:

H1N1[H1N1 == '#N/A'] <- NA

Next, we can view a summary of the proportion of NA values in every column, the following code can be run:

colMeans(is.na(H1N1))

Finally, we can select out the variables that have a missing value threshold of less than 50%. The following code can be run:

h1n1data <-H1N1 %>%
  select(VACC_H1N1_F, VACC_SEAS_F, B_H1N1_ANTIV, B_H1N1_AVOID, B_H1N1_FMASK,
         B_H1N1_HANDS, B_H1N1_LARGE, B_H1N1_RCONT, B_H1N1_TOUCH, CONCERN_LEVEL,
         INT_H1N1, KNOW_H1N1, INT_SEAS, DOCREC, CHRONIC_MED_F, CLOSE_UNDER6MO_F,
         HEALTH_WORKER_F, PATIENT_CONTACT_F, AGEGRP, EDUCATION_COMP, HH_CHILD_R,
         INC_CAT1, MARITAL, RACEETH4_I, N_ADULT_R, SEX_I, STATE)

h1n1data <- transform(h1n1data,
                      VACC_H1N1_F= as.factor(VACC_H1N1_F),
                      VACC_SEAS_F= as.factor(VACC_SEAS_F),
                      B_H1N1_ANTIV= as.factor(B_H1N1_ANTIV),
                      B_H1N1_AVOID= as.factor(B_H1N1_AVOID),
                      B_H1N1_FMASK= as.factor(B_H1N1_FMASK),
                      B_H1N1_HANDS= as.factor(B_H1N1_HANDS),
                      B_H1N1_LARGE= as.factor(B_H1N1_LARGE),
                      B_H1N1_RCONT= as.factor(B_H1N1_RCONT),
                      B_H1N1_TOUCH= as.factor(B_H1N1_TOUCH),
                      CONCERN_LEVEL= as.factor(CONCERN_LEVEL),
                      INT_H1N1= as.factor(INT_H1N1),
                      KNOW_H1N1= as.factor(KNOW_H1N1),
                      DOCREC= as.factor(DOCREC),
                      CHRONIC_MED_F= as.factor(CHRONIC_MED_F),
                      CLOSE_UNDER6MO_F= as.factor(CLOSE_UNDER6MO_F),
                      HEALTH_WORKER_F= as.factor(HEALTH_WORKER_F),
                      AGEGRP= as.factor(AGEGRP),
                      EDUCATION_COMP= as.factor(EDUCATION_COMP),
                      HH_CHILD_R= as.factor(HH_CHILD_R),
                      INC_CAT1= as.factor(INC_CAT1),
                      MARITAL= as.factor(MARITAL),
                      RACEETH4_I= as.factor(RACEETH4_I),
                      N_ADULT_R= as.factor(N_ADULT_R),
                      SEX_I= as.factor(SEX_I),
                      STATE= as.factor(STATE),
                      PATIENT_CONTACT_F = as.factor(PATIENT_CONTACT_F),
                      INT_SEAS = as.factor(INT_SEAS)
                   )

The fourth step is to create a new column for region based on the column in the dataset for state. To do so, we have another dataset that has columns for both region and state. We will start by importing this dataset with the following code:

region <- read_csv("data/state_region.csv")

Then, we will have to recode the values in the state column. This is because the values are currently all in uppercase letters. We will have to recode the letters into lowercase letters apart from the first letter. Finally, we can merge both tables using a left join.

h1n1data$state_recoded <- str_to_title(h1n1data$STATE)
h1n1data <- left_join(h1n1data, region,
                      by=c("state_recoded" = "State"))

3.4 Overview of Variables

Category	Variable Name	Allowable Values
Target Variable	Taken H1N1 flu vaccine	Yes, No
Demographic	Gender	Male, Female
Demographic	Age Group	6 months-9 years, 10-17 years, 18-34 years, 35-44 years, 45-54 years, 55-64 years, 65+ years
Demographic	Education Level	<12 years, 12 years, some college, college graduate
Demographic	Income Level	<=$10000, $10001-$15000, $15001-$25000, $25001-$35000, $35001-$50000, $50001-$75000, $75001-$100000
Demographic	Marital Status	Married, Not Married
Demographic	Race	Hispanic, Non-Hispanic Black Only, Non-Hispanic White Only, Non-Hispanic Other or Multiple Races
Demographic	Number of children in household	0, 1, 2, 3
Demographic	Number of adults in household	1, 2, 3, 4
Demographic	Region	Northeast Region, Midwest Region, South Region, West Region
Behavioural	Taken seasonal flu vaccine	Yes, No
Behavioural	Taken antiviral medications	Yes, No
Behavioural	Avoid close contact with others with flu-like symptoms	Yes, No
Behavioural	Bought a face mask	Yes, No
Behavioural	Washing hands	Yes, No
Behavioural	Reduced time at large gatherings	Yes, No
Behavioural	Reduced contact outside home	Yes, No
Behavioural	Avoid touching eyes, nose or mouth	Yes, No
Attitude	Concern level about H1N1 flu	Not at all concerned, not very concerned, somewhat concerned, very concerned
Attitude	Intent to get H1N1 flu vaccine	Definitely no intent, improbably intent, probable intent, definite intent
Attitude	Knowledge about H1N1 flu	No knowledge, a little knowledge, a lot of knowledge
Attitude	Intent to get seasonal flue vaccine	Definitely no intent, improbably intent, probable intent, definite intent
Attitude	Doctor recommendation on vaccines	Never recommend vaccines, recommend seasonal flu vaccine only, recommend H1N1 vaccine only, recommend both seasonal flu and H1N1 flu vaccines
Risk Factor	Chronic medical condition flag	Yes, No
Risk Factor	Close contact with child under 6 months flag	Yes, No
Risk Factor	Works in healthcare field flag	Yes, No
Risk Factor	Direct patient contact flag	Yes, No

4 Testing and Prototyping

As the storyboard contains mosaic plots and bar charts, this section will to test the codes for the plots and which kind of plots would work better for this project. While there are two sections in the storyboard, this prototype will focus on plotting a randomly chosen input variable against the target variable. The randomly chosen input variable is behavioural variable of whether the individual has bought a face mask. These codes can easily be replicated for other variables, and for the other section in the final Shiny application.

4.1 Mosaic Plot

4.1.1 Using Chi-Square Test for Shading

Starting off on the mosaic plot, the first option is to use shading for the chi-square test plot. The color legend indicates which cells contribute to the significance of the chi-squared test results, and means the following:

The colors imply that there is a relationship between the input variable and target variable. In this case, survey respondents who bought a face mask are more likely to have taken the H1N1 vaccine. The code and resultant plot are as follows:

Color	Statistical Significance
Red	There are fewer observations than would have been expected under the null model
Blue	There are more observations that would have been expected under the null mode

mosaic(~VACC_H1N1_F + B_H1N1_FMASK, data = h1n1data, shade = TRUE,
       labeling_args = list(set_varnames = 
                              c(B_H1N1_ANTIV = "Bought a Face Mask", 
                                VACC_H1N1_F = "Taken H1N1 vaccine")))

As a contrast, the plot for gender is below, and is greyscale. This implies that gender is not a significant indicator of whether an individual takes the vaccine.

4.1.2 Drawing attention to the desired outcome

The alternative option for shading is to draw attention to the desired outcome. However, in comparison with the mosaic plot in 4.1.1, this mosaic plot provides less useful information to the user as it does not display the chi-square test results. Therefore, we have decided to opt for the first mosaic plot. Nonetheless, the code and the resultant plot are as follows:

mosaic(~VACC_H1N1_F + B_H1N1_FMASK, data = h1n1data, 
       labeling_args = list(set_varnames = 
                              c(VACC_H1N1_F = "Taken H1N1 vaccine", 
                                B_H1N1_FMASK = "Bought a face mask")),
       shade = TRUE,
       gp = gpar(fill=c('light grey', 'pink')))

4.2 Bar Chart

The next plot that I will require is a stacked bar chart, which would require the code geom_bar() from the package ggplot2. There are several steps that we ned to take to produce the graph.

4.2.1 Data Preparation

The first step is to drop the NA values in the columns using drop_NA(). Without dropping the NA values, these values would show up in the bar chart but do not offer much meaningful information to the user. Thus, I have decided to remove the NA values.

4.2.2 Plotting the Graph

The basic function ggplot and geom_bar() can be used to plot the bar chart. As I am looking at creating a stacked bar chart for respondents who have taken the vaccine vs those who have not, the target variable can be used as the ‘fill’. While there are many themes that can be used, I have chosen to use the theme theme_classic().

4.2.2.1 Plotting Data Labels of Frequency and Proportion

To provide comprehensive information to the user, I have decided to provide labels for both frequency and proportion.

For effective comparison on how the input variable affects the target variable, the proportion needs to be the count out of the sub-category within the target variable. To illustrate with the example of the input variable ‘Bought a Face Mask’, it will be more effective for users to compare that 32% of those who bought a face mask took the H1N1 vaccine vs a lower proportion of 21% of those who did not buy a face mask took the vaccine. This would highlight that there is a difference in proportion. Comparatively, it would be less insightful if the labels were a proportion of the total population i.e. 19% of the population were those who did not buy a face mask and took the vaccine vs 2% of the population were those who bought a face mask and took the vaccine. As differences in the proportions of the height of the stacked bar chart may not be obvious to the eye, the user would not be able to tell without further computations whether buying a face mask had an impact on whether an individual takes the vaccine.

The R code for the latter is much simpler and requires applying the geom_text() function to ..count.. / sum(..count..). However, the former requires a more complex formula to first compute the percentages using the package scales, and then plotting these values into geom_text(). Frequency and count were called in separate geom_text() functions.

4.2.2.2 Colors

To draw attention to the respondents who have taken the vaccine as this is the desired outcome, I have colored those who have taken the vaccine in bright blue vs those who have not taken the vaccine which is in gray. To draw further attention to the desired outcome, I have included a black border.

As the values in the column for whether an individual took the H1N1 vaccine are coded as ‘Yes’ and ‘No’, I have also re-labelled the legened as ‘Took H1N1 vaccine’ and ‘Did not take H1N1 vaccine’ respectively. I have further titled the legend as ‘H1N1 vaccination’.

4.2.2.3 Axis labels

I have labelled the axes accordingly using xlab() and ylab(). To prevent the x-axis labels from overlapping each other when there are several categories in a variable, I have placed the labels diagonally as opposed to horizontally.

4.2.3 Code and Final Plot

VACC_H1N1_F.labs <- c("Did not take H1N1 vaccine", "Took H1N1 vaccine")
names(VACC_H1N1_F.labs) <- c("No", "Yes")

percentData <- h1n1data %>% 
  drop_na(B_H1N1_FMASK) %>%
  group_by(B_H1N1_FMASK) %>% 
  count(VACC_H1N1_F) %>%
  mutate(ratio=scales::percent(n/sum(n)))

h1n1data %>%
  drop_na(B_H1N1_FMASK) %>%
  ggplot(aes(x = B_H1N1_FMASK, fill = VACC_H1N1_F)) +
  geom_bar(aes(color = VACC_H1N1_F == "Yes")) + 
  geom_text(data = percentData,
            aes(y=n, label=paste("(",ratio,")")),
            size = 3,
            position = position_stack(vjust = 0.5), hjust=-0.1) +
  geom_text(stat = "count",
            aes(label = ..count..),
            size = 3,
            position = position_stack(vjust = 0.5), hjust = 0.95) +
  scale_fill_manual(values = c("#ECECEC", "#2a9df4"),
                    breaks = c("No", "Yes"),
                    labels = VACC_H1N1_F.labs,
                    name = "H1N1 Vaccination",) + 
  scale_x_discrete(guide = guide_axis(angle = 45)) + 
  scale_color_manual(values = c(NA, 'black'), guide=F) +
  xlab("Bought a face mask") + 
  ylab("Frequency") +
  theme_classic()

Exploratory Data Analysis

Author

Affiliation

Published

DOI

1 Literature Review

1.1 Approach to Literature Review