Data Visualization and Interpretation: Using K-Fold Cross Validation to Determine Effectiveness of Previous Covid-19 Models

Existing sets of data can be used in numerous forms, from creating and attempting to prove

hypotheses on past events based on pre-existing data to forecasting what the future may hold by

analyzing trends seen in current data. When it comes to creating a prediction model based on

existing datasets there are a number of variables, or factors if you will, that will determine the

effectiveness of the model being generated from the datasets. One such factor can be the overall

model you are creating compared to how the data is behaving, would a linear model that is generated

be very accurate when modeled off of data that sees a non linear trend?

Another possible factor could be how complex one makes their prediction model, it would be a

natural assumption to think that the more complex a model is that it must be better. After all,

adding more predictors into the model covers for more outliers, however as more factors are added

into the fray it also makes it harder to discern the nature of the model, how each component relates

to each other.

In regards to the current Covid-19, we are re-examining previous models that have been crafted

based on existing data, a model based on a linear regression model as well as a second model that

was created using linear discriminant analysis, also known as LDA. Revisiting these models now a

test of the standard error can be done to determine: whether a model is necessarily better as the

model’s complexity increases?

Data

Model 1

The following images and models were crafted using the most up to date datasets on the current

Covid-19 information as provided by the European Centre for Disease Prevention and Control. The

data chosen for the linear regression model previously was limited to just the numbers

being reported by Canada due to personal interest as a Canadian citizen. Figure one below features

a scatter plot featuring the amount of new Covid-19 cases reported in Canada per day since the data

collection has begun.

Figure 1: Cases Reported per day since Data Collection Started

From a precursory eye test this plot doesn’t look as if it will fit well with a linear regression model,

the initial approximately 80 data points representing the days passed since data collection began

general a line with a slope of near zero, if not zero all together as the number of cases reported in

Canada at this time were close to zero day to day. However, once more cases started arriving

internationally the number of reported cases began to increase almost exponentially, where it then

tapered off into a more quadratic appearing function before becoming more linear decreasing slope

in form as it approaches our current day.

To determine the accuracy of the model itself as well as it’s effectiveness per complexity as a

series of similar models were created based from the same initial model equation, but increasing in

complexity. In the linear model regression, the base equation attempts to determine the number of

cases based on the current day to an increasing number of powers as the model becomes more

complex.

Figure 2: Standard Error and Error Bars for Linear Regression Model

The plot above features the standard error of each model encompassed by their respective error

bars. The Lower the point on the y-axis represents a lower average error, from that one would

assume that on this plot model number 3 is the best based on its position. However, this is where

model complexity also comes into play. The 1 standard rule consists of choosing the model with the

lowest average error value, this has been determined to be the 3rd model. From this one uses the

upper and lower error bars from the standard error model 3 to check if any less complex model has

a mean error value that falls into that range. In this instance we can see that model two and one

both fall within this range and such model one is also a viable model to use in this instance. For this

dataset the reason for how closely tied models one through three are is likely a cause of the linear

regression model not being a good fit based on the existing data line.

Model 2

The second model chosen to be revisited was created through linear discriminate analysis and

was limited to a country within the Americas that still had a relatively high rate of new cases on a day

to day basis, here the countries were: Brazil, Canada, Chile, Mexico, Peru and the United States.

Figure three below features a scatter plot representing the new cases per day reported in each of

these countries included in the dataset.

Figure 3: Plot of New Cases per Day in Each Country in May

From the ploy of the data we can see that each country has their own respective linear fit line to

a varying degree of linear fit, Brazil and the United states have a higher error rate for their linear fits

while Canada, Chile, Mexico and Peru’s have closer linear fits. The model this dataset is created

from is attempting to create an algorithm to correctly predict the classification, in this instance the

country based on the number of new cases reported in a day. Using linear discriminant analysis, the

model is able to increase in complexity by adding in various predictors and changing their powers.

Figure four below features a plot of the error rates of each model. The models were generated

using the k-fold cross validation method, which splits the data set into a number of equal sets based

on a chosen number, which is assigned to the variable k.

From the above plot we can see that model number four has the lowest error rate at 0.2, which

represents a success rate of approximate 0.8, or 80 percent, give or take the standard error bar

range. From this plot we can also see that through the 1 standard error rule model number four

features the lowest error rate and the less complex models mean error value falls outside the error

bar range. Therefore, model number four here would be the best one to choose. The large

difference between the error rates for the models in the plot in figure four compared to two likely

derives from the data sets fitting the model more closely in the LSA data set compared to the linear

regression model.

What was Learned

From using k-fold cross validation we were able to compare multiple models of our own desired

number to determine the best model per level of complexity without sacrificing much accuracy

from just separating the model into one training set and testing set as seen in the previous LDA

model generation.

Using the one standard test method we are also able to determine that model complexity does

not automatically equal being the best model. As shown with the first model being the best to go

with in the linear regression error plot and model 4 being the best choice for the LDA model sets.

Both from a total of five models.

Data Visualization and Interpretation

Thursday, June 4, 2020

Using K-Fold Cross Validation to Determine Effectiveness of Previous Covid-19 Models

No comments:

Post a Comment