Existing sets of data can be used in numerous forms, from creating and attempting to prove
hypotheses on past events based on pre-existing data to forecasting what the future may hold by
analyzing trends seen in current data. When it comes to creating a prediction model based on
existing datasets there are a number of variables, or factors if you will, that will determine the
effectiveness of the model being generated from the datasets. One such factor can be the overall
model you are creating compared to how the data is behaving, would a linear model that is generated
be very accurate when modeled off of data that sees a non linear trend?
Another possible factor could be how complex one makes their prediction model, it would be a
natural assumption to think that the more complex a model is that it must be better. After all,
adding more predictors into the model covers for more outliers, however as more factors are added
into the fray it also makes it harder to discern the nature of the model, how each component relates
to each other.
In regards to the current Covid-19, we are re-examining previous models that have been crafted
based on existing data, a model based on a linear regression model as well as a second model that
was created using linear discriminant analysis, also known as LDA. Revisiting these models now a
test of the standard error can be done to determine: whether a model is necessarily better as the
model’s complexity increases?
Data
Model 1
The following images and models were crafted using the most up to date datasets on the current
data chosen for the linear regression model previously was limited to just the numbers
being reported by Canada due to personal interest as a Canadian citizen. Figure one below features
a scatter plot featuring the amount of new Covid-19 cases reported in Canada per day since the data
collection has begun.
Figure 1: Cases Reported per day since Data Collection Started
From a precursory eye test this plot doesn’t look as if it will fit well with a linear regression model,
the initial approximately 80 data points representing the days passed since data collection began
general a line with a slope of near zero, if not zero all together as the number of cases reported in
Canada at this time were close to zero day to day. However, once more cases started arriving
internationally the number of reported cases began to increase almost exponentially, where it then
tapered off into a more quadratic appearing function before becoming more linear decreasing slope
in form as it approaches our current day.
To determine the accuracy of the model itself as well as it’s effectiveness per complexity as a
series of similar models were created based from the same initial model equation, but increasing in
complexity. In the linear model regression, the base equation attempts to determine the number of
cases based on the current day to an increasing number of powers as the model becomes more
complex.
Figure 2: Standard Error and Error Bars for Linear Regression Model
The plot above features the standard error of each model encompassed by their respective error
bars. The Lower the point on the y-axis represents a lower average error, from that one would
assume that on this plot model number 3 is the best based on its position. However, this is where
model complexity also comes into play. The 1 standard rule consists of choosing the model with the
lowest average error value, this has been determined to be the 3rd model. From this one uses the
upper and lower error bars from the standard error model 3 to check if any less complex model has
a mean error value that falls into that range. In this instance we can see that model two and one
both fall within this range and such model one is also a viable model to use in this instance. For this
dataset the reason for how closely tied models one through three are is likely a cause of the linear
regression model not being a good fit based on the existing data line.
Model 2
The second model chosen to be revisited was created through linear discriminate analysis and
was limited to a country within the Americas that still had a relatively high rate of new cases on a day
to day basis, here the countries were: Brazil, Canada, Chile, Mexico, Peru and the United States.
Figure three below features a scatter plot representing the new cases per day reported in each of
these countries included in the dataset.
Figure 3: Plot of New Cases per Day in Each Country in May
From the ploy of the data we can see that each country has their own respective linear fit line to
a varying degree of linear fit, Brazil and the United states have a higher error rate for their linear fits
while Canada, Chile, Mexico and Peru’s have closer linear fits. The model this dataset is created
from is attempting to create an algorithm to correctly predict the classification, in this instance the
country based on the number of new cases reported in a day. Using linear discriminant analysis, the
model is able to increase in complexity by adding in various predictors and changing their powers.
Figure four below features a plot of the error rates of each model. The models were generated
using the k-fold cross validation method, which splits the data set into a number of equal sets based
on a chosen number, which is assigned to the variable k.
From the above plot we can see that model number four has the lowest error rate at 0.2, which
represents a success rate of approximate 0.8, or 80 percent, give or take the standard error bar
range. From this plot we can also see that through the 1 standard error rule model number four
features the lowest error rate and the less complex models mean error value falls outside the error
bar range. Therefore, model number four here would be the best one to choose. The large
difference between the error rates for the models in the plot in figure four compared to two likely
derives from the data sets fitting the model more closely in the LSA data set compared to the linear
regression model.
What was Learned
From using k-fold cross validation we were able to compare multiple models of our own desired
number to determine the best model per level of complexity without sacrificing much accuracy
from just separating the model into one training set and testing set as seen in the previous LDA
model generation.
Using the one standard test method we are also able to determine that model complexity does
not automatically equal being the best model. As shown with the first model being the best to go
with in the linear regression error plot and model 4 being the best choice for the LDA model sets.
Both from a total of five models.
No comments:
Post a Comment