Data Visualization and Interpretation: Can We Predict Where New Cases of Covid 19 will be?

When it comes to international incidents, such as the ongoing Covid-19 global pandemic there are

various questions that arrive and answers that can’t always be provided with certainty. One of these

questions may be “where are new cases going to pop up?”. The answer to this question will allow

aide to be pre-distributed to predicted impacted regions on varying levels such as municipal bodies

providing more aide to higher affected neighbourhoods, provincial/state governments allocating aide

to higher need cities, up to international aid and relief groups allocating resources to Nations that are

in greater need of response and relief.

In order to allocate resources properly a method of prediction and classification should be

implemented based on existing data. When known data is obtained and analyzed an algorithm, or

equation, can be created to allocate data in the form of raw numbers without proper classification.

However, several questions will arise from such a situation, how accurately could an algorithm from

previous data predict right countries by number of new cases? Would predicting based on another

category, such as new deaths be more accurate? How would it fare compared to prediction by

inspection?

Data

Using the current day to day datasets on new Covid-19 cases provided by the European Centre for Disease Prevention and Control. For my own personal hypothesis, A decision was made to create constraints of countries that were limited to the Americas due to its direct relation to Canada, from there a subsection of countries that were still normally seeing more than 1000 new cases per day, this limited the countries included in the range of data: Brazil, Canada, Chile, Mexico, Peru and the United States. The range of dates chosen were limited to the past calendar week as to reflect the constant evolving situation in regards to the Virus.

For an initial inspection of the current new cases occurring per country and what one may possibly gleam from the data in regards to the future a histogram was created in figure 1 below.

Figure 1: Histogram of New Cases per Day

From the data above Canada is separated from the next highest countries in the number of new cases reported each day by approximately a 1000 or so, which would make it fairly easy to predict which number would be its new cases reported by day. However, after this we see that the bars representing the number of new cases in Chile, Mexico and Peru are all relatively smashed together in one conglomerate, this overlap would likely lead to difficulties in accurately predicting whether or not the number of new cases reported corresponds to the correct country. There is also a much smaller overlap seen between Brazil and the United States, however the overlap is much smaller than the one between Chile, Mexico and Peru, so while the rate of incorrect predictions in the number of new cases countries for these two countries should be lower than the other three. This can be further analyzed using the box plot below in figure 2.

Figure 2: Box Plot of Cases Per Day

Figure 2, details the range of new Covid-19 cases per day by country respectively with the boxes representing the second and third quartiles of the data for each country and the lines attached to the boxes representing the top and bottom 25 percent values respectively. One again this plot shows that predicting the number of new cases per day belonging to Canada should be almost 100 percent guaranteed by inspection as both the box and the outliers are below the range of the rest of the countries included in this dataset. From inspection we can also see that the top 25% of Brazil’s new case numbers overlap partially with the boxed region of the United State’s box chart, while the bottom 25% of the United States’ line overlaps with Brazil’s top 25 percent, this should result in more values being incorrectly predicted as United States that are actually Brazil than the other way around. By inspection it appears that Mexico and Peru share the same data overlap as Brazil and USA, meaning they should feature a similar rate of incorrect predictions. By inspection Chile and Peru appear to have the greatest overlying plots, which could result in nearly all of Chile’s numbers being incorrectly predicted to belong to Peru.

To test this an algorithm was created using the existing data from the previous week, a linear discriminant analysis was used to create a predictor of the country based on the number of new cases per day. In order to test said algorithm a fraction of the data from the dataset was used to determine the success rate of the algorithm. Table 1 below represents a prediction table with the results of feeding in the data through the algorithm.

Table 1: Prediction Table Results

	Brazil	Canada	Chile	Mexico	Peru	USA
Brazil	3	0	0	0	0	1
Canada	0	7	0	2	0	0
Chile	0	0	0	0	0	0
Mexico	0	0	3	3	2	0
Peru	0	0	0	0	0	0
USA	1	0	0	0	0	3

In the above table the column of country names represents the country predicted by the algorithm that it thinks the current value of new cases represents. The row of names ate the top represents the actual countries, for example the number three in the ‘Brazil-Brazil’ box represents the algorithm predicting the value belonging to Brazil and it actually was Brazil three times. While the ‘Brazil-USA’ Box represents one occurrence in which the algorithm predicted a value as belonging to Brazil but it was in reality from the USA. From the rest of the results we can see that our predictions based on inspection of the earlier box plot was represented by both the Brazil and United States overlap as well as the Mexico/Brazil/Chile one while Canada had a previously unseen by inspection overlay with Mexico based on incorrect predictions from the algorithm. Overall, out of a sample size of n = 24, the algorithm predicted the proper category 16 times out resulting in a success rate of 67 percent and an error rate of 33 percent. A decently high enough rate of success rather than guessing blindly, however perhaps using death rate would generate an algorithm with a higher rate of success.

Figure 3: Box Plot of Deaths per Day

From the box plots in figure 3 the box for Brazil is completely within the United States’ range of values, which could attribute to almost all of Brazil’s values being incorrectly classified as belonging to the US by inspection while the United States has more than half their values outside of Brazil’s range. Meanwhile Mexico’s bottom 25 percentile of their values overlaps with Peru’s top quartile, Canada and Peru have a large overlap which should represent nearly 50 percent of the values being incorrectly predicted, and Chile is relatively in its own range which should result in most values by inspecting being correctly predicted. A LDA algorithm is once again generated, this time based on the number of new deaths reported by day in each country, table 2 below displays the results of the predictions,

Table 2: Prediction Table Results based on Deaths/Day Algorithm

	Brazil	Canada	Chile	Mexico	Peru	USA
Brazil	3	0	0	0	0	0
Canada	0	0	0	0	0	0
Chile	0	0	0	0	0	0
Mexico	0	0	0	4	0	0
Peru	0	4	3	1	5	0
USA	1	0	0	0	0	3

The results for this table reinforce Brazil and the USA’s overlapping relationship from the inspection of the box plots, resulting in improper predictions, however the ratio from the algorithm is lower than might be estimated based on inspection of the box plots. Here Canada’s were all incorrectly predicted compared to with the algorithm based on the new cases per day, being incorrectly classified as Peru, while Mexico’s predictions were all correctly classified. Overall, this algorithm shows 15 correct predictions based on a sample size of n = 24, which results in an error rate of 37.5 percent and correct rate of 62.5 percent.
What Was Learned

The error and success rate between the two different algorithms based on the number of new cases per day and new deaths per day respectively was relatively the same and based on random data would likely be about the equally effective. However, both algorithms predicted the classification of the data at a higher rate than would be likely be done from inspection of the information from the box plots or histogram generated from the same dataset as the algorithms.

Data Visualization and Interpretation

Tuesday, May 26, 2020

Can We Predict Where New Cases of Covid 19 will be?

No comments:

Post a Comment