Datasets Used

GBD Death Burden (State of global air)

GBD PM25 Pollution (State of global air)

Data cleaning and consolidation

The data retrieved from the website needed to be cleaned as there were a lot of unnecessary columns and many of the columns needed to be renamed. In order to have a complete data set with pollution and death burden info, the data sets needed to be consolidated. For the purpose of training the ML model a number of columns were dropped and the necessary descriptive data was one-hot encoded so that the model could use the numerical data for the purpose of training.

ML model

In order to create the ML model we used information about what country was being looked at, the exposure to PM25, the year, and the GBD region to estimate the death burden caused by air pollution. This model was also compared against a couple of other models that did not include all of the information to see how impactful certain information was to the overall model accuracy.