Introduction and Motivation
Suicide is a serious problem that affects families, communities, and countries around the world. As there are many factors which play a role in suicide rates, classifying them can help to identify leading suicide risks. When these factors are analyzed, preventative measures may be put in place to decrease individual and national suicide rates. However, due to the sheer number of data points and variables, patterns within the data may be difficult or impossible for humans to analyze and derive. Thus, in our project, we intend to build a machine learning classifier to identify crucial factor combinations contributing to suicide.
Approach
We intend to use the dataset Suicide Rates Overview 1985 to 2016 on Kaggle, which compares socio-economic information with suicide rates by year and country. We intend to create a regression model in order to predict suicide rates off of the given factors. We will run random forest to determine the importance of the factors used, and see if we can reduce the number of factors and still receive a reasonable model. Additionally, we will run several models and compare the mean absolute error to determine the type of model that best fits the data, in order to best predict suicide rates or danger of a given individual. We will use MAE (mean absolute error) as a comparison because it's in the same units as rate and it is easy to understand and compare.
Data Preprocessing
The dataset used contains large quantities of information, but was missing quite a few data points. Particularly with the Human Development Data, large amounts of data are missing. To fix this, we gathered data from the United Nations Development Programme Human Development Reports. Additionally, we mapped the textual values to integers; sex to 1 and 2 (male = 1, female = 2) to avoid division by zero errors, and age ranges from 1 to 6. We used one-hot encoding to represent the countries with no good results, so we removed the countries column. We believe a country can be identified by its population, HDI, and gpd per capita from the dataset. We also adjusted zero values to be epsilon, to avoid division by zero again. Additionally, we removed data noted to be inconsistent, such as generation data.
Next, we ran a visualization of each of our factors against the other to visually veify our data.
Looking at the scatterplot and the data values, we discovered that when population is low (below around 20000), the suicide rates expand exponentially as the population sizes drop to zero.
To fix these outliers, we decided to only use data where the population size is over 20,000 people. The fixed population data is shown below.
The population values look good! An overview of our final data used is below.
To check the correlation of our values, we created a heat map of all the attribues analyzed against each other.
This shows that GDP per capita and HDI are very highly correlated with each other, so they may not both be necessary. However, the other factors have generally low correlation; although interestingly, it implies that suicide rate is most strongly correlated with age.
Preprocessing - models
Random Forest
For our first model, we decided to run random forest to determine feature importance and to get a first, baseline model.
According to results of the permutation importance, age is the most important factor in determining suicide rate. This makes sense, as suicide rate increases as age increases.
*For the encodings, 1 = 5-14 years old, 2 = 15-24 years old, 3 = 25-34 years old, 4 = 35-54 years old, 5 = 55-74 years old, 6=75+ years old*
After age, the next important factor is population, then sex and HDI. All these have strong correlations in their suicide rate vs individual factor graph below.
For a closer look, we included the first four layers of the tree. From this, you can see that sex and age have a large influence on the splitting factors. Additionally, MSE is quite low as age is low, but increases greatly as age increases.
Random forest actually ended up being a rather successful model, with mae of 3.01. We then ran random forest with the four factors of top importance; age, sex, HDI, and population. The MAE increased, so we tested with the top five factors as well, adding GDP. The MAE was still relatively low, but we wanted to test what would happen if we removed GDP, as GDP has a high correlation to HDI according to the heatmap. When we did this, the MAE increased again, but still by a relatively small amount. We also compared the MAE between the full set of data and data without population less than 20k, and they confirmed that MAE is lower for the cleaned data.
Multiple Linear Regression
Random Forest successfully converged with a somewhat low MAE, but we wanted to see if another model is a better fit. We decided to move to regression, beginning with multiple linear regression. In order to accurately compare our values, we ran the regression for each of the factor combinations we had for random forest.
For linear regression, our MAE is very high for each value. As the mean suicide rate is 12, the average error is almost as high as the mean of the data, implying that this is a very bad fit for the data. Looking at the comparison of each factor with the linear regression model (below), most of the variables are not linearly correlated with suicide rate, making this an unsuitable model for our purposes.
Beyond visually looking at the graphs, we can look at the linear correlation between each factor and the suicide rate (below).
The linear correlation of population with suicide rate is very low, although it was comparatively high in importance in the results from random forest. Age, sex, and HDI have high correlation values, alhtough sex is inversely correlated. The variations in correlations and high positive and negative correlations lead to linear regression being a bad fit for the overall model.
Ridge Regression
As linear regression did poorly, we decided to run ridge regression, as it works well with multicollinearity and large numbers of factors. Luckily, running ridge regression with all of the possible factors resulted in the data successfully converging.
For all factors, we calculated the Mean Absolute Error to be 5.22, which isn't bad, but still less accurate than random forest.
Finally, we tested our model prediction on the data test set. A graph of the predictions to the true values is below. It is reasonably accurate in following the general trend of the data, although there are many outliers especially as the suicide rate gets higher.
We also graphed the prediction error for each data point. It is almost gaussian around zero, meaning that the model is relatively successful.
Once again, we ran each of the factor variations in ridge regression that we had previously for an accurate comparison.
Although the change in MAE was small in each case, all errors were all significantly higher than random forest, signifying that this is not a great model for this dataset.
Conclusion
In conclusion, random forest had by far the lowest mean absolute error. Random forest handles the inherent clusters of the dataset better, so this makes sense. Multiple linear regression, on the other hand, does not handle high dimensionality well. Although ridge regression handles dimensionality well, the data has a complex enough relationship that random forest fits better than ridge regression.
Looking at the factors used, all models performed the best when using all factors and excluding the low population outliers. Interestingly, random forest performed almost as well without the year value, while both regressions performed better without the GDP factor. This may be because the high correlation between GDP and population rate caused problems with the regression functions, but not the random forest as it was ranked with low importance. Additionally, without year or GDP, the models performs the worst in all cases, but the error is still low enough that the predictions would be somewhat accurate, at least for random forest.
What's New About Our Approach?
Although there are analaysis of suicide data with different regression algorithms, our analysis is unique because we analyzed the important factors for the data and tested each model with the given factors. This allows us to determine accuracy of a model without all factors, and calculate suicide risk for an individual based on their demographics even if all factors are not provided.
Works Cited
Rusty. (2018, December 1). Suicide Rates Overview 1985 to 2016. Retrieved from https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016.
Human Development Reports. (n.d.). Retrieved from http://hdr.undp.org/en/data.
Distribution of Work
Proprosal: Sook Ji Do, Jiajie Lin, Elizabeth Prucka, Yvonne Yeh
Data Collection: Elizabeth Prucka, Yvonne Yeh
Data Cleaning: Elizabeth Prucka, Yvonne Yeh
Random Forest: Yvonne Yeh
Ridge Regression: Elizabeth Prucka
Multiple Linear Regression: Sook Ji Do, Jiajie Lin
Analysis: Elizabeth Prucka, Yvonne Yeh
Github Pages: Elizabeth Prucka, Yvonne Yeh