Statistical Analysis on factors influencing Life Expectancy in R

Ritesh Uppal
11 min readJul 13, 2021

Here we will analyze various factors such as immunization factors, mortality factors, economic factors, social factors, and other health-related factors for 193 countries over 2000–2015. Since the observations in this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to the lower value of life expectancy.

Dataset

There are 22 Columns and 2938 rows in the data. Missing values have already been handled before statistical analysis using manual imputation based on the distribution of the columns. The column ‘Year’ has been dropped from the dataset.

Code + Dataset (Imputed): https://github.com/riteshuppal1402/Life-Expectancy

Dataset (Original): https://www.kaggle.com/kumarajarshi/life-expectancy-who

Description of columns:

1. Life_expectancy — Life Expectancy in age

2. Status — Developing/Developed

3. Country — Country name

4. Adult_Mortality — Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)

5. infant_deaths — Number of Infant Deaths per 1000 population

6. Alcohol — Alcohol, recorded per capita (15+) consumption (in liters of pure alcohol)

7. percentage_expenditure — Expenditure on health as a percentage of Gross Domestic Product per capita(%)

8. Hepatitis_B — Hepatitis B (HepB) immunization coverage among 1-year-olds (%)

9. Measles — Number of reported cases per 1000 population

10. BMI — Average Body Mass Index of the entire population

11. under_five_deaths — Number of under-five deaths per 1000 population

12. Polio — Polio (Pol3) immunization coverage among 1-year-olds (%)

13. Total_expenditure — General government expenditure on health as a percentage of total government expenditure (%)

14. Diphtheria — Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

15. HIV_AIDS — Deaths per 1 000 live births HIV/AIDS (0–4 years)

16. GDP — Gross Domestic Product per capita (in USD)

17. Population — Population of the country

18. thinness_1_19_years — Prevalence of thinness among children and adolescents for Age 10 to 19 (% )

19. thinness_5_9_years — Prevalence of thinness among children for Age 5 to 9(%)

20. Income_composition_of_resources- Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

21. Schooling — Number of years of Schooling(years)

Let’s get started!

In this section, we will try to figure out the answers to some of the interesting questions and check if the data gives enough evidence to support the claims. To do this, we will first frame a null and alternate hypothesis and then use suitable tests to reject or accept the null hypothesis.

Q1 Check if the sample gives enough evidence to say that Developed countries have more average life expectancy than Developing countries?

Q2 Using the sample, test whether schooling years (average) has a significant impact on life expectancy?

Q3 Using the sample, check if countries that spend a higher proportion of their resources on human development have a higher life expectancy?

Q4 (Hypothetical) Indian Government has claimed that they have spent an average of around 5.2% of their total expenditure on health for the year 2000–2015. Can you test their claim?

Q5 Compare the proportions of the number of infant deaths and the number of under-five deaths. What is your observation?

Q6 Does life expectancy have a positive or negative correlation with habits like drinking alcohol? Does the result point out any strong conclusion?

Q7 What is the most frequent range of life expectancy?

Q8 Fit a Linear Regression model to the dataset.

Q9 Looking at the data, can you show that immunization against Polio and Diphtheria has a significant effect on life expectancy?

Note: All the conclusions below have been made based on the dataset for the years 2000–2015 and not the present time.

Q1 Check if the sample gives enough evidence to say that Developed countries have more average life expectancy than Developing countries?

Which test to use?

Since we are not aware of the population variance, we will use a two-sample T-Test instead of a two-sample Z-Test to check for the equality of the two means. Before using the T-Test we need to check if the variance of the two populations is equal or not. For this, we will use F-Test.

Firstly, we will filter the data and group the data by country to obtain the mean life expectancy of each country over the 16 years.

Developing_X

The observed p-value is less than alpha(0.05 by default). Hence we reject the null hypothesis and accept the alternate statement that the variance of two populations is not equal.

The null hypothesis is rejected against the alternative hypothesis as the p-value<0.05. Hence we conclude that life expectancy in developed countries is more than that of developing countries with 95% confidence.

Q2 Education creates awareness about healthy living. For example Vaccine hesitancy during this Covid-19 period, especially among the rural population, has highlighted the importance of education. Using the sample, test whether schooling years (average) has a significant impact on life expectancy?

Which test to use?

We will be using the ANOVA test to test the significance of education on life expectancy. Here we will categorize countries into one of the three categories: ‘Low’ (≤8), ‘Medium’(>8 and ≤12), ‘High’ (>12) depending upon the country’s average schooling years.

Firstly, we will group the data by country and find the average life expectancy and Schooling for each country over the 16 years.

Now we will return a dataframe having average life and level of education (low, medium, or high) as columns and each row corresponding to one among the 193 countries in the dataset.

Finally, let us now apply the ANOVA test!

As the p-value <0.05, we can say that education has a significant impact on life expectancy.

Q3 The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable, and have a decent standard of living. Using the sample, check if countries that spend a higher proportion of their resources on human development have a higher life expectancy?

What to use?

We will use a scatter plot and Pearson correlation coefficient to determine the relationship between life expectancy and income composition of resources.

As we can see, the countries with higher income composition of resources for human development have better life expectancy. Also the regression line explains 82% of variance in the data. Thus countries should spend more on the human development to achieve higher life expectancy.

Q4 (Hypothetical) Indian Government has claimed that they have spent an average of around 5.2% of their total expenditure on health for the year 2000–2015. Can you test their claim?

Which test to use?

We will use a One-Sample t-test and not a One-sample Z-test to test the claim since we have no information about the population variance.

Firstly, we will use a filter to obtain the data for ‘India’ and the Total_expenditure column as it denotes % of government expenditure on health out of total government expenditure.

Since 5.2 doesn’t lie in the 95% confidence interval range [4.232587, 4.689913], we can say that the sample doesn’t give enough evidence to accept the null hypothesis or simply the claim made by the Indian Government.

Q5 Compare the proportions of the number of infant deaths and the number of under-five deaths. What is your observation?

Which test to use?

We will conduct a two-proportions z-test to compare the two independent proportions.

Firstly, we will group the data by country and then find the average life expectancy, infant deaths, and under-five deaths for each country.

infant_deaths column represents the number of infant deaths per 1000 population (not 100) and similarly, under_five_deaths represents the number of under-five deaths per 1000 population. We have to use the average value of infant or under-five deaths of all the countries and take its ceiling value. For example, if the value is 26.2 then we take the value as 27, so the proportion becomes 27/1000 = 0.027.

Note: We need not group the values in the first step as we are at last taking the average of all the values. Here I have done the grouping just to follow what we have been doing till now :)

Note: Here we have not used continuity correction because n * estimate of p >5 and n*(1- an estimate of p) >5 for both the proportions.

Since the p- value is greater than 0.05, we see no significant difference in the two independent proportions.

Q6 Does life expectancy have a positive or negative correlation with habits like drinking alcohol? Does the result point out any strong conclusion?

Which test to use?

We will use the Pearson correlation test.

Wait..wait..wait…Alcohol consumption has a positive correlation with life expectancy? Before we jump to conclusions, let us check one more thing. Why not plot adult mortality rate and alcohol consumption? Fair enough!

Before you pour yourself some or turn your computer upside down, let me surprise you with something. The correlation coefficient, although significant here, is very less to reach any strong conclusion.

Since there is no strong correlation in both the cases, we cannot reach any remarkable conclusion.

Q7 What is the most frequent range of life expectancy?

a) Less number of people on both sides i.e less than 45 years and more than 85 years of life expectancy

b) Majority of the people fall under the range 65–82 Years of life expectancy.

Q8 Fit a Linear Regression model to the dataset.

The model explains 83% of the variance in the data. We can see that the columns marked with *** are highly significant followed by ** and *.

Q9 In Covid-19 times we all have seen the importance of immunization against the virus to increase life expectancy. Looking at the data, can you show that immunization against Polio and Diphtheria has a significant effect on life expectancy?

We will use a two-way ANOVA test. Here we will divide the countries into two categories for both Polio and Diphtheria. Countries having values of % immunization coverage for one-year-old greater than the median value will get category ‘High’ else ‘Low’.

Step1: Countries with polio (mean) coverage for one-year-old ≤85 will get a label ‘Low’ else ‘High’.

Step2: Countries with Diphtheria(mean) coverage for one-year-old ≤85 will get a label ‘Low’ else ‘High’.

Step3: Merge the two dataframes by country name.

Step4: Apply two-way ANOVA test

P-value for both Polio and Diphtheria immunization coverage for one-year old is less than 0.05, hence we can say that immunization has a significant impact on the life expectancy.

Conclusion

We reached the following results:

  1. Life expectancy in developed countries is more than that of developing countries.

2. Education has a significant impact on life expectancy.

3. Countries with higher income composition of resources for human development have a better life expectancy.

4. There is no significant difference in proportions of the number of infant deaths and the number of under-five deaths.

5. There is no strong correlation between alcohol consumption and life expectancy

6. Most frequent range for life expectancy is 65–82 Years and the least frequent range is less than 45 years and more than 85 years.

7. Immunization coverage has a significant impact on life expectancy.

Thanks a lot for reading! Hope to see you in the next article:)

--

--

Ritesh Uppal

Got hit in head by waves of data! Research Intern @Samsung | Ex-Business Analyst @UC Berkeley