A Data Analysis Using Python

4 min readMar 28, 2023

By: Emmanuel K. Musyoki

Introduction

This notebook aims at cleaning, analyzing, and visualizing a CSV World Bank data on GDP growth in different countries and regions of the world FROM 2010 TO 2021

Section 1: Exploratory Data Analysis

Removing unwanted columns
Checking and Handling Missing Data (Null Values)
Checking any anomalies and plotting them
Visualizing the data
Conclusion

In [86]:

#importing modules
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [87]:

# Load the dataset
Gdp_data = pd.read_csv (r"C:\Users\USER\Desktop\PYCHARM\csv _excel_ files\economic_indicator2.csv")

In [88]:

Gdp_data.head()

Out[88]:

**A Screenshot of the Dataset head after running the code above**

In [89]:

Gdp_data.describe()

Out[89]:

8 rows × 61 columns

We Want data from 2010 to 2021 only,
delete the rest of the year's columns and the indicator code column

In [90]:

Gdp_data.drop(Gdp_data.iloc[:,3:53],axis = 1,inplace = True)

Now I remain with the data I want to work on, let's see the information after deleting unwanted columns

In [91]:

Gdp_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    266 non-null    object 
 1   Country Code    266 non-null    object 
 2   Indicator Name  266 non-null    object 
 3   2010            254 non-null    float64
 4   2011            254 non-null    float64
 5   2012            254 non-null    float64
 6   2013            254 non-null    float64
 7   2014            256 non-null    float64
 8   2015            255 non-null    float64
 9   2016            254 non-null    float64
 10  2017            254 non-null    float64
 11  2018            254 non-null    float64
 12  2019            252 non-null    float64
 13  2020            249 non-null    float64
 14  2021            244 non-null    float64
dtypes: float64(12), object(3)
memory usage: 31.3+ KB

HANDLING MISSING DATA

1. Checking Null Values

Total Number of Null Rows in every column

In [92]:

Gdp_data.isnull().sum()

Out[92]:

Country Name       0
Country Code       0
Indicator Name     0
2010              12
2011              12
2012              12
2013              12
2014              10
2015              11
2016              12
2017              12
2018              12
2019              14
2020              17
2021              22
dtype: int64

Filling Null Values With the mean of individual column
Naming the new dataset without Null values Gdp_data1

In [93]:

Gdp_data1= Gdp_data.fillna(value =Gdp_data['2010'].mean())

In [94]:

Gdp_data1.isnull().sum()

Out[94]:

Country Name      0
Country Code      0
Indicator Name    0
2010              0
2011              0
2012              0
2013              0
2014              0
2015              0
2016              0
2017              0
2018              0
2019              0
2020              0
2021              0
dtype: int64

There is No Null Value Now As seen above, we have replaced them with the mean of the column

In [95]:

Gdp_data1.head()

Rounding Off the values in every column to one decimal place

In [96]:

Gdp_data1.round(1)

Out[96]:

266 rows × 15 columns

Cleaned Data

START VISUALIZATION

You can visualize the GDP growth for every year or for each country in different years
You can also compare the GDP growth for two years.

In [97]:

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [145]:

plt.figure(figsize=(10,6))
sns.distplot(Gdp_data1['2010'], kde=False, hist=True)
plt.title('GDP percentage growth', size=16)
plt.ylabel('Country count');

From The above Graph we can see that;

majority of countries' GDP growth lies between 2% and 7%.
there are a few outliers

Identifying Outliers

Selecting Columns to plot

In [99]:

To_plot = Gdp_data1.drop(columns=['Country Code','Indicator Name' ]).select_dtypes(include=np.number)

Subplots

In [100]:

To_plot.plot(subplots=True, layout=(4,4), kind='box', figsize=(12,14), patch_artist=True)
plt.subplots_adjust(wspace=0.5)

Out [100]:

majority of countries' GDP growth is between 0 to 6%
The year 2020 has negative GDP growth for all countries

Statistical Analysis

Step 1 Filtering the Data to remain with 10 countries only
I will call it Gdp_data2

In [102]: Dropping columns to remain with 2020 and 2021 only and 10 countries


Gdp_data2= Gdp_data1.drop(Gdp_data1.iloc[:,3:13],axis = 1,inplace =False)
Gdp_data2.info()

Gdp_data2.drop(Gdp_data2.index[11:267], inplace=True)

Plot the GDP growth rate in 2020

In [142]:

plt.figure(figsize=(5,4))
Gdp_data2.groupby('Country Name')['2020'].sum().sort_values(ascending=False).plot(kind='bar')
plt.title('2020 GDP growth rate', size=16)
plt.ylabel('percentage rate');

Plot the GDP growth rate in 2021

In [143]:

plt.figure(figsize=(5,4))
Gdp_data2.groupby('Country Name')['2021'].sum().sort_values(ascending=False).plot(kind='bar')
plt.title('2021 GDP Growth Rate', size=16)
plt.ylabel('percentage rate');

Observation for the two Years

In 2020, The economy of every country was a negative GDP
The covid 19 pandemic caused this.
Aruba had the highest GDP decline followed by Andora in 2020 while in 2021 Aruba had the highest GDP growth followed by Argentina
Africa Western and Central were least affected by the pandemic in 2020 since the GDP dropped to -0.1 only.