A Data Analysis Using Python

Emmanuel Kamaku
4 min readMar 28, 2023

--

By: Emmanuel K. Musyoki

Introduction

This notebook aims at cleaning, analyzing, and visualizing a CSV World Bank data on GDP growth in different countries and regions of the world FROM 2010 TO 2021

Section 1: Exploratory Data Analysis

  • Removing unwanted columns
  • Checking and Handling Missing Data (Null Values)
  • Checking any anomalies and plotting them
  • Visualizing the data
  • Conclusion

In [86]:

#importing modules
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [87]:

# Load the dataset
Gdp_data = pd.read_csv (r"C:\Users\USER\Desktop\PYCHARM\csv _excel_ files\economic_indicator2.csv")

In [88]:

Gdp_data.head()

Out[88]:

A Screenshot of the Dataset head after running the code above

In [89]:

Gdp_data.describe()

Out[89]:

8 rows × 61 columns

  • We Want data from 2010 to 2021 only,
  • delete the rest of the year's columns and the indicator code column

In [90]:

Gdp_data.drop(Gdp_data.iloc[:,3:53],axis = 1,inplace = True)

Now I remain with the data I want to work on, let's see the information after deleting unwanted columns

In [91]:

Gdp_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 266 non-null object
1 Country Code 266 non-null object
2 Indicator Name 266 non-null object
3 2010 254 non-null float64
4 2011 254 non-null float64
5 2012 254 non-null float64
6 2013 254 non-null float64
7 2014 256 non-null float64
8 2015 255 non-null float64
9 2016 254 non-null float64
10 2017 254 non-null float64
11 2018 254 non-null float64
12 2019 252 non-null float64
13 2020 249 non-null float64
14 2021 244 non-null float64
dtypes: float64(12), object(3)
memory usage: 31.3+ KB

HANDLING MISSING DATA

1. Checking Null Values

Total Number of Null Rows in every column

In [92]:

Gdp_data.isnull().sum()

Out[92]:

Country Name       0
Country Code 0
Indicator Name 0
2010 12
2011 12
2012 12
2013 12
2014 10
2015 11
2016 12
2017 12
2018 12
2019 14
2020 17
2021 22
dtype: int64
  • Filling Null Values With the mean of individual column
  • Naming the new dataset without Null values Gdp_data1

In [93]:

Gdp_data1= Gdp_data.fillna(value =Gdp_data['2010'].mean())

In [94]:

Gdp_data1.isnull().sum()

Out[94]:

Country Name      0
Country Code 0
Indicator Name 0
2010 0
2011 0
2012 0
2013 0
2014 0
2015 0
2016 0
2017 0
2018 0
2019 0
2020 0
2021 0
dtype: int64

There is No Null Value Now As seen above, we have replaced them with the mean of the column

In [95]:

Gdp_data1.head()
  • Rounding Off the values in every column to one decimal place

In [96]:

Gdp_data1.round(1)

Out[96]:

Cleaned Data

266 rows × 15 columns

Cleaned Data

START VISUALIZATION

  • You can visualize the GDP growth for every year or for each country in different years
  • You can also compare the GDP growth for two years.

In [97]:

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [145]:

plt.figure(figsize=(10,6))
sns.distplot(Gdp_data1['2010'], kde=False, hist=True)
plt.title('GDP percentage growth', size=16)
plt.ylabel('Country count');
A bar graph

From The above Graph we can see that;

  • majority of countries' GDP growth lies between 2% and 7%.
  • there are a few outliers

Identifying Outliers

  • Selecting Columns to plot

In [99]:

To_plot = Gdp_data1.drop(columns=['Country Code','Indicator Name' ]).select_dtypes(include=np.number)

Subplots

In [100]:

To_plot.plot(subplots=True, layout=(4,4), kind='box', figsize=(12,14), patch_artist=True)
plt.subplots_adjust(wspace=0.5)

Out [100]:

Box plots
  • majority of countries' GDP growth is between 0 to 6%
  • The year 2020 has negative GDP growth for all countries

Statistical Analysis

  • Step 1 Filtering the Data to remain with 10 countries only
  • I will call it Gdp_data2

In [102]: Dropping columns to remain with 2020 and 2021 only and 10 countries


Gdp_data2= Gdp_data1.drop(Gdp_data1.iloc[:,3:13],axis = 1,inplace =False)
Gdp_data2.info()

Gdp_data2.drop(Gdp_data2.index[11:267], inplace=True)

Plot the GDP growth rate in 2020

In [142]:

plt.figure(figsize=(5,4))
Gdp_data2.groupby('Country Name')['2020'].sum().sort_values(ascending=False).plot(kind='bar')
plt.title('2020 GDP growth rate', size=16)
plt.ylabel('percentage rate');
A screenshot of output in 142

Plot the GDP growth rate in 2021

In [143]:

plt.figure(figsize=(5,4))
Gdp_data2.groupby('Country Name')['2021'].sum().sort_values(ascending=False).plot(kind='bar')
plt.title('2021 GDP Growth Rate', size=16)
plt.ylabel('percentage rate');

Observation for the two Years

  • In 2020, The economy of every country was a negative GDP
  • The covid 19 pandemic caused this.
  • Aruba had the highest GDP decline followed by Andora in 2020 while in 2021 Aruba had the highest GDP growth followed by Argentina
  • Africa Western and Central were least affected by the pandemic in 2020 since the GDP dropped to -0.1 only.

--

--

Emmanuel Kamaku
Emmanuel Kamaku

No responses yet