A Data Analysis Using Python
By: Emmanuel K. Musyoki
Introduction
This notebook aims at cleaning, analyzing, and visualizing a CSV World Bank data on GDP growth in different countries and regions of the world FROM 2010 TO 2021
Section 1: Exploratory Data Analysis
- Removing unwanted columns
- Checking and Handling Missing Data (Null Values)
- Checking any anomalies and plotting them
- Visualizing the data
- Conclusion
In [86]:
#importing modules
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
In [87]:
# Load the dataset
Gdp_data = pd.read_csv (r"C:\Users\USER\Desktop\PYCHARM\csv _excel_ files\economic_indicator2.csv")
In [88]:
Gdp_data.head()
Out[88]:
In [89]:
Gdp_data.describe()
Out[89]:
8 rows × 61 columns
- We Want data from 2010 to 2021 only,
- delete the rest of the year's columns and the indicator code column
In [90]:
Gdp_data.drop(Gdp_data.iloc[:,3:53],axis = 1,inplace = True)
Now I remain with the data I want to work on, let's see the information after deleting unwanted columns
In [91]:
Gdp_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 266 non-null object
1 Country Code 266 non-null object
2 Indicator Name 266 non-null object
3 2010 254 non-null float64
4 2011 254 non-null float64
5 2012 254 non-null float64
6 2013 254 non-null float64
7 2014 256 non-null float64
8 2015 255 non-null float64
9 2016 254 non-null float64
10 2017 254 non-null float64
11 2018 254 non-null float64
12 2019 252 non-null float64
13 2020 249 non-null float64
14 2021 244 non-null float64
dtypes: float64(12), object(3)
memory usage: 31.3+ KB
HANDLING MISSING DATA
1. Checking Null Values
Total Number of Null Rows in every column
In [92]:
Gdp_data.isnull().sum()
Out[92]:
Country Name 0
Country Code 0
Indicator Name 0
2010 12
2011 12
2012 12
2013 12
2014 10
2015 11
2016 12
2017 12
2018 12
2019 14
2020 17
2021 22
dtype: int64
- Filling Null Values With the mean of individual column
- Naming the new dataset without Null values Gdp_data1
In [93]:
Gdp_data1= Gdp_data.fillna(value =Gdp_data['2010'].mean())
In [94]:
Gdp_data1.isnull().sum()
Out[94]:
Country Name 0
Country Code 0
Indicator Name 0
2010 0
2011 0
2012 0
2013 0
2014 0
2015 0
2016 0
2017 0
2018 0
2019 0
2020 0
2021 0
dtype: int64
There is No Null Value Now As seen above, we have replaced them with the mean of the column
In [95]:
Gdp_data1.head()
- Rounding Off the values in every column to one decimal place
In [96]:
Gdp_data1.round(1)
Out[96]:
266 rows × 15 columns
Cleaned Data
START VISUALIZATION
- You can visualize the GDP growth for every year or for each country in different years
- You can also compare the GDP growth for two years.
In [97]:
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
In [145]:
plt.figure(figsize=(10,6))
sns.distplot(Gdp_data1['2010'], kde=False, hist=True)
plt.title('GDP percentage growth', size=16)
plt.ylabel('Country count');
From The above Graph we can see that;
- majority of countries' GDP growth lies between 2% and 7%.
- there are a few outliers
Identifying Outliers
- Selecting Columns to plot
In [99]:
To_plot = Gdp_data1.drop(columns=['Country Code','Indicator Name' ]).select_dtypes(include=np.number)
Subplots
In [100]:
To_plot.plot(subplots=True, layout=(4,4), kind='box', figsize=(12,14), patch_artist=True)
plt.subplots_adjust(wspace=0.5)
Out [100]:
- majority of countries' GDP growth is between 0 to 6%
- The year 2020 has negative GDP growth for all countries
Statistical Analysis
- Step 1 Filtering the Data to remain with 10 countries only
- I will call it Gdp_data2
In [102]: Dropping columns to remain with 2020 and 2021 only and 10 countries
Gdp_data2= Gdp_data1.drop(Gdp_data1.iloc[:,3:13],axis = 1,inplace =False)
Gdp_data2.info()
Gdp_data2.drop(Gdp_data2.index[11:267], inplace=True)
Plot the GDP growth rate in 2020
In [142]:
plt.figure(figsize=(5,4))
Gdp_data2.groupby('Country Name')['2020'].sum().sort_values(ascending=False).plot(kind='bar')
plt.title('2020 GDP growth rate', size=16)
plt.ylabel('percentage rate');
Plot the GDP growth rate in 2021
In [143]:
plt.figure(figsize=(5,4))
Gdp_data2.groupby('Country Name')['2021'].sum().sort_values(ascending=False).plot(kind='bar')
plt.title('2021 GDP Growth Rate', size=16)
plt.ylabel('percentage rate');
Observation for the two Years
- In 2020, The economy of every country was a negative GDP
- The covid 19 pandemic caused this.
- Aruba had the highest GDP decline followed by Andora in 2020 while in 2021 Aruba had the highest GDP growth followed by Argentina
- Africa Western and Central were least affected by the pandemic in 2020 since the GDP dropped to -0.1 only.