Exploring Your Data: A Guide to Basic Exploratory Data Analysis with Python

Exploratory Data Analysis (EDA) is a critical step in the data science process that involves analyzing and summarizing data in order to gain insights and understand its characteristics. Python provides several libraries for EDA, such as pandas, matplotlib, and seaborn, which make it easy to perform complex data manipulations and visualization tasks.

Steps for EDA in Python:

Import the data
Clean the data
Summarize the data
Visualize the data
Feature engineering
Repeat the process

Import data :

To import data into Python using pandas, you can use the pd.read_csv() function. Here is an example of how to import a CSV file into a pandas DataFrame:

python
import pandas as pd
# import the data
df = pd.read_csv('data.csv')

In this example, the data from the file "data.csv" is imported into a pandas DataFrame named df.

The pd.read_csv() a function is a convenient way to read data from a CSV file, but pandas also provide other functions for importing data from different sources, such as pd.read_excel() for Excel files, pd.read_sql() for SQL databases, and pd.read_json() for JSON files.

The pd.read_csv() function has several parameters that you can use to customize the way the data is imported into pandas. Some of the most commonly used parameters are:

sep: Specifies the separator used in the CSV file. The default separator is a comma (,).
header: Specifies the row number to use as the header. By default, the first row is used as the header. If the CSV file doesn't have a header, you can set header=None.
names: Specifies the names of the columns to use if the CSV file doesn't have a header.
index_col: Specifies the column(s) to use as the index of the DataFrame.
skiprows: Specifies the number of rows to skip before starting to read the data.
nrows: Specifies the number of rows to read from the file.
usecols: Specifies the columns to read from the file.
dtype: Specifies the data type of each column.
na_values: Specifies a list of values to be considered as missing or NaN.

Here is an example of how to use some of these parameters:

python
import pandas as pd 
# import the data, specifying the separator and the header
df = pd.read_csv('data.csv', sep='\t', header=0) 
# import the data, specifying the names of the columns
df = pd.read_csv('data.csv', header=None, names=['col1', 'col2', 'col3']) 
# import the data, using the second column as the index
df = pd.read_csv('data.csv', index_col=1) 
# import the data, skipping the first 5 rows
df = pd.read_csv('data.csv', skiprows=5)

These are just a few examples of the many options you have for importing data into pandas using the pd.read_csv() function. By using the appropriate parameters, you can easily customize the way the data is imported and ensure that it is in the format you need for your analysis.

Clean the data:

Cleaning data is an important step in the Exploratory Data Analysis (EDA) process that helps to ensure that the data is in a format that can be used for analysis. In pandas, there are several techniques for cleaning data, including:

Removing missing values: Use the df.dropna() or df.fillna() functions to remove or fill missing values, respectively.
Removing duplicates: Use the df.drop_duplicates() function to remove duplicate rows from the DataFrame.
Removing outliers: Use boolean indexing and/or the df.query() function to remove outliers from the DataFrame.
Handling incorrect data types: Use the df.astype() function to convert columns to the correct data type.
Handling text data: Use the df.str.replace() or df.str.strip() functions to replace or strip unwanted characters from text data.

Here is a simple example of how to remove missing values and handle incorrect data types:

python
import pandas as pd 
# load the data
df = pd.read_csv('data.csv') 
# remove missing values
df = df.dropna() 
# convert a column to a specific data type
df['col1'] = df['col1'].astype(int)

In this example, missing values are removed using the df.dropna() function and a column is converted to an integer data type using the df.astype() function.

Summarize the Data:

Summarizing a dataset is an important step in Exploratory Data Analysis (EDA) that provides an overview of the main characteristics of the data. In pandas, there are several functions and methods that you can use to summarize a dataset, including:

df.head(): Returns the first n rows of the DataFrame.
df.tail(): Returns the last n rows of the DataFrame.
df.shape: Returns the number of rows and columns in the DataFrame.
df.describe(): Generates descriptive statistics of the numerical columns in the DataFrame.
df.info(): Provides information about the DataFrame, including the number of rows, columns, non-missing values, and data types of the columns.
df.columns: Returns the column names of the DataFrame.
df.value_counts(): Generates a histogram of the values in a column.
df.corr(): Calculates the pairwise correlation between columns in the DataFrame.
df.groupby(): Groups the data by one or more columns and aggregates the remaining columns.
df.pivot_table(): Creates a pivot table from the DataFrame, allowing you to aggregate and summarize the data in multiple ways.
df.mean(), df.median(), df.min(), df.max(): Calculate the mean, median, minimum, and maximum of the numerical columns, respectively.

Visualization of data:

Visualizing data is an important step in Exploratory Data Analysis (EDA) as it allows you to quickly gain insights into the relationships and patterns in your data.

Scatter plot: Scatter plots are used when you have two continuous variables and want to see if there is any relationship between them. They can be used to visualize the relationship between two variables, and can also be used to identify any outliers in the data.
Line plot: Line plots are used when you have one continuous variable and one categorical variable and want to see the distribution of the continuous variable across the categories. They can be used to visualize the trends in a time series data.
Bar plot: Bar plots are used when you have one categorical variable and one continuous variable and want to compare the mean or count of the continuous variable across the categories.
Histogram: Histograms are used when you have one continuous variable and want to visualize the distribution of the variable. They can help you to see the distribution of the data and identify any skewness in the data.
Density plot: Density plots are used when you have one continuous variable and want to visualize the distribution of the variable. They are similar to histograms, but are smoother and can be used to see the overall shape of the distribution.
Box plot: Box plots are used when you have one continuous variable and want to visualize the distribution of the variable and identify any outliers. They can be used to visualize the distribution of the data and also to identify any skewness in the data.
Violin plot: Violin plots are used when you have one continuous variable and one categorical variable and want to visualize the distribution of the continuous variable across the categories. They are similar to box plots, but also show the distribution of the data.
Swarm plot: Swarm plots are used when you have two continuous variables and one categorical variable and want to visualize the relationship between the two continuous variables across the categories.
Pair plot: Pair plots are used when you have multiple continuous variables and want to visualize the relationship between all of them. They can be used to quickly explore the relationships between variables in a dataset.
Joint plot: Joint plots are used when you have two continuous variables and want to visualize the relationship between them. They are similar to scatter plots, but also display the distributions of each variable along the sides of the plot.

Here are plot types in seaborn along with example code to create each plot:

Scatter plot:

python
sns.scatterplot(x='x_column_name', y='y_column_name', data=df)

Line plot:

python
sns.lineplot(x='x_column_name', y='y_column_name', data=df)

Bar plot:

python
sns.barplot(x='x_column_name', y='y_column_name', data=df)

Histogram:

python
sns.histplot(x='column_name', data=df)

Density plot:

python
sns.kdeplot(x='column_name', data=df)

Box plot:

python
sns.boxplot(x='column_name', data=df)

Violin plot:

python
sns.violinplot(x='column_name', data=df)

Swarm plot:

python
sns.swarmplot(x='x_column_name', y='y_column_name', data=df)

Pair plot:

python
sns.pairplot(df)

Joint plot:

python
sns.jointplot(x='x_column_name', y='y_column_name', data=df)

Feature Engineering :

Feature engineering is the process of transforming raw data into features that can be used to train machine learning models.

It is a critical step in the machine learning pipeline, as the quality of the features used as input to the model can greatly impact its performance. Feature engineering can involve a wide range of techniques, including:

Encoding categorical variables: Encoding categorical variables into numerical representation to make them usable by machine learning algorithms. Techniques such as one-hot encoding, label encoding, and target encoding can be used to encode categorical variables.
Scaling numerical variables: Scaling numerical variables so that they are in the same range and can be used by algorithms that are sensitive to the scale of the input data. Techniques such as min-max scaling and standardization can be used to scale numerical variables.
Feature extraction: Extracting new features from the raw data, such as the sum, mean, standard deviation, and so on, to provide additional information to the model.
Feature selection: Selecting the most important features from the raw data to reduce noise and overfitting, and improve the performance of the model.
Feature creation: Creating new features by combining existing features, transforming features, and so on, to provide more information to the model.
Dimensionality reduction: Reducing the number of features by transforming them into a lower-dimensional space while retaining the most important information.

Feature engineering is an iterative process and can involve a combination of these techniques to create the best possible features for a given problem. By carefully selecting and transforming the features used as input to the model, you can greatly improve its performance and get better results.

Conclusion :

Exploratory Data Analysis (EDA) with Python is a process of understanding and summarizing the main characteristics of a dataset through visualizations and statistical analysis.

It is an important step in the data science pipeline, allowing you to get a better understanding of your data and identify any potential issues.

There are several techniques for dealing with missing data, such as deletion, imputation, and interpolation, and these will be discussed in more detail in a future blog. By conducting an EDA and properly handling missing data, you can ensure that your data is ready for further analysis and modeling.

Search This Blog

Chat with Data