Exploring Your Data: A Guide to Basic Exploratory Data Analysis with Python
Exploratory Data Analysis (EDA) is a critical step in the data science process that involves analyzing and summarizing data in order to gain insights and understand its characteristics. Python provides several libraries for EDA, such as pandas, matplotlib, and seaborn, which make it easy to perform complex data manipulations and visualization tasks.
Steps for EDA in Python:
- Import the data
- Clean the data
- Summarize the data
- Visualize the data
- Feature engineering
- Repeat the process
Import data :
To import data into Python using pandas, you can use the pd.read_csv()
function. Here is an example of how to import a CSV file into a pandas DataFrame:
pythonimport pandas as pd
# import the data
df = pd.read_csv('data.csv')
In this example, the data from the file "data.csv" is imported into a pandas DataFrame named df
.
The pd.read_csv()
a function is a convenient way to read data from a CSV file, but pandas also provide other functions for importing data from different sources, such as pd.read_excel()
for Excel files, pd.read_sql()
for SQL databases, and pd.read_json()
for JSON files.
The pd.read_csv()
function has several parameters that you can use to customize the way the data is imported into pandas. Some of the most commonly used parameters are:
sep
: Specifies the separator used in the CSV file. The default separator is a comma (,
).header
: Specifies the row number to use as the header. By default, the first row is used as the header. If the CSV file doesn't have a header, you can setheader=None
.names
: Specifies the names of the columns to use if the CSV file doesn't have a header.index_col
: Specifies the column(s) to use as the index of the DataFrame.skiprows
: Specifies the number of rows to skip before starting to read the data.nrows
: Specifies the number of rows to read from the file.usecols
: Specifies the columns to read from the file.dtype
: Specifies the data type of each column.na_values
: Specifies a list of values to be considered as missing or NaN.
Here is an example of how to use some of these parameters:
pythonimport pandas as pd
# import the data, specifying the separator and the header
df = pd.read_csv('data.csv', sep='\t', header=0)
# import the data, specifying the names of the columns
df = pd.read_csv('data.csv', header=None, names=['col1', 'col2', 'col3'])
# import the data, using the second column as the index
df = pd.read_csv('data.csv', index_col=1)
# import the data, skipping the first 5 rows
df = pd.read_csv('data.csv', skiprows=5)
These are just a few examples of the many options you have for importing data into pandas using the pd.read_csv()
function. By using the appropriate parameters, you can easily customize the way the data is imported and ensure that it is in the format you need for your analysis.
Clean the data:
Cleaning data is an important step in the Exploratory Data Analysis (EDA) process that helps to ensure that the data is in a format that can be used for analysis. In pandas, there are several techniques for cleaning data, including:
- Removing missing values: Use the
df.dropna()
ordf.fillna()
functions to remove or fill missing values, respectively. - Removing duplicates: Use the
df.drop_duplicates()
function to remove duplicate rows from the DataFrame. - Removing outliers: Use boolean indexing and/or the
df.query()
function to remove outliers from the DataFrame. - Handling incorrect data types: Use the
df.astype()
function to convert columns to the correct data type. - Handling text data: Use the
df.str.replace()
ordf.str.strip()
functions to replace or strip unwanted characters from text data.
Here is a simple example of how to remove missing values and handle incorrect data types:
pythonimport pandas as pd
# load the data
df = pd.read_csv('data.csv')
# remove missing values
df = df.dropna()
# convert a column to a specific data type
df['col1'] = df['col1'].astype(int)
In this example, missing values are removed using the df.dropna()
function and a column is converted to an integer data type using the df.astype()
function.
Summarize the Data:
Summarizing a dataset is an important step in Exploratory Data Analysis (EDA) that provides an overview of the main characteristics of the data. In pandas, there are several functions and methods that you can use to summarize a dataset, including:
df.head()
: Returns the firstn
rows of the DataFrame.df.tail()
: Returns the lastn
rows of the DataFrame.df.shape
: Returns the number of rows and columns in the DataFrame.df.describe()
: Generates descriptive statistics of the numerical columns in the DataFrame.df.info()
: Provides information about the DataFrame, including the number of rows, columns, non-missing values, and data types of the columns.df.columns
: Returns the column names of the DataFrame.df.value_counts()
: Generates a histogram of the values in a column.df.corr()
: Calculates the pairwise correlation between columns in the DataFrame.df.groupby()
: Groups the data by one or more columns and aggregates the remaining columns.df.pivot_table()
: Creates a pivot table from the DataFrame, allowing you to aggregate and summarize the data in multiple ways.df.mean()
,df.median()
,df.min()
,df.max()
: Calculate the mean, median, minimum, and maximum of the numerical columns, respectively.
Visualization of data:
Scatter plot: Scatter plots are used when you have two continuous variables and want to see if there is any relationship between them. They can be used to visualize the relationship between two variables, and can also be used to identify any outliers in the data.
Line plot: Line plots are used when you have one continuous variable and one categorical variable and want to see the distribution of the continuous variable across the categories. They can be used to visualize the trends in a time series data.
Bar plot: Bar plots are used when you have one categorical variable and one continuous variable and want to compare the mean or count of the continuous variable across the categories.
Histogram: Histograms are used when you have one continuous variable and want to visualize the distribution of the variable. They can help you to see the distribution of the data and identify any skewness in the data.
Density plot: Density plots are used when you have one continuous variable and want to visualize the distribution of the variable. They are similar to histograms, but are smoother and can be used to see the overall shape of the distribution.
Box plot: Box plots are used when you have one continuous variable and want to visualize the distribution of the variable and identify any outliers. They can be used to visualize the distribution of the data and also to identify any skewness in the data.
Violin plot: Violin plots are used when you have one continuous variable and one categorical variable and want to visualize the distribution of the continuous variable across the categories. They are similar to box plots, but also show the distribution of the data.
Swarm plot: Swarm plots are used when you have two continuous variables and one categorical variable and want to visualize the relationship between the two continuous variables across the categories.
Pair plot: Pair plots are used when you have multiple continuous variables and want to visualize the relationship between all of them. They can be used to quickly explore the relationships between variables in a dataset.
Joint plot: Joint plots are used when you have two continuous variables and want to visualize the relationship between them. They are similar to scatter plots, but also display the distributions of each variable along the sides of the plot.
seaborn
along with example code to create each plot:- Scatter plot:
pythonsns.scatterplot(x='x_column_name', y='y_column_name', data=df)
- Line plot:
pythonsns.lineplot(x='x_column_name', y='y_column_name', data=df)
- Bar plot:
pythonsns.barplot(x='x_column_name', y='y_column_name', data=df)
- Histogram:
pythonsns.histplot(x='column_name', data=df)
- Density plot:
pythonsns.kdeplot(x='column_name', data=df)
- Box plot:
pythonsns.boxplot(x='column_name', data=df)
- Violin plot:
pythonsns.violinplot(x='column_name', data=df)
- Swarm plot:
pythonsns.swarmplot(x='x_column_name', y='y_column_name', data=df)
- Pair plot:
pythonsns.pairplot(df)
- Joint plot:
pythonsns.jointplot(x='x_column_name', y='y_column_name', data=df)
Feature Engineering :
Encoding categorical variables: Encoding categorical variables into numerical representation to make them usable by machine learning algorithms. Techniques such as one-hot encoding, label encoding, and target encoding can be used to encode categorical variables.
Scaling numerical variables: Scaling numerical variables so that they are in the same range and can be used by algorithms that are sensitive to the scale of the input data. Techniques such as min-max scaling and standardization can be used to scale numerical variables.
Feature extraction: Extracting new features from the raw data, such as the sum, mean, standard deviation, and so on, to provide additional information to the model.
Feature selection: Selecting the most important features from the raw data to reduce noise and overfitting, and improve the performance of the model.
Feature creation: Creating new features by combining existing features, transforming features, and so on, to provide more information to the model.
- Dimensionality reduction: Reducing the number of features by transforming them into a lower-dimensional space while retaining the most important information.
Feature engineering is an iterative process and can involve a combination of these techniques to create the best possible features for a given problem. By carefully selecting and transforming the features used as input to the model, you can greatly improve its performance and get better results.