Learn Pandas for Data Analysis (for complete beginners)

Pandas (not to be confused with the white fluffy animal) is basically a Python package that helps us manipulate datasets in order to use them in analysis. It provides easy-to-use tools for working with structured data, such as spreadsheets, SQL tables, and CSV files. Pandas is an essential library for data scientists, analysts, and anyone who needs to work with data in Python.

Pandas is open-source which means:

  1. It's freely accessible to anyone without cost, fostering widespread use.

  2. Its source code is openly available, allowing transparency and collaboration.

  3. Users can modify and customize it to suit their specific needs.

  4. The community of contributors continually enhances and maintains it.

  5. Open-source nature promotes security, reliability, and innovation in Pandas

Pandas has a variety of functions and methods that can help you manipulate and analyze data accurately. In this blog post, we will learn how to use Pandas for data analysis, especially for beginners like you.

You will understand how to sort, summarise, group, and manipulate data with a couple of Pandas methods and functions.

Please ensure you at least have foundation knowledge of Python before you continue.

Data Structures

Before anything else, you need to understand how data structures work. Pandas mainly deal with two essential data structures:

  • DataFrames: Think of a DataFrame as a table, similar to what you might see in an Excel spreadsheet or a SQL database table. It consists of rows and columns. Each column can contain different types of data (numbers, text, dates), and each row represents a single observation or record.

  • Series: A Series is a one-dimensional array-like object. It can be seen as a single column in a DataFrame. It's a labeled array where each element has a label or index.

Install Pandas

There are two easy ways to install Pandas

Pip install

pip install pandas

Conda install

conda install -c anaconda pandas

Import Pandas as Pd

Before you can load your data into your dataframe, you need to import Pandas using the command:

import pandas as pd

The statement "import pandas as pd" in Python is used to import the Pandas library and give it an alias or nickname, which is "pd" in this case. This alias simplifies the process of referencing Pandas functions and objects in your code.

So, instead of typing "pandas" each time you want to use a Pandas function, you can use the shorter "pd." For example, you can access Pandas functions like "pd.DataFrame()" to create a DataFrame or "pd.read_csv()" to read data from a CSV file. This aliasing convention is commonly used in the Python data science community to make code more concise and readable.

Loading data

# Reading data from a CSV file
df = pd.read_csv('data.csv')

The line of code df = pd.read_csv('data.csv') in Python does the following:

  1. pd: It uses the alias "pd" that was previously assigned to the Pandas library when importing it. This alias simplifies the process of calling Pandas functions and objects.

  2. read_csv(): This is a Pandas function used for reading data from CSV (Comma-Separated Values) files. CSV files are a common format for storing tabular data.

  3. 'data.csv': This is the argument passed to the read_csv() function and specifies the file path or URL to the CSV file you want to read. In this case, it assumes there is a file named "data.csv" in the current working directory.

  4. df: This is a variable name (you can choose any name you prefer) that is used to store the result of reading the CSV file. By convention, "df" is often used as an abbreviation for "DataFrame," which is the primary data structure in Pandas used to store and manipulate structured data.

To view your data, you simply just type the variable "df"

Series

A Pandas Series represents a one-dimensional data array capable of holding various value types, including numbers, strings, or booleans. You can generate a Pandas Series using the Series() function by providing it with a list or dictionary as input.

Here's an example of what a series look like:

Creating a series from a list

import pandas as pd

# Create a Series from a list of numbers
numbers = [1, 2, 3, 4, 5]
series_from_list = pd.Series(numbers)

# Create a Series from a list of strings
fruits = ['apple', 'banana', 'cherry', 'date']
series_from_strings = pd.Series(fruits)

Creating a series from a dictionary

import pandas as pd

# Create a Series from a dictionary with custom index labels
data_dict = {'A': 10, 'B': 20, 'C': 30, 'D': 40}
series_from_dict = pd.Series(data_dict)

# Create a Series from a dictionary where keys become index labels
fruit_counts = {'apple': 5, 'banana': 8, 'cherry': 12, 'date': 3}
series_from_dict_keys = pd.Series(fruit_counts)

In the first set of examples, we create Pandas Series from lists, one containing numeric values and another containing strings. In the second set of examples, we create Pandas Series from dictionaries. In the first dictionary example, we specify custom index labels, while in the second dictionary example, the keys of the dictionary become the index labels of the Series.

Data Exploration

You can quickly explore your data using Pandas to get a sense of its structure and content. Some common exploration methods include:

  • df.head(): View the first few rows of the DataFrame. The command views the first 5 rows by default. However, you have the option to choose as many rows as possible by passing the value into the df.head() function. For example, you can view the first 8 rows with the function" df.head(8).

  • df.tail(): View the last rows of the DataFrame. It was exactly like the df.head() function. For example, you can view the last 8 rows with the function" df.tail(8).

  • df.info(): Get information about the data types and missing values. When you call df.info(), it generates a concise summary of the DataFrame's characteristics. Here's what df.info() looks like:

      import pandas as pd
    
      # Create a simple DataFrame
      data = {'Name': ['Alice', 'Bob', 'Charlie'],
              'Age': [25, 30, 35],
              'City': ['New York', 'San Francisco', 'Los Angeles']}
      df = pd.DataFrame(data)
    
      # Get information about the DataFrame
      df.info()
    

    df.describe(): Generate summary statistics of numeric columns. It provides summary statistics for the numeric columns in the DataFrame, which can help you quickly understand the distribution and characteristics of the data.

  • df.shape: Get the number of rows and columns. It returns a tuple containing two values:

    1. The number of rows in the DataFrame.

    2. The number of columns in the DataFrame.

So, when you use df.shape, you get a tuple in the format (rows, columns) that provides information about the structure of your DataFrame. This can be useful for various purposes, such as checking the size of your data, validating data integrity, or dynamically adjusting your data processing code based on the DataFrame's dimensions.

Arithmetic functions

Here are some commonly used mathematical functions in Pandas:

  1. .sum(): Calculates the sum of values along a specified axis, either for a single column or across rows and columns of a DataFrame.

     # Sum of the 'Sales' column
     total_sales = df['Sales'].sum()
    
  2. .mean(): Computes the mean (average) of values in a DataFrame or Series.

     # Mean of ages in the 'Age' column
     average_age = df['Age'].mean()
    
  3. .median(): Calculates the median (middle value) of a dataset, which is a measure of central tendency.

     # Median income in the 'Income' column
     median_income = df['Income'].median()
    
  4. .std() and .var(): These functions compute the standard deviation and variance, respectively, providing insights into the spread and variability of data.

     # Standard deviation of 'Scores'
     std_deviation = df['Scores'].std()  
    
      # Variance of 'Scores'
     variance = df['Scores'].var()
    
  5. .max() and .min(): Find the maximum and minimum values in a DataFrame or Series.

     # Maximum temperature
     highest_temperature = df['Temperature'].max()  
    
     # Minimum stock price
     lowest_stock_price = df['StockPrice'].min()
    
  6. .abs(): Computes the absolute values of all elements in a Series.

     # Convert debt values to positive
     f['Debt'] = df['Debt'].abs()
    
  7. .apply(): This versatile function allows you to apply custom mathematical functions or operations to Series or DataFrame elements.

     def custom_function(x):
         return x * 2
    
     doubled_values = df['Values'].apply(custom_function)
    
  8. .corr(): Computes the correlation between numerical columns, revealing relationships between variables.

     pythonCopy codecorrelation_matrix = df.corr()  # Correlation matrix for all numeric columns
    

These Pandas mathematical functions are essential tools for data analysis, as they help you summarize data, calculate statistics, and gain insights into your datasets. Whether you're working with financial data, scientific measurements, or any other type of structured data, Pandas provides a powerful and efficient way to perform mathematical operations.

Accessing Rows and Columns

In Pandas, you can access rows and columns in a DataFrame using various methods and indexing techniques. Here's how to do it:

Accessing Columns:

  1. Using Column Labels (Column Names): You can access a specific column by using its label (name) enclosed in square brackets or as an attribute of the DataFrame.

     # Using square brackets
     df['ColumnName']  # Returns a Series
    
     # Using attribute access
     df.ColumnName  # Returns a Series (if the column name has no spaces or special characters)
    
  2. Using .loc[] and .iloc[]: You can also access columns using the .loc[] and .iloc[] indexers. .loc[] uses labels, and .iloc[] uses integer-based positions.

     # Using .loc[] (label-based)
     df.loc[:, 'ColumnName']
    
     # Using .iloc[] (integer-based)
     df.iloc[:, column_index]
    

Accessing Rows:

  1. Using Integer-Based Indexing: You can access rows by their integer-based index using .iloc[].

     # Access the first row
     df.iloc[0]
    
     # Access a range of rows (e.g., rows 2 to 4)
     df.iloc[2:5]
    
  2. Using .loc[] with Labels: You can access rows by specifying their labels (index or row name) using .loc[].

     # Access a specific row by its label (e.g., 'index_label')
     df.loc['index_label']
    
     # Access rows based on a condition (e.g., where 'Column' equals a certain value)
     df.loc[df['Column'] == value]
    
  3. Using Boolean Indexing: You can filter rows based on a condition using boolean indexing.

     # Filter rows where a condition is True
     df[df['Column'] > value]
    

Accessing Specific Cells:

You can access specific cells in a DataFrame by specifying both the row and column using .loc[] or .iloc[].

# Access a specific cell by row and column labels
df.loc['row_label', 'Column']

# Access a specific cell by row and column positions
df.iloc[row_index, column_index]

Remember to replace 'ColumnName', 'index_label', column_index, row_index, and 'Column' with the actual column names, row labels, and indices you want to access. These techniques provide you with fine-grained control over the data you retrieve from a Pandas DataFrame, making it easy to extract the specific information you need for analysis or manipulation.

Sorting Data in Pandas

Sorting data in Pandas is a fundamental operation for organizing and analyzing datasets. Pandas provides a variety of methods for sorting data within DataFrames and Series, making it easy to customize and control the sorting process. Here's how to sort data in Pandas:

Sorting DataFrames:

To sort a DataFrame, you can use the sort_values() method. This method allows you to specify the column(s) by which to sort the data and the sorting order (ascending or descending).

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22]}

df = pd.DataFrame(data)

# Sort by the 'Age' column in ascending order
df_sorted = df.sort_values(by='Age')

# Sort by the 'Age' column in descending order
df_sorted_desc = df.sort_values(by='Age', ascending=False)

Sorting Series:

You can also sort a Series using the sort_values() method.

ages = df['Age']

# Sort the 'Age' Series in ascending order
ages_sorted = ages.sort_values()

# Sort the 'Age' Series in descending order
ages_sorted_desc = ages.sort_values(ascending=False)

Sorting by Multiple Columns:

You can sort a DataFrame by multiple columns by passing a list of column names to the by parameter of sort_values(). The DataFrame will be sorted by the first column in the list, and in case of ties, it will then be sorted by the second column, and so on.

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22],
        'Score': [90, 85, 92]}

df = pd.DataFrame(data)

# Sort by 'Age' in ascending order and then by 'Score' in descending order
df_sorted = df.sort_values(by=['Age', 'Score'], ascending=[True, False])

Sorting by Index:

To sort a DataFrame by its index, you can use the sort_index() method.

# Sort the DataFrame by index in ascending order
df_sorted_index = df.sort_index()

# Sort the DataFrame by index in descending order
df_sorted_index_desc = df.sort_index(ascending=False)

Sorting data in Pandas allows you to explore and analyze your data more effectively. Whether you need to arrange data for visualization, identify outliers, or perform other data manipulation tasks, Pandas provides a versatile set of tools to help you sort your data in the desired order.

Now that you know how to manipulate, sort data, and perform some arithmetic operations with Pandas, you can go ahead to practice with some datasets from Kaggle.