Polars – Efficient Data Manipulation in Python

Nov 13, 2024

data manipulation, data wrangling, data-analysis, data-science, machine-learning, polars, python, tutorial

Polars is a python package that has most common functions used for data manipulation. It is an alternative to Pandas and has several advantages. This tutorial is aimed at providing introduction to most commonly used functions with examples.

Advantages:
1. Fast: Polars is written using Rust and is extremely fast compared to Pandas
2. Intuitive Data Manipulation: Functions for data manipulation are intuitive and easy to use. Most common functions covered in this tutorial are:
select, filter, group_by, agg
3. Expressions: Polars allows expressions that can be helpful using Select commands

Dataset:
Below is exploration of Titanic dataset using Polars package

Download data from github here

You can also download code for tutorial here

Read dataset from csv and save it as polars dataframe

import polars as pl


titanic_df = pl.read_csv("titanic.csv")

Example One:
Objective: Find out how many Male and Female children under 18 survived titanic crash.

To accomplish this, we need to
a) Select required columns (Age, Sex, Survived) using Select function
b) Filter for Age less than or equal to 18 using filter function
c) Group the dataset by Sex (Male and Female) using group_by function
d) Aggregate the number of children survived using agg function

code:

# Data Manipulation using Polars
# Objective: Find out how many Male and Female children under 18 survived 
children_survived = titanic_df.select(
    "Survived","Age","Sex"
    ).filter(
        pl.col("Age")<=18
        ).group_by(
            pl.col("Sex")
            ).agg(
                pl.col("Survived").sum().alias("Survival Count")
                )

Example Two:
Objective: Find out average fare of each passenger class and break it down for those who survived titanic crash versus those who did not

To accomplish this, we need to do following:
a) Select Pclass, Survived, Fare columns using select function
b) Group by Survived, Pclass columns using group_by function
c) Aggregate the Fare column using agg function

code:

#Objective find out average fare of survived versus those who did not
avg_fare = titanic_df.select(
                             "Pclass","Survived","Fare"
                             ).group_by(
                                 "Survived","Pclass"
                                 ).agg(
                                     Avg_Fare = pl.col("Fare").mean()
                                 )

If we run code, we will see the first-class passenger who survived paid an average fare of $96 and those who did not paid average fare of $65. Similarly, second class passengers who survived paid an average fare of $22 and those who did not survive paid an average fare of $14

I hope these examples illustrate how easy and intuitive these functions are.
Knowing these common data manipulation functions, we can analyze any dataset.

Polars – Efficient Data Manipulation in Python

Leave a Reply Cancel reply