Switching to Polars from Pandas
What is Polars?
Polars is a Rust library for dataframe operations that comes with a Python wrapper package. It is similar to Pandas but has some neat features that set it aside and which is reportedly more performant than pandas by having parallel and lazy operations. Although I have been aware of Polars for a couple of years, I have never found myself using it. However since it is growing in popularity and I only ever read very good things about it I decided to jump on the hype wagon and write a bit of the few things I have already picked up and noticed since implementing it in a personal project.
Polars and Pandas
The good thing is that pandas DataFrames can be converted to Polars ones and vice-versa. This is already pretty good if you want to slowly migrate from one to the other. If you read articles like the RealPython one on Polars, you will see that some obvious things are already different between the two, for example Polars dataframes do not really have indexes. But since there are many of those examples already around I want to focus more on practical differences when doing operations that I noticed in these days.
So, although operations can be performed similarly to pandas ones, the syntax is not exactly the same and not all the built-in pandas handy methods are available in Polars, so you will find yourself having to implement by yourself some things that in Pandas have already a dedicated method.
df._get_numeric_data().columns # In Pandas we can just use this hidden method (hey, it's Python)
df.select([pl.col(pl.NUMERIC_DTYPES)]).columns # In Polars we have to filter it ourselves
# Adding rows in Pandas
described_df.loc["shapiro-p"] = shapiro_scores
described_df.loc["shapiro-statistic"] = shapiro_statistic
described_df.loc["skew"] = skewness
described_df.loc["kurtosis"] = kurtosis
# Adding them in Polars
shapiro_scores[col] = scipy.stats.shapiro(df[col])[1]
shapiro_statistic[col] = scipy.stats.shapiro(df[col])[0]
kurtosis[col] = scipy.stats.kurtosis(df[col])
skewness[col] = scipy.stats.skew(df[col])
described = described.vstack(
pl.DataFrame([shapiro_scores,
shapiro_statistic,
kurtosis,
skewness]))
Operations
I categorize operations in Polars first in three groups: horizontal, vertical and groupby's. Horizontal operations are those where you select the columns you need, while vertical operations are the ones where you filter the rows you need. Groupby's are just aggregations based on categorical or time variables.
Horizontal
Let's see an example where I want to include numeric columns, but not all of them. Let's say that I have an encoded categorical variable in a column which shows up as numerical (categories: 1,2,3,4), so I want all other numerical columns, except for this one because it is really not numerical:
def get_categoricals(df: pl.DataFrame) -> (pl.DataFrame, pl.DataFrame):
selected_columns = [col for col in df.columns if df[col].n_unique() < 10]
categorical_filtered_df = df.select(selected_columns)
numeric_filtered_df = df.select(
[pl.col(pl.NUMERIC_DTYPES).exclude(categorical_filtered_df.columns)])
return categorical_filtered_df, numeric_filtered_df
Vertical
We can use vertical operations to include rows that only have a
certain value, so instead of the loc
method of a pandas dataframe,
we can just use filter and provide the columns and conditions we want
to apply to filter our rows:
# Using the iris dataframe
df.filter(pl.col("target") == 1) # Only get rows where target column value is 1
# Only get target 1 rows where the sepal length
# is more than 6 & and sepal width less than 3.
df.filter([pl.col("target") == 1,
pl.col("sepal length (cm)") > 6,
pl.col("sepal width (cm)") < 3])
Group By
Group by in itself works similarly to Pandas, however how we set up
aggregations is different. In this example I am first setting up the
aggregations I want for each column, then I unpack and apply them to the
groupby's agg
method.
# In this function for each category we do aggregations
# of each numerical column with different functions
def get_grouped_stats(cat: list[str], num: list[str], df: pl.DataFrame) -> list[pl.DataFrame]:
grouped_dfs = []
for cat_col in cat:
# Generate aggregation operations for each numeric column
agg_operations = []
for num_col in num:
stats = [
("min", pl.col(num_col).min().alias(f"{num_col}-min")),
("max", pl.col(num_col).max().alias(f"{num_col}-max")),
("sum", pl.col(num_col).sum().alias(f"{num_col}-sum")),
("mean", pl.col(num_col).mean().alias(f"{num_col}-mean")),
("median", pl.col(num_col).median().alias(
f"{num_col}-median")),
("std", pl.col(num_col).std().alias(f"{num_col}-std"))
]
agg_operations.extend([op for _, op in stats])
# Group by and aggregate
grouped_dfs.append(df.group_by(pl.col(cat_col)).agg(*agg_operations))
return grouped_dfs
So why use it?
Well, the performance factor is the first reason that comes to mind. The fact that it is still in an early stage phase and the API could change a lot may be a possibility and something that makes it less attractive to stable codebases, however the fact that new features are being added quickly (for example, great support to read and write files to s3 buckets since version 0.19.4) surely looks promising and in my case I will switch to it also for the fact that you can integrate polars in whatever other service of the codebase that are or may be written in Rust. The fact that you have operations that can be executed in parallel without worrying of the GIL and that you can lazy load dataframes larger than memory, surely makes it a great alternative if you plan on working with large datasets. Plus it is what the cool kids are writing about, so might as well join the hype train.