Switching to Polars from Pandas

What is Polars?

Polars is a Rust library for dataframe operations that comes with a Python wrapper package. It is similar to Pandas but has some neat features that set it aside and which is reportedly more performant than pandas by having parallel and lazy operations. Although I have been aware of Polars for a couple of years, I have never found myself using it. However since it is growing in popularity and I only ever read very good things about it I decided to jump on the hype wagon and write a bit of the few things I have already picked up and noticed since implementing it in a personal project.

Polars and Pandas

The good thing is that pandas DataFrames can be converted to Polars ones and vice-versa. This is already pretty good if you want to slowly migrate from one to the other. If you read articles like the RealPython one on Polars, you will see that some obvious things are already different between the two, for example Polars dataframes do not really have indexes. But since there are many of those examples already around I want to focus more on practical differences when doing operations that I noticed in these days.

So, although operations can be performed similarly to pandas ones, the syntax is not exactly the same and not all the built-in pandas handy methods are available in Polars, so you will find yourself having to implement by yourself some things that in Pandas have already a dedicated method.

  df._get_numeric_data().columns # In Pandas we can just use this hidden method (hey, it's Python)
  df.select([pl.col(pl.NUMERIC_DTYPES)]).columns # In Polars we have to filter it ourselves

  # Adding rows in Pandas
  described_df.loc["shapiro-p"] = shapiro_scores
  described_df.loc["shapiro-statistic"] = shapiro_statistic
  described_df.loc["skew"] = skewness
  described_df.loc["kurtosis"] = kurtosis

  # Adding them in Polars
  shapiro_scores[col] = scipy.stats.shapiro(df[col])[1]
  shapiro_statistic[col] = scipy.stats.shapiro(df[col])[0]
  kurtosis[col] = scipy.stats.kurtosis(df[col])
  skewness[col] = scipy.stats.skew(df[col])
  described = described.vstack(
      pl.DataFrame([shapiro_scores,
                    shapiro_statistic,
                    kurtosis,
                    skewness]))

Operations

I categorize operations in Polars first in three groups: horizontal, vertical and groupby's. Horizontal operations are those where you select the columns you need, while vertical operations are the ones where you filter the rows you need. Groupby's are just aggregations based on categorical or time variables.

Horizontal

Let's see an example where I want to include numeric columns, but not all of them. Let's say that I have an encoded categorical variable in a column which shows up as numerical (categories: 1,2,3,4), so I want all other numerical columns, except for this one because it is really not numerical:

def get_categoricals(df: pl.DataFrame) -> (pl.DataFrame, pl.DataFrame):
    selected_columns = [col for col in df.columns if df[col].n_unique() < 10]
    categorical_filtered_df = df.select(selected_columns)
    numeric_filtered_df = df.select(
        [pl.col(pl.NUMERIC_DTYPES).exclude(categorical_filtered_df.columns)])
    return categorical_filtered_df, numeric_filtered_df

Vertical

We can use vertical operations to include rows that only have a certain value, so instead of the loc method of a pandas dataframe, we can just use filter and provide the columns and conditions we want to apply to filter our rows:

    # Using the iris dataframe
    df.filter(pl.col("target") == 1) # Only get rows where target column value is 1

    # Only get target 1 rows where the sepal length
    # is more than 6 & and sepal width less than 3.
    df.filter([pl.col("target") == 1,
               pl.col("sepal length (cm)") > 6,
               pl.col("sepal width (cm)") < 3])

Group By

Group by in itself works similarly to Pandas, however how we set up aggregations is different. In this example I am first setting up the aggregations I want for each column, then I unpack and apply them to the groupby's agg method.

  # In this function for each category we do aggregations
  # of each numerical column with different functions
  def get_grouped_stats(cat: list[str], num: list[str], df: pl.DataFrame) -> list[pl.DataFrame]:
    grouped_dfs = []
    for cat_col in cat:
        # Generate aggregation operations for each numeric column
        agg_operations = []
        for num_col in num:
            stats = [
                ("min", pl.col(num_col).min().alias(f"{num_col}-min")),
                ("max", pl.col(num_col).max().alias(f"{num_col}-max")),
                ("sum", pl.col(num_col).sum().alias(f"{num_col}-sum")),
                ("mean", pl.col(num_col).mean().alias(f"{num_col}-mean")),
                ("median", pl.col(num_col).median().alias(
                    f"{num_col}-median")),
                ("std", pl.col(num_col).std().alias(f"{num_col}-std"))
            ]
            agg_operations.extend([op for _, op in stats])
        # Group by and aggregate
        grouped_dfs.append(df.group_by(pl.col(cat_col)).agg(*agg_operations))
    return grouped_dfs

So why use it?

Well, the performance factor is the first reason that comes to mind. The fact that it is still in an early stage phase and the API could change a lot may be a possibility and something that makes it less attractive to stable codebases, however the fact that new features are being added quickly (for example, great support to read and write files to s3 buckets since version 0.19.4) surely looks promising and in my case I will switch to it also for the fact that you can integrate polars in whatever other service of the codebase that are or may be written in Rust. The fact that you have operations that can be executed in parallel without worrying of the GIL and that you can lazy load dataframes larger than memory, surely makes it a great alternative if you plan on working with large datasets. Plus it is what the cool kids are writing about, so might as well join the hype train.