Examples

pandas

DataFrame example:

>>> import pandas as pd
>>> from pathlib import Path
>>> import raffalib
>>> import raffalib.pandas
>>> from raffalib.export_docx import DocxOptions
>>> logger = raffalib.create_logger(rich=False, fmt="{message}")
>>>
>>> df = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/refs/heads/main/inst/extdata/penguins.csv")
>>> df.head()
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007
3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN  2007
4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007
>>> _ = df.raffa.startlog().dropna(subset=["bill_depth_mm"]).raffa.endlog()
Removed 2/344 (0.58%) rows.
>>> _ = df.raffa.startlog().query("species=='Adelie'").raffa.endlog()
Removed 192/344 (55.81%) rows.
>>> _ = df.raffa.startlog().drop(["bill_length_mm", "bill_depth_mm"], axis=1).raffa.endlog()
Removed 2/8 (25.00%) columns.
>>> _ = df.raffa.startlog().fillna(0).raffa.endlog()
Shape is the same. No value-level comparison done because clone=False was used in startlog().
>>> _ = df.raffa.startlog(clone=True).fillna(0).raffa.endlog()
Changed 19/2,752 (0.69%) values.
>>> outfp = Path("test.docx")
>>> outfp.unlink(missing_ok=True)
>>> docx_options = DocxOptions(heading_text = "Table 1")
>>> df.head(5).raffa.to_docx(outfp, docx_options=docx_options)
>>> assert outfp.is_file()
>>> outfp.unlink()

Series examples:

>>> s = df["bill_length_mm"]
>>> assert type(s)==pd.Series, type(s)
>>> _ = s.raffa.startlog().dropna().raffa.endlog()
Removed 2/344 (0.58%) values.
>>> _ = s.raffa.startlog().fillna(0).raffa.endlog()
Shape is the same. No value-level comparison done because clone=False was used in startlog().
>>> _ = s.raffa.startlog(clone=True).fillna(0).raffa.endlog()
Changed 2/344 (0.58%) values.

polars

Logging

Let’ import the libraries and create a logger:

>>> import polars as pl
>>> import polars.selectors as cs
>>> import raffalib
>>> import raffalib.polars
>>> logger = raffalib.create_logger(rich=False, fmt="{message}")

Let’s load a dataset

>>> df = pl.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/refs/heads/main/inst/extdata/penguins.csv")
>>> df = df.raffa.replace_string_with_null("NA")
>>> df = df.with_columns(pl.col(["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]).cast(pl.Float32).name.keep())
>>> df.head()
shape: (5, 8)
┌─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬────────┬───────┐
│ species ┆ island    ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length_mm ┆ body_mass_g ┆ sex    ┆ year  │
│ ---     ┆ ---       ┆ ---            ┆ ---           ┆ ---               ┆ ---         ┆ ---    ┆ ---   │
│ str     ┆ str       ┆ f32            ┆ f32           ┆ f32               ┆ f32         ┆ str    ┆ i64   │
╞═════════╪═══════════╪════════════════╪═══════════════╪═══════════════════╪═════════════╪════════╪═══════╡
│ Adelie  ┆ Torgersen ┆ 39.099998      ┆ 18.700001     ┆ 181.0             ┆ 3,750.0     ┆ male   ┆ 2,007 │
│ Adelie  ┆ Torgersen ┆ 39.5           ┆ 17.4          ┆ 186.0             ┆ 3,800.0     ┆ female ┆ 2,007 │
│ Adelie  ┆ Torgersen ┆ 40.299999      ┆ 18.0          ┆ 195.0             ┆ 3,250.0     ┆ female ┆ 2,007 │
│ Adelie  ┆ Torgersen ┆ null           ┆ null          ┆ null              ┆ null        ┆ null   ┆ 2,007 │
│ Adelie  ┆ Torgersen ┆ 36.700001      ┆ 19.299999     ┆ 193.0             ┆ 3,450.0     ┆ female ┆ 2,007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Let’s see some data wrangling on this data: remove rows with null values, filtering certain values, or selecting certain columns. The number of rows / columns changed is logged.

>>> _ = df.raffa.startlog().drop_nulls(subset=["bill_depth_mm"]).raffa.endlog(timeit=False)
Removed 2/344 (0.58%) rows.
>>> _ = df.raffa.startlog().filter(pl.col("species")=="Adelie").raffa.endlog(timeit=False)
Removed 192/344 (55.81%) rows.
>>> _ = df.raffa.startlog().select(pl.exclude(["bill_length_mm", "bill_depth_mm"])).raffa.endlog(timeit=False)
Removed 2/8 (25.00%) columns.

Operations that do not change the shape of the DataFrame but only its values requires clone=True in the startlog call to clone the entire initial dataframe for further comparisons:

>>> _ = df.raffa.startlog().fill_null(0).raffa.endlog(timeit=False)
Shape is the same. No value-level comparison done because clone=False was used in startlog().
>>> _ = df.raffa.startlog(clone=True).fill_null("0").raffa.endlog(timeit=False)
Changed 11/2,752 (0.40%) values.

Joins

Let’s see an example with joins.

>>> df1 = pl.DataFrame({"A": ["a1", "a2", "a3", "a4"], "B": ["b1", "b2", "b3", "b4"]})
>>> df2 = pl.DataFrame({"A": ["a1", "a2", "a3", "a5", "a6"], "C": ["c1", "c2", "c3", "c5", "c6"]})

Mutating joins

Outer join:

>>> _ = df1.raffa.join(df2, on="A", how="outer")
Total rows in output table: 6
From left only: 1/6 (16.67%)
From right only: 2/6 (33.33%)
From both: 3/6 (50.00%) (left dups 0, right dups 0)
Dropped rows from left: 0/4 (0.00%)
Dropped rows from right: 0/5 (0.00%)

Inner join:

>>> _ = df1.raffa.join(df2, on="A", how="inner")
Total rows in output table: 3
From left only: 0/3 (0.00%)
From right only: 0/3 (0.00%)
From both: 3/3 (100.00%) (left dups 0, right dups 0)
Dropped rows from left: 1/4 (25.00%)
Dropped rows from right: 2/5 (40.00%)

Left join:

>>> _ = df1.raffa.join(df2, on="A", how="left")
Total rows in output table: 4
From left only: 1/4 (25.00%)
From right only: 0/4 (0.00%)
From both: 3/4 (75.00%) (left dups 0, right dups 0)
Dropped rows from left: 0/4 (0.00%)
Dropped rows from right: 2/5 (40.00%)

Right join:

>>> _ = df1.raffa.join(df2, on="A", how="right")
Total rows in output table: 5
From left only: 0/5 (0.00%)
From right only: 2/5 (40.00%)
From both: 3/5 (60.00%) (left dups 0, right dups 0)
Dropped rows from left: 1/4 (25.00%)
Dropped rows from right: 0/5 (0.00%)

Filtering joins

Filtering joins are automatically detected:

>>> _ = df1.raffa.join(df2, left_on="left_a", right_on="right_a", how="semi", keep_row_index=False)
Detected filtering join. Rows variation -2/5 (-40.00%), total rows after join: 3/5 (60.00%)
>>> _ = df1.raffa.join(df2, left_on="left_a", right_on="right_a", how="anti", keep_row_index=False)
Detected filtering join. Rows variation -3/5 (-60.00%), total rows after join: 2/5 (40.00%)

Word export

Let’s export a DataFrame in a Word .docx file:

>>> df = pl.DataFrame({"a": [1,2,3], "b":["AAA", "BBB", "CCC"]})
>>> df.raffa.to_docx("main.docx")