Sometimes you're dealing with a long polars pipeline that has a bug. It's somewhere in the entire pipeline, but you don't quite know where yet.
Maybe it looks like this:
(
pl.scan_parquet("wow-full.parquet")
.filter(pl.col("race").is_in(races)) .with_columns(date=pl.col("datetime").dt.truncate("4w"))
.group_by("race", "date")
.agg(hours=pl.len() / 6, unique_players=pl.n_unique("player_id"))
.group_by("race")
.agg(
hours=pl.sum("hours").round().cast(pl.Int32),
over_time=pl.col("hours")
)
)
To help debug in moments like this, you can monkeypatch a show()
method to polars dataframes.
def show(self, n=5, name=None):
if name:
print(name)
if isinstance(self, pl.DataFrame):
print(self.head(n))
else:
print(self.head(n).collect())
return self
pl.DataFrame.show = show
pl.LazyFrame.show = show
By doing this you'll be able to keep on chaining but you will be able to peek at different moments in the pipeline to see if the columns/types are what you expect.
You can now change your pipeline to get useful print statements.
(
pl.scan_parquet("wow-full.parquet")
.filter(pl.col("race").is_in(races))
.show()
.with_columns(date=pl.col("datetime").dt.truncate("4w"))
.group_by("race", "date")
.agg(hours=pl.len() / 6, unique_players=pl.n_unique("player_id"))
.show()
.group_by("race")
.agg(
hours=pl.sum("hours").round().cast(pl.Int32),
over_time=pl.col("hours")
)
.show()
)
Also made a 1 minute recording of this setup, if folks prefer a live demo.