Plotting with Python is a nightmare. At least that’s what I thought after almost a decade of plotting experience with R. In R ggplot2
is the undefeated champion of plotting.
I learned ggplot2 syntax in 2014 and it is beautiful. Like building a tower, in ggplot2 one builds layer on layer until the plot is done. All in one expression. And ggplot2 has sane defaults: You add a color to the data, and you get a legend. If you use a shape, you get a legend.
Matplotlib and seaborn might be the Python plotting library of your choice, but they cant hold a candle to ggplot2 in my opinion. Their syntax is complicated. Everything is very cumbersome and they simply do not behave as I want them to. I really could not get used to them.
So I was very happy to learn that I am very much not alone: Plotnine is a project bringing the ggplot2 syntax to Python. What a blessing.
So today I thought I use plotnine to style a plot and show how it is to be used.
Plotnine: The Basics
After installing plotnine you have a choice: Really commit to the bid and import all features directly or only import plotnine as a module:
from plotnine import ggplot
Or
import plotnine
I prefer importing an aliased import of plotnine, which I think keeps the code clean and avoids collisions:
import plotnine as p9
And thats what I will use for this blogpost. I just wanted to make you aware that we do have a choice.
Plotnine integrates tightly with Pandas and expects the DataFrame to be a Pandas Object or something exposing the .to_pandas()
function, like a Polars DataFrame.
A simple plot
To dive into plotnine I chose to recreate the style of the panel C from Figure 1 of the AlphaFold3 paper (Abramson et al (2024), www.nature.com/articles/s41586-024-07487-w). As the goal is not to recreate the analysis I will invent data that matches the figure roughly, simply for the purpose of this blog post.
I create a new conda environment and installed plotnine and polars:
mamba create -n post_plotnine python==3.12 uv
mamba activate post_plotnine
uv pip install polars==1.27.1 plotnine==0.14.5 pyarrow==19.0.1
Then to simulate some data:
import polars as pl
import plotnine as p9
import numpy as np
def simulate_results(median: int, variance: int, n: int) -> np.ndarray:
return np.random.normal(loc=median, scale=np.sqrt(variance), size=n)
ligands_df_summary = pl.DataFrame({
"model": ["AF3", "AutoDock", "RoseTTAFold"],
"median": [78, 52, 43],
"variance": [10, 20, 30],
"n": [428, 428, 427]
})
ligands_simulated: list[pl.DataFrame] = []
for row in ligands_df_summary.iter_rows(named=True):
median = row["median"]
variance = row["variance"]
n = row["n"]
simulated_data = simulate_results(median, variance, n)
ligands_simulated.append(pl.DataFrame({
"model": row["model"],
"simulated_data": simulated_data
}))
ligands_simulated_df = pl.concat(ligands_simulated)
ligands_simulated_df.head()
Now that the data has been simulated, I can get started on the plotting. Lets start with a default plot using nothing but the default seetings of plotnine. I chose a boxplot, as it comes closest to representing the data they want to show:
(
p9.ggplot(ligands_simulated_df, p9.aes(x="model",y = "simulated_data", fill="model"))
+ p9.geom_boxplot()
)
That was easy. I hope you can appreciate how many things plotnine did for us. We have axis labels, we have a legend. We have a decent color scheme. But besides that, the plot is missing a few key things. To name the obvious:
- The colors are all wrong
- We dont have the n values
- We are missing the title
To get the style match correctly, we will need to dive into theming. I recommend having a look at the documentation of ggplot or plotnine for the theme options. For example here: plotnine.org/reference/theme.html
In the following code section I have done a lot of things. I will not dive into the details, as I think reading the code and playing with it, will be the best teacher.
Below I am discussing a few points as a conclusion:
model_colors = {
"AF3": "#9cdcff",
"AutoDock": "#c3c3c3",
"RoseTTAFold": "#c3c3c3",
}
significance_data = pl.DataFrame({
"y": [90, 70],
"x": [1, 2],
"xend": [2, 3],
"label": ["***", "**"],
}).with_columns((pl.col("xend")+(pl.col("x")-pl.col("xend"))/2).alias("x_center"))
(
p9.ggplot(ligands_simulated_df.group_by("model").agg(pl.col("simulated_data").mean().alias("mean")),
p9.aes(x="model",y = "mean", fill="model"))
+ p9.geom_col(width=0.7)
+ p9.scale_fill_manual(values=model_colors)
+ p9.labs(
title = "Ligands PoseBusters set",
x = "",
y = "Success (%)",
)
+ p9.geom_errorbar(
mapping=p9.aes(x="model", y="mean", ymin="ymin", ymax="ymax"),
data = (ligands_simulated_df
.group_by("model")
.agg(pl.len().alias("n"),
pl.col("simulated_data").mean().alias("mean"),
pl.col("simulated_data").std().alias("sd"),
)
.with_columns([
(pl.col("mean") - 1.96 * pl.col("sd")).alias("ymin"),
(pl.col("mean") + 1.96 * pl.col("sd")).alias("ymax")]
)
),
width=0.001,
size=0.6,
position=p9.position_dodge(0.9),
)
+ p9.scale_y_continuous(
breaks=[0,20,40,60,80,100],
limits=(0,100),
expand=(0,0),
)
+ p9.scale_x_discrete(
breaks=["AF3", "AutoDock", "RoseTTAFold"],
labels=["AF3\n2019 cut-off\n" + r"$\it{n}$ = " + ligands_df_summary.filter(pl.col("model") == "AF3").get_column("n").to_numpy()[0].astype(str),
"AutoDock\nVina\n" + r"$\it{n}$ = " + ligands_df_summary.filter(pl.col("model") == "AutoDock").get_column("n").to_numpy()[0].astype(str),
"RoseTTAFold\nAll-Atom\n" + r"$\it{n}$ = " + ligands_df_summary.filter(pl.col("model") == "RoseTTAFold").get_column("n").to_numpy()[0].astype(str),
],
expand=(0.04, 0.04))
+ p9.geom_segment(
data=significance_data,
mapping=p9.aes(x="x", y="y", xend="xend", yend="y"),
inherit_aes=False,
color="black",
size=0.5,
)
+ p9.geom_text(
data=significance_data,
mapping=p9.aes(x="x_center", y="y", label="label"),
inherit_aes=False,
nudge_y=1.5,
size=11,
color="black",
)
+ p9.theme_bw()
+ p9.theme(
panel_grid=p9.element_blank(), # Remove any grid lines
panel_border=p9.element_blank(), # Remove the border around the plot
axis_line_x=p9.element_line(color="black"),
axis_line_y=p9.element_line(color="black"),
axis_text=p9.element_text(linespacing=1.2,
color="black",
size=11),
axis_ticks_pad_major_x=5,
axis_ticks_pad_major_y=1,
axis_ticks_length=7,
legend_key_size=8,
legend_title=p9.element_blank(),
legend_position="inside",
legend_position_inside=(1,1), # Set to (0,1) for top left
legend_text=p9.element_text(size=11),
figure_size=(3.9, 4),
dpi=100,
plot_title=p9.element_text(size=11, ha="center"),
)
)
The recreated plot comes very close to the style shown in the paper. A few details are off. The x-axis labels in the original plot are wider than the x-axis. This is not the case for my version. Maybe with a bit more work, this would be possible.
My error bars are also completely the wrong data and are more placeholders. But they are done in such a way that it would be possible to use other data sources if they are available.
Overall I am very happy with how the final plot looks and how I got there. I hope this post showed that with a bit of theming and a vision, plotnine can be a powerful plotting tool that can produce publication-grade plots.
In a further blog post I hope to see if we can create publication grade figures as well, meaning can we make panels, labels and how flexible is it?