About Vibe Coding

Impact of vibe coding on product desing

Software development is changing rapidly overnight. Indeed:

  1. OpenAI announced Codex on May 16, 2025, for their $200/month Pro users https://openai.com/index/introducing-codex/.
  2. Microsoft GitHub Copilot released its new coding agent on May 19, 2025. https://bsky.app/profile/github.com/post/3lpjxvgje7s2k
  3. Google announced a tool called Jules (jules.google.com) on May 20, 2025, making it available for free and
  4. Mistral releases devstral, an open-source model for coding agents on May 21, 2025. https://mistral.ai/news/devstral

These new coding agents—along with Cursor, Lovable, Windsurf, V0, Bold.new, and others—are all tools that support some form of “vibe coding” (a term coined by Karpathy indicating AI-assisted coding).

This gives rise to a lot of FUD (fear, uncertainty, and doubt) from the corporate gatekeepers. The short-term opportunity is this: in a design thinking approach, a “research prototype” that checks the basic hypotheses (who is this product for, what problem does the product solve) can be developed much faster using vibe coding.

Even with the expected “valley of disappointment” that may follow (because users tend to overreact to the initial prototype, which will likely need to be rewritten from scratch), in the end, the chance of building a product that resonates with users is much higher and it will be ready sooner—if the same good old software process is followed, from prototype to Minimum Viable Product (MVP) to Version 1 accepted by users.

The future Ai ecosystem will be open

a79b47e679f3b1e93e1d2a2aadbb3461875225293794331ce2b9f471931c3f44.jpg
The future AI ecosystem will be open

I was just reading the article The walled garden cracks: Nadella bets Microsoft’s Copilots—and Azure’s next act—on A2A/MCP interoperability and this is how I see what's happening in the AI landscape:

  • Antropic: best user experience and defined MCP (Model Context Protocol)
  • Google: best model on all leaderboards with Gemini and defined A2A (Agent-to-Agent)
  • Microsoft: let's build an open AI ecosystem with MCP and A2A, releases supports for A2A and MCP in VS Code
  • Deepseek: after they pulled of DeepSeek V3, Deepseek released open model DeepSeek-Prover-V2 that tackles advanced theorem proving achieving an 88.9% pass rate on the MiniF2F-test benchmark (Olympiad/AIME level theorems) and solving 49 out of 658 problems on the new PutnamBench. This means that Deepseek is cracking the reasoning part of LLM's.

And OpenAI:

Seeing all those signals, we conclude: the AI future will be open!

"Bar chart made in Altair with Financial Times style"

"#30DayChartChallenge #Day24 Themeday: Financial times"

  • image: images/barchart_FT_style_altair.png
import pandas as pd
import altair as alt

The #30DayChartChallenge Day 24 calls for Financial Times themed charts. The bar chart that I will try to reproduce in Altair was published in the article: "Financial warfare: will there be a backlash against the dollar?"

This is the graph (without FT background) to we want to reproduce:

I digitized the heights of yhe bars with WebplotDigitizer:

data = """Bar0, 3.23
Bar1, 1.27
Bar2, 1.02
Bar3, 0.570
Bar4, 0.553
Bar5, 0.497
Bar6, 0.467
Bar7, 0.440
Bar8, 0.420
Bar9, 0.413
Bar10, 0.317
Bar11, 0.0433"""

data_values = [float(x.split()[1]) for x in data.splitlines()]

I put the values into a Pandas dataframe:

source = pd.DataFrame({
    'label': ['China', 'Japan', 'Switserland', 'India', 'Taiwan', 'Hong Kong', 'Russia', 'South Korea', 'Saudi Arabia', 'Singapore', 'Eurozone', 'US'],
    'val': data_values
})

Now we build the graph and alter it's style to resemble the Financial Times style:

square = alt.Chart().mark_rect(width=80, height=5, color='black', xOffset=-112, yOffset=10)

bars = alt.Chart(source).mark_bar(color='#174C7F', size=30).encode(
    x=alt.X('val:Q', title='', axis=alt.Axis(tickCount=6, domain=False, labelColor='darkgray'), scale=alt.Scale(domain=[0, 3.0])),
    y=alt.Y('label:N', title='', sort=alt.EncodingSortField(
            field="val:Q",  # The field to use for the sort
            op="sum",  # The operation to run on the field prior to sorting
            order="ascending"  # The order to sort in
        ), axis=alt.Axis(domainColor='lightgray',
                         labelFontSize=18, labelColor='darkgray', labelPadding=5,
                         labelFontStyle='Bold',
                         tickSize=18, tickColor='lightgray'))
).properties(title={
      "text": ["The biggest holders of FX reserves", ], 
      "subtitle": ["Official foreign exchange reserve (Jan 2022, $tn)"],
      "align": 'left',
      "anchor": 'start'
    },
    width=700,
    height=512
)

source_text = alt.Chart(
    {"values": [{"text": "Source: IMF, © FT"}]}
).mark_text(size=12, align='left', dx=-140, color='darkgrey').encode(
    text="text:N"
)

# from https://stackoverflow.com/questions/57244390/has-anyone-figured-out-a-workaround-to-add-a-subtitle-to-an-altair-generated-cha
chart = alt.vconcat(
    square,
    bars,
    source_text
).configure_concat(
    spacing=0
).configure(
    background='#fff1e5',
).configure_view(
    stroke=None, # Remove box around graph
).configure_title(
    # font='metricweb',
    fontSize=22,
    fontWeight=400,
    subtitleFontSize=18,
    subtitleColor='darkgray',
    subtitleFontWeight=400,
    subtitlePadding=15,
    offset=80,
    dy=40
)

chart

Trying to use the offical Financial Times fonts

The chart looks quit similar to the original. Biggest difference is the typography. The Financial times uses its own Metric Web and Financier Display Web fonts and Altair can only use fonts available in the browser.

The fonts could be made available via CSS:

@font-face {
    font-family: 'metricweb';
    src: url('https://www.ft.com/__origami/service/build/v2/files/o-fonts-assets@1.5.0/MetricWeb-Regular.woff2''
);
}
from IPython.display import HTML
from google.colab.output import _publish as publish
publish.css("""@font-face {
    font-family: 'metricweb', sans-serif;
    src: url('https://www.ft.com/__origami/service/build/v2/files/o-fonts-assets@1.5.0/MetricWeb-Regular.woff2') format('woff2');
}""")
square = alt.Chart().mark_rect(width=80, height=5, color='black', xOffset=-112, yOffset=10)

bars = alt.Chart(source).mark_bar(color='#174C7F', size=30).encode(
    x=alt.X('val:Q', title='', axis=alt.Axis(tickCount=6, domain=False), scale=alt.Scale(domain=[0, 3.0])),
    y=alt.Y('label:N', title='', sort=alt.EncodingSortField(
            field="val:Q",  # The field to use for the sort
            op="sum",  # The operation to run on the field prior to sorting
            order="ascending"  # The order to sort in
        ), axis=alt.Axis(domainColor='lightgray',
                         labelFontSize=18, labelColor='darkgray', labelPadding=5,
                         labelFontStyle='Bold',
                         tickSize=18, tickColor='lightgray'))
).properties(title={
      "text": ["The biggest holders of FX reserves", ], 
      "subtitle": ["Official foreign exchange reserve (Jan 2022, $tn)"],
      "align": 'left',
      "anchor": 'start'
    },
    width=700,
    height=512
)

source_text = alt.Chart(
    {"values": [{"text": "Source: IMF, © FT"}]}
).mark_text(size=12, align='left', dx=-140, color='darkgrey').encode(
    text="text:N"
)

# from https://stackoverflow.com/questions/57244390/has-anyone-figured-out-a-workaround-to-add-a-subtitle-to-an-altair-generated-cha
chart = alt.vconcat(
    square,
    bars,
    source_text
).configure_concat(
    spacing=0
).configure(
    background='#fff1e5',
).configure_view(
    stroke=None, # Remove box around graph
).configure_title(
    font='metricweb',
    fontSize=22,
    fontWeight=400,
    subtitleFont='metricweb',
    subtitleFontSize=18,
    subtitleColor='darkgray',
    subtitleFontWeight=400,
    subtitlePadding=15,
    offset=80,
    dy=40
)

chart

For the moment the font does not look at all to be Metric web :-(

A second minor difference are the alignment of the 0.0 and 3.0 labels of the x-axis. In the orginal, those labels are centered. Altair aligns 0.0 to the left and 3.0 to the right.


"Reconstructing Economist graph with Altair"

"#30DayChartChallenge #altair #day12"

  • image: images/Economist_stye%3B_30dayschartchallenge_day12.png

In an Economist article "The metamorphosis: How Jeremy Corbyn took control of Labour", the following graph appeared:

Later, Sarah Leo, data visualiser at The Economist, improved the graph to:

The rationale behind this improvement is discussed in her article: 'Mistakes, we made a few'.

In this article, I show how visualisation library Altair can be used to reconstruct the improved graph.

import numpy as np
import pandas as pd
import altair as alt

Read the data for the graph into a Pandas dataframe:

df = pd.read_csv('http://infographics.economist.com/databank/Economist_corbyn.csv').dropna()

This is how the data looks:

df
Page Average number of likes per Facebook post 2016
0 Jeremy Corbyn 5210.0
1 Labour Party 845.0
2 Momentum 229.0
3 Owen Smith 127.0
4 Andy Burnham 105.0
5 Saving Labour 56.0

A standard bar graph in Altair gives this:

alt.Chart(df).mark_bar().encode(
    x='Average number of likes per Facebook post 2016:Q',
    y='Page:O'
)

The message of the graph is that Jerermy Corbyn has by far the most likes per Facebook post in 2016. There are a number of improvements possible:

The number on the x-axis are multiple of thousands. In spirit of removing as much inkt as possible, let's rescale the x-asis with factor 1000. The label 'Page' on the y-axis is superfluous. Let's remove it.

df['page1k'] = df['Average number of likes per Facebook post 2016']/1000.0

After scaling the graphs looks like this:

alt.Chart(df).mark_bar().encode(
    x=alt.X('page1k', title='Average number of likes per Facebook post 2016'),
    y=alt.Y('Page:O', title='')
)

A third improvement is to sort the bars from high to low. This supports the message, Jeremy Corbyn has the most clicks.

alt.Chart(df).mark_bar().encode(
    x=alt.X('page1k:Q', title='Average number of likes per Facebook post 2016'),
    y=alt.Y('Page:O', title='', sort=alt.EncodingSortField(
            field="Average number of likes per Facebook post 2016:Q",  # The field to use for the sort
            op="sum",  # The operation to run on the field prior to sorting
            order="ascending"  # The order to sort in
        ))
)

Now, we see that we have to many ticks on the x-axis. We can add a scale and map the x-axis to integers to cope with that. While adding markup for the x-axis, we add orient='top'. That move the xlabel text to the top of the graph.

alt.Chart(df).mark_bar().encode(
    x=alt.X('page1k:Q', title='Average number of likes per Facebook post 2016',
            axis=alt.Axis(title='Average number of likes per Facebook post 2016', orient="top", format='d', values=[1,2,3,4,5,6]),
            scale=alt.Scale(round=True, domain=[0,6])),
    y=alt.Y('Page:O', title='', sort=alt.EncodingSortField(
            field="Average number of likes per Facebook post 2016:Q",  # The field to use for the sort
            op="sum",  # The operation to run on the field prior to sorting
            order="ascending"  # The order to sort in
        ))
)

Now, we want to remove the x-axis itself as it adds nothing extra. We do that by putting the stroke at None in the configure_view. We also adjust the x-axis title to make clear the numbers are multiples of thousands.

alt.Chart(df).mark_bar().encode(
    x=alt.X('page1k:Q', title="Average number of likes per Facebook post 2016  ('000)",
            axis=alt.Axis(title='Average number of likes per Facebook post 2016', orient="top", format='d', values=[1,2,3,4,5,6]),
            scale=alt.Scale(round=True, domain=[0,6])),
    y=alt.Y('Page:O', title='', sort=alt.EncodingSortField(
            field="Average number of likes per Facebook post 2016:Q",  # The field to use for the sort
            op="sum",  # The operation to run on the field prior to sorting
            order="ascending"  # The order to sort in
        ))
).configure_view(
    stroke=None, # Remove box around graph
)

Next we try to left align the y-axis labels:

alt.Chart(df).mark_bar().encode(
    x=alt.X('page1k:Q',
            axis=alt.Axis(title="Average number of likes per Facebook post 2016  ('000)", orient="top", format='d', values=[1,2,3,4,5,6]),
            scale=alt.Scale(round=True, domain=[0,6])),
    y=alt.Y('Page:O', title='', sort=alt.EncodingSortField(
            field="Average number of likes per Facebook post 2016:Q",  # The field to use for the sort
            op="sum",  # The operation to run on the field prior to sorting
            order="ascending"  # The order to sort in
        ))
).configure_view(
    stroke=None, # Remove box around graph
).configure_axisY(
    labelPadding=70, 
    labelAlign='left'
)

Now, we apply the Economist style:

square = alt.Chart().mark_rect(width=50, height=18, color='#EB111A', xOffset=-105, yOffset=10)

bars = alt.Chart(df).mark_bar().encode(
    x=alt.X('page1k:Q',
            axis=alt.Axis(title="", orient="top", format='d', values=[1,2,3,4,5,6], labelFontSize=14),
            scale=alt.Scale(round=True, domain=[0,6])),
    y=alt.Y('Page:O', title='', sort=alt.EncodingSortField(
            field="Average number of likes per Facebook post 2016:Q",  # The field to use for the sort
            op="sum",  # The operation to run on the field prior to sorting
            order="ascending"  # The order to sort in
        ),
        # Based on https://stackoverflow.com/questions/66684882/color-some-x-labels-in-altair-plot
        axis=alt.Axis(labelFontSize=14, labelFontStyle=alt.condition('datum.value == "Jeremy Corbyn"', alt.value('bold'), alt.value('italic'))))
).properties(title={
      "text": ["Left Click", ], 
      "subtitle": ["Average number of likes per Facebook post\n", "2016, '000"],
      "align": 'left',
      "anchor": 'start'
    }
)

source = alt.Chart(
    {"values": [{"text": "Source: Facebook"}]}
).mark_text(size=12, align='left', dx=-120, color='darkgrey').encode(
    text="text:N"
)

# from https://stackoverflow.com/questions/57244390/has-anyone-figured-out-a-workaround-to-add-a-subtitle-to-an-altair-generated-cha
chart = alt.vconcat(
    square,
    bars,
    source
).configure_concat(
    spacing=0
).configure(
    background='#D9E9F0'
).configure_view(
    stroke=None, # Remove box around graph
).configure_axisY(
    labelPadding=110,
    labelAlign='left',
    ticks=False,
    grid=False
).configure_title(
    fontSize=22,
    subtitleFontSize=18,
    offset=30,
    dy=30
)

chart

The only thing, I could not reproduce with Altair is the light bar around the the first label and bar. For those final touches I think it's better to export the graph and add those finishing touches with a tool such as Inkscape or Illustrator.

Comparing Rt numbers for Belgium on 09-12-2020

Model Based on URL Rt Date
by Niel Hens Cases https://gjbex.github.io/DSI_UHasselt_covid_dashboard/ 0.96 20-12-5
Cori et al. (2013) Hospitalisations https://covid-19.sciensano.be/sites/default/files/Covid19/COVID-19_Weekly_report_NL.pdf 0.798 20-11-27/ till 20-12-3
RKI Hospitalisations https://datastudio.google.com/embed/u/0/reporting/c14a5cfc-cab7-4812-848c-0369173148ab/page/ZwmOB 0.97 20-12-09
rtlive Cases https://rtlive.de/global.html 0.80 20-12-09
epiforecast Cases and Deaths https://epiforecasts.io/covid/posts/national/belgium/ 0.5 20-12-07
Huisman et al. (2020) Cases https://ibz-shiny.ethz.ch/covid-19-re-international/ 1.01 20-11-24
Huisman et al. (2020) Hospitalisations https://ibz-shiny.ethz.ch/covid-19-re-international/ 0.84 20-11-24
RKI Cases https://twitter.com/BartMesuere/status/1336565641764089856 0.99 20-12-08
Deforche (2020) Hospitalisations and Deaths https://twitter.com/houterkabouter/status/1336582281994055680 0.85 20-12-09
SEIR Hospitalisations and Deaths https://twitter.com/vdwnico/status/1336557572254552065 1.5 20-12-09

Estimating the effective reproduction number in Belgium with the RKI method

Using the Robert Koch Institute method with serial interval of 4.

Every day Bart Mesuere tweets a nice dashboard with current numbers about Covid-19 in Belgium. This was the tweet on Wednesday 20/11/04:

twitter: https://twitter.com/BartMesuere/status/1323881489864548352

It's nice to see that the effective reproduction number ($Re(t)$) is again below one. That means the power of virus is declining and the number of infection will start to lower. This occured first on Tuesday 2020/11/3:

twitter: https://twitter.com/BartMesuere/status/1323519613855059968

I estimated the $Re(t)$ earlier with rt.live model in this notebook. There the $Re(t)$ was still estimated to be above one. Michael Osthege replied with a simulation results with furter improved model:

twitter: https://twitter.com/theCake/status/1323211910481874944

In that estimation, the $Re(t)$ was also not yet heading below one at the end of october.

In this notebook, we will implement a calculation based on the method of the Robert Koch Institute. The method is described and programmed in R in this blog post.

In that blogpost there's a link to a website with estimations for most places in the world The estimation for Belgium is here

LSHTM

According to that calculation, $Re(t)$ is already below zero for some days.

Load libraries and data

import numpy as np
import pandas as pd
df_tests = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_tests.csv', parse_dates=['DATE'])
df_cases = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_CASES_AGESEX.csv', parse_dates=['DATE'])
df_cases
DATE PROVINCE REGION AGEGROUP SEX CASES
0 2020-03-01 Antwerpen Flanders 40-49 M 1
1 2020-03-01 Brussels Brussels 10-19 F 1
2 2020-03-01 Brussels Brussels 10-19 M 1
3 2020-03-01 Brussels Brussels 20-29 M 1
4 2020-03-01 Brussels Brussels 30-39 F 1
... ... ... ... ... ... ...
36279 NaT VlaamsBrabant Flanders 40-49 M 3
36280 NaT VlaamsBrabant Flanders 50-59 M 1
36281 NaT WestVlaanderen Flanders 20-29 F 1
36282 NaT WestVlaanderen Flanders 50-59 M 3
36283 NaT NaN NaN NaN NaN 1

36284 rows × 6 columns

Reformat data into Rtlive format

df_cases_per_day = (df_cases
   .dropna(subset=['DATE'])
   .assign(region='Belgium')
   .groupby(['region', 'DATE'], as_index=False)
   .agg(cases=('CASES', 'sum'))
   .rename(columns={'DATE':'date'})
   .set_index(["region", "date"])
)

What's in our basetable:

df_cases_per_day
cases
region date
Belgium 2020-03-01 19
2020-03-02 19
2020-03-03 34
2020-03-04 53
2020-03-05 81
... ...
2020-11-01 2660
2020-11-02 13345
2020-11-03 11167
2020-11-04 4019
2020-11-05 5

250 rows × 1 columns

Let's plot the number of cases in function of the time.

ax = df_cases_per_day.loc['Belgium'].plot(figsize=(18,6))
ax.set(ylabel='Number of cases', title='Number of cases for covid-19 and number of positives in Belgium');

png

We see that the last days are not yet complete. Let's cut off the last two days of reporting.

import datetime
from dateutil.relativedelta import relativedelta

Calculate the date two days ago:

datetime.date(2020, 11, 3)
datetime.date(2020, 11, 3)
# today_minus_two = datetime.date.today() + relativedelta(days=-2)
today_minus_two = datetime.date(2020, 11, 3) # Fix the day
today_minus_two.strftime("%Y-%m-%d")
'2020-11-03'

Replot the cases:

ax = df_cases_per_day.loc['Belgium'][:today_minus_two].plot(figsize=(18,6))
ax.set(ylabel='Number of cases', title='Number of cases for covid-19 and number of positives in Belgium');

png

Select the Belgium region:

region = 'Belgium'
df = df_cases_per_day.loc[region][:today_minus_two]
df
cases
date
2020-03-01 19
2020-03-02 19
2020-03-03 34
2020-03-04 53
2020-03-05 81
... ...
2020-10-30 15185
2020-10-31 6243
2020-11-01 2660
2020-11-02 13345
2020-11-03 11167

248 rows × 1 columns

Check the types of the columns:

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 248 entries, 2020-03-01 to 2020-11-03
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   cases   248 non-null    int64
dtypes: int64(1)
memory usage: 3.9 KB

Robert Koch Institute method

A basic method to calculate the effective reproduction number is described (among others) in this blogpost. I included the relevant paragraph:

In a recent report (an der Heiden and Hamouda 2020) the RKI described their method for computing R as part of the COVID-19 outbreak as follows (p. 13): For a constant generation time of 4 days, one obtains R as the ratio of new infections in two consecutive time periods each consisting of 4 days. Mathematically, this estimation could be formulated as part of a statistical model:

$$y_{s+4} | y_{s} \sim Po(R \cdot y_{s}), s= 1,2,3,4$$

where $y_{1}, \ldots, y_{4}$ are considered as fixed. From this we obtain

$$\hat{R}{RKI} = \sum$$}^{4} y_{s+4} / \sum_{s=1}^{4} y_{s

Somewhat arbitrary, we denote by $Re(t)$ the above estimate for R when $s=1$ corresponds to time $t-8$, i.e. we assign the obtained value to the last of the 8 values used in the computation.

In Python, we define a lambda function that we apply on a rolling window. Since indexes start from zero, we calculate:

$$\hat{R}{RKI} = \sum$$}^{3} y_{s+4} / \sum_{s=0}^{3} y_{s

rt = lambda y: np.sum(y[4:])/np.sum(y[:4])
df.rolling(8).apply(rt)
cases
date
2020-03-01 NaN
2020-03-02 NaN
2020-03-03 NaN
2020-03-04 NaN
2020-03-05 NaN
... ...
2020-10-30 1.273703
2020-10-31 0.929291
2020-11-01 0.601838
2020-11-02 0.499806
2020-11-03 0.475685

248 rows × 1 columns

The first values are Nan because the window is in the past. If we plot the result, it looks like this:

ax = df.rolling(8).apply(rt).plot(figsize=(16,4), label='Re(t)')
ax.set(ylabel='Re(t)', title='Effective reproduction number estimated with RKI method')
ax.legend(['Re(t)']);

png

To avoid the spikes due to weekend reporting issue, I first applied a rolling mean on a window of 7 days:

ax = df.rolling(7).mean().rolling(8).apply(rt).plot(figsize=(16,4), label='Re(t)')
ax.set(ylabel='Re(t)', title='Effective reproduction number estimated with RKI method after rolling mean on window of 7 days')
ax.legend(['Re(t)']);

png

Interactive visualisation in Altair

import altair as alt

alt.Chart(df.rolling(7).mean().rolling(8).apply(rt).fillna(0).reset_index()).mark_line().encode(
    x=alt.X('date:T'),
    y=alt.Y('cases', title='Re(t)'),
    tooltip=['date:T', alt.Tooltip('cases', format='.2f')]
).transform_filter(
    alt.datum.date > alt.expr.toDate('2020-03-13')
).properties(
    width=600,
    title='Effective reproduction number in Belgium based on Robert-Koch Institute method'
)

Making the final visualisation in Altair

In the interactive Altair figure below, we show the $Re(t)$ for the last 14 days. We reduce the rolling mean window to three to see faster reactions.

#collapse

df_plot = df.rolling(7).mean().rolling(8).apply(rt).fillna(0).reset_index()
last_value = str(df_plot.iloc[-1]['cases'].round(2)) + ' ↓'
first_value = str(df_plot[df_plot['date'] == '2020-10-21'].iloc[0]['cases'].round(2)) # + ' ↑'
today_minus_15 = datetime.datetime.today() + relativedelta(days=-15)
today_minus_15_str = today_minus_15.strftime("%Y-%m-%d")

line = alt.Chart(df_plot).mark_line(point=True).encode(
    x=alt.X('date:T', axis=alt.Axis(title='Datum', grid=False)),
    y=alt.Y('cases', axis=alt.Axis(title='Re(t)', grid=False, labels=False, titlePadding=40)),
    tooltip=['date:T', alt.Tooltip('cases', title='Re(t)', format='.2f')]
).transform_filter(
    alt.datum.date > alt.expr.toDate(today_minus_15_str)
).properties(
    width=600,
    height=100
)

hline = alt.Chart(pd.DataFrame({'cases': [1]})).mark_rule().encode(y='cases')


label_right = alt.Chart(df_plot).mark_text(
    align='left', dx=5, dy=-10 , size=15
).encode(
    x=alt.X('max(date):T', title=None),
    text=alt.value(last_value),
)

label_left = alt.Chart(df_plot).mark_text(
    align='right', dx=-5, dy=-40, size=15
).encode(
    x=alt.X('min(date):T', title=None),
    text=alt.value(first_value),
).transform_filter(
    alt.datum.date > alt.expr.toDate(today_minus_15_str)
)

source = alt.Chart(
    {"values": [{"text": "Data source: Sciensano"}]}
).mark_text(size=12, align='left', dx=-57).encode(
    text="text:N"
)

alt.vconcat(line + label_left + label_right + hline, source).configure(
    background='#D9E9F0'
).configure_view(
    stroke=None, # Remove box around graph
).configure_axisY(
    ticks=False,
    grid=False,
    domain=False
).configure_axisX(
    grid=False,
    domain=False
).properties(title={
      "text": ['Effective reproduction number for the last 14 days in Belgium'], 
      "subtitle": [f'Estimation based on the number of cases until {today_minus_two.strftime("%Y-%m-%d")} after example of Robert Koch Institute with serial interval of 4'],
}
)
# .configure_axisY(
#     labelPadding=50,
# )

To check the calculation, here are the last for values for the number of cases after applying the mean window of 7:

df.rolling(7).mean().iloc[-8:-4]
cases
date
2020-10-27 16067.571429
2020-10-28 16135.857143
2020-10-29 15744.571429
2020-10-30 15218.000000

Those must be added together:

df.rolling(7).mean().iloc[-8:-4].sum()
cases    63166.0
dtype: float64

And here are the four values, starting four days ago:

df.rolling(7).mean().iloc[-4:]
cases
date
2020-10-31 14459.428571
2020-11-01 14140.428571
2020-11-02 13213.428571
2020-11-03 11641.428571

These are added together:

df.rolling(7).mean().iloc[-4:].sum()
cases    53454.714286
dtype: float64

And now we divide those two sums to get the $Re(t)$ of 2020-11-03:

df.rolling(7).mean().iloc[-4:].sum()/df.rolling(7).mean().iloc[-8:-4].sum()
cases    0.846258
dtype: float64

This matches (as expected) the value in the graph. Let's compare with three other sources:

  1. Alas it does not match the calculation reported by Bart Mesuere on 2020-11-03 based on the RKI model that reports 0.96:

twitter: https://twitter.com/BartMesuere/status/1323519613855059968

  1. Also, the more elaborated model from rtliveglobal is not yet that optimistic. Mind that model rtlive start estimating the $Re(t)$ from the number of tests instead of the number of cases. It might be that other reporting delays are involved.

  2. epiforecast.io is already below 1 since beginning of November.

Another possiblity is that I made somewhere a mistake. If you spot it, please let me know.


Estimating the effective reproduction number in Belgium

Applying the Bayesian model from Rt.live on Belgium test data.

In this post we estimate the effective reproduction number of COVID-19 in the northern and southern part of Belgium. We apply the Bayesian model of rt.live on Belgian data of COVID-19 tests provided by the goverment.

Install needed packages and software

import numpy as np
import pandas as pd

The current version of pymc3, installed by default in Colab, is version 3.3.7. The requirements for the Bayesian model of rt.live stipulates a more recent version. We first uninstall verions 3.3.7 and then install a version v3.9.3.

!pip  uninstall -y pymc3
Uninstalling pymc3-3.7:
  Successfully uninstalled pymc3-3.7
!pip install pymc3>=3.9.2
!pip install arviz>=0.9.0
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

import pymc3 as pm
import arviz as az
import numpy as np
import pandas as pd
from scipy import stats as sps

import theano
import theano.tensor as tt
from theano.tensor.signal.conv import conv2d

import seaborn as sns
sns.set_context('talk')
from scipy import stats
from matplotlib import pyplot as plt
print('Running on PyMC3 v{}'.format(pm.__version__))
Running on PyMC3 v3.9.3

Now that we are running a recent version of pymc3, we can install the model:

!pip install git+https://github.com/rtcovidlive/covid-model.git

Unfortunately, this does not work. I pasted the code in this notebook as a workaround.

#hide

# from covid.patients import get_delay_distribution
# p_delay = pd.read_csv('https://raw.githubusercontent.com/rtcovidlive/covid-model/master/data/p_delay.csv')
# From https://github.com/rtcovidlive/covid-model/blob/master/covid/models/generative.py
#collapse

import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

import pymc3 as pm
import arviz as az
import numpy as np
import pandas as pd
from scipy import stats as sps

import theano
import theano.tensor as tt
from theano.tensor.signal.conv import conv2d

# from covid.patients import get_delay_distribution


class GenerativeModel:
    version = "1.0.0"

    def __init__(self, region: str, observed: pd.DataFrame, buffer_days=10):
        """ Takes a region (ie State) name and observed new positive and
            total test counts per day. buffer_days is the default number of
            blank days we pad on the leading edge of the time series because
            infections occur long before reports and we need to infer values
            on those days """

        first_index = observed.positive.ne(0).argmax()
        observed = observed.iloc[first_index:]
        new_index = pd.date_range(
            start=observed.index[0] - pd.Timedelta(days=buffer_days),
            end=observed.index[-1],
            freq="D",
        )
        observed = observed.reindex(new_index, fill_value=0)

        self._trace = None
        self._inference_data = None
        self.model = None
        self.observed = observed
        self.region = region

    @property
    def n_divergences(self):
        """ Returns the number of divergences from the current trace """
        assert self.trace != None, "Must run sample() first!"
        return self.trace["diverging"].nonzero()[0].size

    @property
    def inference_data(self):
        """ Returns an Arviz InferenceData object """
        assert self.trace, "Must run sample() first!"

        with self.model:
            posterior_predictive = pm.sample_posterior_predictive(self.trace)

        _inference_data = az.from_pymc3(
            trace=self.trace,
            posterior_predictive=posterior_predictive,
        )
        _inference_data.posterior.attrs["model_version"] = self.version

        return _inference_data

    @property
    def trace(self):
        """ Returns the trace from a sample() call. """
        assert self._trace, "Must run sample() first!"
        return self._trace

    def _scale_to_positives(self, data):
        """ Scales a time series to have the same mean as the observed positives
            time series. This is useful because many of the series we infer are
            relative to their true values so we make them comparable by putting
            them on the same scale. """
        scale_factor = self.observed.positive.mean() / np.mean(data)
        return scale_factor * data

    def _get_generation_time_interval(self):
        """ Create a discrete P(Generation Interval)
            Source: https://www.ijidonline.com/article/S1201-9712(20)30119-3/pdf """
        mean_si = 4.7
        std_si = 2.9
        mu_si = np.log(mean_si ** 2 / np.sqrt(std_si ** 2 + mean_si ** 2))
        sigma_si = np.sqrt(np.log(std_si ** 2 / mean_si ** 2 + 1))
        dist = sps.lognorm(scale=np.exp(mu_si), s=sigma_si)

        # Discretize the Generation Interval up to 20 days max
        g_range = np.arange(0, 20)
        gt = pd.Series(dist.cdf(g_range), index=g_range)
        gt = gt.diff().fillna(0)
        gt /= gt.sum()
        gt = gt.values
        return gt

    def _get_convolution_ready_gt(self, len_observed):
        """ Speeds up theano.scan by pre-computing the generation time interval
            vector. Thank you to Junpeng Lao for this optimization.
            Please see the outbreak simulation math here:
            https://staff.math.su.se/hoehle/blog/2020/04/15/effectiveR0.html """
        gt = self._get_generation_time_interval()
        convolution_ready_gt = np.zeros((len_observed - 1, len_observed))
        for t in range(1, len_observed):
            begin = np.maximum(0, t - len(gt) + 1)
            slice_update = gt[1 : t - begin + 1][::-1]
            convolution_ready_gt[
                t - 1, begin : begin + len(slice_update)
            ] = slice_update
        convolution_ready_gt = theano.shared(convolution_ready_gt)
        return convolution_ready_gt

    def build(self):
        """ Builds and returns the Generative model. Also sets self.model """

        # p_delay = get_delay_distribution()
        p_delay = pd.read_csv('https://raw.githubusercontent.com/rtcovidlive/covid-model/master/data/p_delay.csv')
        nonzero_days = self.observed.total.gt(0)
        len_observed = len(self.observed)
        convolution_ready_gt = self._get_convolution_ready_gt(len_observed)
        x = np.arange(len_observed)[:, None]

        coords = {
            "date": self.observed.index.values,
            "nonzero_date": self.observed.index.values[self.observed.total.gt(0)],
        }
        with pm.Model(coords=coords) as self.model:

            # Let log_r_t walk randomly with a fixed prior of ~0.035. Think
            # of this number as how quickly r_t can react.
            log_r_t = pm.GaussianRandomWalk(
                "log_r_t",
                sigma=0.035,
                dims=["date"]
            )
            r_t = pm.Deterministic("r_t", pm.math.exp(log_r_t), dims=["date"])

            # For a given seed population and R_t curve, we calculate the
            # implied infection curve by simulating an outbreak. While this may
            # look daunting, it's simply a way to recreate the outbreak
            # simulation math inside the model:
            # https://staff.math.su.se/hoehle/blog/2020/04/15/effectiveR0.html
            seed = pm.Exponential("seed", 1 / 0.02)
            y0 = tt.zeros(len_observed)
            y0 = tt.set_subtensor(y0[0], seed)
            outputs, _ = theano.scan(
                fn=lambda t, gt, y, r_t: tt.set_subtensor(y[t], tt.sum(r_t * y * gt)),
                sequences=[tt.arange(1, len_observed), convolution_ready_gt],
                outputs_info=y0,
                non_sequences=r_t,
                n_steps=len_observed - 1,
            )
            infections = pm.Deterministic("infections", outputs[-1], dims=["date"])

            # Convolve infections to confirmed positive reports based on a known
            # p_delay distribution. See patients.py for details on how we calculate
            # this distribution.
            test_adjusted_positive = pm.Deterministic(
                "test_adjusted_positive",
                conv2d(
                    tt.reshape(infections, (1, len_observed)),
                    tt.reshape(p_delay, (1, len(p_delay))),
                    border_mode="full",
                )[0, :len_observed],
                dims=["date"]
            )

            # Picking an exposure with a prior that exposure never goes below
            # 0.1 * max_tests. The 0.1 only affects early values of Rt when
            # testing was minimal or when data errors cause underreporting
            # of tests.
            tests = pm.Data("tests", self.observed.total.values, dims=["date"])
            exposure = pm.Deterministic(
                "exposure",
                pm.math.clip(tests, self.observed.total.max() * 0.1, 1e9),
                dims=["date"]
            )

            # Test-volume adjust reported cases based on an assumed exposure
            # Note: this is similar to the exposure parameter in a Poisson
            # regression.
            positive = pm.Deterministic(
                "positive", exposure * test_adjusted_positive,
                dims=["date"]
            )

            # Save data as part of trace so we can access in inference_data
            observed_positive = pm.Data("observed_positive", self.observed.positive.values, dims=["date"])
            nonzero_observed_positive = pm.Data("nonzero_observed_positive", self.observed.positive[nonzero_days.values].values, dims=["nonzero_date"])

            positive_nonzero = pm.NegativeBinomial(
                "nonzero_positive",
                mu=positive[nonzero_days.values],
                alpha=pm.Gamma("alpha", mu=6, sigma=1),
                observed=nonzero_observed_positive,
                dims=["nonzero_date"]
            )

        return self.model

    def sample(
        self,
        cores=4,
        chains=4,
        tune=700,
        draws=200,
        target_accept=0.95,
        init="jitter+adapt_diag",
    ):
        """ Runs the PyMC3 model and stores the trace result in self.trace """

        if self.model is None:
            self.build()

        with self.model:
            self._trace = pm.sample(
                draws=draws,
                cores=cores,
                chains=chains,
                target_accept=target_accept,
                tune=tune,
                init=init,
            )

        return self

Get the data

Read the data from sciensano:

df_tests = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_tests.csv', parse_dates=['DATE'])

What is in this dataframe ?

df_tests
DATE PROVINCE REGION TESTS_ALL TESTS_ALL_POS
0 2020-03-01 Antwerpen Flanders 18 0
1 2020-03-01 BrabantWallon Wallonia 8 0
2 2020-03-01 Brussels Brussels 4 0
3 2020-03-01 Hainaut Wallonia 5 0
4 2020-03-01 Liège Wallonia 8 0
... ... ... ... ... ...
2935 2020-10-31 NaN NaN 58 13
2936 2020-10-31 Namur Wallonia 864 387
2937 2020-10-31 OostVlaanderen Flanders 927 106
2938 2020-10-31 VlaamsBrabant Flanders 1600 259
2939 2020-10-31 WestVlaanderen Flanders 512 82

2940 rows × 5 columns

We see that we have the number of tests (TESTS_ALL) and the number of positive tests (TEST_ALL_POS) per date, province and region. In this post, we will analyse the three regions: Brussels, Flanders and Wallonia.

Preprocessing

Are the any Nan ?

df_tests.isnull().mean()
DATE             0.000000
PROVINCE         0.083333
REGION           0.083333
TESTS_ALL        0.000000
TESTS_ALL_POS    0.000000
dtype: float64

About eight procent of the lines do not have a PROVINCE nor REGION assigned. What should we do with those ? Ignore them ? Let's look how many there are:

ax = df_tests[df_tests['REGION'].isnull()].groupby(['DATE',], dropna=False).sum().plot(figsize=(18, 4))
ax.set(title='Number of covid-19 tests per day not attributed to a region in Belgium', ylabel='Number of tests');

png

#hide 
# (df_tests
#     .fillna('Nan')
#     .groupby(['DATE','REGION'], as_index=False)['TESTS_ALL']
#     .sum().set_index('REGION')['TESTS_ALL']
# #    .assign(total=lambda d:d[['Brussels',    'Flanders', 'Wallonia']].sum(axis=1))
# )
REGION
Brussels        4
Flanders       56
Nan             1
Wallonia       21
Brussels       17
            ...  
Wallonia    21127
Brussels     1237
Flanders     3515
Nan            58
Wallonia     2806
Name: TESTS_ALL, Length: 980, dtype: int64

Here we create a function that distributes the non attributed tests according to the number of tests in each region. For example suppose on a day there are 10, 20 and 150 test in Brussels, Flanders and Wallonia respectively. Suppose there are 10 test unattributed in Flanders. Then we add 10 * (10/(10+20+150)) = 10 * (10/180) = 100/180 = 0.55 test to Flanders. The total number of test of Flanders becomes 10.55. We round that to the nearest integer: 11. And we do the same for the other regions Brussels and Wallonia. So we distribute the 10 unattributed tests weighted according to the number of tests in each region.

def redistribute(g, col):
  gdata = g.groupby('REGION')[col].sum()
  gdata.loc['Brussels'] += gdata.loc['Nan'] * (gdata.loc['Brussels']/(gdata.loc['Brussels'] + gdata.loc['Flanders'] + gdata.loc['Wallonia']))
  gdata.loc['Flanders'] += gdata.loc['Nan'] * (gdata.loc['Flanders']/(gdata.loc['Brussels'] + gdata.loc['Flanders'] + gdata.loc['Wallonia']))
  gdata.loc['Wallonia'] += gdata.loc['Nan'] * (gdata.loc['Wallonia']/(gdata.loc['Brussels'] + gdata.loc['Flanders'] + gdata.loc['Wallonia']))
  gdata.drop(index='Nan', inplace=True)
  gdata = np.round(gdata.fillna(0)).astype(int)
  return gdata
# Redistribute the nan for the column TESTS_ALL
df_tests_all = (df_tests
    .fillna('Nan')
    .groupby(['DATE'])
    .apply(redistribute, 'TESTS_ALL')
    .stack()
    .reset_index()
    .rename(columns={'DATE':'date', 'REGION':'region', 0:'total'})
)
# Redistribute the nan for the column TESTS_ALL_POS
df_tests_positive = (df_tests
    .fillna('Nan')
    .groupby(['DATE'])
    .apply(redistribute, 'TESTS_ALL_POS')
    .stack()
    .reset_index()
    .rename(columns={'DATE':'date', 'REGION':'region', 0:'positive'})
)
#hide

# (df_tests
#     .fillna('Nan')
#     .groupby(['DATE','REGION'])['TESTS_ALL']
#     .sum().to_frame().unstack()
#  #   .rename(pandas.index.get_level_values(0), axis='columns')
#     .assign(total=lambda d:d.drop(columns=('TESTS_ALL', 'Nan')).sum(axis=1))
# )
# Combine the total number of tests and the number of positive tests into a basetable
df_tests_per_region_day = pd.concat([df_tests_all, df_tests_positive['positive']], axis=1).set_index(['region', 'date'])
# Check if the basetable is ok
assert df_tests_per_region_day.isnull().sum().sum() == 0, 'There are nan in the basetable'
#hide

# Reformat data into Rtlive format
# df_tests_per_region_day = (df_tests
  #  .groupby(['REGION', 'DATE'], as_index=False)
  #  .agg(positive=('TESTS_ALL_POS', 'sum'), total=('TESTS_ALL', 'sum'))
  #  .rename(columns={'REGION':'region', 'DATE':'date'})
  #  .set_index(["region", "date"])
  #  )
df_tests_per_region_day
total positive
region date
Brussels 2020-03-01 4 0
Flanders 2020-03-01 57 0
Wallonia 2020-03-01 21 0
Brussels 2020-03-02 17 3
Flanders 2020-03-02 259 7
... ... ...
2020-10-30 38154 6531
Wallonia 2020-10-30 21414 9011
Brussels 2020-10-31 1246 392
Flanders 2020-10-31 3542 531
Wallonia 2020-10-31 2827 951

735 rows × 2 columns

# What regions do we have in the table ?
df_tests_per_region_day.index.get_level_values(0).unique().to_list()
['Brussels', 'Flanders', 'Wallonia']

Re(t) for Flanders

region = 'Flanders'
ax = df_tests_per_region_day.loc[region].plot(figsize=(18,6))
ax.set(title=f'Number of tests for covid-19 and number of positives in {region}');

png

import datetime
from dateutil.relativedelta import relativedelta
# Remove last two days because tests are not yet fully reported
today_minus_two = datetime.datetime.today() + relativedelta(days=-2)
today_minus_two.strftime("%Y-%m-%d")
'2020-10-30'
ax = df_tests_per_region_day.loc[region][:today_minus_two].plot(figsize=(18,6))
ax.set(title=f'Number of tests for covid-19 and number of positives in {region}');

png

# Fit the model on the data
df = df_tests_per_region_day.loc[region][:today_minus_two]
gm = GenerativeModel(region, df)
gm.sample()
Only 200 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, seed, log_r_t]
100.00% [3600/3600 1:10:58<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 700 tune and 200 draw iterations (2_800 + 800 draws total) took 4260 seconds.





<__main__.GenerativeModel at 0x7f382109add8>
#collapse

def summarize_inference_data(inference_data: az.InferenceData):
    """ Summarizes an inference_data object into the form that we publish on rt.live """
    posterior = inference_data.posterior
    hdi_mass = 80
    hpdi = az.hdi(posterior.r_t, hdi_prob=hdi_mass / 100).r_t

    observed_positive = inference_data.constant_data.observed_positive.to_series()
    scale_to_positives = lambda data: observed_positive.mean() / np.mean(data) * data
    tests = inference_data.constant_data.tests.to_series()
    normalized_positive = observed_positive / tests.clip(0.1 * tests.max())

    summary = pd.DataFrame(
        data={
            "mean": posterior.r_t.mean(["draw", "chain"]),
            "median": posterior.r_t.median(["chain", "draw"]),
            f"lower_{hdi_mass}": hpdi[:, 0],
            f"upper_{hdi_mass}": hpdi[:, 1],
            "infections": scale_to_positives(
                posterior.infections.mean(["draw", "chain"])
            ),
            "test_adjusted_positive": scale_to_positives(
                posterior.test_adjusted_positive.mean(["draw", "chain"])
            ),
            "test_adjusted_positive_raw": scale_to_positives(normalized_positive),
            "positive": observed_positive,
            "tests": tests,
        },
        index=pd.Index(posterior.date.values, name="date"),
    )
    return summary
result = summarize_inference_data(gm.inference_data)
100.00% [800/800 00:12<00:00]
fig, ax = plt.subplots(figsize=(12, 8))
result.infections.plot(c="C2", label="Expected primary infections")
result.test_adjusted_positive.plot(c="C0", label="Expected positive tests if tests were constant")
result.test_adjusted_positive_raw.plot(c="C1", alpha=.5, label="Expected positive tests", style="--")
gm.observed.positive.plot(c="C7", alpha=.7, label="Reported positive tests")
fig.set_facecolor("w")
ax.legend();
ax.set(title=f"rt.live model inference for {region}", ylabel="number of cases")
sns.despine();

png

fig, ax = plt.subplots(figsize=(12, 8))

ax.set(title=f"Effective reproduction number for {region}", ylabel="$R_e(t)$")
samples = gm.trace["r_t"]
x = result.index
cmap = plt.get_cmap("Reds")
percs = np.linspace(51, 99, 40)
colors = (percs - np.min(percs)) / (np.max(percs) - np.min(percs))
samples = samples.T

result["median"].plot(c="k", ls='-')

for i, p in enumerate(percs[::-1]):
    upper = np.percentile(samples, p, axis=1)
    lower = np.percentile(samples, 100-p, axis=1)
    color_val = colors[i]
    ax.fill_between(x, upper, lower, color=cmap(color_val), alpha=.8)

ax.axhline(1.0, c="k", lw=1, linestyle="--")
sns.despine();

png

Re(t) for Wallonia

region = 'Wallonia'
ax = df_tests_per_region_day.loc[region].plot(figsize=(18,6))
ax.set(title=f'Number of tests for covid-19 and number of positives in {region}');

png

# Remove last two days because tests are not yet fully reported
today_minus_two = datetime.datetime.today() + relativedelta(days=-2)
today_minus_two.strftime("%Y-%m-%d")
'2020-10-30'
ax = df_tests_per_region_day.loc[region][:today_minus_two].plot(figsize=(18,6))
ax.set(title=f'Number of tests for covid-19 and number of positives in {region}');

png

df = df_tests_per_region_day.loc[region][:today_minus_two]
gm = GenerativeModel(region, df)
gm.sample()
Only 200 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, seed, log_r_t]
100.00% [3600/3600 1:12:49<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 700 tune and 200 draw iterations (2_800 + 800 draws total) took 4371 seconds.





<__main__.GenerativeModel at 0x7f380ac35c88>
result = summarize_inference_data(gm.inference_data)
100.00% [800/800 00:12<00:00]
fig, ax = plt.subplots(figsize=(12, 8))
result.infections.plot(c="C2", label="Expected primary infections")
result.test_adjusted_positive.plot(c="C0", label="Expected positive tests if tests were constant")
result.test_adjusted_positive_raw.plot(c="C1", alpha=.5, label="Expected positive tests", style="--")
gm.observed.positive.plot(c="C7", alpha=.7, label="Reported positive tests")
fig.set_facecolor("w")
ax.legend();
ax.set(title=f"rt.live model inference for {region}", ylabel="number of cases")
sns.despine();

png

fig, ax = plt.subplots(figsize=(12, 8))

ax.set(title=f"Effective reproduction number for {region}", ylabel="$R_e(t)$")
samples = gm.trace["r_t"]
x = result.index
cmap = plt.get_cmap("Reds")
percs = np.linspace(51, 99, 40)
colors = (percs - np.min(percs)) / (np.max(percs) - np.min(percs))
samples = samples.T

result["median"].plot(c="k", ls='-')

for i, p in enumerate(percs[::-1]):
    upper = np.percentile(samples, p, axis=1)
    lower = np.percentile(samples, 100-p, axis=1)
    color_val = colors[i]
    ax.fill_between(x, upper, lower, color=cmap(color_val), alpha=.8)

ax.axhline(1.0, c="k", lw=1, linestyle="--")
sns.despine();

png


My talk at data science leuven

Links to video and slides of the talk and thanking people.

Talk material

On 23 April 2020, I was invited for a talk at Data science Leuven. I talked about how you can explore and explain the results of a clustering exercise. The target audience are data scientists that that have notions of how to cluster data and that want to improve their skills.

The video is recorded on Youtube:

youtube: https://youtu.be/hk0arqhcX9U?t=3570

You can see the slides here: slides

The talk itself is based on this notebook that I published on this blog yesterday and that I used to demo during the talk.

The host of the conference was Istvan Hajnal. He tweeted the following:

twitter: https://twitter.com/dsleuven/status/1253391470444371968

He also took the R out of my family name NachteRgaele. Troubles with R, it's becoming a story of my life... 😂 Behind the scene Kris Peeters calmly took the heat of doing the live streaming. 👍 Almost Pydata quality! Big thanks to the whole Data Science Leuven team that is doing all this on voluntary basis.

Standing on the shoulders of the giants

This talks was not possible without the awesome Altair visualisation library made by Jake VanderPlas. Secondly, it builds upon the open source Shap library made by Scott Lundberg. Those two libraries had a major impact on my daily work as datascientist at Colruyt group. They inspired me in trying to give back to the open source community with this talk. 🤘

If you want to learn how to use Altair I recommend the tutorial made by Vincent Warmerdam on his calm code site: https://calmcode.io/altair/introduction.htm

I would also like to thank my collegues at work who endured the dry-run of this talk and who made the suggestion to try to use a classifier to explain the clustering result. Top team!

Awesome fastpages

Finally, this blog is build with the awesome fastpages. I can now share a rendered Jupyter notebook, with working interactive demos, that can be opened in My binder or Google Colab with one click on a button. This means that readers can directly tinker around with the code and methods discussed in the talk. All you need is a browser and an internet connection. So thank you Jeremy Howard, Hamel Husain, and the fastdotai team for pulling this off. Thank you Hamel Husain for your Github Actions. I will cast for two how awesome this all is.