Evolution of burglary in Leuven. Is the trend downwards ?
Evolution of burglary in Leuven. Is the trend downwards ?
The local police shared a graph with the number of break-ins in Leuven per year. The article shows a graph with a downwards trendline. Can we conclude that the number of breakins is showing a downward trend based on those numbers? Let's construct a dataframe with the data from the graph.
import numpy as np
import pandas as pd
import altair as alt
df = pd.DataFrame({'year_int':[y for y in range(2006, 2020)],
'breakins':[1133,834,953,891,1006,1218,992,1079,1266,1112,713,669,730,644]})
df['year'] = pd.to_datetime(df['year_int'], format='%Y')
points = alt.Chart(df).mark_line(point=True).encode(
x='year', y='breakins', tooltip='breakins'
)
points + points.transform_regression('year', 'breakins').mark_line(
color='green'
).properties(
title='Regression trend on the number breakins per year in Leuven'
)
The article claims that the number of breakins stabilizes the last years. Let's perform a local regression to check that.
# https://opendatascience.com/local-regression-in-python
# Loess: https://gist.github.com/AllenDowney/818f6153ef316aee80467c51faee80f8
points + points.transform_loess('year', 'breakins').mark_line(
color='green'
).properties(
title='Local regression trend on the number breakins per year in Leuven'
)
But what about the trend line? Are we sure the trend is negative ? Bring in the code based on the blogpost The hacker's guide to uncertainty estimates to estimate the uncertainty.:
# Code from: https://erikbern.com/2018/10/08/the-hackers-guide-to-uncertainty-estimates.html
import scipy.optimize
import random
def model(xs, k, m):
return k * xs + m
def neg_log_likelihood(tup, xs, ys):
# Since sigma > 0, we use use log(sigma) as the parameter instead.
# That way we have an unconstrained problem.
k, m, log_sigma = tup
sigma = np.exp(log_sigma)
delta = model(xs, k, m) - ys
return len(xs)/2*np.log(2*np.pi*sigma**2) + \
np.dot(delta, delta) / (2*sigma**2)
def confidence_bands(xs, ys, nr_bootstrap):
curves = []
xys = list(zip(xs, ys))
for i in range(nr_bootstrap):
# sample with replacement
bootstrap = [random.choice(xys) for _ in xys]
xs_bootstrap = np.array([x for x, y in bootstrap])
ys_bootstrap = np.array([y for x, y in bootstrap])
k_hat, m_hat, log_sigma_hat = scipy.optimize.minimize(
neg_log_likelihood, (0, 0, 0), args=(xs_bootstrap, ys_bootstrap)
).x
curves.append(
model(xs, k_hat, m_hat) +
# Note what's going on here: we're _adding_ the random term
# to the predictions!
np.exp(log_sigma_hat) * np.random.normal(size=xs.shape)
)
lo, hi = np.percentile(curves, (2.5, 97.5), axis=0)
return lo, hi
# Make a plot with a confidence band
df['lo'], df['hi'] = confidence_bands(df.index, df['breakins'], 100)
ci = alt.Chart(df).mark_area().encode(
x=alt.X('year:T', title=''),
y=alt.Y('lo:Q'),
y2=alt.Y2('hi:Q', title=''),
color=alt.value('lightblue'),
opacity=alt.value(0.6)
)
chart = alt.Chart(df).mark_line(point=True).encode(
x='year', y='breakins', tooltip='breakins'
)
ci + chart + chart.transform_regression('year', 'breakins').mark_line(
color='red'
).properties(
title='95% Confidence band of the number of breakins per year in Leuven'
)
On the above chart, we see that a possitive trend might be possible as well.
import statsmodels.formula.api as smf
results = smf.ols('breakins ~ index', data=df.reset_index()).fit()
results.params
The most likely slope of the trend line is 23.17 breakins per year. But how sure are we that the trend is heading down ?
results.summary()
The analysis reveals that the slope of the best fitting regression line is 23 breakins less per year. However, the confidence interval of the trend is between -50.344 and 4.006. Also the p)value of the regression coefficient is 0.088. Meaning we have eight percent chance that the negative trend is by accident. Hence, based on the current data we are not 95% percent sure the trend is downwards. Hence we can not conclude, based on this data, that there is a negative trend. This corresponds with the width of the 95% certainty band drawn that allows for an upward trend line:
# Here are the confidence intervals of the regression
results.conf_int()
y_low = results.params['Intercept'] # ?ost likely value of the intercept
y_high = results.params['Intercept'] + results.conf_int()[1]['index'] * df.shape[0] # Value of upward trend for the last year
df_upward_trend = pd.DataFrame({'year':[df['year'].min(), df['year'].max()],
'breakins':[y_low, y_high]})
possible_upwards_trend = alt.Chart(df_upward_trend).mark_line(
color='green',
strokeDash=[10,10]
).encode(
x='year:T',
y=alt.Y('breakins:Q',
title='Number of breakins per year')
)
points = alt.Chart(df).mark_line(point=True).encode(x='year', y='breakins', tooltip='breakins')
(ci + points + points.transform_regression('year', 'breakins').mark_line(color='red')
+ possible_upwards_trend).properties(
title='Trend analysis on the number of breakins per year in Leuven, Belgium'
)
In the above graph, we see that a slight positive trend (green dashed line) is in the 95% confidence band on the regression coefficient. We are not sure that the trend on the number of breakins is downwards.