2019-06-17T22:40:32.169Z ~ 4 min read

Calculating the Probability of some Event (E)


Probability can be used to describe the relationship between given values and the chance/probability of some event occuring. This post will outline how we describe probability of some event given a fitted distribution. We will fit a distribution to then calculate the survival function using the z-score of our limit given a fitted distribution.

from scipy import stats
import pandas as pd
import numpy as np
import altair as alt
alt.data_transformers.enable('json')
DataTransformerRegistry.enable('json')
df = pd.read_csv('example_parameter.csv')

maximum likelihood estimate

In essence we either try distributions until it fits the data the best, or we know a analytical solution for the distribution at hand and get the best fitted distribution of the data right away.

mu, std = stats.norm.fit(df)
mu, std
(5.669365294764912, 1.7405441476684247)
# x between -3 and above with .001 steps.
x_axis = np.arange(
    -3, 
    mu + 5 * std, 
    0.001
)

# Mean = mu, SD = std.
y_axis = stats.norm.pdf(
    x_axis,
    mu,
    std
)

# plot
df = pd.DataFrame()
df['x'] = x_axis
df['y'] = y_axis
alt.Chart(data=df).mark_line().encode(
    x='x:Q',
    y='y:Q'
).properties(
    title='Probability'
)

png

since we have now fitted a curve around a mu (i.e. not around 0) which is the standard way of showing a distribution. We will here demonstrate how to calculate the likelihood of being above a certain value based on the data that we have seen using the survival function.

2. probability of the survival function

is the where cdf is the cumulative distribution function

# plot
x_axis = x_axis

y_axis = 1 - stats.norm.cdf(x_axis, mu, std)

df = pd.DataFrame()
df['parameter'] = x_axis
df['1 - cdf'] = y_axis
line = alt.Chart(df).mark_line().encode(
    x='parameter:Q',
    y='1 - cdf:Q'
).properties(
    title='Probability of being above limit'
)

vertical_y = np.arange(
    0, 
    2, 
    0.001
)
df = pd.DataFrame()
df['vertical_y'] = vertical_y
df['limit'] = limit

threshold = pd.DataFrame([{"threshold": limit}])
rule = alt.Chart(threshold).mark_rule(color='firebrick').encode(
    alt.X('threshold:Q', title='parameter')
)

line + rule

png

3. take the z-score of the fitted distribution

Since each distribution is centered around we need to center our limit value against a normal distribution. This is where we calculate the z-score from the distribution.

which says how many standard deviations we are from its .

recall again that we want to see the probability of somethign accuring above the limit value. i.e. what is the probability that we exceed this value

# set the limit of the measure
limit = 9
z = (limit - mu) / std
z
1.9135594519085861

since we now have center our limit value around the we need to apply the calculation for the probability given a distrubution that we assume with this calculation.

meaning a and a

# x between -3 and above with .001 steps.
x_axis = np.arange(
    -3, 
    mu + 5 * std, 
    0.001
)

# Mean = mu, SD = std.
y_axis = stats.norm.pdf(
    x_axis,
    0,
    1
)

# plot
df = pd.DataFrame()
df['x'] = x_axis
df['y'] = y_axis
alt.Chart(data=df).mark_line().encode(
    x='x:Q',
    y='y:Q'
).properties(
    title='Probability'
)

png

# With z score we can now use a normal distribution for our calculations

# plot
x_axis = x_axis

y_axis = 1 - stats.norm.cdf(x_axis, 0, 1)

df = pd.DataFrame()
df['parameter'] = x_axis
df['1 - cdf'] = y_axis
line = alt.Chart(df).mark_line().encode(
    x='parameter:Q',
    y='1 - cdf:Q'
).properties(
    title='Probability of being above limit'
)

vertical_y = np.arange(
    0, 
    2, 
    0.001
)


threshold = pd.DataFrame([
    {"prev_threshold": limit},
    {"z_threshold": z}
])

rule = alt.Chart(threshold).mark_rule(color='firebrick').encode(
    x='prev_threshold:Q',
    opacity=alt.OpacityValue(0.6)
)

rule2 = alt.Chart(threshold).mark_rule(color='firebrick').encode(
    alt.X('z_threshold:Q'),
    size=alt.SizeValue(2)
)

line + (rule + rule2)

png

With the calculated z-score, we now have a way of also using the normal distribution without the need to explicitly tell for the distribution.

Back to the real world!!

Instead of writing the survival function which we will explicitly calculate the probability like so:

# score of the survival function
p = stats.norm.sf(limit, mu, std)
p
0.02783823512029548

We can now instead use the z-score for each call to a distribution.

# score of the survival function
p = stats.norm.sf(z)
p
0.02783823512029548

And we can see that it gives the same probability, just using different scales for the ,


Headshot of Moi

Hi, I'm Eric. I'm a software engineer and data scientist based in Lisbon. You can follow me on Twitter, see some of my work on GitHub,