## Wednesday, June 13, 2012

### Is it weird that...

A friend recently posted "Is it weird that at least 3 of my friends have their 3rd anniversary today?". I was curious.  So, I constructed a Monte Carlo simulation to determine if it was, in fact, "weird".  Short answer: no (p ≈ 0.21).

If we assume that that 14% of weddings happen in June, 75% of those on a Saturday, and that relationship lengths are exponentially distributed with a mean of 8 years, then a frequency distribution for the number of independent couples out of a population of 649 having 2009-06-13 as their anniversary is:

1: 35.5%
2: 23.2%
3: 14.5%
4: 4.7%
5: 1.9%
6: 0.3%

The R code I used for this was:

# Probabilities of weddings in each month
# Data for weddings in the Netherlands in 1995 taken from
# http://g2.cvs.sourceforge.net/viewvc/g2/g2/g2/demo/bargraph/graphdata.py
pr_month = c(
2947, 3121, 4514,
5708, 9794, 12427,
6825, 8630, 12108,
6411, 4045, 4939 )
pr_month = pr_month/sum(pr_month)

# Unfortunately, I'm not able to find data on which week days are most common.
# All I've found is that Saturday is 'by far the most common', followed by
# Sunday, with weekday weddings being less common.
pr_day = list()
pr_day[["Monday"   ]] = 0.01
pr_day[["Tuesday"  ]] = 0.01
pr_day[["Wednesday"]] = 0.01
pr_day[["Thursday" ]] = 0.01
pr_day[["Friday"   ]] = 0.01
pr_day[["Saturday" ]] = 0.75
pr_day[["Sunday"   ]] = 0.20

# First marriages which end in divorce last an average of 8 years
# (see http://www.census.gov/prod/2005pubs/p70-97.pdf )
# I'll assume marriage lengths follow an exponential distribution with mean 8

# For performance, cache things we will re-use
month_days = list()
day_probs = list()

# Define a function which can generate a specified number of wedding dates,
# following the probabilities given above:
random_dates = function(n){
dates = rep(as.Date(NA), n)
ages = sort(floor(rexp(n, 1/8)),decreasing=T) # Sort to get good cache performance
month = sample(1:12, n, prob=pr_month, replace=T)

for (i in 1:length(ages)){
m_ix = sprintf('%i-%i-1',2012-ages[i],month[i])

if (is.null(month_days[[m_ix]])){
# Get a list of the days in the month
start = as.Date(m_ix)
stop = as.Date( ifelse(month[i]==12,
sprintf('%i-1-1',  2012-ages[i]+1),
sprintf('%i-%i-1', 2012-ages[i], month[i]+1)) )
}

if (is.null(day_probs[[m_ix]])){
# Figure probabilities for each day
day_prob = rep(NA, length(month_days[[m_ix]]))
for (j in 1:length(day_prob)){
day_prob[j] = pr_day[[ weekdays(month_days[[m_ix]][j]) ]]
}
day_probs[[m_ix]] <<- day_prob/sum(day_prob)
}

# Pick a day out of this month
dates[i] = sample(month_days[[m_ix]], 1, prob=day_probs[[m_ix]] )
}
dates
}

# Generate 1000 sets of 649 independent dates
#  Figure out how many times out of those 1000 simulations
#  there are more than three which fall on 13 June 2009.
count=rep(0,1000)
for (i in 1:1000){
# Progress meter:
if (i%%10==0) cat (i)
else if (i%%2==0) cat ('.')
if (i%%50==0) cat('\n')

count[i] = sum( random_dates(649) == as.Date('2009-06-13'))
}
cat('\n')

# Generate a frequency distribution
for( i in 1:max(count)){
cat(sprintf('%i: %.1f\n', i, 100*sum(count==i)/length(count)))
}