# My Life Stats: I Tracked My Habits for a Year, and This Is What I Learned | by Pau Blasco i Roca | Nov, 2023

I first looked at the individual time series for four variables: Sleep, Studying, Socializing and Mood. I used Microsoft Excel to quickly draw some plots. They represent the daily number of hours spent (blue) and the moving average¹ for five days MA(5) (red) which I considered to be a good measure for my situation. The mood variable was rated from 10 (the greatest!) to 0 (awful!).

Regarding the data contained in the footnote of each plot: the total is the sum of the values of the series, the mean is the arithmetic mean of the series, the STD is the standard deviation and the relative deviation is the STD divided by the mean. Total: 2361h. Mean: 7,1h. STD: 1,1h. Relative deviation: 15.5% (image by author).

All things accounted for, I did well enough with sleep. I had rough days, like everyone else, but I think the trend is pretty stable. In fact, it is one of the least-varying of my study. Total: 589,1h. Mean: 1,8h. STD: 2,2. Relative deviation: 122% (image by author).

These are the hours I dedicated to my academic career. It fluctuates a lot — finding balance between work and studying often means having to cram projects on the weekends — but still, I consider myself satisfied with it. Total: 1440,9h. Mean: 4,3h. STD: 4,7h. Relative deviation: 107% (image by author).

Regarding this table, all I can say is that I’m surprised. The grand total is greater than I expected, given that I’m an introvert. Of course, hours with my colleagues at college also count. In terms of variability, the STD is really high, which makes sense given the difficulty of having a stablished routine regarding socializing.

This the least variable series — the relative deviation is the lowest among my studied variables. A priori, I’m satisfied with the observed trend. I think it’s positive to keep a fairly stable mood — and even better if it’s a good one.

After looking at the trends for the main variables, I decided to dive deeper and study the potential correlations² between them. Since my goal was being able to mathematically model and predict (or at least explain) “Mood”, correlations were an important metric to consider. From them, I could extract relationships like the following: “the days that I study the most are the ones that I sleep the least”, “I usually study languages and music together”, etc.

Before we do anything else, let’s open up a python file and import some key libraries from series analysis. I normally use aliases for them, as it is a common practice and makes things less verbose in the actual code.

`import pandas as pd               #1.4.4import numpy as np                #1.22.4import seaborn as sns             #0.12.0import matplotlib.pyplot as plt   #3.5.2from pmdarima import arima        #2.0.4`

We will make two different studies regarding correlation. We will look into the Person Correlation Coefficient³ (for linear relationships between variables) and the Spearman Correlation Coefficient⁴ (which studies monotonic relationships between variables). We will be using their implementation⁵ in pandas.

## Pearson Correlation matrix

The Pearson Correlation Coefficient between two variables X and Y is computed as follows: where cov is the covariance, sigma X is std(X) and sigma Y is std(Y)

We can quickly calculate a correlation matrix, where every possible pairwise correlation is computed.

`#read, select and normalize the dataraw = pd.read_csv("final_stats.csv", sep=";")numerics = raw.select_dtypes('number')#compute the correlation matrixcorr = numerics.corr(method='pearson')#generate the heatmapsns.heatmap(corr, annot=True)#draw the plotplt.show()`

This is the raw Pearson Correlation matrix obtained from my data.

And these are the significant values⁶ — the ones that are, with a 95% confidence, different from zero. We perform a t-test⁷ with the following formula. For each correlation value rho, we discard it if:

where n is the sample size. We can recycle the code from before and add in this filter.

`#constantsN=332 #number of samplesSTEST = 2/np.sqrt(N)def significance_pearson(val):if np.abs(val)<STEST:return Truereturn False#read dataraw = pd.read_csv("final_stats.csv", sep=";")numerics = raw.select_dtypes('number')#calculate correlationcorr = numerics.corr(method='pearson')#prepare masksmask = corr.copy().applymap(significance_pearson)mask2 = np.triu(np.ones_like(corr, dtype=bool)) #remove upper trianglemask_comb = np.logical_or(mask, mask2)c = sns.heatmap(corr, annot=True, mask=mask_comb)c.set_xticklabels(c.get_xticklabels(), rotation=-45)plt.show()`

Those that have been discarded could just be noise, and wrongfully represent trends or relationships. In any case, it’s better to assume a true relationship is meaningless than consider meaningful one that isn’t (what we refer to as error type II being favored over error type I). This is especially true in a study with rather subjective measurments. Filtered Pearson Correlation matrix. Non-significant values (and the upper triangular) have been filtered out. (image by author)

## Spearman’s rank correlation coefficient

The spearman correlation coefficient can be calculated as follows: where R indicates the rank variable⁸ — the rest of variables are the same ones as described in the Pearson coef.

As we did before, we can quickly compute the correlation matrix:

`#read, select and normalize the dataraw = pd.read_csv("final_stats.csv", sep=";")numerics = raw.select_dtypes('number')#compute the correlation matrixcorr = numerics.corr(method='spearman') #pay attention to this change!#generate the heatmapsns.heatmap(corr, annot=True)#draw the plotplt.show()`

This is the raw Spearman’s Rank Correlation matrix obtained from my data:

Let’s see what values are actually significant. The formula to check for significance is the following: where r is spearman’s coefficient. Here, t follows a t-student distribution with n-2 degrees of freedom.

Here, we will filter out all t-values higher (in absolute value) than 1.96. Again, the reason they have been discarded is that we are not sure whether they are noise — random chance — or an actual trend. Let’s code it up:

`#constantsN=332 #number of samplesTTEST = 1.96def significance_spearman(val):if val==1:return Truet = val * np.sqrt((N-2)/(1-val*val))    if np.abs(t)<1.96:return Truereturn False#read dataraw = pd.read_csv("final_stats.csv", sep=";")numerics = raw.select_dtypes('number')#calculate correlationcorr = numerics.corr(method='spearman')#prepare masksmask = corr.copy().applymap(significance_spearman)mask2 = np.triu(np.ones_like(corr, dtype=bool)) #remove upper trianglemask_comb = np.logical_or(mask, mask2)#plot the resultsc = sns.heatmap(corr, annot=True, mask=mask_comb)c.set_xticklabels(c.get_xticklabels(), rotation=-45)plt.show()`

These are the significant values.