My personal COVID-19 experience: a data project

2022-08-12 1584 words 12 minutes

/posts/datascience/covid-experience/featured.jpg

Contents

Covid knocked me out quite good. I contracted the virus right at the start of summer and spent more than a week in bed – with aching limbs, a sore throat and body temperature almost reaching 40°C. Immediately after the first positive antigen test I decided to monitor my body more closely and generate some data to play with later.

This post summarizes what I have learned through making a data project out of my Covid infection. It features accessing Garmin fitness data via GarminDB, manipulating timeseries with Pandas and a simple correlation analysis (heart rate vs. body temperature), rediscovering a physiological phenomenon that was news to me.

Data recording (body temperature + heart rate)

Even before developing severe symptoms, I decided to constantly wear my Garmin fitness watch and to take my temperature at least once per hour. My Garmin vívoactive 3 does an okay job at measuring heart rates. Although sort of noisy, it works well for long-term monitoring. I measured the body temperature twice (under the tongue) and noted both values in a Google Sheet. The data collection part of this project emphasized an important caveat:

Manual data entry is challenging.

It is time consuming and you cannot really trust the resulting data — even if you are doing it yourself. When reviewing the data for my exploratory analysis, I had to clean up several issues, such as wrong dates and falsely placed decimal points.

Exploration of temperature data

We can take a quick look at the raw temperature data. First, reading the data from a (already cleaned) CSV file:

1
2
3
4
5
6


import matplotlib.pyplot as plt
import pandas as pd

temps = pd.read_csv("data/temperatures.csv")
temps.timestamp = pd.to_datetime(temps.timestamp)
temps.set_index("timestamp", inplace=True)

Now, we can plot the two manual measurements over time.

1
2
3
4


fig, ax = plt.subplots(1, 1, figsize=(6.4, 2.4))
temps.plot(y=["temp1", "temp2"], ax=ax, x_compat=True)
plt.xlabel("Time")
plt.ylabel("Body temp. (°C)")

Although it did not feel like it at the time, there is a solid agreement between the two columns (R² = 0.91). For later analyses, both values are averaged into a single "temp" column:

1
2
3
4


print(f"R² =", temps.corr().loc["temp1", "temp2"]**2)
#> R² = 0.9103542406909357

temps.eval("temp = (temp1 + temp2) / 2", inplace=True)

Accessing fitness data from Garmin Connect

Garmin stores the entire fitness data (activities, heart rates, sleeping, etc.) you generate in the cloud. It can be accessed via app or web interface with the user’s Garmin Connect account. The apps are great for visualization of the data, but they are not made for more elaborate analyses. Luckily, Garmin also provides an API to retrieve all the raw data – a dream scenario for a data scientist.

GarminDB

The open-source GarminDB package provides a couple of nice tools to retrieve and preprocess data from Garmin Connect. With its straight-forward command line interface, it takes little effort to get the data into SQLite database files on your local machine.

Installing GarminDB works seamlessly through pip.

pip install garmindb

Next, for some user-specific settings, the example GarminConnectConfig.json from the repository to ~/GarminDb. Importantly, user credentials can be added here, as well as the start date from which data should be loaded.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


{
    "credentials": {
        "user"                          : "email",
        "secure_password"               : false,
        "password"                      : "password"
    },
    "data": {
        "weight_start_date"             : "01/01/2022",
        "sleep_start_date"              : "01/01/2022",
        "rhr_start_date"                : "01/01/2022",
        "monitoring_start_date"         : "01/01/2022",
    },
    // ...
}

With the configuration done, data can be exported from Garmin Connect in one command. It does a bit more than we actually need for this project, but I like to have the complete database locally.

1

garmindb_cli.py --all --download --import --analyze

The command takes quite a long time. Once it is done, the downloaded data is available in various SQLite databases and JSON files under ~/HealthData/.

Troubleshooting

Sometimes when running the above command, I got an error saying “Failed to login!”. The error disappeared for me after a couple of repeated calls.

Also, on Windows, the command above did not work due to a file association problem. It actually opened the Python file in the editor, rather than executing the script. Not wanting to set file associations to the python.exe inside my virtual environment, I used a small workaround to get it running. Inside the virutalenv, where GarminDB was installed, you can call python %VIRTUAL_ENV%/Scripts/garmindb_cli.py.

The monitoring database

While there are a couple of other databases, the main focus for me was the monitoring database. Among other things, it provides the heart rate recorded every minute. We can explore the database using the SQLite command line program (install on Windows for example with Chocolatey). The .tables command lists all tables. .schema <tablename> provides the list of columns and corresponding datatypes.

We can load the table into Pandas using pd.read_sql and sqlite3. sqlite3 is part of the Python standard library, so no additional pip install needed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# import pandas as pd  # from before.
from pathlib import Path
import sqlite3

monitoring_db = Path("~/HealthData/DBs/garmin_monitoring.db").expanduser()
con = sqlite3.connect(monitoring_db)
hr_df = (
    pd.read_sql("SELECT * FROM monitoring_hr", con, parse_dates=["timestamp"])
    .set_index("timestamp")
)
con.close()

Data wrangling

While recording both body temperature and heart rate, I noticed that my heart was beating faster, when the thermometer read higher values. Therefore, I want to analyze the correlation between my pulse rate and body temperature. In order to do this, both variables need to be aligned in time. In addition, temperatures are recorded at non-uniform intervals of roughly 50-60 minutes, whereas the heart rate is given every minute. Some data wrangling is required.

First, we drop heart rate data outside the time range of recorded temperatures. For shorter code later, we also rename the heart_rate column to hr.

1
2
3


start, end = temps.index.min(), temps.index.max()
hrs = hr_df.query("@start < index < @end").sort_index()
hrs.rename(columns={"heart_rate": "hr"}, inplace=True)

As the heart rate signal is quite noisy, I am also creating a smoothed version for visualizations. There are several options for smoothing. Here, I am using a moving average with Pandas' rolling method. Pandas works great with datetime indices. We can easily specify a centered moving window of 30 minutes, not worrying about sampling rate or non-uniform intervals. For a valid computation, we require at least 15 samples captured by the window (this is an arbitrary choice, it does not really matter for the correlation analysis).

1

hrs["hr_smooth"] = hrs.rolling("30min", min_periods=15, center=True).mean()

Now, the two timeseries can be plotted to get a first feel of the correlation. There are some gaps in the heart rate data, where I had to take off the watch to recharge the battery.

1
2
3
4
5
6
7
8
9


fig, ax = plt.subplots(1, 1)
twinx = ax.twinx()
hrs.plot(y="hr_smooth", ax=ax, legend=False)
temps.plot(y="temp", ax=twinx, color="C1", legend=False)
plt.figlegend()
ax.set_xlabel("Time")
ax.set_ylabel("Heart rate (BPM)")
twinx.set_ylabel("Temperature (°C)")
plt.show()

It appears that the two signals are are not independent of each other. But in order to quantify that relationship, further preprocessing is needed.

Again, Pandas makes it very easy, to combine heart rates and temperatures for analysis. We need to resample both signals to common time intervals. During each interval, we get the mean and standard deviation of both signals. The standard deviation is a good indicator of the noise level in the heart rate data. As this resampling produces some empty rows, we remove them immediately.

1
2
3


aux_df = pd.merge(temps, hrs, left_index=True, right_index=True, how="outer")
res_df = aux_df.resample("30min").agg(["mean", "std"]).dropna(how="all")
res_df.columns = [f"{col}_{feature}" for col, feature in res_df.columns]

Flattening multi-index columns

Personally, I do not like having multi-index columns in a Pandas data frame. A multi-index was created because we calculated two features (mean and std) for each column. The last line above flattens the index and creates single-level column names like temp_mean and hr_std.

Correlation analysis: temperature vs. heart rate

At last, we can quantify the correlation between my body temperature and heart rate throughout the Covid infection. First, let’s plot the two parameters against each other. Coloring by hr_std we can additionally see, how signal noise affects the correlation.

1
2
3
4
5


plt.figure()
sns.scatterplot(
    data=res_df, x="temp_mean", y="hr_mean", hue="hr_std", palette="mako_r"
)
plt.tight_layout()

Especially towards the edges, we see data points with higher standard deviation in the heart rate signal. It therefore makes sense to clean up a bit and remove the most unreliable points. I am dropping the highest standard deviations (top 15%, >9.34 bpm) and calculate the regression metrics for the original and cleaned dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


std_thresh = res_df.hr_std.quantile(0.85)
print("hr_std threshold:", std_thresh)
#> hr_std threshold: 9.348100687600498

for df in [res_df, res_df.query("hr_std < @std_thresh")]:
    res = scipy.stats.linregress(df[["temp_mean", "hr_mean"]].dropna())
    print(f"R² = {res.rvalue**2:.3f} | p = {res.pvalue:.4g}"
          f" | hr = {res.slope:.1f} temp + {res.intercept:.1f}")
#> R² = 0.710 | p = 9.726e-21 | hr = 10.9 temp + -335.7
#> R² = 0.763 | p = 3.8e-19 | hr = 11.0 temp + -340.8

We see that the correlation between heart rate and body temperature is highly significant, both when including and when excluding the heart rate outliers. For every degree of increased body temperature, my heart rate increases by 11 beats per minute. While looking at the plot above, I of course also googled this phenomenon. It is well documented, that a higher body temperature causes increased heart rate — although the reported relationship suggest a smaller increase of (7-10 bpm/°C) ¹²³.

We can also compute a confidence interval for my personal regression slope using statsmodels. For the cleaned data, the 95% confidence interval for the slope is [9.356, 12.638], still rather high compared to the literature.

1
2
3
4
5
6


import statsmodels.formula.api as smf
model = smf.ols(
    "hr_mean ~ temp_mean", data=res_df.query("hr_std < @std_thresh")
)
res = model.fit()
print(res.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                hr_mean   R-squared:                       0.763
Model:                            OLS   Adj. R-squared:                  0.759
Method:                 Least Squares   F-statistic:                     180.2
Date:                Sun, 07 Aug 2022   Prob (F-statistic):           3.80e-19
Time:                        17:34:59   Log-Likelihood:                -162.28
No. Observations:                  58   AIC:                             328.6
Df Residuals:                      56   BIC:                             332.7
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -340.7683     30.917    -11.022      0.000    -402.703    -278.834
temp_mean     10.9972      0.819     13.424      0.000       9.356      12.638
==============================================================================
Omnibus:                        2.450   Durbin-Watson:                   1.653
Prob(Omnibus):                  0.294   Jarque-Bera (JB):                1.594
Skew:                          -0.276   Prob(JB):                        0.451
Kurtosis:                       3.596   Cond. No.                     2.20e+03
==============================================================================

Finally, I want to create a nice plot, which I would include in report.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


std_thresh = res_df.hr_std.quantile(0.85)
df = res_df.query("hr_std < @std_thresh")
lr = scipy.stats.linregress(df[["temp_mean", "hr_mean"]].dropna())

plt.figure(figsize=(6.4, 3.2))
sns.regplot(
    data=df, x="temp_mean", y="hr_mean", line_kws={"color": "C0"}, color=".6"
)
plt.xlabel("Body temperature (°C)")
plt.ylabel("Heart rate (bpm)")
plt.text(0, 1.02, f"HR = {lr.slope:.1f} \u00B7 Temp {lr.intercept:=+7.1f}",
         va="bottom", clip_on=False, transform=plt.gca().transAxes, color=".5")
plt.text(1, 1.02, f"R² = {lr.rvalue**2:.3f} | p = {lr.pvalue:.4g}",
         va="bottom", ha="right", clip_on=False, transform=plt.gca().transAxes,
         color=".5")
plt.tight_layout()

Conclusion

Going through Covid wasn’t so fun. But exploring my health data from that experience was very rewarding. I saw first hand, how difficult it can be to get trustworthy and accurate data. And I learned how to access and play with my Garmin health data.

Data wrangling with timeseries of different formats and frequencies can be quite challenging. However, the Python ecosystem (most notably Pandas) provides many great tools for dealing with such data and makes it at times ridiculously easy to perform certain tasks.

The data analysis revealed a strong correlation between body temperature and heart rate. Although this is already a well-documented phenomenon, it was still news to me. And through some simple statistics and modeling (R², linear regression), we found that with increasing body temperature, my heart rate grew faster than described in the literature.

Finally, I want to note that I feel lucky, that I did not experience any major complications with my Covid infection. Many have to go through worse and my heart is with anyone experiencing severe medical conditions.

Stay healthy and stay safe!

Below is the complete Python code from this post. You can download the code with data from here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96


import pathlib
import sqlite3

import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats
import seaborn as sns

# READING TEMPERATURE DATA
temps = pd.read_csv("data/temperatures.csv")
temps.timestamp = pd.to_datetime(temps.timestamp)
temps.set_index("timestamp", inplace=True)

fig, ax = plt.subplots(1, 1, figsize=(6.4, 2.4))
temps.plot(y=["temp1", "temp2"], ax=ax, x_compat=True)
plt.xlabel("Time")
plt.ylabel("Body temp. (°C)")
plt.tight_layout()

# import scipy.stats
lr = scipy.stats.linregress(temps.temp1, temps.temp2)
print(f"R² = {lr.rvalue**2}")
print("R**2 =", temps.corr().loc["temp1", "temp2"]**2)
temps.eval("temp = (temp1 + temp2) / 2", inplace=True)

# READING HEART RATE DATA
# import sqlite3, pathlib
monitoring_db = pathlib.Path("~/HealthData/DBs/garmin_monitoring.db")
con = sqlite3.connect(monitoring_db.expanduser())
hr_df = (
    pd.read_sql("SELECT * FROM monitoring_hr", con, parse_dates=["timestamp"])
    .set_index("timestamp")
)
con.close()
print(hr_df.head())

# DATA WRANGLING
start, end = temps.index.min(), temps.index.max()
hrs = hr_df.query("@start < index < @end").sort_index()
hrs.rename(columns={"heart_rate": "hr"}, inplace=True)
hrs["hr_smooth"] = hrs.rolling("30min", min_periods=15, center=True).mean()

fig, ax = plt.subplots(1, 1)
twinx = ax.twinx()
hrs.plot(y="hr_smooth", ax=ax, legend=False)
temps.plot(y="temp", ax=twinx, color="C1", legend=False)
plt.figlegend()
ax.set_xlabel("Time")
ax.set_ylabel("Heart rate (BPM)")
twinx.set_ylabel("Temperature (°C)")

aux_df = pd.merge(temps, hrs, left_index=True, right_index=True, how="outer")
res_df = aux_df.resample("30min").agg(["mean", "std"]).dropna(how="all")
res_df.columns = [f"{col}_{feature}" for col, feature in res_df.columns]

# CORRELATION ANALYSIS
plt.figure()
sns.scatterplot(
    data=res_df, x="temp_mean", y="hr_mean", hue="hr_std", palette="mako_r"
)
# sns.regplot(data=res_df, x="temp_mean", y="hr_mean", scatter=False)
plt.tight_layout()

std_thresh = res_df.hr_std.quantile(0.85)
print("hr_std threshold:", std_thresh)
for df in [res_df, res_df.query("hr_std < @std_thresh")]:
    lr = scipy.stats.linregress(df[["temp_mean", "hr_mean"]].dropna(how="any"))
    print(f"R² = {lr.rvalue**2:.3f} | p = {lr.pvalue:.4g}"
          f" hr = {lr.slope:.1f} temp + {lr.intercept:.1f}")

import statsmodels.formula.api as smf
model = smf.ols(
    "hr_mean ~ temp_mean", data=res_df.query("hr_std < @std_thresh")
)
res = model.fit()
print(res.summary())

# FINAL PLOT
std_thresh = res_df.hr_std.quantile(0.85)
df = res_df.query("hr_std < @std_thresh")
lr = scipy.stats.linregress(df[["temp_mean", "hr_mean"]].dropna())

plt.figure(figsize=(6.4, 3.2))
sns.regplot(
    data=df, x="temp_mean", y="hr_mean", line_kws={"color": "C0"}, color=".6"
)
plt.xlabel("Body temperature (°C)")
plt.ylabel("Heart rate (bpm)")
plt.text(0, 1.02, f"HR = {lr.slope:.1f} \u00B7 Temp {lr.intercept:=+7.1f}",
         va="bottom", clip_on=False, transform=plt.gca().transAxes, color=".5")
plt.text(1, 1.02, f"R² = {lr.rvalue**2:.3f} | p = {lr.pvalue:.4g}",
         va="bottom", ha="right", clip_on=False, transform=plt.gca().transAxes,
         color=".5")
plt.tight_layout()

plt.show() # show all figures at once.

(References)

J. Karjalainen and M. Viitasalo, “Fever and cardiac rhythm,” Archives of internal medicine, vol. 146, no. 6, pp. 1169–1171, 1986, doi:10.1001/archinte.1986.00360180179026. ↩︎
G.W. Kirschen, D.D. Singer, H.C. Thode and A.J. Singer, “Relationship between body temperature and heart rate in adults and children: A local and national study,” The American journal of emergency medicine, vol. 38, no. 5, pp. 929–933, 2020, doi:10.1016/J.AJEM.2019.158355. ↩︎
M.E. Broman, J.L. Vincent, C. Ronco, F. Hansson and M. Bell, “The Relationship Between Heart Rate and Body Temperature in Critically Ill Patients,” Critical care medicine, vol. 49, no. 3, pp. E327–E331, 2021, doi:10.1097/CCM.0000000000004807. ↩︎