I’ve spent the last two years in a Biostatistics MS program at UCLA. I’m preparing to move on in life now, taking my last quarter of classes virtually, holed up in a loft in Berkeley. I am currently splitting my quarantine between genetic research and building bikes.
For a final project for one of my classes recently, I did an analysis of my cycling data from the last two years of training in the Santa Monica area. I’ve been very consistent in my training since moving to LA. A very regimented schedule of rides, both on my own and in groups.
My best memories from this time in my life will unquestionably come from my wonderful Tuesday and Thursday mornings with Las Flores Breakfast Club. This is a remarkably consistent group ride that I have participated in over 60 times since December 2018. For class, I downloaded and analyzed my Strava data from these rides and wrote the following statistical analysis of my performance.
Pictures in this post are a mix of my own and Leo Rusaitis’s. Mostly Leo’s, because they’re always so great. All of my data analysis is done in R, but I’m still cleaning the code so I won’t share it here yet. in the meantime, the R packages ‘darksky’, ‘refund’, ‘fda’, ‘gtools’, ‘geosphere’, and a few functions from the tidyverse were very useful. I plot in Base R because I like the flexibility that it provides. It’s just cleaner.
The Las Flores Breakfast Club starts in Santa Monica at 6:05am. The ride is incredibly consistent, leaving at 6:05am on the dot from Rapha Santa Monica. We ride at 25mph up the PCH to either Lower Topanga/Fernwood Pacific Dr. or Las Flores Dr. Both of these rides end up at the top of Saddle Peak, and we have a strict departure from the top at 7:22am. We follow the twisty Tuna Canyon Road to the PCH, followed by a mad dash through morning traffic along the PCH to a final sprint at Temescal. This puts the ride back in Santa Monica by 8am, and me, for the most part, in class by 10am.
Despite it being sunny Southern California, we do face some variability in temperature during the ride, and across days. During the winter months, the temperature can dip into the low 40’s at the start, rising through the morning. During the summer, we may start in the 70’s. I decided to model my Strava positional data as functional data, and determine how the weather conditions affect our speed at certain points in the ride. Do we go slower when it’s colder? What parts of the ride are impacted the most by poor conditions?
Functional Data Analysis (FDA) is the study of curves. In statistical inference, we’re often concerned with conditional relationships and predictions. For example, we can fit a linear relationship to two variables and determine how they are related. Does the value of one variable change conditional on the value of another variable? Functional data adds a new dimension. Instead of predicting a value at a point of time given another variable or more, functional data concerns the prediction of entire curves using predictive curves. Consider the model:
This is a linear regression model, where the outcomes for variable Y are described as a function of a predictor X with some error. Now consider the functional data model:
The functional data model here considers a mean response conditioned on all values of the predictors contained in the paths in X. Y(t) is not a vector of single values, it is a matrix with dimensions equal to the number of paths and the length of those paths. Instead of regressing a scalar on a scalar, as in linear regression, the FDA model considers regressing a function on a function.
Downloading and Aggregating the Data
I downloaded my bulk .gpx data from Strava, and filtered through it manually to select 58 of my paths from the Las Flores Breakfast Club. Of these 58 paths, about half of them correspond to the Tuesday route up Fernwood Pacific, and the other half correspond to the Thursday route up Las Flores.
I paid about $5 in API calls to DarkSky in its R interface to match temperature to coordinates and time. The DarkSky provides a spatial temperature estimate for each coordinate in spacetime, whether or not there is an active measurement device at that location.
My speed is calculated as 15-second length average speeds, each the result of averaging five 3-second intervals of movement data. Strava records GPS position, and I was careful to calculate my speed in each 3-second interval using a distance formula that reflects the curvature of the Earth. Gotta be accurate!
Figure 1: Speed Calculation Diagram
The initial paths are shown below in grey. 58 curves are shown at once. First thing to notice is that there is a good pattern being created. You can clearly see different parts of the ride, including the main climb. Second thing to notice is that there is a lot of noise here. There are spikes, erroneous points, bad measurements.
Figure 2: Raw Speed Data
Below I fit a few smoothing splines to the data to extract a little more detail. Note that these splines are modeled independently for now, so this is just for illustrative purposes. In a little bit I’ll model these paths jointly. Here you can see a little more pattern to the data. You can start to see the two different rides, the steep parts, and a few days where we linger at the top of the climb for someone’s birthday or farewell ride.
Figure 3: Independent Splines, Annotated
From the last chart, I see that there is no real need to realign rides on time, as the start time is so consistent. This would be a “phase shift,” and I don’t feel it’s necessary here. So I jump straight in to running a joint FDA model.
The model I chose was a polynomial b-spline, with one function-on-function term and one binary scalar-on-function term for distinguishing between Tuesday and Thursday curves. These models have so many different alterations and so much literature behind them on things like knot selection, polynomial degree, smoothing basis creation, and about a dozen other characteristics. For the purposes of this post, I won’t go too far into the mathematics behind it all because the more I do, the more shaky my methods will appear. Pay no attention to the methods behind the curtain.
So I ran the model, and it’s pretty difficult to interpret simply. In linear regression, you have a specific beta value and straightforward interpretation. For a given X value, the predicted value of Y is some value. For each increase of one unit in predictor X, the predicted effect on Y is beta. In FDA, consider instead of feeding in one value for the predictor X, you feed in an entire predictor’s path over time and end up with a predicted path for the outcome.
The figure below illustrates this concept. Each ride path (lower paths) are conditionally derived paths given a relationship to the corresponding temperature paths above. I have highlighted two days, one particularly hot and one particularly cold, as a reference. The red temperature path (hot) corresponds to the red predicted path, and the blue temperature path to the blue line. Both of these days were Thursdays.
Figure 4: Temperature Paths and Predicted Ride Paths
Let’s dive into a specific day, the cold one from above. It was a very cold day in February 2019, so cold that I remember there was ice on the descent. Ice! Outside Malibu! The figure above shows the raw data for that, the basis function (used in the smoothing) for that ride, and a predicted path from the model. Note that the model’s paths are shorter than the overall path. I have imposed a cutoff of just over an hour and a half of riding, in order to model just the most consistent parts of the ride.
Figure 5: Raw Path, Basis Function, and Predicted Path for One Ride
The difference between the raw speed curve and the predicted curve is just the prediction error. It’s analogous to drawing a simple linear regression line on a 2D scatter of the two involved variables, but with an added dimension. Now that I have the ability to feed in different paths, let’s play around and see what results pop out.
The following chart shows the effect of temperature path on the predictions. The black and grey curves show the cold-temperature ride from earlier, the dark line a function of the data contained in the grey. The blue and red dashes correspond to paths predicted using constant 30- and 80-degree paths. It is immediately evident that the path predicted by the 80-degree day is considerably faster than the colder one.
Figure 6: Predicted Paths of Constant Hot and Cold Rides
I continued to mess with the temperature paths, feeding in different paths of values and changing Tuesday/Thursday. A pretty consistent pattern emerged, that we go considerably faster when it’s warmer out. In some places, this speed benefit can reach 5-8km/h. Good conditions on the flat PCH result in good times. A warm Tuna Canyon Road is a white-knuckle experience.
Figure 7: Predicted Paths of Non-Constant Temperature Rides
The results here are relatively straightforward. We go faster when it’s warmer out. Big deal. But the larger point I want to make here with this post is how much I’ve enjoyed my time riding with the cycling community in Santa Monica over the last two years. This ride has been such a great way to start the day during my time here. A huge thanks to everyone I had the chance to ride with, I appreciate your energy and enthusiasm for cycling. So many wonderful rides with great company. My time cycling here has meant as much to me as any of my other bike activities, so it deserved special note and an extended shoutout in the blog.
Thanks for reading!
One thought on “Functional Data and the Las Flores Breakfast Club”
Love the write-up, Bryan! I guess when it’s cold enough for my balls to shrivel, we just don’t have the big cojones for the fast descent anymore. 😬. Enjoyed riding with you and wish you all the best in the future.