The Extensive and Often Surprising Data that Companies Have about You, Ready and Waiting for You to Analyze
Data privacy laws are appearing in countries all over the world and are creating a unique opportunity for you to learn how others view you while also gaining insights into yourself. Most laws are similar to the European Union’s General Data Protection Regulation, commonly know as “GDPR”. It includes provisions requiring organizations to tell you the type of personal data they store about you, why they are storing it, how they are using it, and the length of time they store it.
But the laws also include an often overlooked requirement commonly known as data portability. Data portability requires organizations to give you a machine-readable copy of the data they are currently storing about you upon request. In the GDPR, this right is defined in Article 15, “Right of access by the data subject”. The data that organizations have often includes a rich and varied set of features and is clean, making it ripe for several data analysis, modelling, and visualization tasks.
In this article, I share my journey of requesting my data from a few of the companies with whom I routinely interact. I include tips for requesting your data as well as ideas for using your data in data science and for personal insights.
Think you have a solid grasp on your taste in music? I thought I had broad and varied musical tastes. According to Apple, though, I am more of a die-hard rocker.
Want to refine your geographic data mapping skills? These data sources provide a spectacular amount of geocoded data to work with.
Care to try your time series modelling skills? Multiple data sets come with fine-grained time series observations.
The best news of all? This is your data. No license or permissions needed.
Fasten your seat belt — the variety of data you will receive is broad. The types of analyses and modelling you can do are non-trivial. And the insights you gain about yourself and how others view you are intriguing.
To keep the focus on insights from the data and in the interest of brevity, I do not include code in this article. Everybody like code, though, so here is a link to a repo with several of the notebooks I used to analyze my data.
Getting the Data
If you make a list of organizations that have data about you, you will quickly realize the list is large. Social media companies, online retailers, cellular phone carriers, internet service providers, home automation and security services, and streaming entertainment providers are just a few categories of organizations storing data about you. Requesting your data from all of these groups can be quite time-consuming.
To make my analysis manageable, I limited my data requests to Facebook, Google, Microsoft, Apple, Amazon and my cellular carrier, Verizon. Here is a table summarizing my experience with the data request and response process:
And here are the links I used to request my data along with information on any data documentation provided by the vendors:
I use an Apple Watch to track health and fitness data. That data is accessed separately from all other Apple data that you request from the general Apple website. Because of this, I show two separate Apple entries in the above tables and discuss the Apple data in two topics below.
The amount and type of data you receive will depend on how extensively you engage with a particular company. For example, I use social media infrequently. The rather modest amount of data I received from Facebook is therefore not surprising. In contrast, I use Apple products and services a lot. I got a broad range and large volume of data from Apple.
Keep in mind that if you have multiple identities with a company, you will have to request the data for each identity. For example, if Google knows you by one e-mail address for your Google Play account and a different e-mail address for your gmail account, you will have to do a data request for each address in order to get a full picture of the data Google stores about you.
In the table above I show links that I used to request data from my target companies. The links are current as of the publishing of this article but may change over time. In general, you can find instructions for requesting your data at the “Privacy”, “Privacy Rights” or similar sounding links on a company’s home page. Those links frequently appear at the very bottom of the home page.
You usually have to read through documentation describing your privacy rights and search for the “Accessing Your Data”, “Exporting Your Data”, “Data Portability” or similar topic to get a link to the actual page for requesting your data.
Finally, the process for requesting your data, the timeliness of the response and the quality of documentation you receive explaining the data varies greatly from one company to the next. Be patient and persevere. You will be rewarded with a wealth of data and knowledge in no time.
My Data Insights
Here is a review of the data files that I received from each company along with a few observations after analyzing the more interesting files. I also point out some opportunities to do more in-depth data analysis and modelling with the data from these companies.
My download from Facebook included 51 .json files, excluding the numerous .json files containing individual message threads from my Facebook Messenger account. Facebook provides some high-level documentation for its files on the download website.
Data on my Facebook login activity, devices that I used to login, estimated geographic location of my logins, and similar administrative-type data about my account activities appear across several files. Nothing in these files is particularly interesting, though I will say that the location data seemed surprisingly accurate, given it was was often inferred from my IP address at the time of the recorded activity.
The truly interesting data started to appear in a file that tracked my off-Facebook app and web activity. I can see how the data in that file, coupled with the data that Facebook already has from my Facebook profile, paint a demographic picture that result in me being selected as a target by particular Facebook advertisers. The off-Facebook file starts to give you a sense for how the profiling and advertising process works at Facebook.
Let’s take a look at the file. It is named:
“/apps_and_websites_off_of_facebook/your_off-facebook_activity.json”
It contains 1,860 records of actions I took on 441 different non-Facebook websites over the past two years. Here is an edited sample of the websites and action types it records:
Several technology and travel related sites rise to the top of my off-Facebook activity list. Now let’s look at my demographic profile.
The file named:
“ads_information/other_categories_used_to_reach_you.json”
contains a list of demographic categories that Facebook has assigned to me based, I assume, on my Facebook profile data, my Facebook friends, my activity on Facebook, and my off-Facebook app and web activity. Here is an edited sample of the demographic categories:
Most of the categories above are based on my profile, my device usage pattern, and my friends. The “Frequent Travelers” and “Frequent International Travelers” categories come, I assume from my off-Facebook web activity. So far, this all checks out.
Finally, there is a file named:
“ads_information/advertisers_using_your_activity_or_information.json”
The “advertisers_using_your_activity_or_information” in the file title leads me to believe that Facebook makes my data available to its advertisers who in turn use it to target me with ads through Facebook. This file, then, lists those advertisers who displayed an ad to me, or who at least considered doing so based on my data.
The file contained 1,366 different advertisers. Here is a small sample of those advertisers:
Travel sites, retailers, tech companies, fitness centers, car repair companies, healthcare insurers, media companies (who represent advertisers), and other firms appear in the list. It is a wide variety of organizations, but in many instances, I can see how they relate to me, my preferences and my habits.
Other files in the Facebook download include Facebook search history, search timestamps, and browser cookie data.
Google’s export facility is cleverly named “Takeout”. The Takeout web page lists all the various Google services for which you can request your data (gmail, YouTube, search, Nest, etc.) It also shows the files available for each service, and the export format for each file (json, HTML, or csv). Most of the time, Google does not give you a choice of export format for individual files.
Google does a decent job of providing a high-level overview of the purpose of each file. There is, however, no documentation for individual fields.
I received 94 files in my extract. As with Facebook, there were the normal administrative files related to device information, account attributes, preferences, and login/access data history.
One interesting file is the one titled ‘…/Ads/MyActivity.json’. It contains a history of ads presented to me as a result of searches.
Some entries in the Ads/MyActivity file have URLs containing a clickserve domain for example:
Per Google’s 360 ads website, these are ads from an ad campaign being done by one of Google’s advertisers, served to me as a result of some click activity I did. The file does not give any information on which action I took that caused the ad to be served.
The ‘title’ column in the file distinguishes between sites “Visited” and topics “Searched”. The “Visited” records all have “From Google Ads” in the ‘details’ column (see example above), leading me to believe that Google served an ad to me in response to me having visited a particular site.
The “Searched” records show sites I visited directly (macys.com, yelp.com, etc.) The ‘details’ column shows those sites while the ‘title’ column apparently shows what I searched for on those separate sites. For example,
One other file I found interesting is called ‘…/My Activity/Discover/MyActivity.json’. It is a history of the topic suggestions that Google presented to me through its “Discover” feature on the Google app (formerly the Google Feed feature — more on Discover here.) Discover topics are selected based on your web and app activity, assuming you give Google permission to use your activity to guide Discover topics.
Even though I do not allow Discover to use my web and app activity, Discover still presented some topic suggestions relevant to me. Here is an edited sample of the topics presented most frequently over several days:
We see here the recurring themes of technology and travel, along with a new theme we will also see in the Apple files — music!
Google includes in its download several files tracking activity history across Google’s products and services. For example, I received history for my visits to the developers.google.com and cloud.google.com sites for training and documentation resources. No compelling insights came from this data, but it did remind me of topics I wanted to revisit and study further.
Other historical data in the extract included searches and actions performed within my gmail account; search requests for images; places searched, directions requested, and maps viewed through the Google Maps app; searches performed for videos on the web (outside of YouTube); searches done on and watch history for YouTube; and contacts I store with Google, presumably in gmail.
Unlike Facebook, Google does not provide any information on a demographic profile that Google has built for me.
Note that you can view your Google activity data across its products and apps by visiting myactivity.google.com:
While you cannot export the data from this site, you can browse the data, allowing you to get a sense for the type of data you may want to export through the Google Takeout site.
Microsoft
Microsoft lets you export some of your data through the Microsoft Privacy Dashboard. For individual Microsoft services not available on the Dashboard (for example, MSDN, OneDrive, Microsoft 365, or Skype data) you can use links in the “How to access and control your personal data” section of Microsoft’s privacy statement page. The same page directs you to a web form you can submit if you are looking for data that is not available by any of the above methods.
I chose to export all data available through the Privacy Dashboard. This included browsing history, search history, location activity, music, TV and movies history, and apps and service usage data. I also asked for an export of my Skype data. My export included four csv files, six json files, and six jpeg files.
No file documentation was included in the export and none was found on the Microsoft site. The field names in the files are, however, fairly intuitive.
A few interesting observations from the Microsoft files:
The file ‘…\Microsoft\SearchRequestsAndQuery.csv’ contains data from searches I performed over the last 18 months including search terms and, apparently, the site that I clicked on, if any, from the search results. It looks like the data was only for searches that I did through Bing or Windows Search.
Based on the data, it appears I clicked on a link in the search results only 40% of the time (347 out of 870 searches performed.) From this, I assume that the searches for which I did not click on a link were either poorly crafted, returning off-topic results, or I may have been able to get the answer I wanted just by reading the link previews in the search results. I do not recall having to frequently redo search terms, and I know I often see the answer I need right in a link preview, since many of my searches are for reminders on coding syntax. Either way, I was a bit surprised at the 40% click-through rate. I would have expected it to be much higher.
Not much interesting was is in the Skype data. It contained the history of in-app message threads between me and other Skype meeting participants. Also included were .jpeg files with images of participants from a few of my calls.
Apple Fitness
I had to access my Apple health and fitness data separately from the other data that I exported from Apple. The health and fitness data are accessed from the Health app on the iPhone. You simply click on your icon in the upper right-hand corner of the Health app screen. It takes you to a profile screen and you then the click on the Export All Health Data link at the bottom of the screen:
My health export included just under 500 .gpx files totaling 102 meg. They contain route information from my recorded workouts over the last several years. Another 48 files contained 5.3 meg of electrocardiogram data from self-tests that I performed on my Apple Watch.
The file named ‘…/Apple/apple_health_export/export.xml’ contains the real interesting data. For me, it is 770 meg with 1,956,838 records covering multiple different health and exercise measurements for approximately seven years. Some of the activity types measured are as follows:
Note that the frequency at which Apple records data varies by activity type. For example, Active Energy Burned is recorded hourly while Stair Ascent Speed is recorded only when going up stairs, leading to the large difference in observation counts between these two activity types.
The data recorded for each observation include the date/time on which the observation was recorded, the start and end dates/times of the activity being measured, and the device that recorded the activity (iPhone or Apple Watch).
In his excellent Medium article “Analyse Your Health with Python and Apple Health”, Alejandro Rodríguez provides the code that I used to parse the xml in the export.xml file and create a Pandas data frame. (Thank you Alejandro!) After selecting a one year subset of the data and grouping and aggregating it at day and activity type levels, I discovered some interesting things.
As I suspected. my average activity levels were different for days when I was travelling compared to days when I was in one of the cities I call home (Austin or Chicago). To see this, I had to use the latitude and longitude data from the .gpx exercise route files mentioned earlier. That allowed me to determine which of the routes were in a home city and which occurred while I was travelling. I then merged that location data with my activity summary data. This was then further summarized by activity type and location (home city or travelling). Here is the pattern that merged:
While in Chicago, I am in an apartment building with an elevator, so the big decline in average flights climbed was not a surprise. What was surprising was the increase in activity levels for Chicago versus Austin. My exercise routine is very similar in both locations, yet I do more work in Chicago. I think I can attribute this to the fact that I walk to more locations in Chicago, rather than driving most of the time. Clearly, I need to up the amount that I exercise in Austin.
Spotting trends like the one above, which you cannot see in the standard charts of the Apple Health app, are a great use for the health data.
The data is also great for modeling, given it is very complete and generally clean. Here, for example, is a time series forecast of my exercise minutes based on a one year period using Facebook’s Prophet model:
Here is the same forecast, but with annual seasonality enabled and weekly seasonality added manually based on my location (Austin, Chicago or travelling):
The default weekly seasonality model above (first plot) does a worse job of fitting the training data than the model with custom seasonality terms added (second plot). However the default seasonality model is far better (though still not great) at predicting future values of exercise minutes. Needless to say, hyperparameter tuning would help improve these results.
This is just a sample of the type of modeling you can experiment with using your health data. Do you want to try using very granular time-series data? Look at the workout routes files. They have observations for each second of your recorded workouts with latitude, longitude, elevation and velocity fields.
Apple — Non-Fitness/Health
You request a download of all your non-fitness/health data from Apple’s main website. For me, that amounted to 84 files, mostly .csv and .json files along with a few .xml files. I also received hundreds of .vcf files, one for each of the contacts I have on my Apple devices, In total, I downloaded 68meg of data, excluding the .vcf files.
Apple stands out in that it provides comprehensive documentation for each of the data files. It includes explanations of each field, though some definitions are more helpful than others. The documentation helped me interpret a few data files that looked intriguing.
As with most other exports, Apple’s files included the normal administrative data, including things such as my preferences for various apps, login information and device information. I did not find anything remarkable in those files.
There are several files related to Apple Music, one of the services to which I subscribe. Files with titles like:
- “…/Media_Services/Apple Music — Play History Daily Tracks.csv”;
- “…/Media_Services/Apple Music — Recently Played Tracks.csv’’; and,
- “…/Media_Services/Apple Music Play Activity.csv”
contain information such as:
- date and time a song was played;
- play duration in milliseconds;
- how each play was ended (for example, it reached the end of the track, or I skipped past the song);
- the number of times the song has been played;
- the number of times the song was skipped;
- the song title;
- the album title, if any;
- the song’s genre; and,
- where the song was played from — my library, a playlist, or one of Apple’s radio channels.
My files contained between 13,900 and 20,700 records depending on the purpose of the file. The data covered nearly seven years of song plays.
Apple captures a variety data on how song plays are ended, probably for purposes of recommending other songs to me. Song play termination reasons include:
For purposes of the analyses I show below, I focused on the ‘NATURAL_END_OF_TRACK’, ‘TRACK_SKIPPED_FORWARDS’, and ‘MANUALLY_SELECTED_PLAYBACK_OF_A_DIFF_ITEM’ end reasons.
Sometimes I will repeat a song that I like. One question I had was “Do I play favorite songs obsessively, over and over again?”. I answered that question using the Apple data:
The table above summarizes the number of times I’ve played some favorite songs (‘Play Count’) and the number days over which I played the songs (‘Played on Number of Days’). It looks like I generally play a song only once per day. Also, given that the play count is less than the day count for some songs, I must skip some favorites if I have heard them too many times recently or if the song does not fit my mood at the time. So, no obsessive playing here!
I also wondered if I favor certain types of songs on different days of the week, different times of the day, or even different months of the year. My intuition says that I do. With the Apple data, it was easy to visualize the genres I played at different times. Here, for example, are the genres I played most frequently during each month of the year:
I clearly favor rock songs, with alternative and pop music added for some occasional variety. July and August seem to be the months when I prefer the variety.
That said, I was surprised at just how much rock I seem to play. Admittedly I love it. But I also believe I have pretty broad taste in music.
So, I questioned the accuracy of the genre assigned to the songs in Apple’s data. For one thing, 10,083 of the 22,313 song plays in my file had no genre assigned to them. Also, there appears to be a lot of overlap in the genres assigned. For example, “R&B/Soul”, “Soul and R&B”, “Soul”, and “R&B / Soul” are all genres assigned to different songs in my data. The totals in the chart above would certainly be different if I recast the genres of all songs to use a consistent genre naming scheme.
Rather than invest the time to update the genres, I decided on another test to determine if the trends in the chart truly represent my playing patterns. Since Apple includes song play ending reasons in the data, I looked to see if I tend to skip past rock songs more frequently than other genres, indicating that I try to play other genres when too many rock songs are being played.
As it turns out, I do not skip past rock songs significantly more than I skip past other genres that I listen to frequently. I’ll have to face it — I am a die-hard rock fan.
Another interesting file is called “…/Media_Services/Stores Activity/Other Activity/App Store Click Activity.csv”. While I do not analyze it here, I recommend it to anyone who wants to get a sense for the type of data a retailer may want to track for activity on their website. For me, it included 4,900+ records with detailed history of my activity while in the app store and, apparently, in Apple music. Types of actions I took, dates/times, A/B test flag, search terms, and data presented to me (“impressed” is the term used) are among the items included in the file.
One last potentially interesting file for analysis is called \\Media_Services\\Stores Activity\\Other Activity\\Apple Music Click Activity V3.csv. It includes the city and longitude/latitude of the IP address where, I assume, I was using Apple Music. For me, the file had 10,000 records.
Verizon
After a long 80+ day wait, Verizon notified me I could download my data. It included 17 csv files for a total of 1.4 meg of data. Most of the files covered account administrative information (cell line descriptions, device information, billing history, order history, etc.), the history of notifications that Verizon sent to me, and my recent texting history (but without text contents). Though Call History and Data Usage files were provided, they were empty except for a notation that the data was “Masked for security”.
Verizon provided two documentation files. One contained the names and general descriptions of 34 possible files that could be included in a download. The files included depend on the Verizon services you use. The second documentation file contained a description of 3,091 data fields that could appear in the files. While the data field descriptions are helpful, they lack some detail. For example, a lot of fields are described as containing codes for various purposes, however the codes themselves and their meanings are not described.
One file that was extremely interesting is called “…/Verizon/General Inferences.csv”. It contains a spectacular amount of demographic information about me and about other people in my household. Here is how Verizon’s documentation describes the file:
“The General Inferences file provides information general assumptions and inferences to deliver more relatable and relevant content across our platforms. This may include information like Attributes, Preferences, or Opinions.”
Based on the nature of the demographic features, I assume most of it was acquired by Verizon from outside data aggregators and not gathered by Verizon directly from me. The number and scope of demographic features far exceed any information that I ever provided directly to Verizon.
In fact, the Verizon documentation speaks about another file called the “General” information file (not included in my download). The documentation says the “General” file includes data that came from external information sources. My guess is the information in the “General Inferences” file also comes from those external sources. Some of the financial data in the “General Inferences” file could have come from the credit report that Verizon requires its customers to provide.
A total of 332 demographic features were included in my General Inferences data. Here is an abridged list including some of the more surprising features:
All of the General Inferences features are apparently used by Verizon to market to me and retain me as a customer. As you can see in the above list, features about my spouse and our children are also included. You can see the complete list of 332 features here.
A few of the features that I found to be truly unusual include:
One has to wonder if those types of data elements are really needed by Verizon to help it provide service to me and, if so, how Verizon uses them.
Amazon
Amazon provided 214 files containing 4.93 meg of data. Several of the files covered:
- Account preferences;
- Order history;
- Fulfillment and returns history;
- Viewing and listening history (Amazon Prime Video and Amazon Music);
- Kindle purchases and reading activity,
- and search history including search terms.
If I was an Alexa customer or a Ring customer, I assume I would have received data for my activity on those services as well.
Six .txt files contained high-level descriptions of a few of the downloaded data files. Several .pdf files contain documentation for fields in the downloaded files (the “Digital.PrimeVideo.Viewinghistory.Description.pdf” file, for example).
The most interesting files from Amazon pertain to the marketing audiences associated with me by Amazon, it advertisers, or “third parties”. I presume the third parties are data vendors from whom Amazon purchases data.
The “…/Amazon/Advertising.1/Advertising.AmazonAudiences.csv” file contains the audiences that Amazon itself assigned me to. Here is a sample of the 21 audiences:
Amazon’s own audience assignments are largely accurate when I consider products that I purchased or searched for, either for myself or on behalf of others.
The “…/Amazon/Advertising.1/Advertising.AdvertiserAudiences.csv” file apparently contains a list of Amazon advertisers who brought their own audiences to Amazon and whose audience lists included me. The file contains 50 advertisers. Here is a sample:
I do business with or own products from some of the advertisers in the list (for example, Delta, Intuit, Zipcar) so I understand how I ended up on their audience lists. I have no connection with others on the list (for example, AT&T, Red Bull, Royal Bank of Canada) so I am not sure how I got in their audience lists.
According to Amazon, the file
“…/Amazon/Advertising.1/Advertising.3PAudiences.csv”
contains a list of
“Audiences in which you are included by 3rd parties”.
Its accuracy is poor. A total of 33 audiences are listed, 28 of which focused on automobile ownership. The remaining four covered gender, education level, marital status and dependents. A sample of the automobile-related audiences:
While the gender/education level/marital status -type assignments in the file are accurate, only a few of the automobile-related assignments in it are correct. Most are not. And, I am just not that interested in automobiles to warrant 28 of 33 profile assignments. Mercifully, Amazon seems to ignore this data when it presents product or video recommendations to me.
Parting Thoughts
In this article, I hoped to show you the wide variety of data you can get from companies with whom you do business. The data allows you to learn what those companies think about you while also learning some surprising things about yourself!
We’ve seen that some companies correctly identify my interests in technology and travelling, while one company incorrectly sees me as an avid automobile enthusiast. In an eye-opening and somewhat unnerving moment, I realized another company has extensive demographic information about my family.
I learned I need to increase my workout regime in one of the two places I call home, even though I thought my workouts were equivalent in both places. I found out that some companies (facebook, Google) do not have a strong view of my profile. Yet the demographic picture that Verizon has of me is shockingly accurate.
The data the various companies give you supply a rich source of raw material for experimentation. It is data that is susceptible to deep analysis, modelling and visualization activities. For example, geographic coordinates and timestamps are available for many observations, allowing you to visualize or model your movements.
I hope you find your own set of interesting insights by downloading your personal data. Please let me know if you have noteworthy experiences in working with companies other than those I cover here.
It’s your data — Now go for it!
This post originally appeared on TechToday.