My daughter has pointed out to me that I spend too much time on Twitter and to emphasize the point she bought me a Twitter addict mug. Rather than argue that she is unfairly characterizing the way I keep up with developments in the world of science and medicine, I decided that the best defense would be to turn to analyzing tweets and make my Twitter habit a programming project. As my timeline was filled up with tweets about COVID, I’ve decided to focus on tweets about the disease and virus.

I first had to find the tweets to analyse. I used two data sources, one for 2020 and the “ongoing” data set that continues to grow incrementally starting in mid December through a script that is scheduled to run weekly. For a first pass, I decided to explore how often the topics that I recalled from my own Twitter “explorations” within the tweets about COVID-19 occurred. For example, education (e.g. school closings, remote learning), vaccination, medical therapy (e.g. medications thought to work for COVID-19 treatment), and children (from neonates to teenagers)? What about masks? Also, how much did politics feature in the COVID-19 tweets? For each of those topics I crafted a simple filter for the context of each tweet (see the filters here). I also identified two groups of tweeters, those working for news organizations and those identifying as doctors or healthcare organizations (see Tweeter category definitons). Both the topics and the user groups were defined by iteration through different filters and then inspecting the tweets selected. This process has none of the rigor and validation of other efforts (including mine in applying natural language processing to electronic health records). That is, this project is really for my own entertainment and in the COVID-19-shrunken social sphere I’ve inhabited over the past year, the programming is a nerdy balm. I welcome suggestions, especially corrections (including typos), and comments. Also, this is just a first pass to familiarize myself with these data sets.

So let’s start with the big picture. What was the frequency of the topics mentioned within corpus of tweets pertaining to COVID-19 sampled across 2020?

The figure above shows the frequency of tweets across all tweeters in 2020 (corresponding figures for news organizations and medical organizations/doctors follow below). The legend in the figure identifies the 6 topics corresponding to the defined textual patterns (see Topic Filters). The Y axis is logarithmic so that tweet topics appearing towards the top of the graph are at least 100 times more frequent than those towards the bottom. It’s clear that these 6 themes of tweets still correspond to the minority of COVID-19 tweets as the large majorty of the topics are much less than 10% of all tweets that day. Tweets about the President and elections (recall all of the tweets analyzed here mention COVID-19 or the SARS-CoV2 virus) were consistently a higher fraction of tweets for each day than the other 5 topics.

Vaccines were a hot topic in February and March but dropped down by a factor of ten only to rise dramatically starting in October. Tweets about education, including remote schooling, and school closings were consistently higher than those about vaccines with peaks in March and August/September. Tweets about masks started lower but rose to compete with Pres./election by October. Tweets about pediatrics or children were about as frequent as those about vaccines.

Medical therapy tweets over a broad of therapies (from HCQ to Remesvedir to dexamethasone) were one to two orders of magntidue lower in frequency than other tweets for most of the year although there were some large spikes in March, May and June.?html_notebook_output.

In medical twitter (as defined here) education vied with masks in dominating the othervtopics except for immunization/vaccines at the beginning and end of the year. Tweets about the president (Trump or Biden) and elections (see Topic filters) within COVID-19 tweets are as frequent as the more popular tweet topics but not dominant as they are for the entire group of tweeters. Unlike the news tweeters (see below) and the overall group, medical therapies are not consistently the lowest frequency tweet. Interestingly, these 6 topics were an increasing fraction of daily tweets at the end of the year as evidenced by the overall rising trend of plotted points from the left to the right side of the plot.

For news-organization-associated tweeters, COVID-19 tweets were dominated by the president/elections topic (see filters) followed by education and then masks. Vaccines and medical therapy were less frequent than for the medical tweeters group except towards the end of the year.

Whereas the plots above capture some of the dynamics of tweet frequencies across the year, the diagram above summarizes these across the entire interval. Each bar represents the fraction of tweets with that topic and a particular group of users for the entire interval. Group comparisons are shown by adjacent bars. For example there are fewer tweets on average about children (red bars) by news-organizations-associated tweeters vs non-news-organization tweeters and more about children by medical tweeters than non-medical tweeters. Medical therapies were tweeted about far less often, even by the medical tweeters, although they did tweet more about medical therapy than the other groups. The error bars on each colored column represent a standard error.Tech note: Because the sampling strategy of the Twitter API accessed via rTweet is opaque, I’ve had to normalize per day (for the total number of tweets returned that day) because of variations in the sampling per day. That makes the standard error not a particularly robust measure for statistical comparison.

So much for the big picture. Let’s zoom in to the end of 2020 and the beginning of 2021 using the “ongoing” data set.

Unlike most of 2020, in these last few months of COVID-19 tweets, mentions of vaccines were more frequent than President/elections mentions. Also, although it might be too short an interval, there is little evidence of the steady rise of these 6 topics that characterized the bulk of 2020.

For this last interval starting in 2020-12-13 14:10:00 and ending 2021-01-22 05:00:01 there are 1502667 tweets in the “ongoing” database. For convenience (mine, so I don’t have to wait as long), so we are only sampling 1/10of the total.

Vaccine mentions in COVID-19 tweets are much more frequent among medical tweeters than the other 5 topic. Medical therapy appears to be a less frequent topic even among medical tweeters.

News organizations are much more like the vast majority of tweeters than medical tweeters with regard to the frequency of tweets about children and masks

Unlike for the overall year (i.e. just 2020), the news organizations (and those tweeting individually from news organizations) appear very similar to the medical professional tweeters perhaps with comparatively less focus on children.

What is immediately striking is the difference between this bar chart and the one from the prior year as illustrated below (where the two bar charts are shown side by side). Vaccines now dominate COVID-19 tweets whereas the President/Election topic is a distant second. Education features less prominently and medical therapy has decreased to well under 1%.

Data Sources and limitations

I used two data sources. For the full 2020 year view, I reached out to my friend and colleague, John Brownstein (a pioneer in digital epidemiology) and Jared Hawkins and asked them for all tweets that mention COVID-19 or SARS-CoV-2. The file they sent me was not all such tweets but regular sampling of such tweets totallng tweets. Also, around the beginning of December, I started running a cron job on my laptop every three days to get the last week of tweets (to ensure good overlap between each session) that included mention of covid19, I refer to the first data set as the “overall” data set and the second as the “ongoing” data set.

System Configuration

Configuration used to run this notebook: R version 4.0.2 (2020-06-22) included packages:kableExtra, gridExtra, broom, scales, forcats, stringr, dplyr, purrr, readr, tidyr, tibble, ggplot2, tidyverse, rtweet

Search strings

I did not use any NLP techniques but on simple regular expression patterns to identify strings related to the topic. I experimented with different regular expressions until the tweets matched seemd to correspond to the intended topic. I document those patterns below aligned with the Topic labels I used in the above graphs. By the way, if you want to read some of our publications that use more capable techniques for identifying and retrieving instances relevant to a topic, see here.

Filters for Topics and Tweeter Groups

Regular expression strings used for each topic
Topic Regular Expression
Vaccine (V|v)accin|Pfizer-BioNTech|(M|m)oderna|Novavax|Sinovax|(I|i)mmuniz
Education (E|e)ducat|(.*(S|s)chool|(C|c)ollege|(U|u)niversity.*clos|remot|Zoom.*)|(.*clos|remot|Zoom.*(S|s)chool|(C|c)ollege|(U|u)niversity.*)|(.*student|learning|class.*virtual|remote.*)|(.*virtual|remote.*student|learning|class.*)
Medical Therapy (T|t)herap(y|eu)|(R|r)emdesivir|VEKLURY|(A|a)zithromycin|Z-pac|(Z|z)ithromax|Zmax|(C|c)onvalescent (plasma|serum)|hyperimmune (plasma|serum)|(D|d)examethasone|(D|d)ecadron|(D|d)exasone|(S|s)olurex|(B|b)aycadron|(I|i)ntensol
Pres./election (T|t)rump|(P|p)resident|(B|b)iden|election|polls|ballot|vote
Children (C|c)hild|(P|p)ediat|neonat|kids|toddle
Mask (M|m)ask|PPE|N95

Tweeter Categories

Here are the regular expressions used to define the groups of Tweeters that we wish to study more closely.

Regular expression strings used for each group/nMissing nonMD medical personnel
Group Regular Expression
Medical personnel MD|M\.D\.|(D|d)octor|internist|surgeon|physician|Dr\.|Dr\ |Dra(\ |\.) , (P|p)ediatric|(I|i)nternist|(S|s)urgeon|(P|p)hysician|(P|p)sychia|(D|d)octor|(M|m)edical.*(S|s)pecialist
News reporters/organizations (N|n)(ews|EWS)|\W*K[:upper:]+.*|The|\W*W[:upper:]+.*|.*CBS.*|.*F(OX|ox).*|.*CNN.*|.*ABC.*|.*NBC.*|Herald|Tribune|TV|(N|n)oticias|(N|n)oticiero[s]*|Post|Times|CBC|.+S(ol|OL).*|.+Sun.*|Magazine|.+(M|m)ag.*|24h|.*/d/d\./d.*|Telemundo|World|P(olitico|OLITICO)|Wire|(P|p)rensa|Globe|MSN|P(olitica|OLITICA)|Hill|Canal|Media|BNN|Bloomberg|Journal|Intelligencer|Ledger|Clarion|Register|Press|Chronicle|Telegram|Pulse|Pulso|Informativo|Enquirer|Six|Page|Financiero|Financial|Jornada|.*\.com|visión|Daily|Diario|Político|Yahoo|Google|Bing|Microsoft|.*Review.*|.*Examiner.*|Star|CSPAN|InStyle

Feedback

Comments and particularly suggested improvements are welcome.

