Skip to content

Predicting social engagement for the world’s news with TensorFlow and Cloud Dataflow: Part 1

What happens if you take a huge cross-section of the world’s news (The GDELT Project), mix it with the biggest online discussion website, and try to predict what drives the online conversation about the news on Reddit? Is it the author of the news article, the publication time, the title, the contents of the post? Could it be the submitter on the discussion board or the sub-forum where the news was posted to?

One natural hypothesis to make is that the contents of a news article will influence the subreddit that this news post will end up being posted in. That’s a logical thing to assume, but is there proof, and if so, can we get a list of topics and correlate them with subreddits? In addition to the subreddit, can we deduce the factors that will drive the highest number of participants on the discussion thread, and the highest number of comments?

What’s the use of these insights? Imagine a news publisher or a marketing person who wants to see more engagement with the content that they are creating. Knowing what engages your audience can drive up your page views and ultimately revenue. And if you are a researcher who is tracking how news is being disseminated across various audiences, having this understanding will help you too.

A prediction task such as this one is a prime candidate for Machine Learning, and TensorFlow is the authoritative framework for training ML models and running predictions. In a sense, GDELT’s catalog of news coverage and Reddit discussions are a match made in heaven, because GDELT has the features, the inputs, for training models and Reddit has the labels, or the outputs of predictions. GDELT has articles indexed by the article URL and attributes such as Title, Author, Publication Date, the full content of the article etc. Reddit also has the article URL, but in addition to that brings the submitter of the article to the board, the subreddit (topic area), and all of the comments that the users of Reddit have created based on the submission.

As we went about building the TensorFlow model for this prediction task, we had to first assemble the dataset from which TensorFlow would be able to learn. In this multi-part blog series we will first explain how we built this input dataset, bringing GDELT and Reddit data together, and then we will explore training and predicting of outcomes.

Our first task was to place the GDELT and Reddit data into a database that was easy to query. BigQuery was the natural choice here. There are GDELT and Reddit BigQuery datasets already, but we wanted to do a deeper sentiment analysis on the raw content of news articles and Reddit posts and comments, and for that we used the Dataflow Opinion Analysis project (github repo) we wrote about earlier.

Back in June 2017 Reza Rokni and John LaBarge — two GCP solutions architects — shared their best practices for developing production-quality Dataflow pipelines in a two-part blog series (blog 1, and blog 2). As we were solidifying the design for our data preparation pipeline, we used several of these best practices to future-proof our opinion analysis infrastructure. The rest of the article will explain in more detail how we applied these Dataflow design practices to the data problems we encountered in building our training dataset.

The top 10 design patterns for Dataflow can be roughly divided according to the regular lifecycle of a Dataflow pipeline:

  • Orchestrating the execution of a pipeline
  • Onboarding external data
  • Joining data
  • Analyzing data and
  • Writing data

We used patterns in almost every of the above categories. For onboarding external data, we used an external service access pattern that allowed us to call to Cloud NLP to enrich our dataset with additional subject information. For joining the Reddit posts and comments, we used the CoGroupByKey pattern. When performing data analysis, we relied on the GroupByKey pattern for key classes with multiple properties. And, lastly, when dealing with invalid or misformed input data, we implemented a Bigtable sink that collected invalid input records according to the Dead Letter queue pattern.

Let’s dive deeper into how these patterns helped us assemble our training dataset.

As we mentioned before, we sourced half of our training set (the features) from GDELT using files in gzipped JSON format. The other half of our training set (the labels) are Reddit posts and their comments, available in a BigQuery dataset in two tables: posts and comments. We wanted to run both the GDELT news articles and the Reddit posts/comments through a very similar processing pipeline, extracting subjects and sentiments from both of them, and storing the results in BigQuery. We therefore wrote a Dataflow pipeline that can change its input source depending on a parameter we pass to the job, but does all the downstream processing in the same way. Ultimately, we ended up with two BigQuery datasets that had the same schema, and that we could join with each other via the URL of the news article post using BiqQuery’s SQL.

Here is the high-level design of the Dataflow pipeline that we used to bring both the GDELT and Reddit data over to BigQuery.

  • Read from inputs
  • Filter already processed URLs
  • Filter exact duplicates
  • Index (extract opinions and tags)
  • Filter “approximate” duplicates
  • Enrich with Cloud NLP
  • Write opinions to BigQuery and invalid inputs to Bigtable

As we were implementing the input reader for Reddit data, we had to join the posts and comments table together. This join implementation follows the recommendations of our CoGroupByKey design pattern. For each dataset in the join we created a KV key-value pair and then applied the CoGroupByKey operation to match the keys between the two datasets. After the join we iterated over the results to create a single entity that had both post and comment information.

Pro Tips:

  • Side Inputs could be a faster alternative to CoGroupByKey when joining datasets if one of the datasets can fit into the memory of a Dataflow worker VM.
  • Use the service-based Dataflow Shuffle for better performance of your CoGroupByKey. Add the following parameter to your pipeline: — experiments=shuffle_mode=service

After reading our inputs — either news articles from GDELT, or posts from Reddit — we proceeded to eliminating duplicates from our data. Duplicate elimination is typically done either by looking up a record ID among a list of already processed IDs, or, if the record wasn’t yet processed, by grouping elements by some key that represents uniqueness.

There are three levels of filtering in our pipeline that use both of these techniques. In the first filter we simply check if the URL of the record (in effect, the ID column in our dataset) was already processed and stored in BigQuery. For this check we use a side input of hashes of the URLs and populate it from the BigQuery dataset we are writing to.

In the second and third filters, which we apply only to news articles in GDELT, we are looking at the contents of the news article. We don’t need to do this filtering to Reddit records because our data source guarantees uniqueness of posts. However, the GDELT data source is highly redundant, reflecting the fact that the world of news is highly repetitive as well.

Original news stories are still being written by staff at major news publications or by bloggers, but a large percentage of news websites, especially regional publications, reposts news from news wires such as Associated Press, Reuters, Agence France Presse, Deutsche Welle etc. Our pipeline makes an effort to identify such reposts and group them together.

A note on terminology we will use in the remainder of this blog:

  • Original post: A web page where the original news story was published for the first time
  • Repost: A web page which re-posted/republished an original story
  • Publisher: the domain where a story (original or reposted) was published

When grouping raw news articles, we first try to match articles that are identical in content, without any variation in text. In our pipeline this is represented by Filter #2, and this filtering is implemented by grouping all incoming text articles by the hash we calculate over the text of the article concatenated with the Publication Date.

After this initial, simplistic grouping, we then try to group articles that have small variations in text, for example, because they prefix the source of the publication, e.g. (“CHICAGO (AP)“), or because they add a copyright statement at the end. This is a more advanced grouping than just taking a hash over the text. Here, we need to group on combinations of attributes of the article.

The grouping pattern recommends defining a class that has all of the attributes by which you want to group, and then tag this class with AvroCoder as the @DefaultCoder of the class. By doing that we get for free the serialization/deserialization of objects of this class, and we can then use the GroupByKey to group by it.

* @param Document indexes
* @return a POJO containing 2 PCollections: Unique docs, and Duplicates
private static ContentDuplicateOrNot filterSoftDuplicates(
  PCollection<ContentIndexSummary> indexes) {
PCollectionTuple dedupeOrNot = indexes
.apply("Extract Text grouping key",
ParDo.of(new GetContentIndexSummaryKeyFn()))
.apply("Group by Text grouping key",
GroupByKey.<ContentSoftDeduplicationKey, ContentIndexSummary>create())
.apply("Eliminate Text dupes",
ParDo.of(new EliminateTextDupes())
PCollection<TableRow> dedupedWebresources =
.apply(ParDo.of(new CreateWebresourceTableRowFromDupeIndexSummaryFn()));
ContentDuplicateOrNot contentDuplicateOrNot = new ContentDuplicateOrNot(
return contentDuplicateOrNot;

The document attributes that have shown to be effective in finding similar content include the Title of document, document length rounded to a thousand characters, and the topics that are discussed in the articles. The title and document length is something we get without any additional processing because it is part of the source dataset. To get topics of the document we need to run NLP and sentiment analysis, and this is why this third and last filtering step is slotted behind the “Indexing” step. The content-based grouping is highly accurate and allows us to reduce the number of stories we insert into our analysis dataset by about 45%. For example, in June 2017 there were 6.1M web pages from which the GDELT project extracted English-language news, but only 3.7M unique stories. The additional benefit from this deduplication is that we are able to aggregate the social impact of all the reposts of an original story.

As we processed text documents in the Indexing step, we sometimes encountered cases where the inputs were malformed and could not be processed. For example, in GDELT we sometimes saw text full of CSS formatting, the result of edge cases in the GDELT web crawler. Or, and this is typical for social content on Reddit, we saw posts that consisted entirely of Emoticons or Unicode symbol characters that could not be processed by NLP libraries. This is another opportunity to access our bag of patterns. Normally, a developer would place a potentially “dangerous” block of code into a try/catch block and handle exceptions by logging them. The Dead Letter processing design pattern recommends, in addition to logging, redirecting such bad records into a Side Output and then storing them in a database that supports fast writes, for example Bigtable, Datastore or BigQuery. It then recommends periodically reviewing the contents of the Dead Letter tables and debug these bad records, determining the cause of the exceptions. Usually, a change in business logic, covering the additional edge case, or improving data collection will be needed to address bad records, but at the very least, we are not ignoring these errors and preserve data that would have been otherwise lost. In our pipeline we implemented a side output in the Indexing step and a Bigtable sink for this side output.

We’ve talked about how the Indexing step of the pipeline finds topics of documents by using the Sirocco opinion extraction library. Sirocco is not the only way to get topics of documents. Cloud NLP provides an API for classifying a text document into ~700 categories. Additionally, it provides mechanisms for extracting entities from text, which is very similar to what Sirocco does. We integrated entity extraction by CloudNLP with entity extraction by Sirocco into our pipeline to benefit from a richer classification of text in our pipeline, and to do that we applied another pattern from the pattern blog posts.

After the opinion extraction in the Indexing step, the CloudNLP enrichment, and final filtering steps data is finally ready to be inserted into BigQuery. Note that both the Reddit and GDELT data will end up in distinct datasets, however, the schemas of these datasets are the same, and we can join the news articles in GDELT with the Reddit posts via the URL field of the news post.

This concludes the first part of our blog series, focused on preparing the data for model training. We have extracted features from our GDELT article dataset such as the entities found in the article, the author or the news post, publication date and time, and the sentiment. We also have our training labels extracted from the Reddit posts and comments, including the submitter of the Reddit post, the subreddit, and the sentiment of comments. We have cleansed our training data from duplicates, and in the process gained useful information about reposts which will help us providing a more accurate estimate of the social impact.

Next steps

In the next set of blogs we will introduce our methodology for developing our TensorFlow model. We hope you will be able to apply the Dataflow best practices in your own data processing pipelines. Here are a few useful links if you want to learn more.

Guide to common Cloud Dataflow use-case patterns, Part 1

Guide to common Cloud Dataflow use-case patterns, Part 2

Dataflow Opinion Analysis github repo

Leave a Reply