I published an architecture of a serverless ETL solution for Sirocco on the GCP Big Data blog. This solution scales from a few news articles in a cloud bucket to millions of news posts in a database, taking advantage of Cloud Dataflow’s autoscaling features. With this blog you now should have all the components for building a news monitoring or a opinion tracking solution. I know it because I am using exactly the same setup for an actual news monitoring solution — more about it in a future post.
Here is what I suggest you do:
- Read about Plutchik’s framework for Emotion analysis to understand the theory behind this solution
- Read about the ETL solution
- Go to the github repo and follow the instructions in README. Set up your own processing pipeline and run a test crunching a few news articles that I uploaded to the test folder.