Data and Coffee – Database meets coffee mug

The Scandal at Kaggle

by Serhii Sokolenko
July 15, 2024July 15, 2024

As I write these words, I marvel at my silver medal from the 2024 Automated Essay Scoring competition on Kaggle. This competition will go down… Read More »The Scandal at Kaggle

Compound AI Systems: Building a GitHub bot with Llama 3 and dltHub

by Serhii Sokolenko
April 23, 2024April 23, 2024

TL;DR: We built a bot that suggests a meaningful response to an ongoing conversation thread on GitHub. This bot can serve as a coding and… Read More »Compound AI Systems: Building a GitHub bot with Llama 3 and dltHub

A DuckDB moment for application servers?

by Serhii Sokolenko
June 6, 2023June 6, 2023

Apple announced their new high-end Mac Pro desktop with 24 CPU cores, up to 76 GPU cores, 192 GB memory and 800GB/s of system memory… Read More »A DuckDB moment for application servers?

Time Value of Data: The Summit of Now and the Peak of Soon After

by Serhii Sokolenko
April 2, 2023February 21, 2024

Last week I had the chance to visit a major global fashion retailer and give an industry talk on Real-time AI. This company was hosting… Read More »Time Value of Data: The Summit of Now and the Peak of Soon After

Titanic survivors cluster analysis with Graphext

Graphext, data insights for non-data scientists

by Serhii Sokolenko
November 28, 2022November 28, 2022

I am playing with Graphext – it’s like Trifacta, but with more powerful data science functionality. If you are a product manager or finance person,… Read More »Graphext, data insights for non-data scientists

Showing constituent parse trees in the browser

by Serhii Sokolenko
May 21, 2018November 28, 2022

Update: Added the Stanford NLP link for constituent parse trees in text form I needed to visualize a sentence parse tree of the “constituent” variety… Read More »Showing constituent parse trees in the browser

Predicting user engagement with news on Reddit using Kaggle or Colab

by Serhii Sokolenko
May 8, 2018September 18, 2021

About a month ago I wrote a 3-part blog series (parts 1, 2, and 3) on predicting user engagement with news in Reddit communities (subreddits).… Read More »Predicting user engagement with news on Reddit using Kaggle or Colab

How to programmatically monitor your Cloud Dataflow jobs

by Serhii Sokolenko
March 23, 2018November 28, 2022

Ever wanted to define alerts and monitor the status of your Cloud Dataflow jobs programmatically instead of checking some UI every 30 minutes? You can… Read More »How to programmatically monitor your Cloud Dataflow jobs

Calculating per-job Cloud Dataflow costs - now possible with job labels

by Serhii Sokolenko
March 1, 2018November 28, 2022

Ever wanted to track your resource usage and costs by specific Cloud Dataflow jobs? Cloud Dataflow recently started labeling billing records with Job Ids. Here… Read More »Calculating per-job Cloud Dataflow costs - now possible with job labels

Building dictionaries for Word Encodings using BigQuery SQL

by Serhii Sokolenko
February 7, 2018September 18, 2021

I am working on the Reddit Community Engagement analysis, and one of my data sources is the GDELT BigQuery dataset. I love the richness of… Read More »Building dictionaries for Word Encodings using BigQuery SQL

Predicting social engagement for the world’s news with TensorFlow and Cloud Dataflow: Part 1

by Serhii Sokolenko
December 15, 2017November 28, 2022

What happens if you take a huge cross-section of the world’s news (The GDELT Project), mix it with the biggest online discussion website, and try… Read More »Predicting social engagement for the world’s news with TensorFlow and Cloud Dataflow: Part 1

Million Dollar Idea: Order Tracking at Starbucks

by Serhii Sokolenko
September 2, 2017September 18, 2021

Uber lessened the anxiety of cab callers by providing time estimates and position of cars on the map. Standing in an early-morning Starbucks line waiting… Read More »Million Dollar Idea: Order Tracking at Starbucks

How to speed up your BigQuery query 31x by replacing a self-join with two UNNEST() operations

by Serhii Sokolenko
July 9, 2017September 18, 2021

One of my trend calculation queries in the Opinion Analysis project started causing trouble recently. It would run for 270 seconds and break with an… Read More »How to speed up your BigQuery query 31x by replacing a self-join with two UNNEST() operations

Serverless ETL for Sirocco on Google Cloud

by Serhii Sokolenko
May 11, 2017September 18, 2021

I published an architecture of a serverless ETL solution for Sirocco on the GCP Big Data blog. This solution scales from a few news articles… Read More »Serverless ETL for Sirocco on Google Cloud

Opinion Analysis of Text using Plutchik

by Serhii Sokolenko
May 4, 2017September 18, 2021

I got inspired to write this blog by a post I saw today on the French presidential election. Plutchik is really the strongest framework I… Read More »Opinion Analysis of Text using Plutchik

Selecting a Java WordNet API for lemma lookups

by Serhii Sokolenko
April 23, 2017September 18, 2021

This is the part two blog post of the Sirocco “modernization” series. In the old, SharpNLP version of Sirocco, we used WordNet version 2.7 to… Read More »Selecting a Java WordNet API for lemma lookups

Modernizing Sirocco from C# and SharpNLP to Java and Apache OpenNLP

by Serhii Sokolenko
April 20, 2017September 18, 2021

Tl;dr: Automatic conversion from C# to Java is possible! When we set out developing Cuesense in late 2000s, my partners and I standardized on the… Read More »Modernizing Sirocco from C# and SharpNLP to Java and Apache OpenNLP

Sirocco released under Apache 2.0 license to github

by Serhii Sokolenko
April 16, 2017September 18, 2021

It’s been a long time in making, but we are finally ready to release Sirocco — the opinion extraction library based on Robert Plutchik’s emotion framework — under the… Read More »Sirocco released under Apache 2.0 license to github