The Scandal at Kaggle
As I write these words, I marvel at my silver medal from the 2024 Automated Essay Scoring competition on Kaggle. This competition will go down… Read More »The Scandal at Kaggle
As I write these words, I marvel at my silver medal from the 2024 Automated Essay Scoring competition on Kaggle. This competition will go down… Read More »The Scandal at Kaggle
TL;DR: We built a bot that suggests a meaningful response to an ongoing conversation thread on GitHub. This bot can serve as a coding and… Read More »Compound AI Systems: Building a GitHub bot with Llama 3 and dltHub
Apple announced their new high-end Mac Pro desktop with 24 CPU cores, up to 76 GPU cores, 192 GB memory and 800GB/s of system memory… Read More »A DuckDB moment for application servers?
Last week I had the chance to visit a major global fashion retailer and give an industry talk on Real-time AI. This company was hosting… Read More »Time Value of Data: The Summit of Now and the Peak of Soon After
I am playing with Graphext – it’s like Trifacta, but with more powerful data science functionality. If you are a product manager or finance person,… Read More »Graphext, data insights for non-data scientists
Update: Added the Stanford NLP link for constituent parse trees in text form I needed to visualize a sentence parse tree of the “constituent” variety… Read More »Showing constituent parse trees in the browser
About a month ago I wrote a 3-part blog series (parts 1, 2, and 3) on predicting user engagement with news in Reddit communities (subreddits).… Read More »Predicting user engagement with news on Reddit using Kaggle or Colab
Ever wanted to define alerts and monitor the status of your Cloud Dataflow jobs programmatically instead of checking some UI every 30 minutes? You can… Read More »How to programmatically monitor your Cloud Dataflow jobs
Ever wanted to track your resource usage and costs by specific Cloud Dataflow jobs? Cloud Dataflow recently started labeling billing records with Job Ids. Here… Read More »Calculating per-job Cloud Dataflow costs - now possible with job labels
I am working on the Reddit Community Engagement analysis, and one of my data sources is the GDELT BigQuery dataset. I love the richness of… Read More »Building dictionaries for Word Encodings using BigQuery SQL
What happens if you take a huge cross-section of the world’s news (The GDELT Project), mix it with the biggest online discussion website, and try… Read More »Predicting social engagement for the world’s news with TensorFlow and Cloud Dataflow: Part 1
Uber lessened the anxiety of cab callers by providing time estimates and position of cars on the map. Standing in an early-morning Starbucks line waiting… Read More »Million Dollar Idea: Order Tracking at Starbucks
One of my trend calculation queries in the Opinion Analysis project started causing trouble recently. It would run for 270 seconds and break with an… Read More »How to speed up your BigQuery query 31x by replacing a self-join with two UNNEST() operations
I published an architecture of a serverless ETL solution for Sirocco on the GCP Big Data blog. This solution scales from a few news articles… Read More »Serverless ETL for Sirocco on Google Cloud
I got inspired to write this blog by a post I saw today on the French presidential election. Plutchik is really the strongest framework I… Read More »Opinion Analysis of Text using Plutchik
This is the part two blog post of the Sirocco “modernization” series. In the old, SharpNLP version of Sirocco, we used WordNet version 2.7 to… Read More »Selecting a Java WordNet API for lemma lookups
Tl;dr: Automatic conversion from C# to Java is possible! When we set out developing Cuesense in late 2000s, my partners and I standardized on the… Read More »Modernizing Sirocco from C# and SharpNLP to Java and Apache OpenNLP
It’s been a long time in making, but we are finally ready to release Sirocco — the opinion extraction library based on Robert Plutchik’s emotion framework — under the… Read More »Sirocco released under Apache 2.0 license to github