ApacheCon@Home - Federated Data Track

Apache Federated Data Track

Thursday 14:10 UTC
Bridging voter data and social data using Apache Streams
Brian Hodge

Apache Streams unifies a diverse world of digital profiles and online activities into common formats and vocabularies, and makes these datasets accessible across a variety of databases, devices, and platforms for streaming, browsing, search, sharing, and analytics use-cases.
Datasets related to voters (or potential voters) in the US vary widely in quality, coverage, and recency. The state-of-the-art is often a patchwork of state voterfiles, vendor voterfiles, and third-party enrichments, and is expensive to obtain. Moreover, the vast majority of those services don’t offer social media profile enrichment for voter records.
Social data is more important than ever for driving voter engagement, registration, and turnout. While mailing, phone, and text campaigns are still very popular, those modes of contact are increasingly irrelevant for the rising American electorate. There is a need for a system that bridges the gap between traditional voter records and social media profiles.
Our talk would focus on the usefulness of Apache Streams in building that system.
Our approach to matching voter records to social profiles (and vice versa) is to go through an intermediate network of data enrichment providers. Those providers sometimes know certain facts about a social profile that can be used to join that profile to an existing voter record. Similarly, they can sometimes return a social profile from information already attached to a voterfile record.
In either case, it’s important to have up-to-date, performant, and standards-compliant interfaces into the APIs of those enrichment services. This is what Apache Streams provides. Our talk would use the voter-to-social matching problem as a case study for how Apache Streams can abstract away the boilerplate, often time-consuming problem of writing software integrations, and instead lets developers focus their efforts on solving interesting problems in novel ways.

Brian Hodge is a Data Engineer with a background in relational databases, distributed data pipelines, and machine learning. He is a member of the Apache Streams PMC. Currently, he works at Civitech, where he builds data products and infrastructure to support progressive causes.

Thursday 15:00 UTC
The Math of Reliability
Avishai Ish-Shalom

"We often hear talks of “scale” and “reliability”, mostly based on personal experience and lessons learned. What can Mathematics tell us about reliability and scale? Can math help us scale our systems and companies?
It turns out that failure models, probability, statistics and other domains can help our analysis and provide insights.
It turns out that mathematics forces us to rigorously construct and analyze our models, often exposing subtle issues and misconceptions; But moreover, it allows us to expand our understanding and explore the consequences of scale and stress without actually building a system.
This talk will present simple failure models, explain the math behind common practices, show common misconceptions, introduce emergent system properties and showcase mathematical examples of why things behave differently at scale and how things that work well in small systems can be horrible at scale."

"In a world where anything has an API, everything is a software problem" this insight has guided Avishai Ish-Shalom throughout his diverse career working on improving the complex socio-technical systems that create and operate modern software and promoting the use of Mathematics in system design and operations. Spending 15 years in various software fields and capacities, Avishai has served as Engineer in Residence in Aleph VC, engineering manager at Wix.com, co-founded Fewbytes and consulted many other companies on software operations, reliability, design and culture. Currently Avishai is a Developer Advocate for ScyllaDB (The boring database ;-)

Thursday 17:10 UTC
Open Source Social Applications with Apache Streams
Steve Blackmon

Apache Streams unifies a diverse world of digital profiles and online activities into common formats and vocabularies, and makes these datasets accessible across a variety of databases, devices, and platforms for streaming, browsing, search, sharing, and analytics use-cases.
Apache Streams contains JRE-based modules that developers can use to easily integrate with online data sources and build polyglot indexes of activities, entities, and relationships - all based on public standards such as Activity Streams, or other published organizational standards.
Apache Streams makes it simple to load your data from social networks and similar sources, using API connections or full-archive data portability downloads, into a local database you can build applications on top of.
In this talk, I will give a primer on Apache Streams architecture and capabilities, as well as a demo of an application I built using my social data: Probot.
Profile Bot (aka probot) is an open-source software package you can run to manage your social media accounts programatically, as well as deep-dive into your social network data.
Once your data is loaded, you can browse, filter, search, and sort lists of your friends, followers, posts, shares, direct messages, etc… from the built-in browser, and perform mass actions. For example, delete all tweets where you mentioned a particular hashtag, for reasons.
Probot can also enrich your social data by running it through Apache Streams Processor modules, which can append attributes from open-source libraries and third-party APIs.
Probot runs on a small collection of docker containers (all OSS). All of the data can be queried via SQL or the embedded database http api for external integrations.
Probot is currently a single-user application, though with more engineering could turn it into a hosted SaaS application whose users would not need to create their own twitter application or interact with anything aside from a web browser.

Steve has worked on semantic web and big data problems professionally since 2005, presently as VP Data Science at Pix.Wine. Steve is based in Austin, TX, where he co-organizes the Austin Data Meetup and has founded and consulted with numerous early-stage companies on data architecture and strategy. Steve serves as VP, Apache Streams at the Apache Software Foundation. Steve has a Masters in Computer Science from USC, and an MBA and BS in Computer Engineering from UT Austin.

Thursday 18:00 UTC
Versatility and Functionality of ApacheDrill
Misha Isran, Monty Rahman

One of the largest contributing factors to Apache Drill’s success is it’s open-source SQL query engine. Apache Drill possessing an open-source framework allows its community base the ability to continually improve and advance the software to adapt to the constantly evolving data science needs via UDFs (User-Defined Functions). The two clearest examples that come to mind are from our own experience here at Datadistillr where phone numbers and Geo Information System (GIS) functions proved to be of high value. In the case of GIS, we utilized the ST_Point and ST_Distance functions to identify the longitude and latitude coordinates of the International Space Station, convert these into a ST_Point, and return animal-focused charities within a defined radius.
In another instance, when presented with the task of conducting data analytics on a dataset requiring the information to be fused with external sources, new UDFs had to be created to complete the task of standardizing the format of messy, phone numbers data.
If given the opportunity to speak at ApacheCon we will share how we moved forward towards our finish line of creating a machine learning model that all began with Apache Drill. We will discuss the problems that Apache Drill may solve for its users, the actions that users may take to achieve similar results, and the concepts and science behind the software.

Misha Isran:
Misha is currently working at Datadistillr as an Associate Data Scientist. She has a background in business strategy and data analytics with a focus on python, SQL, Relational Database Management Systems, modeling and machine learning. She is currently pursuing a Masters in Data Science from UMBC and has an MBA from Johns Hopkins University.
Monty Rahman:
Currently working at Datadistillr as an Associate Data Scientist. Completed projects focusing on Chicago crime statistics in relation to geographical coordinates within the city. Focus in python, SQL, Relational Database Management Systems, modeling and machine learning. Pursuing a Masters in Data Science from University of Maryland Baltimore County.

Connect with us