ApacheCon @Home - Streaming Track

Streaming Track

Tuesday 16:15 UTC
No More Silos: Integrating Databases and Apache Kafka
Robin Moffatt

Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event-driven architectures and the population of multiple downstream systems. In this talk, we’ll look at one of the most common integration requirements - connecting databases to Kafka. We’ll consider the concept that all data is a stream of events, including that residing within a database. We’ll look at why we’d want to stream data from a database, including driving applications in Kafka from events upstream. We’ll discuss the different methods for connecting databases to Kafka, and the pros and cons of each. Techniques including Change-Data-Capture (CDC) and Kafka Connect will be covered. Attendees of this talk will learn: * That all data is event streams; databases are just a materialised view of a stream of events. * The best ways to integrate databases with Kafka. * Anti-patterns of which to be aware. * The power of ksqlDB for transforming streams of data in Kafka.

Robin is a Senior Developer Advocate at Confluent, the company founded by the original creators of Apache Kafka, as well as an Oracle ACE Director (Alumnus). He has been speaking at conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev. You can find many of his talks online at http://rmoff.net/talks/, and his blog articles at http://cnfl.io/rmoff and http://rmoff.net/. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.

Tuesday 16:55 UTC
Achieve the event-driven Nirvana with Apache Druid
Abdelkrim Hadjidj

After two decades of transforming into data-driven organizations, companies are moving to the next level: building event-driven organizations. An event-driven organization achieves faster insights, a better customer experience and more agility. However, this transformation requires advanced skills to make sense of all the events in real-time which put business people on the side. In this presentation, we will review the data architectures used by the most advanced event driven organizations today. We will discuss the challenges they face on delivering the promised business value and why stream processing technologies like Apache Kafka and Apache Flink are not enough to achieve the streaming nirvana. Finally, we will explain how Apache Druid enables self-service BI on event data and allows business users to ask their own questions leading to real-time insights.

Abdelkrim is a Data expert with 12 years experience on distributed systems (big data, IoT, peer to peer and cloud). He helps customers in EMEA using open source streaming technologies such as Apache Kafka, NiFi, Flink and Druid to pivot into event driven organizations. Abdelkrim is currently working as a Senior Solution Engineer at Imply. Previously, He held several positions including Senior Streaming Specialist at Cloudera, Solution Engineer at Hortonworks, Big Data Lead at Atos and CTO at Artheamis. He published several scientific papers at well-known IEEE and ACM journals. You can find him talking at Meetups or worldwide tech conferences such as Dataworks Summit, Strata or Flink Forward. He founded and runs the Future Of Data Meetup in Paris which is a group of 2300+ data and tech enthusiasts.

Tuesday 17:35 UTC
Incrementally Streaming RDBMS Data to Your DataLake Automagically
Timothy Spann

There is often data locked in transactional relational systems that you would like to ingest, transform, parse, aggregate, and store forever in Hadoop as wide tables. With the new features in Apache NiFi, Cloudera Schema Registry, HBase 2, Phoenix, Hive 3, Kudu, Spark 2, Kafka, Ranger, Atlas, Zeppelin and Hue this becomes something you can do at scale without the heavy hand processing of yore. Now with the hybrid cloud, you may want to securely ingest to multiple clusters with new tools including Streams Replication Manager. They told me it's not ETL or ELT, exactly it is so much more. You now have full control over global data assets with full management, full control and smart dashboards to allow a true enterprise open source solution for all your data. With materialized views and the ability to federate queries to JDBC and other data sources your fully ACID Hive 3 tables allow for you to escape the small scale EDW and be reborn in unlimited scale data worlds. References: https://community.cloudera.com/t5/Community-Articles/ETL-With-Lookups-with-Apache-HBase-and-Apache-NiFi/ta-p/248243 https://community.cloudera.com/t5/Community-Articles/Ingesting-RDBMS-Data-As-New-Tables-Arrive-Automagically-into/ta-p/246214 https://community.cloudera.com/t5/Community-Articles/Incrementally-Streaming-RDBMS-Data-to-Your-Hadoop-DataLake/ta-p/247927 https://community.cloudera.com/t5/Community-Articles/Ingesting-Golden-Gate-Records-From-Apache-Kafka-and/ta-p/247557 https://www.datainmotion.dev/2020/05/cloudera-flow-management-101-lets-build.html

Tim Spann is a Principal DataFlow Field Engineer at Cloudera, the Big Data Zone leader and blogger at DZone and an experienced data engineer with 15 years of experience. He runs the Future of Data Princeton meetup as well as other events. He has spoken at Philly Open SOurce, ApacheCon in Montreal, Strata NYC, Oracle Code NYC, IoT Fusion in Philly, meetups in Princeton, NYC, Philly, Berlin and Prague, DataWorks Summits in San Jose, Berlin and Sydney.

Tuesday 18:15 UTC
Introduction to Event Streams Development with Kafka Streams
Bill Bejeck

Developers today work with a lot of data. Much of this data is available near real-time. And it presents the opportunity for businesses and organizations to improve service and deliver more value to users of today's applications. But the question is, how to manage this incoming stream of records? Viewing the incoming data as event streams is one way to think about working with data. In recent years, Apache Kafka has become a defacto standard for ingesting record streams. To work with the incoming data, Apache Kafka provides a Producer and Consumer interface as the basic building blocks for sending to and reading records from Kafka. When building a Kafka-based microservice, using the Producer and Consumer clients means handling all the details of communicating yourself. To enable building event-driven applications, Apache Kafka provides Kafka Streams. Kafka Streams is the native stream procession library for Apache Kafka In this talk, we'll review Kafka and how it can function as a central nervous system for incoming data. From there, we'll cover how Kafka Producers and Consumers work and how developers can build a microservice using these building blocks. Finally, we'll transition our application to a Kafka Streams application and demonstrate how using Kafka Streams can simplify building a Kafka based microservice. Attendees of this presentation will gain the knowledge needed to understand how Kafka Streams works and how they can get started using it to simplify the development of applications involving Apache Kafka. Additionally, developers in attendance that aren't familiar with Apache Kafka itself will gain an understanding of how it can help their business or organization make effective use of available incoming event streams.

Bill Bejeck is working at Confluent as an integration architect on the Developer Relations Team before that Bill was a software engineer on the Kafka Streams team for three years. He has been a software engineer for over 17 years and has regularly contributed to Kafka Streams. Before Confluent, he worked on various ingest applications as a U.S. Government contractor using distributed software such as Apache Kafka, Apache Spark™, and Apache™ Hadoop®. Bill is a committer to Apache Kafka and has also written a book about Kafka Streams titled Kafka Streams in Action.

Tuesday 18:55 UTC
Change Data Capture with Flink SQL and Debezium
Marta Paes

Change Data Capture (CDC) has become the standard to capture and propagate committed changes from a database to downstream consumers, for example to keep multiple datastores in sync and avoid common pitfalls such as dual writes (remember? "Friends don't let friends do dual writes"). Consuming these changelogs with Apache Flink used to be a pain, but the latest release (Flink 1.11) introduced not only support for CDC, but support for CDC from the comfort of your SQL couch. In this talk, we'll demo how to use Flink SQL to easily process database changelog data generated with Debezium. About the speaker(s):

Marta is a Developer Advocate at Ververica (formerly data Artisans) and a contributor to Apache Flink. After finding her mojo in open source, she is committed to making sense of Data Engineering through the eyes of those using its by-products. Marta holds a Master’s in Biomedical Engineering, where she developed a particular taste for multi-dimensional data visualization, and previously worked as a Data Warehouse Engineer at Zalando and Accenture.

Tuesday 19:35 UTC
Real-Time Stock Processing With Apache NiFi, Apache Flink and Apache Kafka
Pierre Villard, Timothy Spann

We will ingest a variety of real-time feeds including stocks with NiFi, filter and process and segment it into Kafka topics. Kafka data will be in Apache Avro format with schemas specified in Cloudera Schema Registry. Apache Flink, Kafka Connect and NiFi will do additional event processing along with machine learning and deep learning. We will store real-time feed data in Apache Kudu for real-time analytics and summaries. Apache OpenNLP, Apache MXNet, CoreNLP, NLTK and SpaCy will be used to analyse stock trend data in streams as well as stock prices and futures. As part of the stream processing we will also be classifying images and stock data with Apache MXNet and DJL. We will also produce cleaned and aggregated data to subscribers via Apache Kafka, Apache Flink SQL and Apache NiFi. We will push to applications, message listeners, web clients, Slack channels and to email, To be useful in our enterprise, we will have full authorization, authentication, auditing, data encryption and data lineage via Apache Ranger, Apache Atlas and Apache NiFi. References: https://community.cloudera.com/t5/Community-Articles/Real-Time-Stock-Processing-With-Apache-NiFi-and-Apache-Kafka/ta-p/249221

Pierre Villard is currently a Senior Product Manager at Cloudera in charge of all the products around Apache NiFi and its subprojects like the NiFi Registry, MiNiFi agents, etc.. He has been active in the Apache NiFi project for the last 4.5 years and is a committer and PMC member of the project. Before joining Cloudera, Pierre worked at Google and Hortonworks where he helped customers develop solutions on-premises and in the cloud by using many technologies including Apache NiFi.
Tim Spann is a Principal DataFlow Field Engineer at Cloudera, the Big Data Zone leader and blogger at DZone and an experienced data engineer with 15 years of experience. He runs the Future of Data Princeton meetup as well as other events. He has spoken at Philly Open SOurce, ApacheCon in Montreal, Strata NYC, Oracle Code NYC, IoT Fusion in Philly, meetups in Princeton, NYC, Philly, Berlin and Prague, DataWorks Summits in San Jose, Berlin and Sydney.

Wednesday 16:15 UTC
Interactive Streaming Data Analytics via Flink on Zeppelin
Jeff Zhang

Flink is a powerful distributed streaming engine, but it requires lots of programming skills. Even Flink supports sql, it is not an easy job for an analyst to use Flink to do streaming data analytics directly. Fortunately, another apache project Zeppelin integrates Flink and make streaming data analytics pretty easy for these data analyst without programming skillset. In this talk, I would talk about how to use Flink on Zeppelin to do interactive streaming data analytics. And how to build real time dashboard without writting any html/js code.

Jeff has 11 years of experience in big data industry. He is an open source veteran, start to use hadoop since 2009 and is PMC of apache project Tez/Livy/Zeppelin and committer of apache Pig. His past experience is not only on big data infrastructure, but also on how to leverage these big data tools to get insight. He speaks several times on big data conferences like hadoop summit, strata + hadoop world. Now he works in Alibaba Group as a staff engineer. Prior that he works in Hortonworks where he develops these popular big data tools.

Wednesday 16:55 UTC
Fast Samza SQL: Stream Processing Made Easy
Weiqing Yang, Aditya Toomula

Apache Samza is a distributed stream processing framework that allows users to process and analyze data in real-time. Fast Samza SQL (FSS) is a managed stream processing service, powering hundreds of Samza pipelines in production across LinkedIn. Use cases like stream repartitioning, change capture views, materialized views, data migration, and data caching are the popular ones hosted by FSS. Such stream processing pipelines are expressed declaratively, with Samza SQL being the predominant DSL that FSS offers. Due to its SQL-like syntax, rich authoring and testing environment, users can create and deploy their stream processing jobs in a self-serve fashion within a few minutes. FSS also enables creation of stream processing pipelines programmatically. Users just need to focus on their business logic while FSS takes care of the rest, such as dependency management, resource provisioning, auto-scaling, job monitoring, failure recovery, etc. In this talk, we will introduce the overall FSS architecture, highlight the unique value propositions that FSS brings to stream processing at LinkedIn and share the experiences and lessons we have learned.

Weiqing Yang
Weiqing has been working in big data computation frameworks since 2015 and is an Apache Spark/HBase/Hadoop/Samza contributor. She is currently a software engineer in streaming infrastructure team at LinkedIn, working on Samza, Kafka, etc. Before that, she worked in Spark team at Hortonworks. Weiqing obtained a Master Degree in Computational Data Science from Carnegie Mellon University. Weiqing enjoys speaking at conferences. She presented in Spark Summit 2017, HBaseCon 2017, and KubeCon + CloudNativeCon North America 2019.
Aditya Toomula
Aditya has been working at Linkedin in streams infrastructure team since 2016. He has contributed to Apache Samza and Brooklin with latest contributions to Samza Sql and fully managed Samza. He is an Apache Samza committer and has over 15 years of Software Engineering experience. In his earlier life, he worked in Storage domain at NetApp, building various kinds of replication products and file systems.

Wednesday 17:35 UTC
Fresh updates about the new Beam Spark Structured Streaming runner
Etienne Chauchot

Apache Beam provides a unified programming model to execute batch and streaming pipelines on all the popular big data engines. The translation layer from Beam to the chosen big data engine is called a runner. A little more than one year ago, a new Spark runner based on Spark Structured Streaming framework was started and it has been merged to Beam master since. This talk will give updates about this new runner showing some added features, some performance improvements and also things that are yet to come.

Etienne has been working in software engineering and architecture for more than 16 years. He is focused on Big Data subjects. He is an Open Source fan and contributes to Apache projects such as Apache Beam, Apache Flink or Apache Spark. He is a Beam committer and PMC member.

Wednesday 18:15 UTC
Pravega: Storage for data streams
Flavio Junqueira

There is no shortage of use cases with elements that continuously generate data: end users posting updates and shopping online; sensors that periodically emit samples; drones that continuously produce aerial video streams; connected cars that generate a combination of videos, images, and telemetry; and server fleets that generate an abundance of telemetry data. One common aspect shared by several of these cases is that the sources of data are machines, and at scale, machines can generate data at extremely high volumes. Machine-generated data creates an important challenge for analytics systems to ingest, store and process such high-volumes of machine-generated data in an efficient and effective manner. Pravega is a software system developed from the ground up to enable applications to ingest and store high-volumes of continuously generated data. Pravega exposes the stream as a core storage primitive, which enables applications continuously generating data to ingest and store such data permanently. Applications that consume stream data from Pravega are able to access the data through the same API, independent of whether it is tailing the stream, reprocessing the stream, or processing historical data. Pravega has some unique features such as the ability of storing an unbounded amount of data per stream, while appending transactionally and scaling according to workload variations. It uses an underlying segment abstraction not only to implement such features, but advanced ones to support stream applications such as state synchronization and key-value tables. In this presentation, we overview Pravega, including its main features and architecture. We show how to use Pravega when building streaming data pipelines along with stream processors such as Apache Flink. We have implemented Pravega connectors for Flink that enable end-to-end exactly-once semantics for data pipelines using Pravega checkpoints and transactions. Pravega is an open-source project, licensed under the Apache License Version 2.0, and hosted on GitHub (https://github.com/pravega/pravega).

Flavio Junqueira is a Senior Distinguished Engineer at Dell. He holds a PhD in computer science from the University of California, San Diego, and he is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. His recent work at Dell focuses on stream analytics, and specifically, on the development of a novel storage system for streams called Pravega. Before Dell, Flavio held an engineering position with Confluent and research positions with Yahoo! Research and Microsoft Research. Flavio has co-authored a number of scientific publications (over 4,000 citations according to Google Scholar) and an O’Reilly ZooKeeper book on Apache ZooKeeper. Flavio is an Apache Member and has contributed to projects hosted by the ASF, including Apache ZooKeeper (as PMC and committer), Apache BookKeeper (as PMC and committer), and Apache Kafka.

Wednesday 18:55 UTC
Story of moving our 4Trillion Event Log Pipeline from Batch to Streaming
Lohit VijayaRenu, Zhenzhao Wang, Praveen Killamsetti

Twitter's LogPipeline handle more than 4Trillion events per day. This complex pipeline has evolved over the years to support Twitter's scale of data. This pipeline is designed to be resilient, support high throughput and use resources efficiently. Because of its legacy architecture, it was still batch pipeline at scale. For some time, our team has been redesigning this to support streaming use cases and have done significant architecture changes for this pipeline In this talk we deep dive into our old architecture, highlight pros and cons of that and describe how we are making changes for it to be more streaming friendly. We talk about various open source projects such as Apache Hadoop, Apache Flume, Apache Tez, Apache Beam and cloud technologies which tie together to form our large scale event LogPipeline.

Lohit VijayaRenu:
Lohit is part of DataPlatform team at Twitter. He concentrates on projects around storage, compute and log pipeline for Twitter scale both on premise and cloud. He has worked at several startups before joining Twitter. He has a Masters degree in Computer Science from Stony Brook University.
Zhenzhao Wang:
Zhenzhao works at Twitter as part of Hadoop and Log Management team. He is currently concentrating on Twitter Log Ingestion Pipeline which scales to handle trillions of events per day. Previously he was a member of DFS(Pangu) team in Alibaba Cloud where he focused on feature for random file access file in Pangu used as storage for Virtual Machines. He has Bachelor's degree from Nankai University and Master's degree from Tsinghua University.
Praveen Killamsetti
Praveen works at Twitter as part of the DataPlatform organization. In his current role, he is working on scaling the log ingestion pipeline to trillions of events in the streaming model and building a data set lifecycle management system for analytical data sets. He has a master degree in computer science from IIT Madras. Before joining Twitter, Praveen worked on building distributed storage systems at Nimble Storage, NetApp and built various products including Synchronous Replication across multiple data centers with automatic failover, Write Optimized KV stores, Dedupe and Compression stack, Efficient Cloning features, Archiving Storage Snapshots to S3 efficiently etc.

Thursday 16:15 UTC
Google Cloud Pub/Sub vs Apache Kafka for streaming solution at scale
Prateek Srivastava

Evaluation of various technologies to support High speed, Highly scalable REST API to ingest high volume of Analytics payloads from User browsers distributed across the globe. Furthermore, we will discuss tech stack choices, performance benchmarks, costing and best practices for implementing such big data streaming solutions in Google Cloud.

Prateek Srivastava is Technical Architect at Sigmoid. We help organisations realize the power of open source to manage big data and leverage AI/ML tech to derive actionable insights. He has more than 13 years of experience in Big data, Cloud and Service Oriented architecture and has helped build and sustain several end to end data infrastructures for customers around the world.

Thursday 16:55 UTC
Building your First Connector for Kafka Connect
Ricardo Ferreira

Apache Kafka is rapidly becoming the de-facto standard for distributed streaming architectures, and as its adoption grows the need to leverage existing data also grows. When developers need to handle certain technologies that happen to not have an connector available; they have no other choice other than write their own. But that can be quite challenging, even for experienced developers. This talk will explain in details what it takes to develop a connector, how the Kafka Connect framework works, and what are the common pitfalls that you should avoid. The code of an existing connector will be used to explain how the implementation should look like so you can develop more confidence when building your own.

Ricardo is a Developer Advocate at Confluent, the company founded by the original co-creators of Apache Kafka. He has over 20 years of experience where he specializes in streaming data architectures, big data, cloud, and serverless. Prior to Confluent, he worked for other vendors, such as Oracle, Red Hat, and IONA Technologies, as well as several consulting firms. When not working, he enjoys grilling steaks in his backyard with his family and friends, where he gets the chance to talk about anything that is not IT related. Currently, he lives in Raleigh, North Carolina, with his wife and son. Follow Ricardo on Twitter: @riferrei

Thursday 17:35 UTC
Understanding Data Streaming and Analytics with Apache Kafka
Ricardo Ferreira

The use of distributed streaming platforms is becoming increasingly popular among developers, but have you ever wonder what exactly this is? Part Pub/Sub messaging system, partly distributed storage, partly event processing engine, the usage of this type of technology brings a whole new perspective on how developers capture, store, and process events. This talk will explain what distributed streaming platforms are and how it can be a game changer for modern data architectures. It will be discussed the road in IT that led to the need of this type of plataform, the current state of Apache Kafka, as well as scenarios where this technology can be implemented.

Thursday 18:55 UTC
Event Streaming and the Data Communication Layer
Adam Bellemare

Streaming technologies unlock decoupled, near real-time services at scale. The most important part of any streaming platform is the event-broker (eg. Apache Kafka or Pulsar) as it plays the role of the Data Communication Layer (DCL). Many organizations fail to grasp the importance of the DCL and often relegate it to the role of a simple asynchronous message queue, leaving their key business domain events locked away in monolithic data stores. This is one of the many pitfalls that will be covered in this presentation, along with strategies and tipss for avoiding them. A well constructed DCL decouples both the ownership and production of data from the downstream services that require access to it. Access to clean, reliable, structured, and sorted data streams enables extremely powerful event-driven patterns. Data becomes much easier to access and no longer relies upon the producer's implementation to serve disparate business requirements. Teams and products can organize much more clearly along business bounded contexts, and modular, disposable, and compositional services become extremely easy to build and test. This presentation covers the best practices, responsibilities of the various actors, recommendations about specific technological implementations, and both the organizational changes required and those that will occur as a result of a reliable DCL implementation.

Adam Bellemare is the author of Building Event-Driven Microservices (O'Reilly, 2020). He has been working on event-driven architectures since 2010. His major accomplishments in this time include building an event-driven processing platform at BlackBerry, driving the migration to event-driven microservices at Flipp, and most recently, starting a new role to improve event-driven architectures at Shopify. He has contributed to both Apache Avro and Apache Kafka and is a keen supporter of the open-source community.

Connect with us