To see a more detailed schedule, register for the event, and then visit the Event website
October 3-4, Rhythms I
|Session Title||Session Description||Presenter(s)|
|Scaling LinkedIn's Hadoop YARN cluster beyond 10,000 nodes||At LinkedIn, we use Hadoop as our backbone for big data analytics and machine learning. With an exponentially growing data volume, and the company heavily investing in machine learning and data science, we have been doubling our cluster size year over year to match the compute workload growth. Our largest cluster now has 10,000+ nodes, one of the largest (if not the largest) Hadoop clusters on the planet. Scaling Hadoop YARN has emerged as one of the most challenging tasks for our infrastructure over the years. In this blog post, we will first discuss the YARN cluster slowdowns we observed as we approached 10,000 nodes and the fixes we developed for these slowdowns. Then, we will share the ways we proactively monitored for future performance degradations, including a now open-sourced tool we wrote called DynoYARN, which reliably forecasts performance for YARN clusters of arbitrary size. Finally, we will describe Robin, an in-house service which enables us to horizontally scale our clusters beyond 10,000 nodes.||Keqiu Hu, Jonathan Hung, and Sriram Rao|
|Big Data Security in Apache Projects||Apache analytic frameworks often process sensitive information stored in the datasets. To safeguard the privacy and integrity of personal or business confidential details, this information must be protected. Apache Parquet, popular storage format leveraged in most Apache big data frameworks, recently added a built-in support for encryption and integrity verification of the stored data. With the goal of enabling a standards-based and interoperable data security support across the Apache big data ecosystem, Parquet’s data protection mechanism is already available for use in Apache Spark, Arrow and Flink. Work is under way to also integrate this mechanism in Apache Iceberg and Trino projects.|
This talk will describe the data protection tools in Apache Parquet, and demonstrate how users can leverage them to protect their data in the Apache projects that already support data security. It will also outline the ongoing work on new features, and integration of the security tools in Apache Iceberg and Trino. The talk will conclude with a discussion on how other Apache projects can use these new capabilities to protect sensitive data of their users.
|Reduce Your Storage Footprint with Apache Ozone Erasure Coding||Apache Ozone is a highly scalable distributed object storage system. Distributed storage systems typically use replication to provide high reliability and Ozone supports the replication model for the same. However replication is expensive in terms of storage space and other resources ( ex: network bandwidth etc). Erasure Coding(EC) is a proven technique to save storage space and throughput requirements.|
Apache Ozone implemented the EC support. With the EC in place, Apache Ozone can reduce the storage cost by ~50% compared to traditional 3-way replication storages by providing the same level of reliability. In Apache Ozone, the replication unit is a Container. The Container is nothing but a logical batch of data blocks. Here EC uses the same Container abstraction, but uses d data Containers and p parity Containers(d>p) and places them into distinct storage nodes. The actual data block chunks stored into the d data Container blocks in order and the encoded parity chunks stored into the p Container blocks. In this talk we deep dive into the detailed EC architecture which covers the data layout and decoding sections as well. Also we will discuss some of the design challenges we faced to achieve consistency with Container abstraction and how we solved them.
|Uma Maheswara Rao Gangumalla & Stephen O'Donnell|
|Apache Arrow Flight SQL: high performance, simplicity, and interoperability for data transfers||Network protocols for transferring data generally have one of two problems: they’re slow for large data transfers but have simple APIs (e.g. JDBC) or they’re fast for large data transfers but have complex APIs specific to the system. Apache Arrow Flight addresses the former by providing high performance data transfers and half of the latter by having a standard API independent of systems. However, while the Arrow Flight API is performant and an open standard, it can be more complex when using it for SQL than APIs like JDBC.|
Arrow Flight SQL rounds out the solution, providing both great performance and a simple universal API for SQL.
In this talk, we’ll show the performance benefits of Arrow Flight, the client difference between interacting with Arrow Flight and Arrow Flight SQL, and an overview of a universal JDBC driver built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.
|Lessons Learned Running Apache YuniKorn at Scale||Apache YuniKorn is one of the most advanced resource schedulers on Kubernetes that specializes in batch workloads. Companies that have large-scale batch data processing use cases (e.g. Apache Spark) are increasingly looking to migrate their workloads to Kubernetes in the cloud in order to leverage improved cost effectiveness, scalability and elasticity. Apache YuniKorn fills an important gap in the cloud by providing comprehensive resource scheduling capabilities. But it is no easy feat to run any component including YuniKorn in production. Lots of challenges need to be overcome to ensure it runs smoothly at all times. In this talk, we will first provide a high-level overview of the indispensable role that YuniKorn plays in a batch-oriented data platform. Then we will share lessons learned deploying Apache YuniKorn at a massive scale across many Kubernetes clusters in the cloud. Specifically, we will cover topics such as creating monitors and alerts to minimize production issues, performing upgrades and maintenance in a live production environment, and common performance issues and solutions.||Bowen Li and Chaoran Yu|
|Large scale migration to Parquet in Uber||Parquet is the core file format in Uber's big data stack. It is a prerequisite for many key initiatives like column level encryption, column pruning etc. in Uber's data org.|
However, there are nearly 20,000 existing Hive tables still using other formats. It is inefficient and error-prone to migrate them using the traditional SELECT-INSERT method.
We tackled the challenge with a 3-pillared solution:
- High throughput rewriter
- Mixed format partition support in Hive and Spark query engines
- Reliable ETL pipeline conversion
The solution can be a reference for anyone needs to migrate massive amounts of Hive tables and data files to Parquet format
|Huicheng Song & Xinli Shang|
|From Column-Level to Cell-Level: Toward Finer-Grained Access Control with Apache Parquet||New challenges emerge to protect the privacy of sensitive information in the big data industry. As a company's business grows, the data could come from different countries, having diverse privacy requirements. The analytic tables have that data need to apply different privacy policies not only at column level but also at specific values of a row (called a ‘cell’). While different columns use different privacy policies, the data inside a column could also have distinct privacy requirements. For instance, they are from various countries with separate access control and retention policies to apply for the same column value.|
As traditional table level or even column level cannot meet those requirements, we introduced cell-level access control with encryption in Parquet. Parquet is the industry-leading standard for the formatting, storage, and efficient processing of big data. We worked with the Apache Parquet community to implement this feature in Parquet as a solution for finer-grained access control. With this solution, users can not only encrypt the privacy data for a given column but also apply filters for rows based on the row values to selectively encrypt. Once the cells are encrypted, we can use the key through KMS(Key Management System) for access control and deletion (crypto-shredding). We will talk about the details of our solution with use cases, challenges, solutions, performance, and learnings.
|Xinli shang and Mohammad Islam|
|Apache Ozone - State of the Union||Apache Ozone is a highly scalable distributed storage system that supports 10+ billion objects, and thousands of dense storage nodes. Ozone supports the Hadoop Compatible FileSystem interface as well as the S3 protocol natively. Apache Ozone's growing community of contributors and users has allowed us to add many highly anticipated features in the last year. This talk will cover Ozone’s recent developments, including support for erasure coding, data balancing, atomic directory operations, multi-tenancy under Ozone's S3 interface, and an improved data layout for scalability and performance. We will also cover the roadmap for the future of Ozone development.||Ethan Rose and Siyao Meng|
|Elastic Managed Spark at Apple||At Apple, data scientists and engineers are running enormous Spark workloads to deliver amazing cloud services. Apple Cloud Services supports an increasing scale of Spark workloads and resource requirements with a great user experience. From code to deployment management, there is one interface for all compute backends.|
In this talk, we walk through the lessons learned and pitfalls encountered for supporting the service at Apple scale. The talk covers how to effectively orchestrate Spark applications, best practice to tune Kubernetes, Spark ControlPlane and Spark applications for scaling up. We would also discuss the seamless switchover among different resource managers, resource utilization, monitoring and more.
|Aaruna Godthi and Zhou JIANG|
|Fast and Consistent, and Pre-aggregated Data Ingestion on Apache Pinot||Apache Pinot is a distributed columnar storage engine that can ingest data in real-time and serve analytical queries at low latency. At LinkedIn, we leverage Pinot as the de-facto solution for high-speed analytical queries on both offline and real-time data. While operating the large production Pinot cluster, we have encountered a lot of challenges regarding data ingestion. In this talk, we will deep-dive into how we greatly improved the data ingestion performance from 10+ hours to sub-hour for our production Pinot installation. Moreover, we will also discuss the following features contributed by LinkedIn to improve the data ingestion experience:|
- Consistent Data Push to allow transaction-like offline data ingestion.
- Time-based Aggregation and Rollups to improve query performance and storage cost.
|Seunghyun Lee and Jiapeng Tao|
|Hadoop Vectored IO: your data just got faster!||Since 2006 the world of big data has moved from terabytes to hundreds of petabytes, from local clusters to remote cloud storage, yet the original Apache Hadoop posix-based file APIs have barely changed.|
It is wonderful that these APIs have worked so well, but we can do a lot better with remote object stores, by providing new operations which suit them better, targeted at columnar data libraries such as ORC and Spark. Only a few libraries need to migrate to these APIs for significant speedups of all big data applications.
This talk introduces a new Hadoop Filesystem API called "vectored read", coming in Hadoop 3.4. An extension of the classic FSDataInputStream it is automatically offered by all filesystem clients.
The S3A connector is the first object store to provide a custom implementation, reading different blocks of data in parallel. In Apache Hive benchmarks with a modified ORC library, we saw a 2x speedup compared to using the classic s3a connector through the Posix APIs.
We will introduce the API spec, the S3A implementation and the benchmarks, and show how to use it in your own applications. We will also cover our ongoing work on providing similar speedups with other object stores, and use of the API in other applications.
|Data Governance with Apache Atlas - An alternative User Interface||Apache Atlas provides data governance functionality and is part of the Hadoop eco-system, however, it is not limited to it. The underlying data model is very generic and can be extended. This makes Apache Atlas very flexible, however, has consequences for the usability of the user interface.|
In this talk, I present an extension of the data model and present an alternative open source user interface, which can be used more intuitively by non-technical users, especially business users.
In more detail, in this presentation the underlying extension of the data model is motivated and explained. Further, it is motivated which derived information is required by the business user to increase the usability. Next the open source backend functionality is explained and the underlying technologies are motivated. Finally, a short tutorial is provided on how to setup your own system with a related open source helm chart and explains how to get started.
The motivation for this talk is to promote this open source project and find support and interest in the community.
Some related numbers:
- we are adding 6 additional base types
- we have a further extension of these 6 types for elastic, kafka and kubernetes
- the frontend is in Angular
- the backend uses Apache Atlas, HBase, Apache Kafka, Apache Flink, Keycloak, Apache Httpd, elasticsearch and elastic enterprise search
October 5-6, Rhythms I
|Session Title||Session Description||Presenter(s)|
|Developing Cassandra Applications with Accord||Are you in this camp? “I can’t use Cassandra because it doesn’t have transactions.” Have you heard about CEP-15, featuring a new consensus protocol called Accord? This proposal is to add fully ACID compliant global transactions to Apache Cassandra. Well, that’s going to change everything! Let's talk about how it will work and how it will change the way you use Cassandra.|
- A user view of how Accord enables global transactions
- Changes in CQL Syntax
- Examples of how it could be used in your application
|Cassandra at Bloomberg: A Coming-of-Age Story||Bloomberg provides a wide array of analytics to help our customers understand the global financial markets. Since 2017, Apache Cassandra has grown in its role of helping many different financial analytics (which we call 'functions') running on the Bloomberg Terminal to access large archives of data with low latency billions of times per day. In this presentation, we will talk about some of the scaling obstacles we've overcome along the way.||David Paulk, Krishna Vadali, and Lindsey Zurovchak|
|Guerrilla Tactics for Scalable E-commerce services with Apache Cassandra and Apache Pulsar||Many of the world’s top e-tailers rely on distributed tech like Apache Cassandra and Apache Pulsar. Properly leveraging those technologies to drive business value is often the difference between sales and abandoned carts. In this talk, we'll discuss implementation techniques, strategies, and other considerations on building/improving an E-commerce backend with Java, Spring Boot, Cassandra, and Pulsar.||Aaron Ploetz|
|Improving performance with byte order and tries||Byte order, the property of being comparable using lexicographic comparison on the bytes of a key, and tries, structures implementing ordered maps using byte-labelled transitions, have been in use in DataStax Enterprise since version 6 and are currently being contributed to mainline Cassandra in CEP-19, CEP-7 and others.|
This talk describes some key ideas we used to translate typed keys into byte ordered sequences, to use them to form compact tries and efficiently process them, and to apply these to Cassandra's memtables and sstable indexes. We will also show some of the performance difference this can make.
|Improving Bad Partition Handling in Apache Cassandra||Reading and compacting Bad Partitions have long been known to impact Cassandra performance. They have been the root cause of various production issues at Netflix. While there are several potential solutions for addressing them at an implementation level we must also deal with them today when they arise.|
There are several forms of bad partitions, which include: a) a partition that gets large in size several GBs+; b) a partition with many (millions or more) small rows, potentially spread across many sstables; c) a partition with many small rows and many of them have been deleted or expired; d) a partition with rows that themselves are very large (e.g. blobs of binary or text).
In this talk, we present the approaches we use at Netflix to handle bad partitions when they arise. Specifically, we present how we identify, block, and mitigate them during production incidents. We will also share our on-going efforts on improving some of the existing tools as well as new tools for the Cassandra community. Additionally, we will present examples from real production incidents.
|Cheng Wang and Jordan West|
|Toward a more Modular Cassandra||Cassandra has grown tremendously in the 14 years since its first release, both in features and in codebase complexity. Recent work on a pluggable memtable implementation has inspired the author to step back and consider how we can begin to separate other concerns within the codebase, with the goal of making it easier and faster to test, experiment, and onboard new developers. This talk will discuss the various responsibilities of a Cassandra node, how they’re currently implemented, and introduce proposals for interfaces that can help us reduce entanglement between those implementations.||Derek Chen-Becker|
|Apache Cassandra at Apple||Apache Cassandra is one of the most popular open source distributed databases. The stability of Cassandra 4.0 has enabled rapid adoption within the user community – including Apple – where it powers some of the company's most critical services.|
This presentation will cover two primary topics:
– Apple's approach to designing services that leverage Cassandra's transactional; strongly-consistent; and eventually-consistent semantics to meet different goals.
– Efforts underway in CEP-15  toward enabling Apache Cassandra to become the only open source database in our industry capable of serving petabytes of data, millions of queries per second, and executing leaderless transactions across multiple regions.
|Reducing client latency and timeouts when running Cassandra on Public Cloud||Running Cassandra in a public cloud has been popular and provides good performance. However, unlike an on premise data center cloud providers often perform maintenance in the background, be it mitigating (hardware) problems, upgrading components or host operating systems, or rebalancing their fleet. During those times a virtual machine might be paused, the network or disk unavailable, or something else altogether. Though Cassandra will eventually remove the node from the ring during this time clients will experience increased latency when being connected to this node or trying to reach data from the node’s replica - often leading to database transactions failing and needing to be retried.|
This talk will explore metrics helping to detect incidents faster and how to turn them into actionable automation. It will show on the example of Microsoft Azure’s Scheduled Events how these maintenance announcements can be used to mitigate latency events. It will also discuss best practices on how to configure clients to take advantage of this and reduce latency even further.
|Path towards Secure-by-default Cassandra||In a traditional authentication system, the user is verified using username and password. Managing passwords is not a very secure form of Authentication. Also, during Cassandra authentication setup it is recommended that the node should not be connected to any client. In order to initiate this process we need to either remove a node from the client config or add a new temporary node to the cluster. Similarly for authorization there is no way to turn on role based authorization transparently.|
At Netflix, we want to align all applications to be using a centralized paved path Authentication and Authorization (authn/authz) mechanism, and be able to transition the existing clusters to secure mode in a transparent and efficient way. This is not easy to manage using Cassandra's traditional way of authn/authz, which motivated us to develop a new authn/authz system for Cassandra. Given the authentication and authorization are pluggable in Cassandra, it became easy for us to associate the new system with the existing and new Cassandra clusters at Netflix.
In this talk we would like to present how we built the new password independent authn/authz system which allows us to enable these two features on any cluster without any downtime; demo how we are using a centralized portal for creating and dynamically updating access policies in an easy and transparent way. Using remote cqlsh we also want to present how the policy changes impact the access.
|The Problem with Gossip: Importance of Transactional Metadata||Apache Cassandra was always known for resiliency, and gossip was tooted as one of the features that facilitates this resiliency: a metadata delivery protocol that can continue functioning despite network partitions and processes that do not overlap in time. Unfortunately, lack of delivery guarantees and divergence within the cluster turned out to be a problem even for an eventually consistent database.|
This talk will discuss the details of the Cassandra Enhancement Proposal to make cluster metadata transactional, and discuss how it may help to resolve many of the scalability issues, such as slow schema propagation, concurrent cluster expansions, open horizons to new features, such as flexible range ownerships, planned expansions and shrinks, and more, and, finally, open up opportunities to safely execute transactions during cluster changes.
This talk is going to cover why’s and how’s: explain how we got here, why we needed to make this change, reveal the implementation details, and hint at some of the things that are going to become possible once transactional metadata is adopted.
|Storage considerations when running Apache Cassandra on Kubernetes with K8ssandra||There are a broad range of considerations when you are looking to deploy any database into Kubernetes.|
With Apache Cassandra, storage choices have always been one of the key decisions to make to ensure your Cassandra deployment is a success. When running Cassandra on Kubernetes, storage is managed completely differently to traditional VMs and Bare-Metal. Furthermore, managed Kubernetes solutions can often fall short of the minimum storage requirements and underperform significantly.
Through this talk, we will briefly introduce you to K8ssandra, compare and contrast available storage solutions, the pros and cons of each and discuss how to better tune up the storage to get the most out of your Cassandra cluster.
|Hayato Shimizu and Johnny Miller|
|Building APIs on Cassandra||Have you ever created an abstraction layer over Cassandra or another database? In my time as a Cassandra advocate, I observed many of these layers and the need for a common solution that could help save teams the effort of building and maintaining their own abstractions. This search led me to the Stargate project, an open source data gateway for building developer friendly APIs on top of Cassandra.|
In this talk, we’ll look at the reasons why teams create API abstractions and several common patterns and pitfalls. We’ll look at how Stargate supports a pluggable architecture for supporting multiple versions of Cassandra and different authentication APIs, and how the Stargate v2 architecture enables you to extend Cassandra with your own APIs.
October 3-5, Rex
|Session Title||Session Description||Presenter(s)|
|Community @ Apache Cassandra||Take a dive into what's behind the scenes to the health of an OSS community, ways to stay connected to users with their changing needs, and enabling a rich surrounding ecosystem.|
Folk like to bang on about the Apache Way, too often with wax. The real world and human behaviour can be a bit complicated and painful for these ideals and generalisations.
Some of the questions this session will throw up are: volunteering and paying for social media ops, marketing, conferences, how best to use the private ML, how to re-energize a ten year old tech, what happens when most of your contributors are forks, and how multiple companies with different motivations avoid leaking tensions to and out via the contributors they employ.
Expect the answer to many of these not to be …it happens on the ML.
|Mick Semb Wever|
|Hey maintainer (and user), exercise your empathy!||This talk is a walk through a number of ways maintainers of open-source projects (for example Airflow) can improve the communication with their users by exercising empathy.|
This subject is often overlooked in the cirriculum of average developer and contributor, but one that can make or break the product you developed, simply because it will become more approachable for users. Maintainers often forget or simply do not realize how many assumptions they have in their head.
There are a number of techniques maintainers can use to improve it. This talk will walk through a number of examples (from Airflow and other projects), reasoning and ways how communication between maintainers and users can be improved - in the code, documentation, communication but also with involving and engaging the users they are commmunicating with, as more often than not - the users might be of great help when it comes to communication with them - if only asked.
This talk is for both - maintainers and users, as I consider communication between users and maintainers two way street.
|Collaborate in Google Summer of Code and beyond, the Apache Way||In this talk, we will discuss the participation of the Apache Software Foundation in the Google Summer of Code(GSoC) program and other open-source awareness/internship programs with a focus on welcoming new contributors and students around the globe. These programs provide an open platform for contributing, creating new contacts, and enabling mentors from top professionals to guide open source newcomers along the Apache way to blossom into lifelong contributors to open source.|
We will be discussing the GSoC program, and its dynamics for providing a platform for all to define the roles and responsibilities of participants, mentors, and organization admin, and will discuss the complete roadmap from being a contributor to the committer.
We will also talk about how the Apache's participation in GSoC program increases each year and introducing more projects/PMCs to get more contributors
In 2022, we have 38 applications selected for participating in GSoC within several projects like ShenYu, Dubbo, APISIX, SkyWalking, ShardingSphere, Airavata, Cloudstack, Nemo, IoTDB, EventMesh, DolphinScheduler, Beam, Fineract as compared to 2021, where 28 applications were selected and completed.
In the end, I will be sharing my personal experience of implementing the apache way at project Fineract from being a student contributor(2017) to GSoC mentor(2019, 2020) to GSoC org admin(2021,2022) for the Apache.
I will also share my journey on being selected as a committer for project Fineract.
|Non-code Contributions to Open Source||Contributing to open source is more than contributing code.|
More often, these non-code contributions are more valuable than contributing code.
In this talk, Navendu walks through his experience and discusses different, impactful ways to contribute to open source other than code.
He talks about contributions in the areas of writing, designing, testing, mentoring, and community managing.
|Tips to prevent and survive burnout: I went through burnout, so you don’t have to||The work we do in technology and open source communities can put us into various high-stress situations. But you don’t want to put yourself in a situation where you overextend yourself and approach burnout. In this session, you’ll hear personal stories and experiences with burnout and learn ways to prevent and manage stress.|
This session will help you:
* Understand the signs of burnout
* Learn techniques on how to prevent burnout
* Recognize signs of burnout for remote workers and distributed teams
* Discover ways to effectively manage stress in open source communities
|Understand how visitors use your documentation with Matomo web analytics||Documentation is one of the most important things that determines the success of your software. Sometimes documentation is considered an afterthought, others spent a lot of time on designing and writing documentation and websites. Documentation should be written to achieve certain goals, but are you actually measuring them?
While working on the Apache Flink project and documentation, I wanted to answer questions like what parts of the documentation are important to visitors, which features are most frequently read up on, and where visitors get lost in the docs. By answering those questions, we can better understand how users use the software, the website, the documentation and where we should focus improvements next.
In this talk:
- I will explain how I’ve worked with the Apache Privacy committee and Infrastructure to set up a privacy-first web analytics tool, Matomo.
- I will show you how you can analyze your visitor behavior
- I will demonstrate how you can use these insights to improve your documentation.
|Apache Local Community (ALC): Present & Beyond||Apache Local Community (ALC) is an initiative by the Apache Community Development project.|
ALC comprises local groups of Apache (Open Source) enthusiasts, called an 'ALC Chapter'. For details please refer https://s.apache.org/alc
The session will be majorly on two topics:
1.) How the Apache Software Foundation provides the opportunity to flourish your idea.
I shared the initial ALC idea with the community around mid-2019, from there with the great inputs from the community and mentors, we have given the shape to the idea. It is a great example, how the community can help you to transform and enhance your idea to match global standards.
2.) Introduction, Present State, and next plans of ALC
Introduction, About ALC, ALC Roles and Responsibilities, and Benefits of ALC
How to apply to set up ALC Chapter, Code of conduct, ALC Resources
Present State Current ALC Chapters (Indore, Beijing, Warsaw, Budapest, Lagos, Shenzhen, and others..)
Next Steps on ALC - Establishing new ALCs and future roadmap.
How to participate in this initiative.
More details on ALC can be found at https://s.apache.org/alc
The following will be the takeaway from the session:
- How community engagement can help in improvising and implementing your idea.
- What is ALC and how to participate in this initiative.
|Swapnil M Mane|
|Fundraising at Apache||The ASF runs on more than just code donations. To operate the way it does today requires significant amounts of cash and services donated by our generous sponsors and individual supporters. In this session, we'll examine the different sponsorship programs available at the ASF and how they work, including options that target specific projects. We'll discuss how the ASF operates compared to similar foundations like Eclipse Foundation or the Linux Foundation and some of the advantages and limitations we have based on our 501c(3) Not-for-profit structure. Finally, we'll have a discussion on the future of open source fundraising. Bring your thoughts and ideas.||Bob Paulin|
|Running an Apache Project: 10 Traps and How to Avoid Them||When you are starting on your open source adventure, there are lot of things to learn that have very little to do with coding and instead relate to interacting with people. Apache is, at its best, a group of people who are trying to share their experience and teach new projects and contributors how to successfully manage open source projects. However, like the blind people each describing a part of an elephant, each mentor brings their personal experience to the table, and thus can give good, yet conflicting advice to new projects. However, that aggregate advice has helped many projects to become successful. Based on the author's experience, this talk will take you through 10 common traps in running Apache projects, why they happen, and how to avoid or mitigate them.||Owen O'Malley|
|From an idea to an Apache TLP (Post pandemic edition)||Talk about my journey from having a crazy idea to having that open-source project I started become a top level project at the Apache Software Foundation.|
How to communicate your idea, what I did to grow a community around my idea. What's important and what's not?
Even if we graduated before the pandemic, a TLP never stops evolving and the last two years added quite a bit more insight on community dynamics, that I would love to share.
|Issue Management and Bug Triage for Apache Committers||Regular bug review and prioritization prevents production emergencies. It also helps build communities since new contributors often start by filing bugs. Quick bug response that listens to users and makes them feel valued and respected helps bring new people into the community. By contrast when bugs are ignored, potential contributors leave and move on to other projects.|
Too often the bug tracker is where bugs go to be ignored while they fester and breed until they metamorphose into P0 production emergencies. Regular bug review can prevent neglected issues from getting worse. Tracker pruning prevents backlists from growing out of control and gives you early warning when the team is falling behind so you can reprioritize. It also enables you to fix more bugs faster, and makes sure the most important issues are addressed promptly.
Certain patterns and practices apply whether you track bugs in Jira, Github, Bugzilla, FogBugz or an Excel spreadsheet. Topics covered include triaging new bugs, priority levels, SLAs for addressing bugs, scheduling time to work on bug management, and what to do with bugs your team doesn’t have the resources or time to fix.
|Elliotte Rusty Harold|
|The ASF's Secret Weapons||Today, you can't throw a rock without hitting an Open Source foundation. What is it that makes the Apache Software Foundation unique in this vast collection of non-profits? Three things: a strict policy of vendor neutrality; a focus on the volunteer contributor; and a restriction against the foundation funding any development on any Apache project. Hear and, more importantly, understand why these 3 pillars were not only vitals to the ASF's success over the last 2 decades, but also see why it's even more important to the continued success of open source in the future.||Jim Jagielski|
|Distributed events and how to make one||During the pandemics. we've been experiencing the period of not being able to travel and see and meet other people. Those were pretty sad times - we all want to forget about this quickly. Getting back to in-person events like ApacheCon in New Orleans is something we've all been eager to do and it's super cool we can do the physical event again.|
But should we really forget about those times or rather learn from it? Do we really want to be back in the pre-covid days, or can we adapt and learn what we've learned during pandemics?
In Apache Airflow, we were forced to do online Airlfow Summit. That made the events bigger than we could ever imagine (10.000 attendees!) and much more accessible. We had attendees from 90+ countries in the world. Most of those people who attended would not be able to be part of an in-person event. Yeah, the in-person part was missing - but can we try to join the best of both worlds?
We will descirbe the first event of a new kind - Distributed Event. Not a "Hybrid" event (I hate that name) - but a truly distributed Event that took the whole week in 13 cities all over the world - from Lagos in Africa to San Francisco in US.
You will get some learnings from that experience - where 10 000 people still attended the event, but we managed to bring the local in-person experience for those who wanted badly to meet again - with far less exclusion that regular on-site event has.
Are we about to write a new Blueprint for Distributed events? Let's see.
|Jarek Potiuk & Pedro Galvan|
Cloud Runtime/Cloud Native
October 5-6, Mid-City
|Session Title||Session Description||Presenter(s)|
|Saving Lives with Apache Camel K: Fast-tracking a Cloud-native Text-to-911 Emergency Service||For emergency services, miscommunications can have devastating consequences. If a call comes in and the emergency responder can’t adequately respond to a crisis or causes an error in communication, the caller will be left behind without critical information. In a country where over 350 languages are spoken, the startup ConveyTEK was poised to create a digital, cloud-native solution based on Apache Camel K and Kubernetes that guaranteed high-speed translations for text-to-911 services.|
The founding team knew they needed to move fast while architecting a rock-solid, resilient technology platform, ensuring no emergency message will ever be missed. They settled with Apache Camel’s latest cloud-native innovation, Camel K, to provide ultra-fast turnaround times during development and rollouts. Instead of focusing on complex environment setups and deployment strategies, the team was ready to go from day one. In just under three months, the startup team was thrilled to look into a modern API and container-based solution rolled out in AWS and GovCloud running on Kubernetes.
During this presentation, we will share insights, dos and don’ts, and lessons learned while using Apache Camel K in a real-world startup environment.
|Andre Sluczka and Jeff Bruns|
|Polyglot Cloud Native Debugger - Beyond APM and Logs||All the unit tests and the largest QA team still can’t stop bugs from slithering into production. With a distributed microservice architecture debugging becomes much harder. Especially across language and machine boundaries. APMs and logs provide the first steps. But in these crucial moments we need something more.|
Production bugs are the WORST bugs. They got through unit tests, integration, QA and staging… We cross our fingers, put on the Sherlock Holmes hat & hope the bug made it into the log…
If not our only remedy is more logging. That bogs performance for everyone and makes logs damn near unreadable. We have no choice other than crossing our fingers & going through CI/CD again... again... again...
With developer observability we can follow a specific process through several different microservices and “step into” as if we were using a local debugger without interrupting the server flow. In this session I will demonstrate such an approach and how everything integrates through tomcat and more.
I'll also demonstrate the process of debugging complex distributed systems such as airflow/spark or kafka using this approach.
|Lessons learned from running thousands of Kafka clusters on AWS||Apache Kafka is well known as a low-latency, high-throughput and highly configurable streaming platform. At Amazon MSK, we run thousands of Kafka clusters on AWS, each cluster with different hardware and software configurations. Managing such a large and diverse Kafka fleet has taught us several operational lessons. We would like to share some of these lessons with you.|
We’ll talk about several topics including (a) monitoring Kafka health, (b) optimizing Kafka to address compute, storage and networking bottlenecks, (c) automating detection and mitigation of infrastructure failures related to compute, storage and networking and (d) continuous software patching.
|Mehari Beyene & Tom Schutte|
|Open source serverless spark data pipelines||Utilize open-sourced, cloud native, customizable, serverless spark data pipelines to turbo charge data engineering use cases. In this session we will walk through overview of serverless spark pipelines, their advantages and a brief demo.||Shashank Agarwal & Ajay Kumar|
|Event-driven autoscaling through Apache Kafka Source, KEDA, and Knative Integration||Kubernetes allows lots of enterprises to run a variety of business applications from web services to mobile applications, IoT edge streaming, and AI/ML. The biggest benefit of Kubernetes is to autoscale your apps on-demand, as this reduces the amount of process time required to handle incidents. It also helps make your cloud platform more reliable and stable to serve business services seamlessly.|
One caveat of the Kubernetes autoscaling is fundamentally based on hardware resource utilization (CPU, memory) through Horizontal Pod Autoscaling. This causes a new challenge to build an event-driven architecture on Kubernetes. In an event-driven architecture, you probably have multiple event sources, Apache Kafka to consume message streams. These metrics are more relevant than a pod's CPU usage for deciding when applications need to be scaled out and in.
Kubernetes Event-Driven Autoscaling (KEDA) is designed to solve this challenge by autoscaling existing deployed applications based on event metrics. Knative can also scale serverless applications on Kubernetes using its own Knative autoscaler. But what if you needed to manage the autoscaling capability from normal applications to serverless functions based on event sources?
This session will teach you how to redesign an event-driven autoscaling architecture by deploying apps by Knative service then KEDA enables you to autoscale Knative Eventing components (KafkaSource) by events consumption over standard resources (CPU, Memory).
|Empower your Service Mesh with APISIX Ingress Controller||Service meshes are the talk of the town these days. A service mesh handles the east-west traffic in your clusters enabling service-to-service communication. But, in real-world applications, these services need to be exposed to external traffic.|
Default ingress options provided by service meshes work well for primary use cases. But as your need for security, scalability, and controllability increases, a need for a high-performance ingress gateway becomes evident.
This talk introduces Apache APISIX and APISIX Ingress Controller as a high-performance Kubernetes ingress controller that is declarative, dynamic, extensible, and easy to use. Navendu walks through how service mesh users can leverage APISIX to expose their services to the consumers in a secure, scalable, and controllable manner.
This talk is for service mesh adopters looking to have the same degree of control over the north-south traffic as they have for their east-west traffic.
Attendees will learn about:
• The difference between a service mesh and a gateway.
• Why they would need an ingress controller while using a service mesh.
• Using APISIX or other ingress controllers with their service mesh deployment.
|M3s: Apache Mesos and K3s - Resource-conscious Platform for ML and Data-Processing||How an out of the ordinary proposal to use K3s and Apache Mesos to run a ML and Data platform turned in to a highly scalable, low operating overhead, resource-conscious production environment.|
Our platform at ISS, Inc. enables teams across the organization to connect, explore, and interact with our vast amount of Environmental, Social, Governance (ESG) related data. It supports the fast-growing needs around large-scale data-collection, data-storage, and data-filtering for our client-facing applications and services running in a hybrid environment.
This talk will provide all the technical details and won't hide any of the challenges faced along the way.
|Apache YuniKorn: enhanced scheduling in the cloud||Kubernetes has historically focused on service-type workloads. Stateful workloads have become better supported in recent releases. Support for batch and HPC workloads continues to lag in the Kubernetes scheduler.|
Alternative schedulers, like Apache YuniKorn, have been created to fill the gaps in functionality. Apache YuniKorn adds advanced options like workload queueing and quota sharing to Kubernetes scheduling without affecting the traditional workloads. The existence of the alternative schedulers has driven enhancements to Kubernetes scheduling. The Apache YuniKorn community is participating in shaping some of these Kubernetes enhancements within the K8s community.
In this talk we will look into our involvement in some of these enhancements, focussing on how they cover our enhanced scheduling use cases.
The enhancements could be used by Apache YuniKorn to support new features, reduce maintenance and improve the integration with K8s. We will look at what we have done inside Apache YuniKorn over the last releases and what is planned in preparation for the Kubernetes changes.
We will also give you an overview of the roadmap and features planned for Apache YuniKorn.
|Wilfred Spiegelenburg and Manikandan Ramaraj|
|Challenges and Learnings in building hbase k8s operator||Hbase k8s operator is built on top of kubernetes operator pattern, to make kubernetes understand how to bring up a production grade hbase cluster and maintain it with heavy simplification in k8s manifest despite making it highly configurable.|
In this talk, will be talking about some of the challenges faced and how they were solved while building hbase k8s operator in
1. Bootstrapping a fully automated hbase + hdfs cluster
2. Up and running a production grade cluster with auto recovery in case of maintenance
3. Abstracting out the complexity of health checks, startup/shutdown scripts.
4. Making it highly customisable to bring up a cluster of varied requirements.
5. Cluster spanning across multi k8s namespace to run a multi tenant mode of deployment
6. Baked in hbase rackawareness with zookeeper as a state store
Link to project: https://github.com/flipkart-incubator/hbase-k8s-operator
|Modernize APIs to run serverless using Apache CXF||Years ago the Service-oriented architecture (SOA) architectural style came along with implementations of web services based on standards like the Web Service Description Language (WSDL) and SOAP. Many of those interfaces are still in place as of today as a change requires both provider and all consumers to agree on a new definition and change the implementation (often without any business value). The underlying infrastructure, sometimes based on Enterprise Services Buses (ESB) is however often end-of-life and hard to maintain.
In this session you will learn how to modernize API infrastructure without changing the interface definition in the first place. Apache CXF allows to provide APIs using SOAP or RESTful HTTP protocol in a contract-first manner. The combination with cloud-based serverless function services like AWS Lambda enables you to reduce the management effort and lowering the cost by only paying for what you use.
|Open Source On-premises vs Cloud Infrastructure at Twitter : Shift in open source development adopting cloud||Large scale data analytics and processing solutions have traditionally built and supported using open source projects. Many of these are Apache projects such from Hadoop, Hive, Presto and many more. Thousands of developers include Twitter engineers have contributed to these open source projects which has helped in building scalable and reliable data platform solutions for data intensive applications.
In recent years, there has been a shift towards adopting solutions offered by cloud vendors. Data platform components which were dependent on maintaining open source projects have now become commodity in the cloud. In such an environment, how does modern data platform for large data companies look like and where has the development and open source contributions shifted from engineers?
In this talk we walk through Twitters journey of how the platform solutions were built using open source solutions and with the adoption of cloud where has those open source contributions shifted. We discuss the architecture of Data Platform before and after cloud adoption. We walk through examples of shift in platform solutions and what does open source mean to companies adopting cloud. We also discuss the challenges of scale, complexity of distributed systems and show case how to build out exabyte size platform in the cloud.
|Lohit VijayaRenu and Daniel Templeton|
|Overview of Apache Airavata||The talk introduces Apache Airavata which enables scientists, engineers, and data scientists to securely access and manage remote data, execute remote scientific software, and execute distributed data processing pipelines. Airavata can be used to create Web-based interfaces to|
Execute scientific computing workflows and automated data analysis pipelines on multiple, remote supercomputers, cluster computers, or cloud computing resources;
Manage data transfers and data access across local and remote resources;
Keep track of their activities in the user environment;
Clone, modify, and reuse earlier sessions with the user environment; and
Share sessions (such as workflow and jupyter notebook executions) with their collaborators.
Apache Airavata consists of the following core components:
Workflow Execution Services: for running software on remote computing clusters and for tracking metadata associated with executions;
Custos Security Services: to support federated authentication, user management, group and authorization management, permission management, and resource credential management;
Managed File Transfer and Data Lake Services: for managing data on distributed resources and executing automated, event-driven analysis pipelines;
The Airavata Django Gateway: a turnkey Web-based frontend for all Apache Airavata middleware; the Django Gateway can also be customized to provide tailored user experiences for different target users (engineers, managers, etc).
October 6, Rhythms II
|Session Title||Session Description||Presenter(s)|
|What's new in Apache CloudStack 4.17||4.17.0 is the latest Apache CloudStack major release. In this talk we will go through the new features introduced on this version from an administrator/user perspective, explaining their benefits and the problems those features resolve. Will also run a live demo to see these new features in action.||Nicolas Vazquez|
|Kubernetes CAPI for Apache CloudStack - A story of cross-community collaboration||The open-source ecosystem consists of several communities, and when these communities work together, true innovation happens and customer-centric products develop. When members of the Apache CloudStack community worked together with the Kubernetes SIG community, CAPC - Cluster API Provider for Apache CloudStack was born, as well as the advancement of the Kubernetes Image Builder project, both aimed to provide a simplified, unified way to deploy and manage Kubernetes Clusters.|
Join us to hear our story. From building an initial prototype over a weekend to being invited to be a part of Kubernetes SIG Community, and finally integrating CAPC within both communities! We share our learnings, and key takeaways from our journey with the Kubernetes SIG community, showing that cross-community collaboration is always a win-win!
|Improvements in volume snapshot process in KVM plugin of Apache CloudStack||Apache CloudStack (ACS) and KVM is a combination that many organizations decided to adopt. KVM is a widely used hypervisor, with a vibrant community, and support in different operating system distributions. In ACS, the number of contributors to the KVM plugin has increased considerably, which indicates the growing number of companies using this stack. While developing the KVM plugin functionalities, one normally tries to make use of the full potential of the hypervisor; however, recently we faced a problem with the volume snapshots implementation, which was affecting running VMs. This talk will address the problems we encountered, how they were solved with the recently merged implementations, which optimized considerably the snapshot process of volumes with KVM by introducing disk-only snapshots, and limitations we may face (e.g. lack of support of the operating systems). Further, we will highlight the next steps we are taking to improve the KVM plugin, by introducing differential snapshots to the volume snapshot process.||Daniel Augusto Veronezi Salvador and Rafael Weingärtner|
|Making Apache CloudStack market ready with a native rating solution||Apache CloudStack (ACS) is a solid option among known cloud orchestration systems, being on the same level as OpenStack, Azure Stack, and others. All of them address the basic needs to create and run a private cloud system; however, ACS's users have to adopt external solutions for rating/billing the resources consumption, which is native in other orchestration tools (e.g. OpenStack). This presentation will address the design and efforts of the ACS community to implement a native rating feature that will allow more flexibility and reduce the need for external systems.||Daniel Augusto Veronezi Salvador and Rafael Weingärtner|
|Lessons Learnt in using Github Actions with CI/CD in an Apache project||Github Actions provide an alternative CI/CD infra for Apache projects such as CloudStack. In this talk, we go through some recent integrations for Apache CloudStack project using Github Actions and bots for PR/issue triaging and PR analysis.||Daan Hoogland, David Jumani, and Pearl Dsilva|
|Building Clouds with Apache Cloudstack||Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. This talk will give an introduction to the technology, its history and its architecture. It will look at common use-cases (and some real production deployments) that are seen across both public and private cloud infrastructures and where CloudStack can be completed by other open source technologies. The talks will also compare and contrast Apache Cloudstack with other IaaS platforms and why the technology, combined with the Apache governance model will see CloudStack become the de-facto open source cloud platform. A live demo will be run, to show the software. Introduction to the ways that people can get involved in the Apache CloudStack project, will be given.||Daan Hoogland|
October 3-4, Mid-City
|Session Title||Session Description||Presenter(s)|
|Git for Data Lakes - How lakeFS Scales data versioning to billions of objects||Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.|
lakeFS - an open source data version control system designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data.
In this talk you'll learn about the challenges with using object storage for data lakes and how lakeFS enables you to solve them.
By the end of the session you’ll understand how lakeFS scales its Git-like data model to petabytes of data, across billions of objects - without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.
|Logging modernization and cybersecurity at scale with (Mi)NiFi, Kafka and Flink||Efficiently collecting logs across a wide set of heterogeneous devices (network equipments, laptops, servers, etc) over a globally distributed network at scale is hard. This is however a requirement for cybersecurity oriented use cases.|
In this talk, we'll be discussing an architecture involving Apache MiNiFi agents, Apache NiFi, Apache Kafka and Apache Flink to efficiently collect logs at scale, processing and normalizing very different log patterns and perform streaming analytics over the streams of data to efficiently monitor the overall network and generate alerts when appropriate.
We'll be demoing how this architecture can be applied to a globally distributed network including on-premises and cloud based deployments.
|Pierre Villard & Sunile Manjee|
|Apache Arrow and Go: A match made in Data||With Apache Arrow fast becoming a standard for working with data, most people are primarily familiar with the Python, C++ and Java libraries. This talk instead focuses on the Golang implementations of Apache Arrow and Parquet. We'll cover getting started using the Go Arrow and Parquet libraries, creating an Arrow Flight server and client over gRPC, and integrating with other runtimes using the C Data API. The concurrency primitives in Go make it ideal for constructing efficient pipelines for parallel processing of large amounts of data. We'll also cover some of the internals of the implementation to demonstrate how the Go Arrow and Parquet libraries achieve their performance including benefiting from SIMD.||Matthew Topol|
|Batch and Stream analysis with Typescript? Yes, with Beam : )||Beam's mission has been to meet developers where they are: Language of choice, and runner of choice.|
The Beam Typescript SDK started as a hackathon project to test out how much functionality we could add to a Beam SDK in a week. Since then it's become a really nice prototype of an SDK, and we're ready to strengthen it and make it production-ready!
We'll discuss API choices, usability, use cases, and demo the functionality.
At the end, we'll call for your help with feedback and contributions, of course : )
|OpenLineage: An Open Standard for Data Lineage||Data pipelines may start off simple, organized, and easy to understand…but they never stay that way. When your pipeline consists of hundreds of jobs, spread across different teams, it becomes easy to lose track of them all.|
If a job run fails, how can you learn about downstream datasets that have become out-of-date? Can you be confident that they are consuming fresh, high-quality data from their upstream tasks? How might you predict the impact of a planned change on distant corners of the pipeline? These questions become easier once you have a complete understanding of data lineage, the complex set of relationships between all of your jobs and datasets.
OpenLineage is an open framework for collection of lineage metadata. It integrates with pipeline tools like Apache Airflow and Apache Spark to observe and record data transformations. Using this metadata, you can assemble a lineage graph - a picture of your pipeline worth a thousand words.
In this session, you will learn the purpose of data lineage, and its many uses in the modern data stack. You will hear about the basics of OpenLineage, its data model, and how it collects metadata. Finally, you will be introduced to Marquez, an OpenLineage metadata server.
|Big Data Workflow Scheduling - Introducing Apache DolphinScheduler||Apache DolphinScheduler is a distributed, scalable, and visual cloud-native workflow task scheduling platform that supports massive task scheduling.|
With a decentralized architecture, it allows us to easily and quickly scale horizontally to ensure it runs in any size cluster. It is also designed with a microkernel plug-in architecture, so you can easily extend it with plug-ins.
In addition, Apache DolphinScheduler provides richer isolation of permissions than other scheduling systems, makes it easier to configure inter-workflow dependencies, and allows workflow runtime to be adjusted.
In version 3.0, Apache DolphinScheduler supports AWS and Kubernetes and adds a Python API to implement workflow-as-code. If you want to learn more about the concepts, examples, and the latest developments in the community, this is the topic you should not miss.
From the talk, the audience will learn about:
1. the basic features of Apache DolphinScheduler
2. how to use Apache DolphinScheduler to schedule tasks
3. the basic concepts of the Python API and how to create a workflow with it
4. the new features of Apache DolphinScheduler 3.X
5. roadmap of Apache DolphinScheduler
|Dive into Avro: Everything a data engineer needs to know||Apache Avro is the de-facto standard for serializing structured data: Big data engines, streaming platforms, and data lakes use it to optimise storage or transmission of data.|
Avro is stable and mature, and usually "just works", but there's a lot going on behind the scenes. This talk covers topics, gotchas and best practices for using Avro in your systems: types and schemas, logical types, binary and JSON serialization, code generation, and its strong ecosystem of tools.
Finally we'll dive a bit deeper into schema evolution, an important topic especially in event-driven architectures.
|Ryan Skraba and Ismaël Mejía|
|Elastic Heterogeneous Cluster and Heterogeneity-Aware Job Configuration||Nowadays, different cloud providers like AWS, Azure, GCP provide a wide range of instance types in different categories, like General purpose, Compute-optimized, Memory-optimized, Storage optimized, Accelerated Computing like GPU, FPGA, etc. Each specific instance category fits different use cases. For example, a CPU-optimized instance is designated for CPU-intensive applications but not for Memory intensive applications. Similarly, not all applications can utilize Accelerated computing on GPU or FPGA, running such jobs on these instances will end up wasting the advanced resources. Hence, non-optimal performance and higher cost. To improve application performance as well as save cost, customers right now must configure different types of clusters and carefully assign applications to the best fitting cluster. But this is not a scalable solution for the customers.|
Hence, Talk will provide an insight on the following Area:
- A heterogeneous cluster environment, with mixed instance types for worker nodes, such as CPU node and GPU node.
- Autodetecting or identifying the best suitable instance type for a spark application to achieve the best performance and cost.
- Auto healing the auto-detection mechanism with the eventual run of a spark application.
- Effective scheduling of the application to achieve better resource utilization within a cluster.
- Auto upscaling the cluster with the appropriate instance type based on resource request of the applications and downscaling the node of certain instance type if it is no longer beneficial for overall performance and cost.
- If the application can be split into computation stages, utilize stage level scheduling to run different stages of the application on the effective instance type, based on the stage’s resource characteristics.
|Yongqin Xiao and Atam Prakash Agrawal|
|Daffodil: How Functional Programming leads to tight C-code at Runtime||This talk will introduce the internal organization of Apache Daffodil. Daffodil contains a DFDL schema compiler which is written in Scala using powerful functional programming techniques like Object-Oriented Lazy Attribute Grammars (OOLAG).|
Daffodil has evolved over the last few years and now supports multiple different runtime environments that are implemented in widely different ways, from Scala for the JVM, to a C-code generator that emits tight C-language source code for standard C compilers.
We will also demonstrate new aspects of Daffodil like the new Daffodil VSCode-based data debugger.
|Morel, a data-parallel programming language||What would the perfect data-parallel programming language look like? It would be as expressive as a general-purpose functional programming language, as powerful and concise as SQL, and run programs just as efficiently on a laptop or a thousand-node cluster.|
We present Morel, a functional programming language with relational extensions, working towards that goal. Morel is implemented in the Apache Calcite community on top of Calcite’s relational algebra framework. In this talk, we describe Morel’s evolution, including how we are pushing Calcite’s capabilities with graph and recursive queries.
|Iceberg’s Best Secret: Exploring Metadata Tables||Iceberg's secret sauce is its rich metadata, powering core features like time travel, query optimizations, and optimistic concurrency handling. But did you know that everyone can easily access this secret sauce via system tables? In this talk, we go over real life queries on metadata tables that get more insights out of Iceberg. What is the last partition updated and when? Why are there too many small files? Why are certain data files filtered out or not? We explore even more advanced use cases like data auditing and data quality. How many null values are being added per hour? What is the latency of data ingest over time? We will also cover metadata table performance tips and tricks, and ongoing improvements in the community. Whether you are already using Iceberg, or interested in getting started, attend this talk to learn how this under-utilized feature to get even more out of Iceberg.||Szehon Ho|
|Integrated Audits: Streamlined Data Observability with Apache Iceberg||We all want to make sure our data is correct before making it available to downstream consumers. Likewise, we all have our techniques for accomplishing this! This talk describes the Integrated Audits pattern that has worked tremendously well for data at massive scale, enabled through Apache Iceberg, the open table format.||Sam Redai|
October 3-4, Rhythms II
|Session Title||Session Description||Presenter(s)|
|Living and Breathing the Apache Way. Contributing upstream with competitors, customers, and investors breathing down your neck||We all know that Fineract as a mission-critical core banking system provides a difficult balance to strike when rapidly building a differentiated product on top of an open source project. While challenging to balance the need for secrecy and time to market for one’s solution versus the transparency and contribution to the open source project, when upstream contribution is done right it unlocks enormous economic value for not just you the innovator but the entire ecosystem as a whole.|
This session will explore at a practical level the culture, processes, and standards that need to be in place at both an org level and a community level to ensure that individuals can effectively contribute to upstream open source codebase while efficiently maintaining their downstream solution. We know that putting into practice the Apache Way can be difficult and will share our firsthand experiences so others in the community can do the same and reap the economic benefits both internally as well as externally across the entire community.
|Apache Kafka as Data Hub for Crypto, NFT, Metaverse – Beyond the Buzz||Decentralized finance with crypto and NFTs is a huge topic these days. It becomes a powerful combination with the coming metaverse platforms across industries. This session explores the relationship between crypto technologies and modern enterprise architecture.|
I discuss how data streaming and Apache Kafka help build innovation and scalable real-time applications of a future metaverse. Let’s skip the buzz (and NFT bubble) and instead review existing real-world deployments in the crypto and blockchain world powered by Kafka and its ecosystem.
|Building a trustworthy digital financial platform is a journey, not an end!||Design in trust(security, privacy and compliance), don't bolt on after. How to begin! There are many security and compliance frameworks and best practices for building successful security, governance, and compliance program, and choosing an appropriate one for digital financial service providers is vital. In this talk, I will share my experiences and expertise on the key security and compliance issues below, which should be considered in every digital financial system:|
•Enhance customer trust and confidence in digital financial services.
•Clarify the role and responsibilities of each of the stakeholders in the ecosystem.
•Identify security vulnerabilities and related threats within the ecosystem.
•Establish security controls to provide end-to-end security.
•Secure adoption of open source technologies
•Strengthen management practices concerning security risk management that includes all stakeholders – customers, regulators, and partners.
•Customer data is managed, processed, and stored following relevant data protection and regulations, with specific requirements formally established in customer contracts
•Secure integration of KCY/AML systems into the digital financial system ecosystem.
|Central Bank Digital Currencies, Stablecoins, and the role of open source and fineract||Central Bank Digital Currencies (CBDCs) continue to evolve as a concept, with the US Federal Reserve, Sweden's Riksbank, and the Kenyan Central Bank all requesting policy makers and vendors to provide additional information about ways to improve financial inclusion and digital currencies with new technologies and standards.
While advocates of so called "Stablecoins" claim proper currency controls, the Central Banks and regulators are not as confident. CBDCs are a way for Central Banks to provide a government backed digital currency, and a counter to the private currency providers.
Fineract, as a centralized ledger system, could be configured and used in tandem with one or more protocol providers (e.g. XRP) to establish a gateway between the formal sector banking (legacy systems) and new stablecoins or CBDCs.
This discussion will focus on how the project should navigate the waters between CBDCs, Stablecoins, and various global efforts.
|James Dailey and Orang Dialemah|
|Scaling and Modularizing Fineract 1.x - Keeping Pace with the Evolution of Fintech and Embedded Finance.||Composability and Scalability in the cloud is all the rage these days when looking at fintech & core banking infrastructure. The Fineract community has explored and traveled down several paths of evolution and is now deeply focused on refining Fineract 1.x to meet the scalability and modularity requirements of the world’s leading fintechs and financial institutions. Our panelists will explore and walk through a number of the ongoing changes that are being implemented into the upstream Fineract codebase that will allow it continue to stay at the forefront of core banking, fintech infrastructure, and embedded finance.|
You’ll discover the major enhancements at the database, API, event handling, and batch processing layer that are being introduced to evolve Fineract to scale to support tens of millions of accounts and the transaction processing required by digitally native organizations. We’ll walk through the ongoing roadmap to enable greater modularity and composability to ease downstream contribution and equip institutions to innovate in a rapid, agile, low-code manner. At an infrastructure level, we’ve also vastly improved release management, testing coverage, automated code quality checks and more to make contribution and downstream release management seamless. Lastly we invite you to participate in this roadmap for scalability by giving your architectural input, contributing upstream, and picking up tasks in the backlog.
|Istvan Molnar and Aleksandar Vidakovic|
|An Open Core: Open Source Primitives for Managing Accounts||This session explores how core banking systems can be broken down into open source building blocks - i.e. primitives for accounts - customers, deposit accounts, credit accounts. At an architectural level we will explore the foundational building blocks for managing customer and financial accounts. They will benefit from learning about the architecture, a hands-on view of the APIs, understanding the lower total cost of ownership and time to market by building from open source primitives, and the economic value of an upstream first approach to development enabling a virtuous cycle to collectively maintain an open source core banking platform.|
Angela Strange, in her keynote at the inaugural Fintech DevCon powerfully articulated open source financial primitives as transformative building blocks to unlock new financial services innovation. To disrupt the fintech sector, core banking software must be commoditized and that can only be done by breaking the core into a set of primitives for managing accounts.
We will explore at an architectural and practical level what these foundational components are for managing customer accounts, deposit/savings accounts, and credit/loan accounts. We will define the architecture and APIs underlying these primitives and through illustrative case studies of Fineract users show how fintechs from around the world layer additional innovation on our open source APIs and building blocks to quickly get to market with solutions at a much lower cost.
|Functional Roadmap for Fineract - Honing in on a Limitless Horizon||Across the world, Fineract is the clear market leader in open source for financial services as the breadth of functionality in the core banking platform equips hundreds of institutions of all sizes, regulatory forms, and market focuses with the underlying core functionality. Despite this broad footprint of use cases supported, the horizon for new functionality is endless.|
This session led by one of the foremost functional Fineract experts will explore a number of topics:
- What Next? Within loan management, there is overwhelming demand in many additional verticals - line of credit, factoring, supply chain financing, leasing, etc. Across other verticals, the topics are boundless: better support for payments, wallets, savings-led financial inclusion. We can also deepen retail core banking features - treasury management, fixed asset management. On complementary systems, we can extend further to decisioning, origination, KYC/identity, FARM, compliance, & more.
- How do we Focus? A widespread global community presents a challenge & opportunity to gather, consolidate & prioritize feedback & requirements.
- Who publishes a roadmap? Mifos aims to convene companies around a collective vision/roadmap that gets implemented by individuals contributing via Apache Way.
- How do we Execute? This roadmap then must get translated into upstream contribution.
- Who do we benchmark against? Modern composable platforms like Mambu or Thought Machine? Legacy players like Temenos or Flexcube?
|Modernizing the Legacy Core: Fineract Driving Disruption and Displacement across the Retail Banking Sector||Across the Fineract ecosystem it’s readily evident that legacy core banking systems at the heart of the retail financial services sector are too costly, too closed, and too disconnected to enable the widespread reach needed to serve the 3B underbanked. With its roots in microfinance, in the past institutions would graduate from Fineract onto a more traditional legacy core system; we’re now witnessing the opposite as Fineract is displacing legacy core banking systems across multiple markets because of its lower total cost of ownership, modern architecture, greater agility & higher degree of flexibility & connectivity into digital payments & channels.|
Leading system integrators from across multiple regions will present their firsthand experiences migrating fintechs & financial institutions onto Fineract from legacy systems. The journey is ongoing as Fineract still has additional functional capabilities to be added to be at greater parity with these incumbents. The session will also practically explore the range of features to be contributed and collaborated on including multi-tiered KYC, treasury management, IFRS accounting standards, limit management, seamless omnichannel integration, multi-currency, and ATM/POS/Card integration.
Participants will learn what is driving this shift, the growing opportunity it presents, the challenges in displacing legacy systems, and how to advance the Fineract roadmap to accelerate the shift even further with more adoption by retail banks.
|Ademola Babalola and Victor Romero|
|AI for All: Fineract as the Foundation for Democratizing Data Science for Good||Of all the sectors that data science, artificial intelligence and machine learning are dramatically transforming, financial services is one of the most groundbreaking. Nonetheless, the upstream open source Fineract project has struggled to keep pace with the innovation on this front providing a massive opportunity for Fineract to take an even stronger foothold in the sector.|
Nonetheless, In the last couple of years, the AI for All working group across Mifos and Fineract has built foundational AI/ML components around the Poverty Probability Index, Credit Scoring and Chatbots but really needs more active participation.
The working group is making progress on Federated Learning and Synthetic data to handle the data privacy and scarcity challenges. Further, the importance of explainability and testing is factored in the adoption of models.
While the advantages of these initiatives are being fully realized, the recent focus has been to productionalize ML functionality. The MLOps becomes paramount for full scale adoption of AI/ML in the Apache Fineract/Mifos community.
This session will showcase the implementations on PPI and Credit Scoring, case studies from the ecosystem and its plan for the next 2-3 years to productionalize ML adoption.
|Pivoting Open Source Fintech Solutions For Big Players: Enterprise Ready CBS based on Apache Fineract||Until now, the big banks had never paid attention to Open Source solutions as alternatives to solve their problems.|
Today, more and more banks are investing into Finlabs to keep themselves connected to the Fintech revolution.
Digital fintech solutions such as Apache Fineract, built entirely on Open Source technology are gaining traction with such big institutions.
I want to share some techniques, customisations and enhancements we have done over the years to position such Open Source Core Banking ( Apache Fineract ) as an Enterprise Ready Core Banking application with capabilities to handle tens of millions of transactions and seamless integrations with internally built and third party solutions.
The rise of neo banks, digital technologies such as blockchain have done incredible job to shed light on Open Source Solutions that these big institutions can not pay attention.
This talk will take us through this revolution and paint a picture of the new reality that the big financial institutions can not avoid.
I will be sharing some learnings and knowledge gathered across the globe while enhancing, customising and developing on top of Core Banking Solution (Apache Fineract) for some of our diverse clients.
|Realtime Payments in Mexico using Apache Fineract||We will show how easy it is to implement Apache Fineract with CoDi using a popular Messaging Platform for merchants or B2C use cases.|
Real Time payments everywhere allows for an easier adoption of electronic banking and a faster financial inclusion, in Mexico the Central Bank has created a mobile Electronic Banking platform that implements Digital Payments flows using QR codes and Push messages: It is CoDi.
We want to highlight that using an Open Source financial technology framework like Apache Fineract allows banking and fintech institutions to create a bigger ecosystem that facilitates the financial services delivery in a secure, fast and efficient way. The economy has an initial positive impact on the merchants because transactions will be carried out in a matter of seconds (real time); without schedule restrictions, 24 hours a day, 7 days a week or any cost fee over the transaction amount of the sale.
Together (Apache Fineract and CoDi) boosts the economy because they streamline the disposition of resources immediately, just like cash transactions but safer and faster.
Using Apache Fineract for the implementation of this functionality improves the security of the population, avoiding the risks of transferring values and managing cash operations. Also allows to reduce poverty because the Social Program beneficiaries can use their resources using common and well known platforms, reducing the learning curve or going to physical branches.
|Open Banking & Finance: Changing the Game as We Know It||Open banking continues to dramatically shape the financial services landscape as it opens up broader innovation, enables greater choice for consumers, and levels the playing field amongst banks, fintechs, unregulated institutions, and platfins alike. Massive change is afoot as the ways consumers interact with their data and their money is fundamentally changing.|
Standards for data sharing and third party initiation of payments are beginning to emerge in different forms - in some cases it’s regulatory-driven like with PSD2 and Open Banking in Europe and the UK along with similar emergent standards across Africa from leading countries like Kenya and Nigeria, commercially-driven like via Plaid in the US or led by open API innovation like with UPI and Sahamiti in India and the 3PPI API from Google. We’ve assembled a diverse panel of fintechs, practitioners and industry stakeholders who will update us on the latest trends in Open Banking, the standards that are emerging, showcases and cases studies of Fineract-powered open banking innovation, the open banking opportunities and trends on the horizon and recommendations on how the Fineract community can best fuel innovation via these emerging standards.
|Ali Hussein Kassim|
October 5, Rex
|Session Title||Session Description||Presenter(s)|
|The Open Source Money||The confluence between money and information technologies has been growing exponentially over the last 10 years.|
Technology is disrupting the way we interact with each other.
And money is an ancient technology that has been evolving over the last 5000 years. During the last 50 years, it has been digitized. Money is a collaboration tool, money is the way we exchange our talent-time value with each other. Money has been centralized by the rulers (kings, queens, generals, and now central banks). But now, there are new forms of money, licensed as open source.
What has the open-source movement to say with the money? How this will impact the financial industry? How Open Source software can have an impact on an ancient technology as money. Well, all of these we will discuss on the talk.
|Injecting the Apache Way to to Help Sustain and Scale the Digital Public Goods Movement||Digital Public Goods (DPGs) are open source projects that are important for serving the needs of countries and communities globally, especially aimed at such things as health service delivery, identity systems, and financial services. These include Fineract, Mifos, OpenG2P, Mojaloop, and MOSIP. Digital Public Infrastructure are those same DPGs applied in the context of countries and regional efforts with the necessary implementation focus.|
This past year momentum in DPG/DPIs and building blocks has built up immensely as governments and multi-laterals recognize the need for long-term sustainability and investment into the maintenance of these technologies and the health of their open source communities.
This talk will discuss the important evolution of DPGs and DPIs and how the Apache Way, with its focus on community and upstream development ethos, can play a critical role in transforming the procurement, governance, and contribution to DPGs/DPIs to advance their long term sustainability.
|Jake Watson and James Dailey|
|OpenG2P - Unifying Multiple Digital Public Goods to create an end to end platform for digitizing social protection programs||Throughout the two previous ApacheCon gatherings we’ve highlighted the growth and evolution of OpenG2P as end to end system digitize bulk cash transfers and social protection programs from beneficiary management to payment disbursement to grievance and redress mechanisms. The OpenG2P architecture continues to evolve as a showcase of how multiple Digital Public Goods can be unified under a common framework to deliver an end to end solution for a complex use case challenging governments.|
This session will focus on the current and ongoing integration of two DPGs into OpenG2P - Apache Fineract and MIfos Payment Hub for payment orchestration and MOSIP for digital identity and biometric authentication and authorization. Through this exploration of the solution architecture and the ongoing reference implementations in two separate regions, we’ll highlight the complementary nature of the current Digital Public Goods being intertwined as well as explore the roadmap and potential of incorporating other DPGs like Mojaloop, X-Road, and OpenCRVS.
From this session, we will present the governance and collaboration structure for other open source contributors to get involved as well as in-depth look at the architecture, use cases, and roadmap for social protection that showcases the complementary nature of these various systems, demonstrates the boundaries of their APIs and pinpoints the logical points of integration, and directly invites other contributors to become involved.
|Ed Cable and Puneet Joshi|
Geospatial and Remote Sensing
October 4, Endymion
|Session Title||Session Description||Presenter(s)|
|Apache Science Data Analytics Platform (SDAP)||The Apache Science Data Analytics Platform (SDAP) is an open-source Analytics Collaborative Framework (ACF) that enables the confluence of resources for scientific investigation. SDAP was build to support Earth science use cases and is optimized to leverage the elastic cloud or on-premise computing clusters. SDAP provides a wide range of capabilities such as data analysis, anomaly detection, geospatial data matchup, search and discovery, and data subsetting. The SDAP technology stack includes modern technologies such as Apache Spark, Apache Cassandra, Apache Solr, and Kubernetes. SDAP is utilized by a number of projects such as the Cloud-based Data Matchup Service (CDMS), the Air Quality Analytic Centered Framework (AQACF), the Integrated Digital Earth Analysis System (IDEAS), and the IPCC AR6 Sea Level Projection Tool. This talk describes the latest projects using SDAP and some of the upcoming capabilities coming to SDAP.||Stepheny Perez|
|Geospatial Search with Apache Lucene||Come have a look under the covers at the data structures that enable geospatial and multi-dimensional indexing and search at scale in Apache Lucene. This talk will cover not only the indexing structures considered and ultimately implemented in the Apache Lucene Open Source Project but the exceptional performance improvements and centimeter spatial accuracy obtained in the latest release. As a bonus, this talk will cover new and upcoming Spatial Analysis Aggregations and Processing available in the OpenSearch Open Source project enabled through the Lucene API.|
From tessellation to multidimension encoding and block KD trees this talk will cover the algorithms and data structures written and committed to the Lucene code base and seen online at these locations:
Apache Lucene (specifically the release of BKD based geo indexing https://issues.apache.org/jira/browse/LUCENE-8396)
Performance benchmarks for Lucene Spatial Indexing: https://home.apache.org/~mikemccand/geobench.html
Finally, we will discuss the future of the project including existing and evolving support for custom coordinate reference systems, projections, and spatial regression modeling and statistics.
|Understanding and Streaming Geospatial Vector Data using Apache Kafka and GeoMesa||Apache Kafka is an industry standard technology for streaming data; GeoMesa uses Kafka for transporting spatial data and uses open source, spatial libraries to build analytics and real-time visualizations.|
Presently, sharing spatial data is hard since there are few common practices around how spatial data should be advertised. This talk will dive into improvements in GeoMesa’s ability to use data from a schema registry to understand how topics with spatial data should be interpreted.
With a better understanding of the spatial data in an enterprise, one is ready to perform analytics. The second half of this talk will discuss details of using Kafka Streams to build spatially-aware data analytics pipelines. This will include topics such as spatial joins, how to partition spatial data, and other optimizations.
|James Hughes and Austin Heyne|
|Real-time analytics over Geospatial data at Uber scale with Apache Pinot||By its nature, Uber’s business is highly real-time and contingent upon geospatial data. PBs of data are continuously being collected from our drivers, riders, restaurants, and eaters. Real-time analytics over this geospatial data could provide powerful insights.|
To derive insights from timely and accurate geospatial data, Uber has contributed geospatial support to Apache Pinot. In this talk, we'll introduce the geospatial features including the data model, geospatial functions conforming to the SQL/MM 3 standard, as well as a geospatial indexing that greatly accelerates the geospatial query evaluation. In particular, the geospatial indexing in Pinot is based on Uber’s H3, a hexagon-based hierarchical geospatial indexing library.
Also we will highlight some use cases from the Uber Eats app as example to show how Uber generates real-time insights across our geospatial data.
|Enabling scalable modern GIS with GeoParquet||For years, traditional GIS workflows and databases have been completely siloed away from other data infrastructure, specifically innovations taking place in the Spark ecosystem and other cloud data warehouses and data lakes. Spatial SQL, specifically with PostGIS, enabled more interoperability between databases, but this still requires data transformation and loading, all the while more geospatial data is being produced and growing in scale and velocity. GeoParquet, a new community proposal supported by the Open Geospatial Consortium, has the goal of standardizing the storage of geospatial vector data in Parquet. This talk will focus on why GeoParquet provides a true foundation for modern GIS workflows, both in and outside of the cloud, and can be a foundation to help create more interoperability between traditionally siloed geospatial users and other teams, helping to solve some of the most critical problems today.||Matthew Forrest|
|Geospatial Track Capstone Discussion with Community||This session will culminate the Geospatial and Remote Sensing track. The objective will be to synthesize common elements from the earlier sessions of the track and to move toward recommendations on coordinated activities that advance Apache Projects. The session will begin with panelists offering their own position statement and serving as discussants of the earlier sessions. After the panelists statements, there will be a discussion open to all in the room. It is anticipated that the synthesis and discussion will highlight specific actions that will reduce development effort through the reuse and increased quality of geospatial and remote sensing information handling within and across Apache Projects.||George Percivall|
October 5-6, Bacchus
|Session Title||Session Description||Presenter(s)|
|Groovy 4 Update||Version 4 of Apache Groovy introduces records, switch expressions, sealed types, custom type checkers, built-in macro methods and incubating features such as JavaShell, Groovy contracts, and language integrated query support. This talk looks at these new features and why they would be of interest to today's JVM developers.|
Groovy is a widely used (passing 1 billion downloads last year) alternative language for the JVM offering both dynamic and static natures, great extensibility, and special support for writing domain-specific languages and succinct code.
|Groovy meets Genetic Algorithms||Groovy is a powerful multi-paradigm programming language for the JVM that offers a wealth of features that make it ideal for many data science and big data scenarios. Part 5 looks at genetic algorithms. Genetic algorithms are a nonlinear optimization technique. They encode a set of solutions, called a population, that are evaluated using a fitness function. Successful solutions are chosen to form new solutions, called offspring, through a process called crossover. In each generation, random mutations are introduced as well, to maintain a good level of genetic diversity. In addition to the basic concepts, examples will be shown in Groovy using the Jenetics library. This talk will show several problems that are ideally suited to genetic algorithms, including the Traveling Salesman Problem, Ant Colony Optimization, the Knapsack Problem, and more. We also briefly look at the Genetic Algorithm support in Apache Commons Math.||Ken Kousen|
|Property-based testing with Spock and jqwik||Property-based testing is an approach to testing that involves checking that a system meets certain expected properties often in the presence of randomized generated test data. The approach is frequently promoted as a desired technique when adopting a functional style of programming but should be of interest to anyone wanting to apply best-practice testing techniques.|
The examples in this talk will be for the JVM platform and use the Spock testing framework and the jqwik property-based testing library but the concepts are applicable that context.
|Transpile Groovy code!||Have you ever taken a YAML overdose, in this cloud native era where every product you use is littered with YAML configuration files? Well, I wanted to code in Groovy. And see if I could use my favorite language, in lieu of the required YAML configuration file for the tool I was using. Should I just go with the shiny new YAML builder introduced in Groovy 3? Or do I want something a little more special? I decided instead to transpile my Groovy source code into my programmatic YAML file. In this session, we’ll learn about some of the Domainc-Specific Language capabilities of Groovy, and have a closer look at how you can analyze the structure of your Groovy programs to transpile it into a YAML dialect.||Guillaume Laforge|
|Functional Groovy||There are many advantages to writing programs using a functional style. Groovy is a multi-faceted language which supports both functional and imperative styles of programming. This talk looks at how to use Groovy while adhering to the most popular functional programming idioms.|
Topics covered include using closures, currying and partial evaluation, closure composition, Groovy meta-programming and type checking tricks for the functional programmer, recursion, trampolining, using Java functional libraries, immutable data structures, lazy and infinite lists and leveraging Java8 lambdas.
|Functional Programming in Java, Groovy, and Kotlin||See how features of functional programming are implemented in three different JVM-based languages. Examples include how lambda expressions, method references, and streams are handled differently, as well as higher-order functions, closure composition, trampolining, currying, tail recursion, and more.|
Kotlin, Groovy, and Java are all object-oriented languages with functional features. It's interesting to see what capabilities they implemented in similar ways and which are unique to that language
|How Grails® framework leverages Apache Groovy||Apache Groovy is a very powerful programming language which is not just used for scripting. Grails® framework would not be possible without the powerful features of Groovy such as DSL, Closure, meta-programming, AST transformation etc.|
In this talk, to see the features of Apache Groovy that Grails leverages to create supreme developer productivity and maybe you could use those lessons to apply to your use-cases.
|Groogle, when Google meet Groovy||After a while working with the Google API Java Library I realized I was repeating lot of lines of code in every project so I wrote a small DSL (domain specific language), how offer the power of these APIs to users without specific programming knowledges|
Groogle is a DSL in Groovy who allows to the user work with Drive, Sheet, Gmail and more Google Services without the complexity of Java projects. Only edit the script with your editor and run it from a terminal.
In this talk I'll guide to you how easy is to create a new DSL using Groovy and let the user the freedom to write programs they understand basically because they are using their own language.
|Skyrocketing Micronaut microservices into Google Cloud||Instead of spending too much time on infrastructure, take advantage of readily available serverless solutions. Focus on your Micronaut code, and deploy it rapidly as a function, an application, or within a container, on Google Cloud Platform, with Cloud Functions, App Engine, or Cloud Run.|
In this presentation, you’ll discover the options you have to deploy your Micronaut applications and services on Google Cloud. With Micronaut Launch, it’s easy to get started with a template project, and with a few tweaks, you can then push your code to production.
Thanks to its performance, its low memory consumption, and its lightning-fast startup time, Micronaut is particularly well-suited for services that run on serverless solutions.
|Writing Convention around Micronaut framework with Apache Groovy||Apache Groovy is an extremely powerful programming language which offers many features such as Closures, dynamic programming, AST transformations, etc. Microanut framework - which is a general application framework for JVM which is based on Annotation based approach.|
In this session, we will see how we can apply Groovy AST transformations to convert the annotation-based approach of the Micronaut framework to a convention based approach.
|Groovy-Powered Microservices with Micronaut||The Micronaut Framework makes building performant microservices and serverless applications with Groovy not only practical, but enjoyable! Using AST transformations and AOT compilation, Micronaut helps Groovy to shine by reducing the runtime overhead incurred by traditional frameworks, and this, together with Groovy's support for static compilation, allows you to play your favorite JVM languages to its strengths without compromising runtime performance. Come learn how Micronaut can help make your next cloud, serverless or IOT project a Groovy reality!||Zachary Klein|
|Key Gradle Concepts And Practices||Gradle has been described as the open source project with the most documentation that doesn't help. Key concepts, like the different steps Gradle takes at initialization time, configuration time, and execution time are not obvious, but must be understood to use Gradle effectively. This talk will cover those topics, as well as how to use source sets, IDE integration, testing in parallel, the build cache, and multi-project builds.|
Other topics to be included based on the interests of attendees will include writing your own custom tasks, using and building plugins, archiving and expanding files and folders, and incremental builds for efficiency.
Recently revised to include dependency conflict resolution, lazy task creation, and more.
October 3-4, Iris
|Session Title||Session Description||Presenter(s)|
|Keep Identities in Sync the SCIMple Way||What if keeping your user stores in sync across domains was as simple as running "java -jar"? With Apache SCIMPle, it is! SCIMple is a SCIM 2.0-compliant server powered by Spring Boot. You can run it standalone or embedded in your existing app. It exposes user management REST endpoints and handles the hassle of user synchronization for you. If your identity provider supports SCIM, use the simple way!||Matt Raible and Brian Demers|
October 5, Endymion
|Session Title||Session Description||Presenter(s)|
|Explaining Trademark Law For FOSS||Are you trying to build the brand of your community-led project? Is your community struggling to keep vendor marketing teams out of your project’s governance? Do you need a lawyer before you can “trademark” something, or can you do it yourself? (Tip: you can do it yourself!)|
This AMA is here to help answer basic trademark law questions in practical, everyday terms for FOSS projects and the companies that contribute to them. Legal advice can only come from your own lawyer - but most community questions have practical answers that can get you started without a lawyer. Trademarks are all about the public’s association of a brand with a product - and most of that happens in the real world, not a lawyer’s office.
Bring your simple community questions about how trademarks work, and we’ll try to get you some practical advice on what to do. Similarly, corporate questions are welcome - for how you can effectively partner with a Foundation or community-led project without stepping on toes.
|How to Slide Your Release Pass the Incubator||All podling releases need to be voted on by the incubator PMC before being released to the world. I'll go through what the incubator PMC looks for in every release and what you can do to make it pass the IPMC vote and get your project one step closer to graduation. More importantly, I'll cover where you can get help if you need it. In this talk, I'll describe current incubator and ASF policy, recent changes that you may not be aware of, and go into detail the legal requirements of common open source licenses and the best way to assemble your NOTICE and LICENSE files. Where possible, I describe the reasons behind why things are done a certain which may not always be obvious from our documentation. I'll show how I review a release and the simple tools I use. I'll go through an example or two and cover common mistakes I've seen in releases.||Justin Mclean|
|Open-Source: It's more than just code||When thinking about starting an open-source project, most of us think about the aspect of what we want to achieve and how to technically achieve that goal.
However I had to learn the hard way, that if you want your project to be more than just a fun project you do on the side, there are a lot of additional dimensions you need to pay attention to.
Just because your project is awesome, it doesn't mean that you will be successful.
In this talk I want to share what I had to learn the hard way on my journey with Apache PLC4X.
|Growing your contributors base||Ever wondered how to attract more contributors to your project?|
Have you ever thought how to make sure that your community is reach and thriving and that you have new people joining, providing new ideas and sometimes small, sometimes bigger contributions?
You might think this is something that happens "on its own" without too much deliberate effort - just build your fantastic project and people will come.
Not that easy. Building the community for your project and especially attracting new contributors is almost full time job on its own (which can be split among many people though).
In this talk you will hear the story of all the ways how Apache Airflow team approached it and what led to surpassing Apache Spark and becoming the PMC with the biggest number of contributors among the ASF projects.
|Ismaël Mejía & Jarek Potiuk|
|Apache Toree: A Jupyter Kernel for Scala / Apache Spark||Many data scientists are already making heavy usage of the Jupyter ecosystem for analyzing data using interactive notebooks. Apache Toree (incubating) is a Jupyter kernel designed that enables data scientists and data engineers to easily connect and leverage Apache Spark and its powerful APIs from a standard Jupyter notebook to execute their analytics workloads. In this talk, we will go over what's new with the most recent Apache Toree release. We will cover available magics and visualizations extensions that can be integrated with Toree to enable better data exploration and data visualizations. We will also describe some high-level designs of Toree and how users can extend the functionality of Apache Toree powerful plugin system. And all of these with multiple live demos that demonstrate how Toree can help with your analytics workloads in an Apache Spark environment.||Luciano Resende|
|Coding Presentations with Apache Training (Incubating)||I'll never forget how I worked on my first presentation for the Apache Con NA in Denver in 2014.|
At that time I was layouting my slides one by one using PowerPoint for days to almost over a week.
This works fine if you do one or two talks a year, however it doesn't scale, especially if you like to create individual talks or variants for every talk you give.
After a lot of trying out things I stuck with a setup consisting of:
- Apache Maven
With this setup I can code my presentations in my IDE of choice, have them compiled to nicely looking presentations via maven and run them in any browser. With this setup I managed to reduce the amount of time for preparing presentations from days to a few hours, for a customization, even a few minutes.
This setup has recently become the core of the Apache Training incubating project.
I want to show you how I write presentations with this and demonstrate most of the amazing features it brings.
Also do I want to encourage others to participate in the Apache Training project, as I do think it has a lot of value to add.
Internet of Things
October 3, Endymion
|Session Title||Session Description||Presenter(s)|
|Having fun with a solar panel, camera and raspberry.||How with a few dollars you end making IoT!|
The talk will present a fun project of a raspberry powered by a small solar panel that sends images, temperature an other informations to a server.
The project uses an attiny to control the panel and the battery, i2c to pilot it and measure temperature with small sensor. A raspberry pi zero to take picture and small Apache httpd server to control the whole thing from remote.
|Apache PLC4X: State of the Civet||A lot has happened in the Apache PLC4X project in the last years.|
- New Drivers
- New Languages
- New Integrations
- New Features (Discovery, Browsing)
In this talk I want to give an update on what has been happening behind the scenes and which amazing new features we have to offer.
|Ninety Eight RPis on the wall||Ninety Eight. That's the number of devices that have assigned IP addresses on my network. That includes 24 VMs across a KVM hypervisor and HyperV, Raspberry PIs, two smart watches, several metal servers, cameras, wireless devices, Google Smart home products, cellular phones, a doorbell, switches, custom made smart home products, and even a garage door opener. This doesn't include the zigbee devices I've installed that work through a centralized hub! At any given point I have devices offline, in need of update, or forgotten. In this presentation we will discuss the complexities of update management with IoT, and extend that discussion into private enclaves where the problem is even more difficult to manage. Further, we'll investigate the risk profile of IoT in relation to the growth of connected devices as a way to better inform consumers.||Marc Parisi|
|Use Cases and Architectures for Data Streaming in Manufacturing and Industry 4.0||The manufacturing industry must process billions of events per day in real-time and ensure consistent and reliable data processing and correlation across machines, sensors, and standard software such as MES, ERP, PLM, and CRM. Deployments must run in hybrid architectures in factories and across the globe in cloud infrastructures. Mission-critical and secure 24/7 operations 365 days a year is normality and a key requirement.|
Learn how data streaming with open-source frameworks such as Apache Kafka provides a scalable, reliable, and efficient infrastructure to make manufacturing companies more innovative and successful in automotive, aerospace, semiconductors, chemical, food, and other industries.
|From leading edge IoT Protocols to Python Dashboarding an End2End Journey||First i like to give an overview on common IoT Protocols:|
#CoAP (Constrained Application Protocol -> Close to HTTP / REST ) #MQTT ( Message Queue Telemetry Transport -> Pub/Sub with Broker -> Well defined Quality of Service -> Newest addition Eclipse Amlem (formerly the core of IBM Watson IoT platform) -> Eclipse Sparkplug -> Standardization of the topics and payloads -> Interoperability!) , #DDS (Data Distribution Service -> Pub/Sub without Broker -> Drones / Robotics) #LwM2M (Lightweight M2M -> Runs on Top of CoAP or MQTT -> standard sets of payloads for sensors) #zenoh (https://zenoh.io/ Pub/Sub Protocol -> combines the advantages of #DDS and #MQTT) #eclipsefoundation #apache #opensource #lightweight (+ some comments that this is not complete and does not encompass Industrial and Building Automation)
Then I would like to show the leading edge IoT protocol Zenoh. After that I would like to dive into Panel and the awesome capabilities of Apache ECharts. To process all the incoming data from Zenoh I want to show the Feather File Format from Apache Arrow . Optionally showing and if the time allows showing some base example (using Modbus and pymodbus library) of Apache PLC4Py replacing the Zenoh retriever in the Panel Webapplication.
|IIoT analytics made easy with Apache StreamPipes||Many IoT-driven use cases in areas such as manufacturing require the continuous collection, integration and analysis of sensor data to detect time-critical situations. Many available tools require a high level of IT expertise to do this, and thus tend to be targeted at less IT-savvy users. In this talk, we will present Apache StreamPipes (incubating), a self-service solution for IoT data analytics that allows users to connect heterogeneous IoT data sources with just a few clicks and then link analytics building blocks based on reusable algorithms, like the Lego principle. The presentation covers technical basics, applications and demo and presents latest community work.||Dominik Riemer and Philipp Zehnder|
Libraries, Frameworks and Developer Tools
October 3-4, Bacchus
|Session Title||Session Description||Presenter(s)|
|Apache Camel & Quarkus - Supersonic, Subatomic, Integration at the Speed of Cloud||The venerable Apache Camel has been a cornerstone of Open Source enterprise integration for many years. In recent times it has also become a major source of innovation for cloud-native development when combined with Quarkus. In this session we will discuss how the combination of Camel and Quarkus could be a key to helping you more productive and more efficient.||Deven Phillips|
|Apache Maven survival guide “Bring it on! -Mode” #no-external-tools #only-standard-plugins||In many projects, Apache Maven is used and does some stuff - who knows what? In this session, we will pack the best practices from over 10 years of projects into a pom.xml. And you will know what it does and why it's there. Using practical examples, we will look at how to:|
* solve problems in the build
* make the build reproducible
* find security issues in the code
* find security issues in the build
* make the legal department happy
* make the build faster
* reduce the cost of the build
* emit less CO2
We can do all of this without commercial or additional products - only with Maven standard plugins.
|Diamonds Aren’t Forever: Resolving the Diamond Dependency Problem Among Open Source Java Libraries||In 2017, Google Cloud Platform Client Libraries for Beam, Cloud Storage, Big Query, Spanner, gRPC, and more were suffering from an inability to produce an up-to-date list of Maven artifacts that all worked together, due to conflicting library versions in their dependency trees. To unravel the tangled web of dependencies that spanned a dozen repos and hundreds of artifacts, we developed and implemented standards for our libraries based on Apache Maven BOMs and semantic versioning. The resulting best practices are relevant to anyone shipping libraries that other Java projects depend on.|
In this talk we’ll discuss both the issues that led us into this state and the practices that got us out of it including:
Static Linkage Checks
Dependency mediation in Maven vs Gradle
Dependency and API surface minimization
Stable and unstable dependencies
Managing breaking changes
Using BOMs to manage dependencies and keep disparate projects in sync
|Elliotte Rusty Harold|
|Inside an Apache Ozone Upgrade||Apache Ozone is a rapidly evolving distributed storage system committed to supporting upgrades, downgrades, and full client cross compatibility. However, the velocity and complexity of new feature development has the potential to jeopardize these guarantees, clutter existing code with work arounds, and slow down development. Ozone's solution is a pluggable framework to handle upgrade related concerns for all new features. This talk will provide a background on the upgrade flow of an Ozone cluster, then explain Ozone's annotation based upgrade/downgrade and client/server versioning frameworks that allow us to preserve compatibility and code maintainability while quickly onboarding major features.||Ethan Rose|
|Apache Zookeeper and Curator Meet the Dining Philosophers||A ZooKeeper walks into a pub … (actually an Outback pub), and ends up helping some Philosophers solve their fork resource contention problem. This talk is an introduction to Apache Zookeeper and Apache Curator to solve a new variant of the classic computer science Dining Philosophers problem. We’ll introduce Zookeeper (a mature and widely-used de facto technology for distributed systems coordination) and the Dining Philosophers problem, and explore how we used Apache Curator (a high-level Java client for Zookeeper) to implement the solution and show how it works. We tested the application on a managed Apache Zookeper cloud service, so we can also reveal performance results using a single Zookeeper server vs. an Ensemble.||Paul Brebner|
|Mod_wasm : bringing WebAssembly to Apache||WebAssembly (Wasm) is a portable binary format that allows code from a variety of languages such as Rust, C and C++ to be run securely and efficiently in a variety of scenarios. Wasm has its origins in the web browser, where it is used to provide desktop-like functionality and performance in regular web pages, powering applications such as Google Earth, Unity and Figma.
WebAssembly is also increasingly being used in the backend, as an extension mechanism for popular server-side software including Envoy and NGINX as well as a runtime for functions as a service, especially in the Edge. This session introduces mod_wasm, an open source module that integrates a WebAssembly runtime with the Apache HTTPD server.
The session will provide an overview of Wasm, introduce the architecture and functionality of mod_wasm a provide a demo of its capabilities.
|Daniel Lopez Ridruejo and Jesus Gonzalez Marti|
|OpenNLP 2.0 - What's New and Coming||The first release of Apache OpenNLP was over 10 years ago. The NLP landscape has dramatically changed in the past few years with Python taking over as the dominant language for NLP applications. OpenNLP 2.0 introduces support for modern NLP architectures through ONNX models for entity recognition and document classification. We will show how Java applications, such as Apache Solr, can use these new capabilities in OpenNLP 2.0. This talk will highlight OpenNLP’s journey, what’s new in version 2.0, and plans for future versions.||Jeff Zemerick|
|Low-code Visual Integration Design with Apache Camel Karavan||The latest innovation from the Apache Camel community: Karavan, the Mastering Tool that facilitates Integration development by combining visual integration design and iterative development cycle with automatic hot-reload.|
Session starts with a short Karavan presentation followed by a live demo showing the quick development from simple low-code integration for beginners, to complex visual + code integration for experts and one-click deployment to Kubernetes.
|Marat Gubaidullin and Claus Ibsen|
|BuildStream - A distribution agnostic integration tool||BuildStream is a tool for building software stacks with a focus on build correctness and determinism.|
We provide a convenient declarative YAML interface for the user to define software build and integration instructions and then execute these instructions safely inside an isolated container environment, eliminating host tool contamination in builds while maximizing build reproducibility.
BuildStream is also extensible with a python plugin interface, allowing the tool to cater to a wide variety of software integration use cases such as generating container images, binary packages targeting specific distributions, and fully integrated bootable operating system images.
In this talk, we will outline the BuildStream mission, and then discuss some of the challenges we are faced with when catering to languages and build systems which provide their own dependency management systems such as rust, go, python, node.js and notably java/maven.
|Tristan van Berkom|
|Apache Iceberg's REST Catalog - A Gateway to Enriching Data Access via the Simplicity of an HTTP Service||This talk will center around the benefits provided by the new REST Catalog in Apache Iceberg.|
Catalogs are a crucial component of Apache Iceberg, as well as many other systems such as Spark, Flink, and newer versions of Hive. However, the problems that catalogs solve are not always clear to users, many of whom are used to interacting with their tables directly via the file store at all times.
Attendees will leave this talk with a better understanding of the primary function of catalogs in Iceberg and supported engines, specifically their role in ensuring ACID compliance in the face of concurrent readers and concurrent writers.
Iceberg’s REST catalog provides a JSON HTTP API for performing all of Iceberg’s core functionality. The REST catalog allows developers to build catalog services that can be run anywhere, using their favorite programming language and their existing tech stack, with a guarantee of operability with open-source Iceberg.
Additionally, the REST catalog allows developers to follow common server-side patterns which they're already comfortable with for adding things like:
- event logging
- any business-specific requirements for which a library already exists
Code examples and a sample micro-service that implements Iceberg’s REST catalog specification will be provided, demonstrating custom authentication when integrating with common Apache big data frameworks.
|Event-Driven Microservices with Apache Kafka & Micronaut||Event-driven architectures and event streaming are paradigm-shifting approaches to application development. Unlike simple message queues and inter-service notifications, event-driven applications make their "source of truth" in the stream of occurrences and updates that are occurring real-time across the system, orchestrated and stored by frameworks like Apache Kafka, and ingested and processed by any number of consumers and/or stream processors. The Micronaut Framework provides seamless Kafka integration that eliminates the "gap" between cloud-native microservices and Kafka-powered event-driven applications, with support for declarative producers, consumers, Kafka Streams, and full access to Kafka's palette of APIs. This talk introduces event-driven architecture, Apache Kafka, and showcases Micronaut's integration with Kafka and how this architecture can be leveraged in both legacy and greenfield applications.||Zachary Klein|
|Introduction to the Grails® framework||The Grails® framework is a powerful web application framework for JVM which has been out there for a decade and provides a smoother development experience than any other framework. It allows developers to quickly on-board the project and easily find the pieces of code and focus more on the business aspects of the application.|
In this talk I will showcase some of the powerful features of Grails framework to provide extreme developer productivity.
October 6, Endymion
|Session Title||Session Description||Presenter(s)|
|Learning from 11+ Years of Nightly Apache Lucene Benchmarks||The Lucene community cares deeply about the performance of each Lucene source code change and release. Lucene (or Elasticsearch, OpenSearch, Solr) offers feature-rich, low-latency search for so many users that even tiny optimizations can lead to massive worldwide energy savings. To continuously empower these savings, Lucene developers created a suite of scrappy tooling, luceneutil , based on multiple real-world corpora (Wikipedia, GeoNames, OpenStreetMaps, NYC taxis, Europarl), to measure performance and detailed Java Flight Recorder profiling across a diverse set of “typical” tasks. This macrobenchmark suite has run (mostly) continuously, every night, for the past 11+ years, generating many eyebrow-raising interactive performance graphs . These tools are now our standard benchmarking suite, not only to catch unexpected slowdowns but also for developers to measure the performance impact of a change in the privacy of their Lucene workspace. Lucene has seen revolutionary improvements in these 11+ years. The JDK has as well (from JDK 1.6 to JDK 18!), but many such improvements make benchmarking noisy. This talk will share battle scars and lessons learned from these 11+ years of Lucene’s continuous performance testing.|
|Performance of Apache Ozone on NVMe||Apache Ozone is an open source, scalable, redundant, distributed storage system.|
Apache Ozone is designed for extremely large datasets and new types of workloads in mind, and is optimized for the latest hardware, including NVMe. We continue to iterate ad improve its performance as we learn more.
In this talk, we'll present the latest performance benchmarks of Ozone on NVMe hardware, lesson learned and improvements we made based on the performance profile obtained.
|Wei-Chiu Chuang & Ritesh Shukla|
|The Impact of Hardware and Software Version Changes on Apache Kafka Performance and Scalability||Apache Kafka's performance and scalability can be impacted by both hardware and software dimensions. In this presentation, we explore two recent experiences from running a managed Kafka service.|
The first example recounts our experiences with running Kafka on AWS’s Graviton2 (ARM) instances. We performed extensive benchmarking but didn’t initially see the expected performance benefits. We developed multiple hypotheses to explain the unrealized performance improvement, but we could not experimentally determine the cause. We then profiled the Kafka application, and after identifying and confirming a likely cause, we found a workaround and obtained the hoped-for improved price/performance.
The second example explores the ability of Kafka to scale with increasing partitions. We revisit our previous benchmarking experiments with the newest version of Kafka (3.X), which has the option to replace Zookeeper with the new KRaft protocol. We test the theory that Kafka with KRaft can “scale to millions of partitions” and also provide valuable experimental feedback on how close KRaft is to being production-ready.
|Paul Brebner and Hendra Gunadi|
|Improving Cassandra Client Load Balancing in the Cloud||Out of the box the Datastax Java Driver for Apache Cassandra® 4.x implements a reasonable choice-of-two over replicas for token aware queries, which is a significant improvement over the random algorithm from the 3.x driver. The implementation, however, can be improved in the cloud by introducing latency aware concurrency weighting and removing expensive atomic operations. At Netflix, we were able to scale our clients much further by introducing a new load balancing algorithm: Weighted Least Loaded Load Balancing (WLLLB).|
In this talk we present classical distributed load balancing algorithms, and how they perform in simulated and real world situations. Finding the existing algorithms unable to optimally load balance to Cassandra coordinators, we set out to implement a simple, fast, and highly performant solution. The resulting algorithm demonstrated up to 40% reduction in client side latency in real-world workloads, allowing us to query our Cassandra datasets in mere hundreds of microseconds in modern AWS environments.
Furthermore, we present the results of our chaos testing of the new algorithm, showing that it is resilient to coordinator maintenance (restarting Cassandra nodes), latent coordinators (slightly slow coordinators) and garbage collection pauses (very slow coordinators). With the improvements in latency, availability, and reliability this load balancing algorithm has brought us significant gains across Netflix's fleet of diverse Apache Cassandra workloads.
|Joey Lynch and Ammar Khaku|
|JMeter scripting: the missing piece||The presentation will first dive into the main pros and cons of JMeter usage comparing it to other main OSS alternatives (like Gatling, Taurus & K6).|
Then we will present jmeter-java-dsl (https://abstracta.github.io/jmeter-java-dsl/) as a new alternative for JMeter users and performance engineers in general. JMeter DSL significantly improves JMeter's experience and moves performance testing closer to developers and development process.
JMeter DSL will be presented through a demo, implementing a simple performance test, highlighting the main features and benefits of such approach.
The demo will include a comparison of JMX and DSL testplan visibility and readability, IDE autocompletion and inline documentation, live reporting with Grafana, scalability with Blazemeter (or JMeter distributed run) and jmx2dsl built-in tool for easy migration.
The talk will end with a summary, a call to action and 10 mins for Q&A.
|Build your high performance ML inference services||With the trending demands in Machine Learning (ML) industry, more requests came from the modeling team asking for solution on high-performance infrastructure that can support the use cases for offline and online ML inference. Most of the challenging came from the productionization process: How do we optimize the tail traffic (P90/P99)? How do we optimize the offline ML inference scaling issues? In this session, Qing will walk you through with the ML system design on the high-performance online/offline system through sharing the story that team encountered in reality. In the meantime, also share general techniques to find bottleneck and optimize the infrastructure. The infrastructure is built on top of open-source solution, including Apache Spark, DeepJavaLibrary, Java Spring, etc. At the end of the session, the audience are expected to understand the general pitfalls in the ML system and master the high-performance Open-Source ML architecture that Amazon used currently.||Qing Lan|
Pulsar Event Streaming
October 5-6, Iris
|Session Title||Session Description||Presenter(s)|
|How to Become an Apache Pulsar Contributor||Making the jump from user to contributor can be intimidating. Perhaps you have an idea to help improve Pulsar, but it's not clear how you should share it. Perhaps you just want to stay in the loop and amplify Pulsar's adoption, but you're not sure how. In this talk, we'll try to break down barriers to help you understand how you too can contribute to the Pulsar Community. We'll start with two Pulsar Contributors describing their experiences and lessons learned as contributors to the Pulsar project. Then, we'll discuss the available Pulsar Community communication tools and how to use them. Finally, we'll focus on enumerating some traditional and non-traditional ways that you can become a contributor/promoter, while also giving practical advice and encouragement on how to engage meaningfully with the Pulsar Community.||Michael Marshall|
|Exploring Apache Pulsar Streaming Platform for MLOps||With the exploding interests in Machine Learning technology in the recent years, they have also introduced many aspects of computing that are required to support such a grand vision in what ML is capable of delivering. On the infrastructural layer, we need to handle the high frequency of data ingestion with low latency, and one of the best mechanisms that we can think of leveraging on is streaming. So what is streaming, and what are the different choices we have as a platform? We will learn about a few options, and zoom in to learn more about what a true next-generation, cloud-native streaming platform such as Apache Pulsar is capable of, beyond what the more common messaging platforms that we have these days.||Mary Grygleski|
|Performance tuning for Apache Pulsar Kubernetes deployments in the cloud||Apache Pulsar can be deployed on cloud provider managed Kubernetes services with ease. However, one will quickly realize that the out-of-the-box performance is not optimal for high-volume use cases. In this talk, common bottlenecks are demonstrated using an Apache Pulsar and Apache BookKeeper deployment in GCP. It will be shown how monitoring can be used to detect the issues. Furthermore, the bottlenecks are resolved by tuning the system by changing configuration parameters. If you are interested in learning more about Apache Pulsar operations and performance tuning, this talk is for you.||Lari Hotari|
|Optimizing Speed and Scale of Real-Time Analytics Using Apache Pulsar and Apache Pinot||Apache Pulsar is a new generation of platform that offers enterprise-grade event streaming and processing capabilities built for today's Cloud Native environment. But what do you do if you want to perform user-facing, ad-hoc, real-time analytics too? That's where Apache Pinot comes in.|
Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage) as well as streaming sources such as Pulsar. Pinot is used extensively at LinkedIn and Uber to power many analytical applications such as Who Viewed My Profile, Ad Analytics, Talent Analytics, Uber Eats and many more serving 100k+ queries per second while ingesting 1Million+ events per second.
Apache Pulsar's highly performant, distributed, fault-tolerant, real-time publish-subscribe as well as queueing messaging platform that operates seamlessly in a Cloud-Native environment with support for geo-replication, multi-tenancy, data warehouse or data lake integrations, and beyond. It is a tried-and-true platform that has major enterprise customers such as Yahoo, Verizon, GM, Comcast, etc.
Best of all, Apache Pulsar and Apache Pinot together represents a blissful union in the #OSS "heaven"!
|Karin Wolok and Mary Grygleski|
|Pulsar Heartbeat||To manage Pulsar clusters, we developed Pulsar Heartbeat to monitor service availability of Pulsar cluster, track latency of Pulsar message pub-sub protocol, and alert failures of the Pulsar components. The tool has become a key component of our Pulsar cluster monitor stack. It’s been deployed in all of our clusters. We have open sourced the software to share our best practice of Pulsar cluster management.||Ming Luo|
|Introducing TableView: Pulsar's database table abstraction||In many use cases, applications are using Pulsar consumers or readers to fetch all the updates from a topic and construct a map with the latest value of each key for the messages that were received.|
The new TableView consumer offers support for this access pattern directly in the Pulsar client API itself, and encapsulate the complexities of manually constructing such local cache manually. In this talk, we will demonstrate how to use the new TableView consumer using a simple application and discuss best practices and patterns for using the TableView consumer.
|Citizen Streaming Engineer - A How To||Democratizing the ability to build streaming data pipelines will help turn everyone who needs streaming data to be able to do it themselves. By utilizing the open source streaming stack known as FLiPNS will let us achieve this.|
FLiPNS is a stack of Apache Flink, Apache Pulsar, Apache Spark and Apache NiFi. Apache NiFi provides a Web UI for cse’s to build their own data pipelines. These citizen applications can be used to ingest data, route, transform, enrich, join and store it. As part of these applications Pulsar will provide a streaming data hub for ML model access as well as plugging in additional processing components.
|Starlight-for-RabbitMQ, powered by Apache Pulsar||Starlight for RabbitMQ is an open-source compatibility layer for Apache Pulsar that helps developers use their legacy RabbitMQ applications with Pulsar, a modern streaming platform.|
This talk will highlight the key architectural differences that exist between a traditional message broker such as RabbitMQ and a log-based streaming platform that is Pulsar. Then we will see the challenges that where overcome while adapting the former to the latter.
|Securing Pulsar for a Zero Trust Environment||Apache Pulsar handles mission critical data in motion. When that data is sensitive, it’s essential to protect it using a Zero Trust Security Model. In this talk, we’ll provide an overview of Zero Trust Principles and how to deploy a Zero Trust Pulsar cluster. We’ll cover Pulsar’s available authentication modes and how they integrate with Pulsar’s built-in authorization provider. We’ll also touch on options for configuring authentication for Zookeeper and Bookkeeper. Then, we’ll take a deep dive on how to configure TLS encryption to secure in-transit data. We’ll describe the network topology of a Pulsar cluster by enumerating every connection opened by each Pulsar component, and show exactly how to configure each of those Pulsar components to use TLS for every connection. Finally, we’ll cover the different protocols used in a cluster in order to define the minimum set of hostnames required to enable hostname verification.||Michael Marshall|
|Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd.||Starting with version 2.10, the Apache ZooKeeper dependency has been eliminated and been replaced with a pluggable framework that enables you to reduce the infrastructure footprint of Apache Pulsar by leveraging alternative metadata and coordination systems based upon your deployment environment.|
In this talk, I will walk you through the steps required to utilize the existing etcd service running inside Kubernetes to act as Pulsar's metadata store. Thereby eliminating the need to run ZooKeeper entirely, leaving you with a Zookeeper-less Pulsar.
|Understanding Broker Rebalancing||"Pulsar is an horizontally scalable messaging system, so the traffic in a logical cluster must be balanced across all the available Pulsar brokers as evenly as possible, in order to ensure full utilization of the broker layer.|
You can use multiple settings and tools to control the traffic distribution which require a bit of context to understand how the traffic is managed in Pulsar.
In this talk, we will walk you through the load balancing capabilities of Apache Pulsar, and highlight some of the control mechanisms available to control the distribution of load across the Pulsar brokers. Finally, we will discuss the various loading shedding strategies that are available.
At the end of the talk, you will have a better understanding of how Pulsar's broker level auto-balancing works, and how to properly configure it to meet your workload demands."
|Deep Dive into Building Streaming Applications with Apache Pulsar||In this session I will get you started with real-time cloud native streaming programming with Java, Golang, Python and Apache NiFi.|
I will start off with an introduction to Apache Pulsar and setting up your first easy standalone cluster in docker. We will then go into terms and architecture so you have an idea of what is going on with your events.
I will then show you how to produce and consume messages to and from Pulsar topics. As well as using some of the command line and REST interfaces to monitor, manage and do CRUD on things like tenants, namespaces and topics.
After this session you will be able to build simple real-time streaming and messaging applications with your chosen language or tool of your choice.
|Timothy J Spann|
October 6, Rex
|Session Title||Session Description||Presenter(s)|
|Neural Search Comes to Apache Solr: Approximate Nearest Neighbor, BERT and More!||Learning To Rank has been the first integration of machine learning techniques with Apache Solr allowing you to improve the ranking of your search results using training data.|
One limitation is that documents have to contain the keywords that the user typed in the search box in order to be retrieved(and then reranked).
For example, the query “jaguar” won’t retrieve documents containing only the terms “panthera onca”.
This is called the vocabulary mismatch problem.
Neural search is an Artificial Intelligence technique that allows a search engine to reach those documents that are semantically similar to the user’s information need without necessarily containing those query terms; it learns the similarity of terms and sentences in your collection through deep neural networks and numerical vector representation(so no manual synonyms are needed!).
This talk explores the first Apache Solr official contribution about this topic, available from Apache Solr 9.0.
We start with an overview of neural search (Don’t worry - we keep it simple!): we describe vector representations for queries and documents, and how Approximate K-Nearest Neighbor (KNN) vector search works.
We show how neural search can be used along with deep learning techniques (e.g, BERT) or directly on vector data, and how we implemented this feature in Apache Solr, giving usage examples!
Join us as we explore this new exciting Apache Solr feature and learn how you can leverage it to improve your search experience!
|Overview and Recent Improvements of Lucene's Faceted Search||Apache Lucene supports rich faceted search capabilities, allowing users to analyze and organize indexed documents across many dimensions, and implement drill down, up and sideways functionality to quickly apply, remove and change search result filters. Large-scale search applications, such as Amazon’s Product Search and MongoDB’s Atlas Database, leverage these capabilities, but many users don't realize Lucene offers these powerful features since Solr, ElasticSearch and OpenSearch implement their own navigation capabilities. While development on Lucene's faceting was somewhat inactive from 2017 to 2020, 2021 and 2022 have seen a significant uptick in activity, with at least 17 diverse (in geo, company, timezone) community members involved in creating new features and optimizations. This talk will give an overview of Lucene's faceted search capabilities, and highlight some new features and performance improvements made over the past two years. It will cover the two broad types of faceting capabilities: enumerable faceting and numeric faceting. It will also discuss the two different enumerable faceting implementations—Taxonomy-based facets and SortedSetDocValue facets—and new developments in both. Finally, the talk will cover some specific performance optimizations made over the last year, which have produced query-per-second benchmark improvements ranging from +20% to +400%.||Greg Miller|
|The Making of Lucene Vector Search||Nearest-neighbor (KNN) search in high-dimensional vector "embedding" spaces has become an essential part of the search toolkit. It finds uses not only in audio and image domains, but also in text-oriented applications. While term matching over tokenized text still predominates in the text world, there are subject areas where machine-learned KNN matching outshines it, and using text and vector search together can provide the best results of all.|
Apache Lucene now includes the KnnVector field type that enables scalable, efficient, and precise approximate vector search using the HNSW algorithm. In this talk we'll describe the feature in detail, and relate the open source journey that brought it to Lucene, propelled by deep collaboration from a worldwide community of contributors. Since its inception, we have continued to see rapid innovation in Lucene's vector search; 9.1 adds support for a truly hierarchical index (H in HNSW), and pre-filtering (finding top K matches subject to a constraint), as well as performance enhancements and improvements to the index encoding. We'll give a brief overview of the algorithm used and describe how we adapted it to Lucene's near-real-time indexing architecture. And we'll describe some of the innovations we look forward to seeing in the future from Lucene's vibrant and active community.
|Decoupling Indexing and Search scalability in Apache Lucene applications that use Segment Replication.||Indexing and search cater to different parts of a search engine, with different compute requirements. Apache Lucene’s near-real-time segment replication allows users to fully decouple indexing from search, and reap efficiency gains by providing the right resources to each of them. This opens doors for independently scalable indexer and searcher shard geometries. Indexing heavy applications could benefit from a wider indexing fleet with a well replicated narrow search fleet. Applications with heavy search compute, like inference models, could save costs on the indexing side.|
While this is good in theory, most search engines that process real time updates end up with a tight 1-to-1 coupling between indexing and search shards, despite using segment replication. Clearly, something is missing...
This talk dives into why this coupling is hard to break. The challenges real world applications must solve to achieve it — like optimistic concurrency control using a version field, and ensuring top-level data consistency across boxes. Apache Lucene, with its write once, segment-based design, has the right foundation to solve these problems. This talk shares how Lucene’s existing APIs, and recent concurrency improvements, are being used to tackle these problems at Amazon Product Search.
|Vigya Sharma and Tue Bui|
|Analyzing Search Clusters Performance with Gatling and Kubernetes||With a number of improvements done in open source search engines like Apache Solr and Elasticsearch in terms of features, ease of use, and documentation, configuring and deploying search clusters of massive scale is possible with limited resources. As easy as it is to set up a distributed Apache Solr or Elasticsearch cluster, getting the correct logical and physical layout, taking account into traffic load, is extremely hard. Revenue driven by e-commerces constantly invent and strive for a robust and dynamic performance testing tool to determine pain points, misconfigurations and optimal infrastructure complementing the outline of search clusters, for events like Black Friday.|
This talk is focused on a robust Performance Testing and Analyzer strategy for Apache Solr and Elasticsearch combining open-source features of orchestration tool - Kubernetes and competent load testing tool - Gatling to simulate various real-life scenarios involving conjunction of query and data ingestion load, to analyse performance and behaviour of Apache Solr and Elasticsearch clusters from an end-user perspective. The talk involves live demonstrations and concludes with a discussion on extensibility to other search engines and the future scope of the project.
|Give your Solr Queries some Love with Quepid||You are fighting to make your search engine return the right results for your queries. You are drowning in synonyms, analysis chains, and the Lucene explain syntax. You need to give your queries some love, you need Quepid! Quepid is an ASL 2.0 licensed open source web applications that makes relevancy tuning a test driven process. You can capture "what is good search" from your business users, and then have a safe space to experiment with Solr (and Elastic/OpenSearch). Quepid is the indepensible tool for quickly improving the quality of your search using classic Information Retrieval metrics like NDCG. Come to this class to learn hands on how to use Quepid.||Eric Pugh|
October 5, Rhythms II
|Session Title||Session Description||Presenter(s)|
|New and Upcoming||The session will cover the new and upcoming features in the Tomcat 10.1 release, and beyond with 11. It will also give an overview on the state of the project.||Remy Maucherat and Jean-Frederic Clere|
|HTTP/2, HTTP/3 the status of the art in our servers||As HTTP/3 is still getting ready we will look to where we are with it in our serves. The "old" HTTP/2 protocol and the corresponding TLS/SSL are common to Traffic Server, HTTP Server and Tomcat. The presentation will shortly explain the new protocol and look to the state of the those in our 3 servers and show the common parts and the specifics of each servers. A demo configuration of each server supporting HTTP/3 will be run.||Jean-Frederic Clere|
|Panama: A case study with OpenSSL and Tomcat||The presentation will focus on the use of the OpenSSL native library with Tomcat, and will show how the Panama API (now available as preview in Java 19) was leveraged to rewrite its integration using only Java code, while retaining the performance and capabilities of the existing tomcat-native JNI code.||Remy Maucherat and Jean-Frederic Clere|
|Automating your Tomcat with Ansible||This is an interactive session with an experienced Tomcat and Ansible engineer to:
- Receive hands-on experience via live demo with automating Tomcat deployments and updating processes using Ansible
- Provide direct feedback on how your applications and environments can utilize this Ansible collection and playbooks
- Demonstrate how Ansible can be utilized to automatically test in-progress releases
By leveraging Ansible we can replace the technical debt and development time spent on every user's unique build environment, and instead provide a standardized, scalable solution that will automate for building, deploying, and updating your Apache Tomcat instances.
|Proxing to tomcat with httpd||Although mostly known as a fast and reliable web server, Apache httpd also excels as a reverse proxy. In this session find out how to setup httpd as a reverse proxy, how to connect to Tomcat using HTTP/1.1, HTTP2 and AJP. We will also look to the full feature list of Apache httpd proxying capability. We also looking how to move from mod_jk and mod_proxy_ajp to a safer proxy module.||Jean-Frederic Clere|
|Migrating from AJP to HTTP: It's About Time||The Apache JServ Protocol was developed in 1997 as a proxying protocol between Apache httpd and Apache Jserv. At the time, mod_proxy was not an option for connecting to Apache Jserv, so Apache mod_jk was developed and generations of developers have used it to great effect. But AJP has some serious flaws, including lack of encryption and the inability to upgrade connections to use Websockets. In the intervening years, mod_proxy has become much more fully-featured and can solve all the problems with using AJP. We will cover all of the reasons AJP should be abandoned, all the nice things mod_jk does for you, and how to achieve the same results using mod_proxy with the http and wstunnel child-mods.||Christopher Schultz|