ApacheCon @Home - Geospatial Track

Apache Geospatial Track

Wednesday 16:15 UTC
Apache Geospatial: Open Source and Open Standards
George Percivall

The Apache Geospatial Track provides the latest in applying Apache Projects to geospatial data and processing. Beginning in 2016, the geospatial track has provided a venue for geospatial applications using open source from Apache and other open source foundations. The Geospatial Track includes a focus on the use of open standards to enable interoperability and code reuse between independent software developments. The Geospatial Track for ApacheCon 2020 will include projects from Apache Software Foundation as well as from LocationTech Technology ( http://locationtech.org) and the Open Source Geospatial Foundation (https://www.osgeo.org/). Open standards will be included from the Open Geospatial Consortium (http://www.ogc.org/). Open standards enables reuse of common elements for geospatial information and processing resulting in increased productivity, lower interoperability friction, and higher data quality. Standards for Coordinate Reference Systems (CRSs), spatial geometries and data arrays used for projects with geospatial content will be described based on open source projects and open standards. Emphasis is placed on the use of open standards including the emerging baseline of OGC APIs. OGC APIs are being defined for geospatial resources, e.g., maps, tiles, features, coverages. Developed using OpenAPI, the APIs can be implemented in a number of languages and patterns. The presentation will be describe the state of implementations and plans for standardization. The modular structure enables flexibility for developers to reuse OGC APIs in their APIs. If open source for geospatial is of interest to you, join the discussion on geospatial@apache.org.

George Percivall serves as CTO and Chief Engineer of the Open Geospatial Consortium (OGC). As CTO he works with OGC members on strategic technology across OGC Programs and leads OGC Technology Forecasting. As Chief Engineer, the OGC Architecture Board. Prior to OGC, Mr. Percivall was Chief Engineer with Hughes Aircraft for NASA's Earth Observing System Data and Information System; Principal engineer for NASA's Digital Earth Office; he applied Systems Engineering on the General Motors’ EV1 program and a control system engineer on Hughes satellites. He holds a BS in Engineering Physics and an MSEE from the University of Illinois.

Wednesday 16:55 UTC
Visualize Apache SIS capabilities with raster data
Martin Desruisseaux

Apache SIS is a Java library for helping developers to create their own geospatial application. SIS follows closely international standards published jointly by the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO). But the core standards implemented by SIS are abstracts, and their practical use are non-obvious for developers unfamiliar with OGC/ISO conceptual models. Recently a JavaFX application is being developed for showing Apache SIS in action. Its main purpose is still to provide building blocks that developers can use in their own applications, but the SIS application can also be used as-is for navigating in some raster data. Using that application, some ISO 19115 elements (the common metadata structure used by SIS for representing information stored in headers of various data formats) get a visual aspect. Some ISO 19111 concepts (the model for reference systems and operations) can be more easily explored. Jacobian matrices (a SIS feature) can been seen in action in the context of map projections. Apache SIS is an implementation of GeoAPI 3.0.1 interfaces, which are developed by OGC. Another GeoAPI implementation created during the year is a binding to PROJ 7 C++ API. We will show how GeoAPI allows the use of alternative implementation such as PROJ 7 in a Java application. GeoAPI interoperability with the Python language (work in progress) will also be demonstrated. Finally recent development of Apache SIS support of raster data (work in progress) will be presented, with an introduction to its API. The emphasis will be on Earth observation data.

Did a Ph.D thesis in oceanography, but have continuously developed tools for helping analysis work. Used C/C++ before to switch to Java in 1997. Develop geospatial libraries since that time, initially as a personal project then as a GeoTools contributor until 2008. Now contributing to Apache SIS since 2013. Attend to Open Geospatial Consortium (OGC) meetings about twice per year in the hope to follow closely standard developments and improve Apache SIS conformance to those standards. Work in a small IT services company (Geomatys) specialized in development of geoportals. Geomatys is an OGC member and develop a stack of open source software for spatial applications, with Apache SIS as the foundation to which Geomatys contributes actively.

Wednesday 17:35 UTC
pygeoapi: an OSGeo community project implementing OGC API standards
Tom Kralidis, Francesco Bartoli

The proliferation of REST as an architectural style as well as OpenAPI has resulted in broader adoption of a leaner service contract and the OGC developing a new generation of API specifications in support of discovery, access, visualization and processing of geospatial data. These efforts are aimed to lower the barrier to implementation, especially for mass-market/non-geospatial developers. pygeoapi is an OGC Reference Implementation compliant with the OGC API - Features specification. Implemented in Python, pygeoapi supports many other OGC APIs via the Flask web framework and a fully integrated OpenAPI structure. Lightweight, easy to deploy and cloud-ready, pygeoapi's architecture facilitates publishing datasets and processes from multiple sources. Implementations of other OGC APIs are in progress for the 1.0 roadmap, including gridded/coverage data (OGC API - Coverages), search (OGC API - Records), and vector/map tiles (OGC API - Tiles). pygeoapi is a community project of the Open Source Geospatial Foundation (OSGeo). pygeoapi follows a clear separation structure with a view, provider/plugin and entry point module. The view approach allows for easy integration with other Python web frameworks like Starlette and Django. The provider abstracts connectivity to numerous data sources (CSV, SQLite3, GeoJSON, Elasticsearch, GDAL/OGR) and provides extensibility to support additional formats, databases, object storage and more. This presentation will provide an overview of pygeoapi, current status and next steps as part of the evolution of the project.

Tom is a Senior Systems Scientist for the Meteorological Service of Canada and is a longtime proponent of spatial data infrastructure, interoperability, open standards and open source. He is chief architect of the World Ozone and Ultraviolet Radiation Data Centre (WOUDC) and MSC's GeoMet platform of real-time and archive weather, climate and water APIs. Tom is active in OGC and is currently co-chair of the OGC API - Records SWG. He is also the chair of the UN World Meteorological Organization Expert Team on Metadata. Tom is has contributed to numerous FOSS4G projects such as QGIS and MapServer. He is the founder of the pycsw and pygeoapi projects. He is a Charter Member of OSGeo and currently serves on their Board of Directors.

Wednesday 18:15 UTC
Enabling geospatial in big data lakes and databases with LocationTech GeoMesa
James Hughes

Many of the Apache projects serving the big data space do not come with out of the box support for geospatial data types like points, lines, and polygons. LocationTech GeoMesa has provided add-on support to Apache database projects such as Accumulo, Cassandra, HBase, and Redis crafting spatial and spatio-temporal keys. In addition to distributed databases, GeoMesa has enables spatial storage in many of the popular Apache file format projects such as Arrow, Avro, Orc, and Parquet. This talk will review the basics of big geo data persistence either in a data lake or in a database, and provide an overview of the benefits (and limitations) of each technology.

Jim Hughes applies training in mathematics and computer science to build distributed, scalable system capable of supporting data science and machine learning. He is a core committer for GeoMesa, which leverages HBase, Accumulo and other distributed database systems to provide distributed computation and query capabilities. He is also a committer for the LocationTech projects JTS and SFCurve and serves a mentor for other LocationTech and Eclipse projects. He serves on the LocationTech Project Management Committee and Steering Committee. Through work with LocationTech and OSGeo projects like GeoTools and GeoServer, he works to build end-to-end solutions for big spatio-temporal problems. Jim received his Ph.D. in Mathematics from the University of Virginia for work studying algebraic topology. He enjoys playing outdoors and swing dancing.

Wednesday 18:55 UTC
Map Serving with Apache HTTPD Tile Server Ecosystem (AHTSE)
Dr. Lucian Plesea

AHTSE is a collection of Open Source Apache httpd modules that can be used independently or combined to implement high performance and scalable tile services. Developed for geospatial applications, AHTSE can be used for other domains that need fast pan and zoom access to large datasets. AHTSE source code is available on GitHub, licensed under Apache License 2.0 terms. Geospatial web services compatible with the OGC WMTS, Esri REST and tiled WMS can be implemented. The tight integration with httpd results in exceptional scalability and reliability. The AHTSE development represents an evolution of the NASA original OnEarth server code. Examples of public services that use AHTSE are NASA's WorldView server (https://worldview.earthdata.nasa.gov/), Esri's Astro server (https://astro.arcgis.com) and Esri's EarthLive server (https://earthlive.maptiles.arcgis.com) This session will describe the core AHTSE concepts, demonstrate some of the existing server instances and provide sample server configurations.

Dr. Plesea worked at NASA's JPL, where he was a pioneer in developing geospatial imagery using supercomputers and later transitioned to building geospatial web services. He built and maintained multiple generations of the well known JPL Onearth/OnMars/OnMoon geospatial image servers, and was involved in the early development of the NASA WorldWind system. Once the OnEarth server technology was adopted and transitioned to the NASA EOSDIS as the core server technology behind the WorldView client, Dr. Plesea transitioned to Esri, where his responsibilities include developing cloud architecture for the basemap geospatial services and develop cloud raster technologies. Dr. Plesea is also an active GDAL contributor and maintainer, and is the principal OnEarth and AHTSE developer.

Wednesday 19:35 UTC
Lizmap to create Web Map ApplicationsEdit proposal
René-Luc DHONT

Lizmap is an open-source application to create web map application, based on a QGIS plugin and a Web Client. The project started in 2011 and the 3rd version has been published in 2016. In 2019, the project has to be adapted to QGIS 3. We will present the state of the project, the connected projects as mapbuilder module and extension scripts, and what is coming in the future.

Founding 3Liz Open Source GIS Software Editor * Lizmap developer * QGIS Server maintener

Thursday 16:15 UTC
Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics
Dalton Lunga, PhD

With volumes of acquired remotely sensed imagery growing at an exponential rate, there is an ever-increasing burden of research and development challenges associated with processing this data at scale. In particular, the application of object detection models across large geographic areas predominantly faces three obstacles: (1) a lack of workflows for gathering representative training data and mitigating data bias, (2) the inability of current machine learning algorithms to generalize across diverse sensor and geographic conditions, and (3) the deployment and reuse of hundreds of unique models at scale. By considering the above challenges in a joint manner, we formulate and present an efficient, geographically agnostic framework for remote sensing imagery analysis at a global scale. The framework addresses the problem of bias-free data selection by mapping observed satellite images to a novel metric space rooted in the manifold geometry of the data itself, forming natural partitions of similar data. Using these partitions to seed training, the framework enables simpler, localized models to be developed; alleviating the challenge of generalization seen by more complex models for larger geographic extents. In an agile manner the framework further exploits the inherent parallelism for dataflow, and harnesses Apache Spark to implement distributed inference and training strategies which are seen to favorably scale. We discuss the challenges and merits of using Spark with current deep learning frameworks, providing insight into solutions developed for overall workflow harmonization. As a test case study, with no training data gathered for any entire country, we deploy the framework to detect buildings and roads, over areas that spans thousands of square kilometers and covered by 26TB of satellite image data. Drawing understanding from the results of this study, we finally present future directions which this exciting research may take.

Dalton is currently a research scientist in machine learning driven geospatial image analytics at ORNL. In this role he deploys machine learning and computer vision techniques in high performance computing environments, focusing on creating imagery-based data layers of interest to various societal problems e.g. enable accurate population distribution estimates and damage mapping for disaster management needs. He currently conducts research and development in machine learning techniques and advanced workflows for handling large volumes of geospatial data. Prior to ORNL, Dalton worked as machine learning research scientist at the council for scientific and industrial research in South Africa on a variety of projects. He received his PhD in electrical and computer engineering from Purdue University, West Lafayette, IN, US.

Thursday 16:55 UTC
AutoRetrain: automated deep learning model training on imagery using Apache Airflow and Apache Nifi.-
Carlos Caceres

The ability to automate model training is a complex subject that has recently received much attention in the deep learning community. Multiple workflow management systems have also begun gaining traction, and are necessary in order to orchestrate the necessary steps to make auto-retraining feasible. This work tackles model automation by making use of two such technologies: Apache Airflow and Apache Nifi. Since both fields of automatic model training and the overarching field of AutoML are broad and complex, this work seeks to show the utility of AutoML approaches on object detection in overhead imagery by a simple approach: integrating cycles of model retraining as data becomes available over time. Not only does this approach match the reality of data acquisition, it also seeks to leverage information as it becomes available and in so doing, reduces the time lag from acquiring new data to extracting useful intelligence. This work tackles a few problems practitioners often encounter when involved in long-term, deep learning projects. Questions include: 1). when to start a new round of training, 2). how to minimize the time complexity of training a deep learning network, and 3). how to tackle the problem of selection bias, which occurs when training sets contain uneven probability across classes. The third and most complex question originates from the uneven distribution that may be present in the data. This bias occurs for a variety of reasons, low sampling opportunities chief among them. Selection bias and other forms of dataset bias are only a part of the learning problem as learning through back propagation also allows the model to ignore uncertainty in its predictions. Instead, certain scenarios have been helped by other techniques, such as curriculum learning, active-bias learning, and hard example mining that focus training on easy, uncertain, and hard examples respectively. Retraining as described consists of training cycles, where each cycle contains the whole data science pipeline – from data gathering, data preparation, to training, and scoring. In order to automate this process for a production system, it is first necessary to establish a reliable method to orchestrate the execution of individual pieces of the pipeline. To this end, this work experimented with Apache Nifi and Apache Airflow, two popular data flow management tools. By combining them with a tracking tool such as Mlflow, both Apache Nifi and Apache Airflow become extremely useful in managing retraining flows in a way that allows for reliable reproducibility.

Carlos Caceres:
MAXAR
Cloud Computing for Gov & Milsatcom Applications from satellite data.

Thursday 17:35 UTC
GeoSpark: Manage Big Geospatial Data in Apache Spark
Jia Yu, Mohamed Sarwat

The volume of spatial data increases at a staggering rate. This talk comprehensively studies how GeoSpark extends Apache Spark to uphold massive-scale spatial data. During this talk, we first provide a background introduction of the characteristics of spatial data and the history of distributed spatial data management systems. A follow-up section presents the vital components in GeoSpark, such as spatial data partitioning, index, and query algorithms. The third section then introduces the latest updates in GeoSpark including geospatial visualization, integration with Apache Zeppelin, Python and R wrapper. The fourth part finally concludes this talk to help the audience better grasp the overall content and points out future research directions.

Jia Yu:
Jia Yu is an Assistant Professor at Washington State University School of EECS. He obtained his Ph.D. in Computer Science from Arizona State University in Summer 2020. Jia’s research focuses on database systems and geospatial data management. In particular, he worked on distributed data management systems, database indexing, and data visualization. He is the main contributor of several open-sourced research projects such as Apache Sedona (incubating), a cluster computing framework for processing big spatial data.
Mohamed Sarwat :
Mohamed is an assistant professor of computer science at Arizona State University. Dr. Sarwat is a recipient of the 2019 National Science Foundation CAREER award. His general research interest lies in developing robust and scalable data systems for spatial and spatiotemporal applications. The outcome of his research has been recognized by two best research paper awards in the IEEE International Conference on Mobile Data Management (MDM 2015) and the International Symposium on Spatial and Temporal Databases (SSTD 2011), a best of conference citation in the IEEE International Conference on Data Engineering (ICDE 2012) as well as a best vision paper award (3rd place) in SSTD 2017. Besides impact through scientific publications, Mohamed is also the co-architect of several software artifacts, which include GeoSpark (a scalable system for processing big geospatial data) that is being used by major tech companies. He is an associate editor for the GeoInformatica journal and has served as an organizer / reviewer / program committee member for major data management and spatial computing venues. In June 2019, Dr. Sarwat has been named an Early Career Distinguished Lecturer by the IEEE Mobile Data Management community.

Thursday 18:15 UTC
Rethinking Earth Observation using Deep Learning
Sayantan Das

Earth observation is the gathering of information about the physical, chemical, and biological systems of the planet via remote-sensing technologies. With the advent of better compute, deep learning based methods have come up that are optimizing over existing remote sensing algorithms. In this talk, we shall go over some examples of Computer Vision tasks on Satellite Images including showcasing of one of my key projects done under the Indian Space Research Organisation. Slides to my abstract: http://bit.ly/sessionzero-geo Talk will be divided into three parts: 1. Coverage of what remote sensing is and how deep learning technology is being leveraged for betterment 2. Project showcase of semantic segmentation and object and land use classification using Tensorflow/Pytorch. 3. Open Source tools and a small example of map visualization .

I am Sayantan Das, a final year undergraduate student. I am mentoring for Google Code-In 2019 in Tensorflow. This summer I completed my research internship at Space Applications Centre,ISRO Ahmedabad. Currently doing a research internship at CVPR Unit,ISI Kolkata. I am pursuing my bachelors in Computer Science & Engineering from West Bengal University of Technology. I love to read,review and reproduce research papers.

Thursday 18:55 UTC
Bring Satellite and Drone Imagery into your Data Science Workflows
Jason Brown

Overhead imagery from satellites and drones have entered the mainstream of how we explore, understand, and tell stories about our world. They are undeniable and arresting descriptions of cultural events, environmental disasters, economic shifts, and more. Data scientists recognize that their value goes far beyond anecdotal storytelling. It is unstructured data full of distinctive patterns in a high dimensional space. With machine learning, we can extract structured data from the vast set of imagery available. RasterFrames extends Apache Spark SQL with a strong Python API to enable processing of satellite, drone, and other spatial image data. This talk will discuss the fundamentals ideas to make sense of this imagery data. We will discuss how RasterFrames custom DataSource exploits convergent trends in how public and private providers publish images. Through deep Spark SQL integration, RasterFrames lets users consider imagery and other location-aware data sets in their existing data pipelines. RasterFrames builds on Apache licensed tech stack, fully supports Spark ML and interoperates smoothly with scikit-learn, TensorFlow, Keras, and PyTorch. To crystallize these ideas, we will discuss a practical data science case study using overhead imagery in PySpark.

Jason is a Senior Data Scientist at Astraea, Inc. applying machine learning to Earth-observing data to provide actionable insights to clients' and partners' challenges. He brings a background in mathematical modeling and statistics together with an appreciation for data visualization, geography, and software development.

Thursday 19:35 UTC
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra
Paul Brebner

This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can efficiently process spatiotemporal data (space and time). In order to find location-specific anomalies, we need ways to represent locations, to index locations, and to query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”

Since learning to program on a VAX 11/780, Paul has extensive R&D and consulting experience in distributed systems, technology innovation, software architecture and engineering, software performance and scalability, grid and cloud computing, and data analytics and machine learning. Paul is the Technology Evangelist at Instaclustr. He’s been learning new scalable technologies, solving realistic problems and building applications, and blogging about Apache Cassandra, Spark, Zeppelin, Kafka, and Elasticsearch. Paul has worked at UNSW, several tech start-ups, CSIRO, UCL (London, UK), & NICTA. Paul has helped solve significant software architecture and performance problems for clients including Defence and NBN Co. Paul has an MSc in Machine Learning and a BSc (Computer Science and Philosophy).

Connect with us