Heterogeneity-Aware Adaptive Placement

Research Project

Heterogeneity-Aware Adaptive Placement in Hybrid Edge–Cloud Stream Processing

This project explores how datacenter-centric stream-processing engines such as Apache Flink can be extended to support hybrid edge–cloud analytics without forcing users to rewrite applications. The goal is to make heterogeneous deployments practical under variable WAN bandwidth, shifting workloads, and constrained edge resources.

Apache Flink Hybrid Edge–Cloud Adaptive Routing Query Rewriting Fault Tolerance Event-Time Processing

Overview

Modern analytics pipelines increasingly ingest data at the edge—through cameras, sensors, wearables, drones, and other devices—while still relying on cloud-hosted systems for storage, aggregation, and decision-making. Traditional dataflow engines assume relatively homogeneous resources and stable datacenter conditions, which makes them poorly suited for deployments where WAN links fluctuate, edge devices differ in compute capacity, and input rates vary over time.

This work introduces a hybrid edge–cloud stream-processing framework built on Apache Flink that automatically rewrites applications into independently deployable segments and dynamically shifts selected operators between edge and cloud. The design preserves the usability benefits of mature stream engines while adding the adaptability required for heterogeneous wide-area settings.

Why This Is Needed

Resource heterogeneity: Edge devices and cloud servers operate under very different compute, memory, and network conditions.
Variable WAN performance: Edge-to-cloud bandwidth can fluctuate significantly, making any static placement eventually suboptimal.
Disruptive recovery in cloud-centric systems: Standard job-wide checkpoint recovery can create unnecessary downtime when failures occur at the edge.
Need for transparency: Users should be able to submit ordinary Flink programs without manually splitting queries or hand-managing deployment logic.

Core Ideas

Automatic Query Rewriting

Flink DataStream programs are transformed into independent query segments that can be scheduled, monitored, and reconfigured separately.

Adaptive Routing

Stateless operators before the first stateful stage can execute at the edge or in the cloud, allowing the system to shift computation as network and workload conditions change.

Near-Zero-Downtime Reconfiguration

Reconfiguration is driven by routing changes rather than full task migration, which avoids stop-and-restart behavior and keeps the pipeline live.

Compatibility with Existing Semantics

The framework is designed to preserve event-time processing, watermark propagation, and checkpoint-based fault-tolerance guarantees expected from modern stream engines.

How the Framework Works

Stateless-to-stateful split: The application is partitioned at the boundary between early stateless operators and downstream stateful processing.
Replicated execution paths: Stateless operators can have both edge and cloud replicas, with routing decisions selecting the active path.
Segment-level independence: Query segments are isolated so failures or slowdowns in one segment do not necessarily disrupt the others.
Metric-driven decisions: Routing decisions are made from observed metrics such as throughput, backpressure, bandwidth conditions, and edge compute contention.

Adaptive Placement Policy

The routing policy distinguishes between two main bottlenecks: network bottlenecks and edge compute bottlenecks. When WAN bandwidth becomes the limiting factor, the system moves suitable operators toward the edge to reduce traffic volume. When bandwidth is sufficient but the edge cannot keep up, more work is shifted back to the cloud. This process is repeated until the pipeline sustains the target throughput without backpressure.

Correctness and Processing Guarantees

Event-time awareness: The design carries watermark progress across segments so stateful operators can continue to process time-sensitive workloads correctly.
Fault tolerance: The implementation is compatible with checkpoint-based fault tolerance and relies on reliable message-broker delivery between segments.
Semantic consistency: Only stateless, order-agnostic operators are replicated for adaptive routing, which helps preserve logical equivalence with the original query.

Evaluation Highlights

Evaluated on a physical Raspberry Pi edge–cloud testbed and a 100-VM large-scale testbed.
Studied across five real-world applications, including air-quality monitoring, taxi-route monitoring, smart-grid anomaly detection, sentiment analysis, and video-assisted traffic monitoring.
Demonstrated that adaptive placement can sustain target throughput under changing bandwidth, input rates, and compute contention.
In the 100-device experiment, the framework converged to backpressure-free configurations after policy triggers and achieved an average throughput-to-target ratio of 99.8%.

Representative Application Scenarios

Air Quality Monitoring

Continuous PM2.5 sensor monitoring with downstream alerting and analytics.

Taxi Route Monitoring

Streaming trip reports to identify frequent routes for mapping and pricing systems.

Smart Grid Anomaly Detection

Monitoring smart-meter streams to detect and escalate abnormal household energy spikes.

Sentiment Analysis

Geo-tagged social media processing feeding downstream advertising and analytics pipelines.

Video-Assisted Traffic Monitoring

Analyzing continuous video feeds from cameras to detect traffic violations and anomalies.

Implementation Notes

The framework is implemented on Apache Flink and Apache Kafka as the message broker service. Query rewriting is performed automatically, allowing users to submit Flink applications without manually splitting pipelines or deciding placement in advance. The implementation also includes mechanisms for routing control, watermark handling across segments, and scalable policy evaluation.

Research Significance

This work shows that hybrid edge–cloud analytics does not necessarily require a clean-slate processing engine. By extending mature dataflow systems with adaptive placement, it is possible to retain the ecosystem advantages of tools like Apache Flink while adding the dynamic behavior needed for real-world heterogeneous deployments.

Note: This page summarizes an ongoing research effort. Since the work is not yet published under a public project identity, this page intentionally avoids using the internal system name and other details from the draft manuscript.