Research Project

Multi-Level Error Detection (MLED) Framework

MLED is a configurable recursive architecture for large-scale file transfer that uses in-network resources to reduce the probability of undetected errors and localize recovery within the network. It is designed for settings where even a single undetected bit error can invalidate valuable scientific or data-intensive workloads.

Large-Scale File Transfer In-Network Computation Recursive Architecture Error Detection FABRIC Testbed C++20

MLED Team

Prateek Jain

Prateek Jain

Boston University

Arash Sarabi

Arash Sarabi

Arizona State University

Abraham Matta

Abraham Matta

Boston University

Violet R. Syrotiuk

Violet R. Syrotiuk

Arizona State University

Overview

The Multi-Level Error Detection framework, denoted MLED(n, P), is defined by n ≥ 3 levels and a set of policies P. Each level consists of one or more layers, and each layer is governed by a configurable policy over its scope. The architecture generalizes the traditional two-layer approach to error detection by introducing additional levels that can detect and localize errors before corrupted data reaches the final destination.

MLED is designed to be modular and decoupled. In principle, different layers can be configured with their own error-detection, routing, addressing, flow-control, and recovery behavior. The current implementation focuses on error detection for large-scale data transfers, but the architecture is intended to support broader communication functions in the future.

Why MLED Is Needed

Key Features

Recursive Layered Design

MLED organizes communication across multiple levels, where lower levels connected by relay processes realize higher ones. Layers at level i operate over smaller or equal scope than those at level i + 1.

Policy-Driven Operation

Each layer is governed by a policy that defines its behavior over its scope, enabling flexible choices for error detection and other communication functions.

Localized Recovery

Errors can be detected and corrected at intermediate levels inside the network, reducing the need for full file retransmission from the source.

Protocol Compatibility

MLED can be configured to mimic or extend error-detection behavior used by existing large-scale file transfer tools.

Benefits

MLED Architecture

In MLED, each layer starts and ends with its own source and destination processes, while relay processes connect lower layers to realize higher-level logical communication. This recursive organization allows routing and recovery to be handled locally at each level while still achieving an end-to-end transfer objective.

MLED Architecture

MLED(4, P) configured across five nodes with recursive levels and relay processes.

Implementation Highlights

Configurable Policies

The current implementation supports CRC-8, CRC-16, CRC-32, Internet checksum, MD5, and SHA1; payload sizing; static routing; sliding-window flow control; ARQ-based recovery; and static addressing.

JSON Configuration

MLED uses a JSON configuration file to describe processes, addresses, ports, routing, payload lengths, layer structure, and optional integrity-check settings.

GUI Support

A drag-and-drop web interface helps users build an MLED configuration and generate a validated configuration file automatically.

C++20 Implementation

The current implementation follows C++20 standards and uses a decentralized setup with managers distributing layer-specific information across nodes.

Experimental Validation on FABRIC

The framework was evaluated on the FABRIC testbed using a five-node deployment spanning sites in Chicago, New York, Washington, Atlanta, and Dallas. The implementation was compared against the traditional two-layer approach under an adversarial error model that injects errors able to evade both CRC and Internet checksum checks in the baseline design.

Performance Results

The following results summarize the behavior of MLED and the traditional two-layer design for a 20,480 MB file under different PDU error rates and an adversarial error model. In this setup, errors are introduced in protocol data units (PDUs) in a way that allows them to evade the CRC and Internet checksum checks used by the traditional approach. As a result, the baseline design can deliver a corrupted final file, which then requires retransmission of the entire file after file-level integrity verification. At 0.000% error rate, both approaches perform similarly, reaching roughly 810 Mbps goodput and completing the transfer in about 200 seconds. As the error rate increases, however, the difference becomes pronounced.

MLED consistently achieves error-free transfers under this adversarial model, sustaining goodput close to 790–810 Mbps and file delivery time around 200–210 seconds. In contrast, the traditional approach drops to roughly 400 Mbps goodput and requires about 400–415 seconds to complete the transfer because corruption propagates to the final file and forces a full retransmission. These results show that MLED both preserves throughput and cuts recovery time nearly in half by detecting and correcting errors inside the network rather than relying on end-to-end detection after the full file has already been transferred.

Boxplot showing MLED and traditional goodput across PDU error rates for a 20480 MB transfer

Goodput comparison for MLED and the traditional approach across three PDU error rates. At non-zero error rates, MLED sustains close to 800 Mbps while the traditional design drops to about 400 Mbps.

Boxplot showing MLED and traditional file delivery time across PDU error rates for a 20480 MB transfer

File delivery time comparison for MLED and the traditional approach. Under non-zero error rates, MLED completes the transfer in about 200–210 seconds, while the traditional approach takes roughly 400–415 seconds.

Undergraduate Student Researchers

Ethan Frink

Ethan Frink

Arizona State University

Noah Barnes

Noah Barnes

Boston University

MLED Publications

Prateek Jain, Arash Sarabi, Abraham Matta, and Violet R. Syrotiuk. (2025). Design and Modeling of a New File Transfer Architecture to Reduce Undetected Errors Evaluated in the FABRIC Testbed. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 9(2), Article 19. https://doi.org/10.1145/3727111

Prateek Jain, Arash Sarabi, Abraham Matta, and Violet R. Syrotiuk. (2025). Design and Modeling of a New File Transfer Architecture to Reduce Undetected Errors Evaluated in the FABRIC Testbed. In Abstracts of the 2025 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS Abstracts ’25), June 9–13, 2025, Stony Brook, NY, USA. ACM. https://doi.org/10.1145/3726854.3727281

Resources

This work is supported in part by NSF grants CNS-2215671 and CNS-2215672.

If you have any questions or would like to discuss the MLED framework further, please feel free to contact me.