Research Project
Multi-Level Error Detection (MLED) Framework
MLED is a configurable recursive architecture for large-scale file transfer that uses in-network resources to reduce the probability of undetected errors and localize recovery within the network. It is designed for settings where even a single undetected bit error can invalidate valuable scientific or data-intensive workloads.
MLED Team
Prateek Jain
Boston University
Arash Sarabi
Arizona State University
Abraham Matta
Boston University
Violet R. Syrotiuk
Arizona State University
Overview
The Multi-Level Error Detection framework, denoted MLED(n, P), is defined by n ≥ 3 levels and a set of policies P. Each level consists of one or more layers, and each layer is governed by a configurable policy over its scope. The architecture generalizes the traditional two-layer approach to error detection by introducing additional levels that can detect and localize errors before corrupted data reaches the final destination.
MLED is designed to be modular and decoupled. In principle, different layers can be configured with their own error-detection, routing, addressing, flow-control, and recovery behavior. The current implementation focuses on error detection for large-scale data transfers, but the architecture is intended to support broader communication functions in the future.
Why MLED Is Needed
- Packet sizes have grown substantially, increasing the chance that errors may slip past conventional transport- and link-layer checks.
- Large file transfers are especially vulnerable, since a single undetected error can invalidate a petabyte-scale scientific dataset and force expensive retransmission.
- Traditional end-to-end recovery is inefficient when corruption is detected only after the full transfer reaches the destination.
- In-network detection enables localization, allowing recovery at intermediate scopes rather than retransmitting the full file from the source.
Key Features
Recursive Layered Design
MLED organizes communication across multiple levels, where lower levels connected by relay processes realize higher ones. Layers at level i operate over smaller or equal scope than those at level i + 1.
Policy-Driven Operation
Each layer is governed by a policy that defines its behavior over its scope, enabling flexible choices for error detection and other communication functions.
Localized Recovery
Errors can be detected and corrected at intermediate levels inside the network, reducing the need for full file retransmission from the source.
Protocol Compatibility
MLED can be configured to mimic or extend error-detection behavior used by existing large-scale file transfer tools.
Benefits
- Reduced undetected error probability: adding extra levels reduces the effective undetected error probability across the transfer path.
- Efficient recovery: retransmissions can be triggered from intermediate scopes instead of always restarting from the source.
- Configurability: different policies and payload sizes can be selected to satisfy a target admissible undetected error probability.
- Strong experimental validation: the architecture has been implemented and evaluated on the FABRIC testbed.
MLED Architecture
In MLED, each layer starts and ends with its own source and destination processes, while relay processes connect lower layers to realize higher-level logical communication. This recursive organization allows routing and recovery to be handled locally at each level while still achieving an end-to-end transfer objective.
MLED(4, P) configured across five nodes with recursive levels and relay processes.
Implementation Highlights
Configurable Policies
The current implementation supports CRC-8, CRC-16, CRC-32, Internet checksum, MD5, and SHA1; payload sizing; static routing; sliding-window flow control; ARQ-based recovery; and static addressing.
JSON Configuration
MLED uses a JSON configuration file to describe processes, addresses, ports, routing, payload lengths, layer structure, and optional integrity-check settings.
GUI Support
A drag-and-drop web interface helps users build an MLED configuration and generate a validated configuration file automatically.
C++20 Implementation
The current implementation follows C++20 standards and uses a decentralized setup with managers distributing layer-specific information across nodes.
Experimental Validation on FABRIC
The framework was evaluated on the FABRIC testbed using a five-node deployment spanning sites in Chicago, New York, Washington, Atlanta, and Dallas. The implementation was compared against the traditional two-layer approach under an adversarial error model that injects errors able to evade both CRC and Internet checksum checks in the baseline design.
- In the traditional approach, such adversarial corruption led to a corrupt file transfer and full retransmission from the source.
- With one additional MLED level using CRC-8, the framework detected and corrected these errors inside the network.
- Under non-zero error rates, MLED achieved a 100% gain in goodput over the traditional approach.
- For error-free transfers, MLED reached a maximum goodput of about 810 Mbps with no appreciable increase in delay.
Performance Results
The following results summarize the behavior of MLED and the traditional two-layer design for a 20,480 MB file under different PDU error rates and an adversarial error model. In this setup, errors are introduced in protocol data units (PDUs) in a way that allows them to evade the CRC and Internet checksum checks used by the traditional approach. As a result, the baseline design can deliver a corrupted final file, which then requires retransmission of the entire file after file-level integrity verification. At 0.000% error rate, both approaches perform similarly, reaching roughly 810 Mbps goodput and completing the transfer in about 200 seconds. As the error rate increases, however, the difference becomes pronounced.
MLED consistently achieves error-free transfers under this adversarial model, sustaining goodput close to 790–810 Mbps and file delivery time around 200–210 seconds. In contrast, the traditional approach drops to roughly 400 Mbps goodput and requires about 400–415 seconds to complete the transfer because corruption propagates to the final file and forces a full retransmission. These results show that MLED both preserves throughput and cuts recovery time nearly in half by detecting and correcting errors inside the network rather than relying on end-to-end detection after the full file has already been transferred.
Goodput comparison for MLED and the traditional approach across three PDU error rates. At non-zero error rates, MLED sustains close to 800 Mbps while the traditional design drops to about 400 Mbps.
File delivery time comparison for MLED and the traditional approach. Under non-zero error rates, MLED completes the transfer in about 200–210 seconds, while the traditional approach takes roughly 400–415 seconds.
Undergraduate Student Researchers
Ethan Frink
Arizona State University
Noah Barnes
Boston University
MLED Publications
Prateek Jain, Arash Sarabi, Abraham Matta, and Violet R. Syrotiuk. (2025). Design and Modeling of a New File Transfer Architecture to Reduce Undetected Errors Evaluated in the FABRIC Testbed. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 9(2), Article 19. https://doi.org/10.1145/3727111
Prateek Jain, Arash Sarabi, Abraham Matta, and Violet R. Syrotiuk. (2025). Design and Modeling of a New File Transfer Architecture to Reduce Undetected Errors Evaluated in the FABRIC Testbed. In Abstracts of the 2025 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS Abstracts ’25), June 9–13, 2025, Stony Brook, NY, USA. ACM. https://doi.org/10.1145/3726854.3727281
Resources
This work is supported in part by NSF grants CNS-2215671 and CNS-2215672.
If you have any questions or would like to discuss the MLED framework further, please feel free to contact me.