Streaming and optimization of MapReduce

Sponsor: NSF


As the field stands today, operating systems provide a poor interface for data-intensive computing, requiring programmers to engage in tedious, non-reconfigurable, and error-prone code development. These software-engineering practices often lead to easily exploitable vulnerabilities and devastating security breaches. This project creates the algorithms and implementation for a virtual-stream interface that enables development of big-data software that is more easily managed, simpler to understand, inherently faster, and bug-free.

Conference Papers

  • G. Stella and D. Loguinov, "On High-Latency Bowtie Data Streaming," IEEE BigData, December 2022.

  • A. Arman and D. Loguinov, "Origami: A High-Performance Mergesort Framework," VLDB, September 2022.

  • C. Hanel, A. Arman, D. Xiao, J. Keech, and D. Loguinov, "Vortex: Extreme-Performance Memory Abstractions for Data-Intensive Streaming Applications," ACM ASPLOS, March 2020.



Vortex is a high-performance streaming produce-consumer framework for large data workloads, with the reference implementation included below. The code also contains such applications as file I/O, in-memory pipelines, and in-place integer sorting. It can be compiled on Windows and Linux, each explained in a separate section below.

  • Windows

    1) Visual Studio 2019 16.3.10 or earlier is recommended. Later versions of the compiler have problems avoiding conditional branches in sorting networks, for which we implemented a fix; however, there is no guarantee that future versions of VS won't break. A more long-term solution would be conversion of sorting networks to assembly, which we may release in the future.

    2) Make sure to check that the project is set for x64 Release.

    3) To manage physical memory in user space, Windows requires that the SE_LOCK privilege be enabled for the current user. It is not sufficient to just be an administrator.

    4) The code itself must run with elevated privileges, which can be done by any of the following: a) running Visual Studio as administrator; b) right clicking the executable generated by Visual Studio and selecting "Run as Administrator"; or c) disabling the Run All Administrators in Admin Approval Mode local security policy. Additionally, the project manifest can be modified to automatically execute the program with admin permissions.

    5) For maximum speed, Windows Server 2016 or older is recommended. In newer kernels, MapUserPhysicalPages() (i.e., the kernel API utilized by Vortex to map/unmap pages) is much slower, e.g., by 50% in Server 2019 (build 17763) and 60% in Windows 10 (build 19035). Additionally, both of these operating systems experience memory "rot" after some uptime, which causes memory-intensive benchmarks (e.g., producer-consumer, sorting) to take a ~30% performance hit. This can be fixed by rebooting the system.
  • Linux

    1) The code has been tested in Ubuntu 18.10 and gcc 9.3. If earlier versions of gcc are used, _xgetbv() may need to be invoked manually via inline assembly. This function is used to detect presence of AVX, but is not critical to the rest of Vortex. It can thus be commented out and cpuId.avx be hardcoded to either true or false.

    2) Compilation is performed via the command line with the included makefile in Vortex-1.0/Vortex. Type "make" and "./Vortex" upon success.

    3) Due to time limitations, file I/O benchmarks have not been ported to Linux due to the extra effort needed to rewrite the overlapped I/O model of Windows.

    4) Linux mremap() does not support remapping across VMAs. This leads to a small increase in memory overhead during sorting compared to Windows.
  • Additional considerations

    1) Mitigation of Spectre/Meltdown bugs affects the speed of Vortex bucket sort and various memory-intensive benchmarks. As such, it is recommended that these features be disabled to achieve peak performance.

    2) The bucket sort prefers to use AVX offload during write-combine for a small gain in speed; however, certain motherboards run AVX at lower frequencies, which can make AVX slower than SSE. One option for overcoming this is to set the AVX offset to zero in BIOS. The other is to assign cpuId.avx = false and let Vortex fall back to SSE intrinsics.

    3) If sorting large inputs (e.g., 256 GB), two levels of write-combine might be beneficial, especially on multi-socket Xeons; however, this is currently not implemented. Additionally, big arrays may need 5-level paging from the OS and CPU to avoid running out of virtual memory. Future work will look into this further.

    4) While the main aim of Vortex sort is multi-GB datasets, small inputs (e.g., a few hundred bytes) are supported; however, they are not meant to be top-speed or memory-efficient.

    5) For completeness, the code is templated to handle 8/16/32-bit keys in addition to regular 64-bit keys, but these use cases are not well-developed, extensively-tested, or super-efficient.

    6) Vortex sort can handle non-uniform keys without overflowing memory or running into problems; however, the nature of MSD radix sort guarantees that certain skewed distributions will incur a performance drop. Future work will examine how to overcome these issues and offer additional improvements (e.g., sorting key-value pairs & variable-size keys).

The code is released under the GPLv3 license. The source can be downloaded here or from github.


Origami is a fast SIMD mergesort framework for 32/64-bit keys and 128-bit key-value pairs. It is released under the GPLv3 license and its source is available from github or local download.


Tuxedo is an efficient external-memory framework for bowtie streaming (i.e., reading/writing concurrently from/to multiple files). Its main contribution is to minimize the impact of seek delays on application throughput. The Tuxedo source code is available from Dmitri Loguinov.

     Copyright 2002-2023 IRL at Texas A&M. All Rights Reserved.