Selected Publications

Existing wired optical interconnects face a challenge of supporting wide-spread communications in production clusters. Initial proposals are constrained to only support hotspots between a small number of racks (e.g., 2 or 4) at a time, reconfigurable at milliseconds. Recent efforts on reducing optical circuit reconfiguration time from milliseconds to microseconds partially mitigate this problem by rapidly time-sharing optical circuits across more nodes, but are still limited by the total number of parallel circuits available simultaneously. In this paper, we seek an optical interconnect that can enable unconstrained communications within a computing cluster of thousands of servers. In particular, we present MegaSwitch, a multi-fiber ring optical fabric that exploits space division multiplexing across multiple fibers to deliver rearrangeably non-blocking communications to 30+ racks and 6000+ servers. We have implemented a 5-rack 40-server MegaSwitch prototype with real optical devices, and used testbed experiments as well as large-scale simulations to explore MegaSwitch’s architectural benefits and tradeoffs.
USENIX NSDI’17

Cloud applications generate a mix of flows with and without deadlines. Scheduling such mix-flows is a key challenge; our experiments show that trivially combining existing schemes for deadline/non-deadline flows is problematic. For example, prioritizing deadline flows hurts flow completion time (FCT) for non-deadline flows, with minor improvement for deadline miss rate. We present Karuna, a first systematic solution for scheduling mix-flows.
ACM SIGCOMM’16

Leveraging application-level requirements using coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing coflow-based solutions rely on modifying applications to extract coflows, making them inapplicable to many practical scenarios. In this paper, we present CODA, a first attempt at automatically identifying and scheduling coflows without any application-level modifications. We employ an incremental clustering algorithm to perform fast, application-transparent coflow identification and complement it by proposing an error-tolerant coflow scheduler to mitigate occasional identification errors. Testbed experiments and large-scale simulations with production workloads show that CODA can identify coflows with over 90% accuracy, and its scheduler is robust to inaccuracies, enabling communication stages to complete 2.4x (5.1x) faster on average (95-th percentile) compared to per-flow mechanisms. Overall, CODA's performance is comparable to that of solutions requiring application modifications.
ACM SIGCOMM’16

Recent Publications

All Publications

  • Enabling Wide-spread Communications on Optical Fabric with MegaSwitch

    USENIX NSDI’17

    Details PDF

  • Enabling ECN over Generic Packet Scheduling

    ACM CoNEXT’16

    Details

  • Scheduling Mix-flows in Commodity Datacenters with Karuna

    ACM SIGCOMM’16

    Details PDF

  • CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark

    ACM SIGCOMM’16

    Details PDF

  • Enabling ECN in Multi-Service Multi-Queue Data Centers

    USENIX NSDI’16

    Details PDF

Projects

  • Angel: Network-Accelerated Large-Scale Machine Learning

    Angel is an in-house large scale machine learning framework in Tencent. We cooperated with Technology Engineering Group (TEG), and developed a network accelerator. Via algorithm-specific flow scheduling, We achieved 70x reduction in job completion time compared to vanilla Apache Spark.

  • Chukonu: Application-Aware Networking

    Datacenters exists because of a standalone server/rack can no longer meet the requirements of modern day applications: web search, ad recommendation, online commerce, machine learning, etc. Different from traditional networks, data center networks enjoy high bandwidth, low latency, and minimal packet loss. These features, however, are not fully utilized today, because application developers are usually unfamiliar with datacenter environment and/or networking stack and its tuning. We aim to design a system for application developers to access networking functions in datacenters and unlock its full potential.

Professional & Teaching Experience

Internship experiences:

  • Software Engineer in Tencent (Jun 2015-Now)
  • Technology Analyst in Royal Bank of Scotland (Jun - Aug, 2010)

I have been a teaching assistant for the following courses at HKUST:

  • COMP 3511 - Operating Systems
  • COMP 4621 - Computer Communication Networks I
  • ELEC 2100 - Signals and Systems
  • ELEC 2600 - Probability and Random Processes in Engineering
  • ELEC 4120 - Computer Communication Networks
  • ELEC 5350 - Multimedia Networking

Awards

2016

MSRA Ph.D Fellowship

2011 - Now

HKUST Postgraduates Studentship

2013

HKUST Research Travel Grant

2010

Meritorious Winner of Mathematical Competition of Modeling

2010

The Commercial Radio 50th Anniversary Scholarships

2007 - 2011

HKUST Scholarship for Continuing UG Students

...