Research Motivation

I have a fascination for synergies created at grand scales. As an electronics engineer at Sun Microsystems Inc, I took part in designing and integrating a quarter billion transistors into a high performance microprocessor. In 2004, after three years in the industry and two US patents, I decided to pursue a career in research starting with a doctoral degree at University of California, Berkeley. My work in the PALLAS group aims at extracting parallelism from complex concurrent applications, and automating the design flow that can efficiently utilize hundreds of processing elements on billion-transistor parallel platforms.

Here are some of my recent projects:

Scalable HMM based Inference Engine in Large Vocabulary Continuous Speech Recognition Parallel scalability allows an application to efficiently utilize an increasing number of processing elements. In this paper we explore a design space for application scalability for an inference engine in large vocabulary continuous speech recognition (LVCSR). Our implementation of the inference engine involves a parallel graph traversal through an irregular graph-based knowledge network with millions of states and arcs. The challenge is not only to define a software architecture that exposes sufficient fine-grained application concurrency, but also to efficiently synchronize between an increasing number of concurrent tasks and to effectively utilize the parallelism opportunities in today’s highly parallel processors. We propose four application-level implementation alternatives we call “algorithm styles”, and construct highly optimized implementations on two parallel platforms: an Intel Core i7 multicore processor and a NVIDIA GTX280 manycore processor. The highest performing algorithm style varies with the implementation platform. On 44 minutes of speech data set, we demonstrate substantial speedups of 3.4x on Core i7 and 10.5x on GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms.

Kisun You, Jike Chong, Youngmin Yi, Ekaterina Gonina, Christopher Hughes, Yen-Kuang Chen, Wonyong Sung, Kurt Keutzer, "Scalable HMM based Inference Engine in Large Vocabulary Continuous Speech Recognition", IEEE Signal Processing Magazine, March, 2010. (PDF)

Jike Chong, Ekaterina Gonina, Youngmin Yi, Kurt Keutzer, "A Fully Data Parallel WFST-based Large Vocabulary Continuous Speech Recognition on a Graphics Processing Unit", 10th Annual Conference
of the International Speech Communication Association (InterSpeech), September, 2009. (PDF)

Jike Chong, Kisun You, Youngmin Yi, Ekaterina Gonina, Christopher Hughes, Wonyong Sung, Kurt Keutzer, "Scalable HMM based Inference Engine in Large Vocabulary Continuous Speech Recognition", IEEE International Conference on Multimedia & Expo (ICME), page 1797-1800, July, 2009. (PDF)

Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors: Automatic speech recognition is a key technology for enabling rich human-computer interaction in emerging applications. Hidden Markov Model (HMM) based recognition approaches are widely used for modeling the human speech process by constructing probabilistic estimates of the underlying word sequence from an acoustic signal. High-accuracy speech recognition, however, requires complex models, large vocabulary sizes, and exploration of a very large search space, making the computation too intense for current personal and mobile platforms. In this paper, we explore opportunities for parallelizing the HMM based Viterbi search algorithm typically used for large-vocabulary continuous speech recognition (LVCSR), and present an efficient implementation on current many-core platforms. For the case study, we use a recognition model of 50,000 English words, with more than 500,000 word bigram transitions, and one million hidden states. We examine important implementation tradeoffs for shared-memory single-chip many-core processors by implementing LVCSR on the NVIDIA G80 Graphics Processing Unit (GPU) in Compute Unified Device Architecture (CUDA), leading to significant speedups. This work is an important step forward for LVCSR-based applications to leverage many-core processors in achieving real-time performance on personal and mobile computing platforms.

Jike Chong, Youngmin Yi, Arlo Faria, Nadathur Satish, Kurt Keutzer, "Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors", Proceedings of the 1st Annual Workshop on Emerging Applications and Many Core Architecture (EAMA), page 23-35, June 2008. (PDF)

Efficient Parallelization of H.264 Decoding: H.264 is an advanced video compression algorithm. It contains complex Digital Signal Processing (DSP) routines and significant control flow. The complexity enables very high data compression rate, making it an ideal standard to use in todays media-rich communication and storage domains. However, its sequential control flow makes H.264 decoding very difficult to parallelize. My work proposes a functional parallelization technique to expose the rich data-level parallelism in the decoding process, allowing effective parallelization of H.264 on multiprocessor platforms.

Jike Chong, Nadathur Satish, Bryan Catanzaro, Kaushik Ravindran, Kurt Keutzer, "Efficient Parallelization of H.264 Decoding with Macro Block Level Scheduling", ICME, July 2007. (PDF)

MILP for Task Allocation and Scheduling: Task allocation and scheduling for heterogeneous multi-core platforms must be automated for such platforms to be successful. Techniques such as Mixed Integer Linear Programming (MILP) provide the ability to easily customize the allocation and scheduling problem to application or platform-specific peculiarities. The representation of the core problem in a MILP form has a large impact on the solution time required. In this paper, we investigate a variety of such representations and propose a taxonomy for them. A promising representation is chosen with extensive computational characterization. The MILP formulation is customized for a multimedia case study involving the deployment of a Motion JPEG encoder application onto a Xilinx Virtex II Pro FPGA platform. We demonstrate that our approach can produce solutions that are competitive with manual designs.

Abhijit Davare, Jike Chong, Qi Zhu, Douglas Michael Densmore and Alberto L. Sangiovanni-Vincentelli, "Classification, Customization, and Characterization: Using MILP for Task Allocation and Scheduling", Technical Report, EECS Department, University of California, Berkeley, UCB/EECS-2006-166, 2006. (HTML, PDF)

A Flexible and Scalable DRAM Interface for Networking: A fundamental challenge to successful deployment of DRAMs is the availability of a flexible and scalable DRAM interface. This is exacerbated by the application specific nature of the logic-side DRAM interface. This paper presents a study that attempts to overcome this challenge for networking application domain. We quantify the various challenges and present techniques that were implemented to build a flexible and scalable interface to an existing multi-port memory controller for DDR DRAM using a FPGA. We demonstrate the deployment of this new interface in two example applications. We present two novel techniques that enable us to reduce the latency of DRAM related memory accesses and improve throughput.

Jike Chong, Chidamber Kulkarni, Gorber Brebner, "Building a Flexible and Scalable DRAM Interface for Networking", Applications on FPGAs", WMPI 2006. (PDF)

Extensible and Scalable Time Triggered Scheduling: We investigate how to design a system that can accommodate additional functionality with either no changes to the design or adding architectural modules without changing the implementation of the legacy functionality. This objective is very relevant to industrial domains where an architecture is designed before the full range of functionalities to support is known. We focus on an important aspect of the design of automotive systems: the scheduling problem for hard real time distributed embedded systems. Two metrics are used to capture the design goals. The metrics are optimized subject to a set of constraints within a mathematical programming framework. The cost of modifying a legacy system is characterized at an Electrical Control Unit (ECU) component level. Results obtained in automotive applications show that the optimization framework is effective in reducing development and re-verification efforts after incremental design changes.

Zheng Wei, Jike Chong,Claudio Pinello, Sri Kanajan and Alberto L. Sangiovanni-Vincentelli, "Extensible and Scalable Time Triggered Scheduling", Proceedings of the Fifth International Conference on Application of Concurrency to System Design, 2005.(PDF)