{"id":5446,"date":"2012-07-18T22:28:42","date_gmt":"2012-07-18T22:28:42","guid":{"rendered":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/?p=5446"},"modified":"2012-07-18T22:28:42","modified_gmt":"2012-07-18T22:28:42","slug":"energy-scalable-speech-recognition-circuits","status":"publish","type":"post","link":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/energy-scalable-speech-recognition-circuits\/","title":{"rendered":"Energy-scalable Speech-recognition Circuits"},"content":{"rendered":"

Speech recognition is becoming a ubiquitous component of the digital infrastructure, serving as an I\/O adapter between people and electronic devices.\u00a0 However, the computational demands imposed by speech recognition make it difficult to integrate into a wide variety of systems.\u00a0 Today\u2019s popular practice of transmitting voice data to cloud servers requires an Internet connection that may impose unwanted complexity, bandwidth, and latency constraints.\u00a0 In order to lift these constraints, we are developing a hardware speech decoder architecture that can easily be scaled to trade among performance (vocabulary, decoding speed, and accuracy), power consumption, and cost.<\/p>\n

We are now designing a \u201cbaseline\u201d speech recognizer IC to serve as a starting point for architectural studies and improved designs in the future.\u00a0 This chip is intended to decode the 5,000-word Wall Street Journal<\/em> (Nov. 1992) data set in real time with 10% word error rate (WER) and system power consumption of 100 mW.\u00a0 Building on the architecture of [1<\/a>] <\/sup>, it performs a Viterbi search over a hidden Markov model using industry-standard weighted finite-state transducer (WFST) transition probabilities [2<\/a>] <\/sup> and Gaussian mixture model (GMM) emission probabilities.\u00a0 The WFST and GMM parameters are stored in an off-chip NAND flash memory; models for different speakers, vocabularies, and\/or languages can be prepared offline and loaded from a computer.\u00a0 The chip integrates the front-end, modeling, and search components needed to convert audio samples from an ADC directly to text without software assistance.\u00a0 Its tradeoffs among energy, speed, and accuracy can be manipulated via the model complexity, runtime parameters (e.g., beam width), and voltage\/frequency scaling.<\/p>\n

Memory access is expected to be the limiting factor in both decoding speed and power consumption.\u00a0 Previous FPGA implementations required DDR SDRAM with multi-GB\/s bandwidth, which is not practical for low-power systems.\u00a0 We are focusing our efforts on minimizing the off-chip memory bandwidth demands using model compression (e.g., nonlinear quantization of parameters [3<\/a>] <\/sup> ), access reordering, and caching techniques.\u00a0 Future implementations will also allow larger active state list sizes to improve decoding accuracy, especially with larger vocabularies.<\/p>\n\n\t\t