{"id":5446,"date":"2012-07-18T22:28:42","date_gmt":"2012-07-18T22:28:42","guid":{"rendered":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/?p=5446"},"modified":"2012-07-18T22:28:42","modified_gmt":"2012-07-18T22:28:42","slug":"energy-scalable-speech-recognition-circuits","status":"publish","type":"post","link":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/energy-scalable-speech-recognition-circuits\/","title":{"rendered":"Energy-scalable Speech-recognition Circuits"},"content":{"rendered":"<p>Speech recognition is becoming a ubiquitous component of the digital infrastructure, serving as an I\/O adapter between people and electronic devices.\u00a0 However, the computational demands imposed by speech recognition make it difficult to integrate into a wide variety of systems.\u00a0 Today\u2019s popular practice of transmitting voice data to cloud servers requires an Internet connection that may impose unwanted complexity, bandwidth, and latency constraints.\u00a0 In order to lift these constraints, we are developing a hardware speech decoder architecture that can easily be scaled to trade among performance (vocabulary, decoding speed, and accuracy), power consumption, and cost.<\/p>\n<p>We are now designing a \u201cbaseline\u201d speech recognizer IC to serve as a starting point for architectural studies and improved designs in the future.\u00a0 This chip is intended to decode the 5,000-word <em>Wall Street Journal<\/em> (Nov. 1992) data set in real time with 10% word error rate (WER) and system power consumption of 100 mW.\u00a0 Building on the architecture of<sup> [<a href=\"#footnote_0_5446\" id=\"identifier_0_5446\" class=\"footnote-link footnote-identifier-link\" title=\"J. Choi, K. You, and W. Sung, &ldquo;An FPGA implementation of speech recognition with weighted finite state transducers,&rdquo; in IEEE International Conference on Acoustics Speech and Signal Processing, 2010, pp. 1602-1605.\">1<\/a>] <\/sup>, it performs a Viterbi search over a hidden Markov model using industry-standard weighted finite-state transducer (WFST) transition probabilities<sup> [<a href=\"#footnote_1_5446\" id=\"identifier_1_5446\" class=\"footnote-link footnote-identifier-link\" title=\"M. Mohri, F. Pereira, and M. Riley, &ldquo;Speech recognition with weighted finite-state transducers,&rdquo; in Springer Handbook on Speech Processing and Speech Communication. Heidelberg, Germany: Springer-Verlag, 2008 \">2<\/a>] <\/sup> and Gaussian mixture model (GMM) emission probabilities.\u00a0 The WFST and GMM parameters are stored in an off-chip NAND flash memory; models for different speakers, vocabularies, and\/or languages can be prepared offline and loaded from a computer.\u00a0 The chip integrates the front-end, modeling, and search components needed to convert audio samples from an ADC directly to text without software assistance.\u00a0 Its tradeoffs among energy, speed, and accuracy can be manipulated via the model complexity, runtime parameters (e.g., beam width), and voltage\/frequency scaling.<\/p>\n<p>Memory access is expected to be the limiting factor in both decoding speed and power consumption.\u00a0 Previous FPGA implementations required DDR SDRAM with multi-GB\/s bandwidth, which is not practical for low-power systems.\u00a0 We are focusing our efforts on minimizing the off-chip memory bandwidth demands using model compression (e.g., nonlinear quantization of parameters<sup> [<a href=\"#footnote_2_5446\" id=\"identifier_2_5446\" class=\"footnote-link footnote-identifier-link\" title=\"I. L. Hetherington, &ldquo;PocketSUMMIT: Small-footprint continuous speech recognition,&rdquo; in INTERSPEECH, 2007, pp. 1465-1468.\">3<\/a>] <\/sup> ), access reordering, and caching techniques.\u00a0 Future implementations will also allow larger active state list sizes to improve decoding accuracy, especially with larger vocabularies.<\/p>\n\n\t\t<style type=\"text\/css\">\n\t\t\t#gallery-1 {\n\t\t\t\tmargin: auto;\n\t\t\t}\n\t\t\t#gallery-1 .gallery-item {\n\t\t\t\tfloat: left;\n\t\t\t\tmargin-top: 10px;\n\t\t\t\ttext-align: center;\n\t\t\t\twidth: 50%;\n\t\t\t}\n\t\t\t#gallery-1 img {\n\t\t\t\tborder: 2px solid #cfcfcf;\n\t\t\t}\n\t\t\t#gallery-1 .gallery-caption {\n\t\t\t\tmargin-left: 0;\n\t\t\t}\n\t\t\t\/* see gallery_shortcode() in wp-includes\/media.php *\/\n\t\t<\/style>\n\t\t<div id='gallery-1' class='gallery galleryid-5446 gallery-columns-2 gallery-size-medium'><dl class='gallery-item'>\n\t\t\t<dt class='gallery-icon landscape'>\n\t\t\t\t<a href='https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_01-e1341499827762.png' rel=\"lightbox[5446]\"><img width=\"300\" height=\"85\" src=\"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_01-300x85.png\" class=\"attachment-medium size-medium\" alt=\"Figure 1\" loading=\"lazy\" aria-describedby=\"gallery-1-5447\" srcset=\"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_01-300x85.png 300w, https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_01-1024x292.png 1024w, https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_01-e1341499827762.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a>\n\t\t\t<\/dt>\n\t\t\t\t<dd class='wp-caption-text gallery-caption' id='gallery-1-5447'>\n\t\t\t\tFigure 1: Signal processing transformations applied in the front-end to generate mel-frequency cepstral coefficients (MFCCs).  The audio signal (left) is converted to a spectrogram (center) by a series of short-term FFTs and then to feature vectors (left) via mel-frequency bandpass filters and a DCT for dimensionality reduction.\n\t\t\t\t<\/dd><\/dl><dl class='gallery-item'>\n\t\t\t<dt class='gallery-icon landscape'>\n\t\t\t\t<a href='https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_02-e1341499905424.png' rel=\"lightbox[5446]\"><img width=\"300\" height=\"290\" src=\"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_02-300x290.png\" class=\"attachment-medium size-medium\" alt=\"Figure 2\" loading=\"lazy\" aria-describedby=\"gallery-1-5448\" srcset=\"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_02-300x290.png 300w, https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_02-1024x990.png 1024w, https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-content\/blogs.dir\/15\/files\/2012\/07\/price_circuits_02-e1341499905424.png 517w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a>\n\t\t\t<\/dt>\n\t\t\t\t<dd class='wp-caption-text gallery-caption' id='gallery-1-5448'>\n\t\t\t\tFigure 2: Block diagram of baseline speech decoder chip and test setup.  A second external memory is needed to supply GMM parameters since bandwidth demands exceed the capabilities of NAND flash in the baseline architecture. \n\t\t\t\t<\/dd><\/dl><br style=\"clear: both\" \/>\n\t\t<\/div>\n\n<ol class=\"footnotes\"><li id=\"footnote_0_5446\" class=\"footnote\">J. Choi, K. You, and W. Sung, \u201cAn FPGA implementation of speech recognition with weighted finite state transducers,\u201d in <em>IEEE International Conference on Acoustics Speech and Signal Processing,<\/em> 2010,<em> <\/em>pp. 1602-1605. [<a href=\"#identifier_0_5446\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>] <\/li><li id=\"footnote_1_5446\" class=\"footnote\">M. Mohri, F. Pereira, and M. Riley, \u201cSpeech recognition with weighted finite-state transducers,\u201d in <em>Springer Handbook on Speech Processing and Speech Communication<\/em>. Heidelberg, Germany: Springer-Verlag, 2008  [<a href=\"#identifier_1_5446\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>] <\/li><li id=\"footnote_2_5446\" class=\"footnote\">I. L. Hetherington, \u201cPocketSUMMIT: Small-footprint continuous speech recognition,\u201d in <em>INTERSPEECH<\/em>,<em> <\/em>2007, pp. 1465-1468. [<a href=\"#identifier_2_5446\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>] <\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>Speech recognition is becoming a ubiquitous component of the digital infrastructure, serving as an I\/O adapter between people and electronic&#8230;<\/p>\n","protected":false},"author":1,"featured_media":5448,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[26],"tags":[17,11513,11514],"_links":{"self":[{"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/posts\/5446"}],"collection":[{"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/comments?post=5446"}],"version-history":[{"count":4,"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/posts\/5446\/revisions"}],"predecessor-version":[{"id":6441,"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/posts\/5446\/revisions\/6441"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/media\/5448"}],"wp:attachment":[{"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/media?parent=5446"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/categories?post=5446"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mtlsites.mit.edu\/annual_reports\/2012\/wp-json\/wp\/v2\/tags?post=5446"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}