Skip to main content
  • Poster presentation
  • Open access
  • Published:

A hierarchical model of vision (HMAX) can also recognize speech

HMAX is a well-known computational model of visual recognition in cortex consisting of just two computational operations – a “template match” and non-linear pooling – alternating in a feedforward hierarchy in which receptive fields exhibit increasing specificity and invariance [1]. Interestingly, auditory recognition problems (such as speech recognition) share similar computational requirements, and recent work in auditory neuroscience suggests that auditory and visual cortex share similar anatomical and functional organization. Based on these similarities, we tested whether HMAX could support an auditory recognition task (specifically, word spotting).

To test HMAX on word spotting, recorded speech samples from the TIMIT corpus [2] were first converted into time-frequency spectrograms using a computational model of the auditory periphery [3]. These spectrograms were then split into 750 ms frames and input to a standard HMAX model [4]. Based on observed similarities between the receptive fields in primary auditory cortex (spectro-temporal receptive fields, or STRFs) and primary visual cortex (typically modeled as oriented Gabor filters), we used S1 filters identical to those used in vision [4]. Similarly, S2 “patches” were randomly selected from C1 representations of speech sounds drawn from an independent speech corpus. One vs. all linear support vector machines (SVMs) were then trained to discriminate frames that contain a target word from those that did not. These SVMs were then tested on a novel set of test sentences using a sliding frame approach (750 ms frame size, 20 ms step size). For each frame in a sentence, the SVM produced a distance from the hyperplane, and a threshold value was applied to produce a binary classification whether or not the target word was present in the sentence. When tested on target words that appeared in a fixed context (i.e. SA sentences in TIMIT), performance was highly robust, with ROC areas consistently above 0.9. When tested on target words that appeared in variable contexts (i.e., SI sentences in TIMIT), performance was somewhat decreased with ROC areas around 0.8. This decrease in performance is likely due to the inclusion of “clutter” (i.e., target irrelevant features) within the frame, also commonly observed when HMAX is applied to visual object recognition tasks [1].

These results are novel in that they provide support for the hypothesis that the simple computational framework implemented in HMAX – consisting of a feedforward hierarchy of only two alternating computational operations – may generalize beyond vision to support auditory recognition as well. It is possible that such a representation could give rise to stable neural encodings that are invariant to behaviorally irrelevant characteristics as seen in higher order visual and auditory cortices [5, 6]. While it is likely that this auditory version of the HMAX model would benefit from the use of more auditory-specific filters based on STRF models [7], the Gabor features used here are largely compatible with previous computational models based on STRFs up to the level of primary auditory cortex [8]. Additional benefit may also be gained by learning sparse representations from natural sounds, at both the S1 and S2 levels [9].


  1. Riesenhuber M, Poggio T: Hierarchical models of object recognition in cortex. Nat Neurosci. 1999, 2: 1019-25. 10.1038/14819.

    Article  CAS  PubMed  Google Scholar 

  2. Garofolo JS: TIMIT Acoustic-Phonetic Continuous Speech Corpus. 1993

    Google Scholar 

  3. Yang X, Wang K, Shamma SA: Auditory representations of acoustic signals. IEEE Trans Inf Theory. 1992, 38: 824-839. 10.1109/18.119739.

    Article  Google Scholar 

  4. Serre T, Wolf L, Bileschi S, Riesenhuber M, Poggio T: Robust object recognition with cortex-like mechanisms. IEEE Trans Pattern Anal Mach Intell. 2007, 29: 411-26.

    Article  PubMed  Google Scholar 

  5. Quiroga RQ, Reddy L, Kreiman G, Koch C, Fried I: Invariant visual representation by single neurons in the human brain. Nature. 2005, 435: 1102-7. 10.1038/nature03687.

    Article  CAS  PubMed  Google Scholar 

  6. Chan AM, Dykstra AR, Jayaram V, Leonard MK, Travis KE, Gygi B, Baker JM, Eskandar E, Hochberg LR, Halgren E, Cash SS: Speech-Specific Tuning of Neurons in Human Superior Temporal Gyrus. Cereb Cortex. 2013, 10.1093/cercor/bht127.

    Google Scholar 

  7. Theunissen FE, Sen K, Doupe AJ: Spectral-Temporal Receptive Fields of Nonlinear Auditory Neurons Obtained Using Natural Sounds. J Neurosci. 2000, 20: 2315-2331.

    CAS  PubMed  Google Scholar 

  8. Mesgarani N, Shamma S, Slaney M: Speech discrimination based on multiscale spectro-temporal modulations. 2004 IEEE Int Conf Acoust Speech, Signal Process. 2004, 1: 601-4. 10.1109/ICASSP.2004.1326057.

    Google Scholar 

  9. Hu X, Zhang J, Li J, Zhang B: Sparsity-Regularized HMAX for Visual Recognition. PLoS One. 2014, 9: e81813-10.1371/journal.pone.0081813.

    Article  PubMed Central  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Matthew J Roos.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roos, M.J., Wolmetz, M. & Chevillet, M.A. A hierarchical model of vision (HMAX) can also recognize speech. BMC Neurosci 15 (Suppl 1), P187 (2014).

Download citation

  • Published:

  • DOI: