Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton December 11, 2019

Computational construction grammar for visual question answering

Jens Nevens, Paul Van Eecke and Katrien Beuls
From the journal Linguistics Vanguard

Abstract

In order to be able to answer a natural language question, a computational system needs three main capabilities. First, the system needs to be able to analyze the question into a structured query, revealing its component parts and how these are combined. Second, it needs to have access to relevant knowledge sources, such as databases, texts or images. Third, it needs to be able to execute the query on these knowledge sources. This paper focuses on the first capability, presenting a novel approach to semantically parsing questions expressed in natural language. The method makes use of a computational construction grammar model for mapping questions onto their executable semantic representations. We demonstrate and evaluate the methodology on the CLEVR visual question answering benchmark task. Our system achieves a 100% accuracy, effectively solving the language understanding part of the benchmark task. Additionally, we demonstrate how this solution can be embedded in a full visual question answering system, in which a question is answered by executing its semantic representation on an image. The main advantages of the approach include (i) its transparent and interpretable properties, (ii) its extensibility, and (iii) the fact that the method does not rely on any annotated training data.

Funding source: FWO

Award Identifier / Grant number: 1SB6219N

Funding statement: We would like to thank Roxana Radulescu, Mathieu Reymond and Kyriakos Efthymiadis for the brainstorming sessions that have led to this publication. We also thank Remi van Trijp for his constructive feedback on earlier versions of this paper. Finally, we are grateful to the two anonymous reviewers for their valuable comments that greatly improved the final version of this paper. This work was supported by the Research Foundation Flanders (FWO), funder id: http://dx.doi.org/10.13039/501100003130, through grant 1SB6219N.

References

Abou-Assaleh, T., N. Cercone & V. Keselj. 2005. Question-answering with relaxed unification. In Proceedings of the Conference Pacific Association for Computational Linguistics, volume 5. Tokyo, Japan: Pacific Association for Computational Linguistics.Search in Google Scholar

Agrawal, A., D. Batra & D. Parikh. 2016. Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1955–1960. Austin, TX, USA: Association for Computational Linguistics.10.18653/v1/D16-1203Search in Google Scholar

Andreas, J., M. Rohrbach, T. Darrell & D. Klein. 2016a. Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1545–1554. San Diego, CA, USA: Association for Computational Linguistics.10.18653/v1/N16-1181Search in Google Scholar

Andreas, J., M. Rohrbach, T. Darrell & D. Klein. 2016b. Neural module networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 39–48. Las Vegas, NV, USA: IEEE.10.1109/CVPR.2016.12Search in Google Scholar

Frank, A., H.-U. Krieger, F. Xu, H. Uszkoreit, B. Crysmann, B. Jörg & U. Schäfer. 2007. Question answering from structured knowledge sources. Journal of Applied Logic 5(1). 20–48.10.1016/j.jal.2005.12.006Search in Google Scholar

Hoffmann, T. & G. Trousdale. 2013. Construction Grammar: Introduction. In Thomas Hoffmann & Graeme Trousdale (eds.), The Oxford Handbook of Construction Grammar, Oxford: Oxford University Press.10.1093/oxfordhb/9780195396683.001.0001Search in Google Scholar

Hu, R., J. Andreas, M. Rohrbach, T. Darrell & K. Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the 2017 IEEE International Conference on Computer Vision, 804–813. Venice, Italy: IEEE.10.1109/ICCV.2017.93Search in Google Scholar

Johnson, J., B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick & R. Girshick. 2017a. Inferring and executing programs for visual reasoning. In Proceedings of the 2017 IEEE International Conference on Computer Vision, 3008–3017. Venice, Italy: IEEE.10.1109/ICCV.2017.325Search in Google Scholar

Johnson, J., B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick & R. Girshick. 2017b. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 1988–1997. Honolulu, HI, USA: IEEE.10.1109/CVPR.2017.215Search in Google Scholar

Li, P. & L. Liao. 2012. Web question answering based on CCG parsing and DL ontology. In 8th International Conference on Information Science and Digital Content Technology, volume 1, 212–217. Hyatt Regency Jeju, Korea: IEEE.Search in Google Scholar

Malinowski, M., M. Rohrbach & M. Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the 2015 IEEE International Conference on Computer Vision, 1–9. Santiago, Chile: IEEE.10.1109/ICCV.2015.9Search in Google Scholar

Mao, J., C. Gan, P. Kohli, J. B. Tenenbaum & J. Wu. 2019. The Neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations. New Orleans, LA, USA: Open Review. https://openreview.net/forum?id=rJgMlhRctm.Search in Google Scholar

Mascharka, D., P. Tran, R. Soklaski & A. Majumdar. 2018. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 4942–4950. Istanbul, Turkey: IEEE.10.1109/CVPR.2018.00519Search in Google Scholar

McFetridge, P., F. Popowich & D. Fass. 1996. An analysis of compounds in HPSG (head-driven phrase structure grammar) for database queries. Data & Knowledge Engineering 20(2). 195–209.10.1016/S0169-023X(96)00033-XSearch in Google Scholar

Noh, H., P. Hongsuck Seo & B. Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 30–38. Las Vegas, NV, USA: IEEE.10.1109/CVPR.2016.11Search in Google Scholar

Ren, M., R. Kiros & R. S. Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 2, 2953–2961. Montreal, Canada: MIT Press.Search in Google Scholar

Shamsfard, M. & M. A. Yarmohammadi. 2010. A semantic approach to extract the final answer in SBUQA question answering system. International Journal of Digital Content Technology and its Applications 4(7). 165–176.10.4156/jdcta.vol4.issue7.16Search in Google Scholar

Spranger, M., S. Pauw, M. Loetzsch & L. Steels. 2012. Open-ended procedural semantics. In L. Steeks and M. Hild (eds.), Language grounding in robots, 159–178. New York: Springer.10.1007/978-1-4614-3064-3_8Search in Google Scholar

Steels, L. 2007. The recruitment theory of language origins. In C. Lyon, C. L. Nehaniv, & A. Cangelosi (eds.), Emergence of Language and Communication, 129–151. Berlin: Springer.10.1007/978-1-84628-779-4_7Search in Google Scholar

Steels, L. (ed.). 2011. Design Patterns in Fluid Construction Grammar. Amsterdam: John Benjamins.10.1075/cal.11Search in Google Scholar

Steels, L. 2017. Basics of Fluid Construction Grammar. Constructions and Frames 9(2). 178–225.10.1075/bct.106.cf.00002.steSearch in Google Scholar

Xu, H. & K. Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, 451–466. Amsterdam, The Netherlands: Springer.10.1007/978-3-319-46478-7_28Search in Google Scholar

Yang, Z., X. He, J. Gao, L. Deng & A. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 21–29. Las Vegas, NV, USA: IEEE.10.1109/CVPR.2016.10Search in Google Scholar

Yarmohammadi, M. A., M. Shamsfard, M. A. Yarmohammadi and M. Rouhizadeh. 2008. SBUQA question answering system. In Computer Society of Iran Computer Conference, 316–323. Kish Island, Iran: Springer.10.1007/978-3-540-89985-3_39Search in Google Scholar

Yi, K., J. Wu, C. Gan, A. Torralba, P. Kohli & J. Tenenbaum. 2018. Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, 1031–1042. Montreal, Canada: MIT Press.Search in Google Scholar

Zettlemoyer, L. S. & M. Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In UAI’05 Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, 658–666. Edinburgh, Scotland: AUAI Press Arlington.Search in Google Scholar

Zhang, P., Y. Goyal, D. Summers-Stay, D. Batra & D. Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 5014–5022. Las Vegas, NV, USA: IEEE.10.1109/CVPR.2016.542Search in Google Scholar

Zhou, B., Y. Tian, S. Sukhbaatar, A. Szlam & R. Fergus. 2015. Simple baseline for visual question answering. arXiv e-prints. arXiv:1512.02167.Search in Google Scholar

Received: 2018-12-03
Accepted: 2019-07-18
Published Online: 2019-12-11

© 2019 Walter de Gruyter GmbH, Berlin/Boston

Scroll Up Arrow