Text Sequence Modeling and Deep Learning

Aggarwal, Charu C.

doi:10.1007/978-3-319-73531-3_10

Charu C. Aggarwal²

10k Accesses
3 Citations

Abstract

Much of the discussion in the previous chapters has focused on a bag-of-words representation of text. While the bag-of-words representation is sufficient in many practical applications, there are cases in which the sequential aspects of text become more important.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The original definition of distance-graph [15] differs from the skip-gram definition here in a very minor way. The original definition always assumes that a word is connected to itself. Such a definition also allows a distance graph of order k = 0, which corresponds to a traditional bag-of-words with only self-loops. For example, for the sentence “Adam ate an apple” there would always be a self-loop at each of the four nodes even though there are no adjacent repetitions of words. The slightly modified definition here would include a self-loop only when words actually occur in their own context. For many traditional applications, however, this distinction does not seem to affect the results.
2.
Note that \(\overline{u}_{i}\) and \(\overline{v}_{j}\) are added in the updates, which is a slight abuse of notation. Although \(\overline{u}_{i}\) is a row vector and \(\overline{v}_{j}\) is a column vector, the updates are intuitively clear.
3.
An LSTM was used, which is a variation on the vanilla RNN discussed here.
4.
https://www.nasa.gov/mission_pages/chandra/cosmic-winter-wonderland.html.
5.
In principle, one can also allow it to be input at all time-stamps, but it only seems to worsen performance.
6.
The original work in [464] seems to use this option [274]. In the Google Neural Machine Translation system [620], this weight is removed. This system is now used in Google Translate.
7.
Here, we are treating the forget bits as a vector of binary bits, although it contains continuous values in (0, 1), which can be viewed as probabilities. As discussed earlier, the binary abstraction helps us understand the conceptual nature of the operations.

Bibliography

C. Aggarwal. Data mining: The textbook. Springer, 2015.
Google Scholar
C. Aggarwal and P. Zhao. Towards graphical models for text processing. Knowledge and Information Systems, 36(1), pp. 1–21, 2013. [Preliminary version in ACM SIGIR, 2010]
Google Scholar
M. Baroni, G. Dinu, and G. Kruszewski. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, pp. 238–247, 2014.
Google Scholar
M. Baroni and A. Lenci. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), pp. 673–721, 2010.
Article Google Scholar
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3, pp. 1137–1155, 2003.
MATH Google Scholar
C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.
Google Scholar
J. Bullinaria and J. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3), pp. 510–526, 2007.
Article Google Scholar
R. Bunescu and R. Mooney. A shortest path dependency kernel for relation extraction. Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724–731, 2005.
Google Scholar
R. Bunescu and R. Mooney. Subsequence kernels for relation extraction. NIPS Conference, pp. 171–178, 2005.
Google Scholar
K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP, 2014. https://arxiv.org/pdf/1406.1078.pdf
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. https://arxiv.org/abs/1412.3555
K. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), pp. 22–29, 1990.
Google Scholar
M. Collins and N. Duffy. Convolution kernels for natural language. NIPS Conference, pp. 625–632, 2001.
Google Scholar
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, pp. 2493–2537, 2011.
MATH Google Scholar
R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML Conference, pp. 160–167, 2008.
Google Scholar
A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. ACL Conference, 2004.
Google Scholar
A. Fader, L. Zettlemoyer, and O. Etzioni. Paraphrase-Driven Learning for Open Question Answering. ACL, pp. 1608–1618, 2013.
Google Scholar
A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated and extracted knowledge bases. ACM KDD Conference, 2014.
Google Scholar
Y. Goldberg. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research (JAIR), 57, pp. 345–420, 2016.
MathSciNet MATH Google Scholar
Y. Goldberg and O. Levy. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722, 2014. https://arxiv.org/abs/1402.3722
I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.
Google Scholar
A. Graves. Supervised sequence labelling with recurrent neural networks Springer, 2012. http://rd.springer.com/book/10.1007%2F978-3-642-24797-2
Book Google Scholar
A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. https://arxiv.org/abs/1308.0850
A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649, 2013.
Google Scholar
M. Gutmann and A. Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS, 1(2), pp. 6, 2010.
Google Scholar
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), pp. 1735–1785, 1997.
Article Google Scholar
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.
Google Scholar
K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), pp. 359–366, 1989.
Article Google Scholar
M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and H. Daume III. A Neural Network for Factoid Question Answering over Paragraphs. EMNLP, 2014.
Google Scholar
C. Johnson. Logistic matrix factorization for implicit feedback data. NIPS Conference, 2014.
Google Scholar
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. EMNLP, 3, 39, pp. 413, 2013.
Google Scholar
A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015. https://arxiv.org/abs/1506.02078
A. Karpathy. The unreasonable effectiveness of recurrent neural networks, Blog post, 2015. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Google Scholar
Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
Google Scholar
J. Lau and T. Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv:1607.05368, 2016. https://arxiv.org/abs/1607.05368
Q. Le. Personal communication, 2017.
Google Scholar
Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.
Google Scholar
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. NIPS Conference, pp. 2177–2185, 2014.
Google Scholar
O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, pp. 211–225, 2015.
Google Scholar
O. Levy, Y. Goldberg, and I. Ramat-Gan. Linguistic regularities in sparse and explicit word representations. CoNLL, 2014.
Google Scholar
Z. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019, 2015. https://arxiv.org/abs/1506.00019
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.
MATH Google Scholar
K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28(2). pp. 203–208, 1996.
Article Google Scholar
J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. ICML Conference, pp. 1033–1040, 2011.
Google Scholar
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NIPS Conference, pp. 3111–3119, 2013.
Google Scholar
T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network based language model. Interspeech, Vol 2, 2010.
Google Scholar
T. Mikolov, W. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL, pp. 746–751, 2013.
Google Scholar
T. Mikolov, Q. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013. https://arxiv.org/abs/1309.4168
A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. ICML Conference, pp. 641–648, 2007.
Google Scholar
A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. NIPS Conference, pp. 2265–2273, 2013.
Google Scholar
A. Mnih and Y. Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv:1206.6426, 2012. https://arxiv.org/abs/1206.6426
H. Niitsuma and M. Lee. Word2Vec is a special case of kernel correspondence analysis and kernels for natural language processing, arXiv preprint arXiv:1605.05087, 2016. https://arxiv.org/abs/1605.05087
R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML, (3), 28, pp. 1310–1318, 2013. http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf
J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. EMNLP, pp. 1532–1543, 2014.
Google Scholar
L. Polanyi and A. Zaenen. Contextual valence shifters. Computing Attitude and Affect in Text: Theory and Applications, pp. 1–10, Springer, 2006.
Google Scholar
L. Qian, G. Zhou, F. Kong, Q. Zhu, and P. Qian. Exploiting constituent dependencies for tree kernel-based semantic relation extraction. International Conference on Computational Linguistics, pp. 697–704, 2008.
Google Scholar
R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html
X. Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014. https://arxiv.org/abs/1411.2738
M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), pp. 2673–2681, 1997.
Article Google Scholar
S. Siencnik. Adapting word2vec to named entity recognition. Nordic Conference of Computational Linguistics, NODALIDA, 2015.
Google Scholar
M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. Interspeech, 2010.
Google Scholar
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS Conference, pp. 3104–3112, 2014.
Google Scholar
P. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), pp. 141–188, 2010.
MathSciNet MATH Google Scholar
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR Conference, pp. 3156–3164, 2015.
Google Scholar
J. Weston, A. Bordes, S. Chopra, A. Rush, B. van Merrienboer, A. Joulin, and T. Mikolov. Towards ai-complete question answering: A set of pre-requisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. https://arxiv.org/abs/1502.05698
J. Weston, S. Chopra, and A. Bordes. Memory networks. ICLR, 2015.
Google Scholar
D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3. pp. 1083–1106, 2003.
Google Scholar
M. Zhang, J. Zhang, and J. Su. Exploring syntactic features for relation extraction using a convolution tree kernel. Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 288-295, 2006.
Google Scholar
M. Zhang, J. Zhang, J. Su, and G. Zhou. A composite kernel to extract relations between entities with both flat and structured features. International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics, pp. 825–832, 2006.
Google Scholar
https://code.google.com/archive/p/word2vec/
https://www.tensorflow.org/tutorials/word2vec/
http://clic.cimec.unitn.it/composes/toolkit/
https://github.com/stanfordnlp/GloVe
https://deeplearning4j.org/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://deeplearning.net/tutorial/lstm.html
http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural- networks-python-keras/
https://deeplearning4j.org/lstm
https://github.com/karpathy/char-rnn
https://arxiv.org/abs/1609.08144

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C. (2018). Text Sequence Modeling and Deep Learning. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-73531-3_10
Published: 20 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics