Abstract
Much of the discussion in the previous chapters has focused on a bag-of-words representation of text. While the bag-of-words representation is sufficient in many practical applications, there are cases in which the sequential aspects of text become more important.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The original definition of distance-graph [15] differs from the skip-gram definition here in a very minor way. The original definition always assumes that a word is connected to itself. Such a definition also allows a distance graph of order k = 0, which corresponds to a traditional bag-of-words with only self-loops. For example, for the sentence “Adam ate an apple” there would always be a self-loop at each of the four nodes even though there are no adjacent repetitions of words. The slightly modified definition here would include a self-loop only when words actually occur in their own context. For many traditional applications, however, this distinction does not seem to affect the results.
- 2.
Note that \(\overline{u}_{i}\) and \(\overline{v}_{j}\) are added in the updates, which is a slight abuse of notation. Although \(\overline{u}_{i}\) is a row vector and \(\overline{v}_{j}\) is a column vector, the updates are intuitively clear.
- 3.
An LSTM was used, which is a variation on the vanilla RNN discussed here.
- 4.
- 5.
In principle, one can also allow it to be input at all time-stamps, but it only seems to worsen performance.
- 6.
- 7.
Here, we are treating the forget bits as a vector of binary bits, although it contains continuous values in (0, 1), which can be viewed as probabilities. As discussed earlier, the binary abstraction helps us understand the conceptual nature of the operations.
Bibliography
C. Aggarwal. Data mining: The textbook. Springer, 2015.
C. Aggarwal and P. Zhao. Towards graphical models for text processing. Knowledge and Information Systems, 36(1), pp. 1–21, 2013. [Preliminary version in ACM SIGIR, 2010]
M. Baroni, G. Dinu, and G. Kruszewski. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, pp. 238–247, 2014.
M. Baroni and A. Lenci. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), pp. 673–721, 2010.
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3, pp. 1137–1155, 2003.
C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.
J. Bullinaria and J. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3), pp. 510–526, 2007.
R. Bunescu and R. Mooney. A shortest path dependency kernel for relation extraction. Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724–731, 2005.
R. Bunescu and R. Mooney. Subsequence kernels for relation extraction. NIPS Conference, pp. 171–178, 2005.
K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP, 2014. https://arxiv.org/pdf/1406.1078.pdf
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. https://arxiv.org/abs/1412.3555
K. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), pp. 22–29, 1990.
M. Collins and N. Duffy. Convolution kernels for natural language. NIPS Conference, pp. 625–632, 2001.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, pp. 2493–2537, 2011.
R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML Conference, pp. 160–167, 2008.
A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. ACL Conference, 2004.
A. Fader, L. Zettlemoyer, and O. Etzioni. Paraphrase-Driven Learning for Open Question Answering. ACL, pp. 1608–1618, 2013.
A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated and extracted knowledge bases. ACM KDD Conference, 2014.
Y. Goldberg. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research (JAIR), 57, pp. 345–420, 2016.
Y. Goldberg and O. Levy. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722, 2014. https://arxiv.org/abs/1402.3722
I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.
A. Graves. Supervised sequence labelling with recurrent neural networks Springer, 2012. http://rd.springer.com/book/10.1007%2F978-3-642-24797-2
A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. https://arxiv.org/abs/1308.0850
A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649, 2013.
M. Gutmann and A. Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS, 1(2), pp. 6, 2010.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), pp. 1735–1785, 1997.
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.
K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), pp. 359–366, 1989.
M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and H. Daume III. A Neural Network for Factoid Question Answering over Paragraphs. EMNLP, 2014.
C. Johnson. Logistic matrix factorization for implicit feedback data. NIPS Conference, 2014.
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. EMNLP, 3, 39, pp. 413, 2013.
A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015. https://arxiv.org/abs/1506.02078
A. Karpathy. The unreasonable effectiveness of recurrent neural networks, Blog post, 2015. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
J. Lau and T. Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv:1607.05368, 2016. https://arxiv.org/abs/1607.05368
Q. Le. Personal communication, 2017.
Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. NIPS Conference, pp. 2177–2185, 2014.
O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, pp. 211–225, 2015.
O. Levy, Y. Goldberg, and I. Ramat-Gan. Linguistic regularities in sparse and explicit word representations. CoNLL, 2014.
Z. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019, 2015. https://arxiv.org/abs/1506.00019
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.
K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28(2). pp. 203–208, 1996.
J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. ICML Conference, pp. 1033–1040, 2011.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NIPS Conference, pp. 3111–3119, 2013.
T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network based language model. Interspeech, Vol 2, 2010.
T. Mikolov, W. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL, pp. 746–751, 2013.
T. Mikolov, Q. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013. https://arxiv.org/abs/1309.4168
A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. ICML Conference, pp. 641–648, 2007.
A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. NIPS Conference, pp. 2265–2273, 2013.
A. Mnih and Y. Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv:1206.6426, 2012. https://arxiv.org/abs/1206.6426
H. Niitsuma and M. Lee. Word2Vec is a special case of kernel correspondence analysis and kernels for natural language processing, arXiv preprint arXiv:1605.05087, 2016. https://arxiv.org/abs/1605.05087
R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML, (3), 28, pp. 1310–1318, 2013. http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf
J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. EMNLP, pp. 1532–1543, 2014.
L. Polanyi and A. Zaenen. Contextual valence shifters. Computing Attitude and Affect in Text: Theory and Applications, pp. 1–10, Springer, 2006.
L. Qian, G. Zhou, F. Kong, Q. Zhu, and P. Qian. Exploiting constituent dependencies for tree kernel-based semantic relation extraction. International Conference on Computational Linguistics, pp. 697–704, 2008.
R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html
X. Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014. https://arxiv.org/abs/1411.2738
M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), pp. 2673–2681, 1997.
S. Siencnik. Adapting word2vec to named entity recognition. Nordic Conference of Computational Linguistics, NODALIDA, 2015.
M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. Interspeech, 2010.
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS Conference, pp. 3104–3112, 2014.
P. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), pp. 141–188, 2010.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR Conference, pp. 3156–3164, 2015.
J. Weston, A. Bordes, S. Chopra, A. Rush, B. van Merrienboer, A. Joulin, and T. Mikolov. Towards ai-complete question answering: A set of pre-requisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. https://arxiv.org/abs/1502.05698
J. Weston, S. Chopra, and A. Bordes. Memory networks. ICLR, 2015.
D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3. pp. 1083–1106, 2003.
M. Zhang, J. Zhang, and J. Su. Exploring syntactic features for relation extraction using a convolution tree kernel. Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 288-295, 2006.
M. Zhang, J. Zhang, J. Su, and G. Zhou. A composite kernel to extract relations between entities with both flat and structured features. International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics, pp. 825–832, 2006.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Aggarwal, C.C. (2018). Text Sequence Modeling and Deep Learning. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-73531-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)