Dynamic Programming for Linear Time Incremental Parsing

Dynamic Programming for Linear Time ncremental Parsing Liang Huang nformation Sciences nstitute University of Southern California Kenji Sagae nstitute for Creative Technologies University of Southern California ACL 2010, Uppsala, Sweden, July 2010 (slightly expanded)

DP for ncremental Parsing (Huang and Sagae) Ambiguities in Parsing feed cats nearby in the garden... let s focus on dependency structures for simplicity ambiguous attachments of nearby and in ambiguity explodes exponentially with sentence length must design efficient (polynomial) search algorithm typically using dynamic programming (DP); e.g. CKY 2

DP for ncremental Parsing (Huang and Sagae) 3 But full DP is too slow... feed cats nearby in the garden... full DP (like CKY) is too slow (cubic-time) while human parsing is fast & incremental (linear-time)

DP for ncremental Parsing (Huang and Sagae) 3 But full DP is too slow... feed cats nearby in the garden... full DP (like CKY) is too slow (cubic-time) how about incremental parsing then? while human parsing is fast & incremental (linear-time) yes, but only with greedy search (accuracy suffers) explores tiny fraction of trees (even w/ beam search)

But full DP is too slow... feed cats nearby in the garden... full DP (like CKY) is too slow (cubic-time) how about incremental parsing then? while human parsing is fast & incremental (linear-time) yes, but only with greedy search (accuracy suffers) explores tiny fraction of trees (even w/ beam search) can we combine the merits of both approaches? a fast, incremental parser with dynamic programming? explores exponentially many trees in linear-time? DP for ncremental Parsing (Huang and Sagae) 3

DP for ncremental Parsing (Huang and Sagae) 4 Linear-Time ncremental DP greedy search principled search incremental parsing (e.g. shift-reduce) (Nivre 04; Collins/Roark 04;...) this work: fast shift-reduce parsing with dynamic programming fast (linear-time) full DP (e.g. CKY) (Eisner 96; Collins 99;...) slow (cubic-time)

Preview of the Results very fast linear-time dynamic programming parser best reported dependency accuracy on PTB/CTB explores exponentially many trees (and outputs forest) parsing time (secs) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Charniak Berkeley MST this work 0 10 20 30 40 50 60 70 sentence length DP for ncremental Parsing (Huang and Sagae) 5

Preview of the Results very fast linear-time dynamic programming parser best reported dependency accuracy on PTB/CTB explores exponentially many trees (and outputs forest) parsing time (secs) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Charniak number of trees explored Berkeley MST this work 0 10 20 30 40 50 60 70 sentence length DP for ncremental Parsing (Huang and Sagae) 10 10 10 8 10 6 10 4 10 2 10 0 DP: exponential non-dp beam search 0 10 20 30 40 50 60 70 sentence length 5

DP for ncremental Parsing (Huang and Sagae) 6 Outline Motivation ncremental (Shift-Reduce) Parsing Dynamic Programming for ncremental Parsing Experiments

Shift-Reduce Parsing feed cats nearby in the garden. action stack queue 0 - feed cats... feed cats nearby... cats nearby in... cats nearby in... nearby in the... nearby in the... in the garden... DP for ncremental Parsing (Huang and Sagae) 7

Shift-Reduce Parsing feed cats nearby in the garden. action stack queue 0-1 shift feed cats... feed cats nearby... cats nearby in... cats nearby in... nearby in the... nearby in the... in the garden... DP for ncremental Parsing (Huang and Sagae) 8

Shift-Reduce Parsing feed cats nearby in the garden. action stack queue 0-1 shift 2 shift feed feed cats... feed cats nearby... cats nearby in... cats nearby in... nearby in the... nearby in the... in the garden... DP for ncremental Parsing (Huang and Sagae) 9

Shift-Reduce Parsing feed cats nearby in the garden. action stack queue 0-1 shift 2 shift 3 l-reduce feed feed feed cats... feed cats nearby... cats nearby in... cats nearby in... nearby in the... nearby in the... in the garden... DP for ncremental Parsing (Huang and Sagae) 10

Shift-Reduce Parsing feed cats nearby in the garden. action stack queue 0-1 shift 2 shift 3 l-reduce 4 shift feed feed feed cats feed cats... feed cats nearby... cats nearby in... cats nearby in... nearby in the... nearby in the... in the garden... DP for ncremental Parsing (Huang and Sagae) 11

Shift-Reduce Parsing feed cats nearby in the garden. action stack queue 0-1 shift 2 shift 3 l-reduce 4 shift 5a r-reduce feed feed feed cats feed cats feed cats... feed cats nearby... cats nearby in... cats nearby in... nearby in the... nearby in the... in the garden... DP for ncremental Parsing (Huang and Sagae) 12

Shift-Reduce Parsing feed cats nearby in the garden. action stack queue 0-1 shift 2 shift 3 l-reduce 4 shift 5a r-reduce 5b shift feed feed feed cats feed cats feed cats nearby feed cats... feed cats nearby... cats nearby in... cats nearby in... nearby in the... nearby in the... in the garden... DP for ncremental Parsing (Huang and Sagae) 13

DP for ncremental Parsing (Huang and Sagae) 14 Shift-Reduce Parsing feed cats nearby in the garden. action stack queue 0-1 shift 2 shift 3 l-reduce 4 shift 5a r-reduce 5b shift feed feed feed cats feed cats feed cats nearby shift-reduce conflict feed cats... feed cats nearby... cats nearby in... cats nearby in... nearby in the... nearby in the... in the garden...

DP for ncremental Parsing (Huang and Sagae) 15 Choosing Parser Actions stack queue... feed cats in the garden... nearby stack queue... s2 s1 s0 q0 q1... features: (s0.w, s0.rc, q0,...) = (cats, nearby, in,...) score each action using features f and weights w features are drawn from a local window abstraction (or signature) of a state -- this inspires DP! weights trained by structured perceptron (Collins 02)

DP for ncremental Parsing (Huang and Sagae) 16 Greedy Search each state => three new states (shift, l-reduce, r-reduce) search space should be exponential greedy search: always pick the best next state

DP for ncremental Parsing (Huang and Sagae) 17 Greedy Search each state => three new states (shift, l-reduce, r-reduce) search space should be exponential greedy search: always pick the best next state

DP for ncremental Parsing (Huang and Sagae) 18 Beam Search each state => three new states (shift, l-reduce, r-reduce) search space should be exponential beam search: always keep top-b states

DP for ncremental Parsing (Huang and Sagae) 19 Dynamic Programming each state => three new states (shift, l-reduce, r-reduce) key idea of DP: share common subproblems merge equivalent states => polynomial space

Dynamic Programming each state => three new states (shift, l-reduce, r-reduce) key idea of DP: share common subproblems merge equivalent states => polynomial space graph-structured stack (Tomita, 1988) DP for ncremental Parsing (Huang and Sagae) 20

Dynamic Programming each state => three new states (shift, l-reduce, r-reduce) key idea of DP: share common subproblems merge equivalent states => polynomial space each DP state corresponds to exponentially many non-dp states graph-structured stack (Tomita, 1988) DP for ncremental Parsing (Huang and Sagae) 21

Dynamic Programming each state => three new states (shift, l-reduce, r-reduce) key idea of DP: share common subproblems merge equivalent states => polynomial space each DP state corresponds to exponentially many non-dp states 10 10 10 8 10 6 10 4 DP: exponential graph-structured stack (Tomita, 1988) DP for ncremental Parsing (Huang and Sagae) 10 2 10 0 non-dp beam search 0 10 20 30 40 50 60 70 sentence length 22

DP for ncremental Parsing (Huang and Sagae) 23 Merging Equivalent States two states are equivalent if they agree on features because same features guarantee same cost stack queue shift-reduce conflict: feed cats nearby in the garden sh... feed re feed sh... cats... s2 s1 s0 q0 q1... feed cats nearby in the garden

DP for ncremental Parsing (Huang and Sagae) 23 Merging Equivalent States two states are equivalent if they agree on features because same features guarantee same cost stack queue shift-reduce conflict: feed cats nearby in the garden sh... feed re feed... cats feed cats nearby in the garden sh... s2 s1 s0 q0 q1... assume features only look at root of s0 two states are equivalent if they agree on root of s0

DP for ncremental Parsing (Huang and Sagae) 24 Merging Equivalent States two states are equivalent if they agree on features because same features guarantee same cost shift-reduce conflict: feed cats nearby in the garden feed cats nearby in the garden cats...... cats sh re... nearby... feed stack queue... s2 s1 s0 q0 q1...

DP for ncremental Parsing (Huang and Sagae) 25 Merging Equivalent States two states are equivalent if they agree on features because same features guarantee same cost shift-reduce conflict: feed cats nearby in the garden nearby feed cats nearby in the garden cats...... cats sh re... nearby... feed stack queue... s2 s1 s0 q0 q1... sh re... cats... nearby

DP for ncremental Parsing (Huang and Sagae) 26 Merging Equivalent States two states are equivalent if they agree on features because same features guarantee same cost shift-reduce conflict: feed in the garden feed cats nearby... cats nearby... cats sh re... nearby... feed in the garden stack queue... s2 s1 s0 q0 q1... sh re... cats re... feed... nearby re... feed

DP for ncremental Parsing (Huang and Sagae) 27 Merging Equivalent States two states are equivalent if they agree on features because same features guarantee same cost shift-reduce conflict: feed in the garden feed cats cats nearby nearby...... cats in the garden stack queue... s2 s1 s0 q0 q1... sh re... nearby... cats re re sh... feed... nearby re... feed

DP for ncremental Parsing (Huang and Sagae) 28 Merging Equivalent States two states are equivalent if they agree on features because same features guarantee same cost shift-reduce conflict: feed in the garden feed cats cats nearby nearby...... cats sh re in the garden stack queue... s2 s1 s0 q0 q1... re... nearby... cats re... feed sh... feed... nearby re sh... in

Merging Equivalent States two states are equivalent if they agree on features because same features guarantee same cost shift-reduce conflict: feed in the garden feed cats cats nearby nearby...... cats sh re in the garden stack queue... s2 s1 s0 q0 q1... re... nearby... cats re... feed sh... feed... nearby re sh... in DP for ncremental Parsing (Huang and Sagae) graph-structured stack 28

Theory: Polynomial-Time DP stack queue... s2 s1 s0 q0 q1... this DP is exact and polynomial-time if features are: a) bounded -- for polynomial time features can only look at a local window b) monotonic -- for correctness (optimal substructure) features should draw no more info from trees farther away from stack top than from trees closer to top both are intuitive: a) always true; b) almost always true DP for ncremental Parsing (Huang and Sagae) 29

DP for ncremental Parsing (Huang and Sagae) 30 Theory: Monotonic History related: grammar refinement by annotation (Johnson, 1998) annotate vertical context history (e.g., parent) monotonicity: can t annotate grand-parent without annotating the parent (otherwise DP would fail) our features: left-context history instead of vertical-context similarly, can t annotate s2 without annotating s1 but we can always design minimum monotonic superset grand-parent parent s2 s1 s0 stack

DP for ncremental Parsing (Huang and Sagae) 31 Related Work Graph-Structured Stack (Tomita 88): Generalized LR GSS is just a chart viewed from left to right (e.g. Earley 70) this line of work started w/ Lang (1974); stuck since 1990 b/c explicit LR table is impossible with modern grammars general idea: compile CFG parse chart to FSAs (e.g. our beam)

Related Work Graph-Structured Stack (Tomita 88): Generalized LR GSS is just a chart viewed from left to right (e.g. Earley 70) this line of work started w/ Lang (1974); stuck since 1990 b/c explicit LR table is impossible with modern grammars general idea: compile CFG parse chart to FSAs (e.g. our beam) We revived and advanced this line of work in two aspects theoretical: implicit LR table based on features merge and split on-the-fly; no pre-compilation needed monotonic feature functions guarantee correctness (new) practical: achieved linear-time performance with pruning 31 DP for ncremental Parsing (Huang and Sagae)

Experiments

DP for ncremental Parsing (Huang and Sagae) 33 Speed Comparison 5 times faster with the same parsing accuracy DP non-dp time (hours)

DP for ncremental Parsing (Huang and Sagae) 34 Correlation of Search and Parsing better search quality <=> better parsing accuracy dependency accuracy 93.1 93 92.9 92.8 92.7 92.6 92.5 92.4 92.3 92.2 DP non-dp 2365 2370 2375 2380 2385 2390 2395 average model score

Search Space: Exponential number of trees explored 10 10 10 8 10 6 10 4 10 2 10 0 DP for ncremental Parsing (Huang and Sagae) DP: exponential non-dp: fixed (beam-width) 0 10 20 30 40 50 60 70 sentence length 35

DP for ncremental Parsing (Huang and Sagae) 36 N-Best / Forest Oracles DP forest oracle (98.15) DP k-best in forest non-dp k-best in beam

DP for ncremental Parsing (Huang and Sagae) 37 Better Search => Better Learning DP leads to faster and better learning w/ perceptron

DP for ncremental Parsing (Huang and Sagae) 38 Learning Details: Early Updates greedy search: update at first error (Collins/Roark 04) beam search: update when gold is pruned (Zhang/Clark 08) DP search: also update when gold is merged (new!) b/c we know gold can t make to the top again

Parsing Time vs. Sentence Length parsing speed (scatter plot) compared to other parsers parsing time (secs) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 sentence length DP for ncremental Parsing (Huang and Sagae) 39

Parsing Time vs. Sentence Length parsing speed (scatter plot) compared to other parsers parsing time (secs) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Charniak DP for ncremental Parsing (Huang and Sagae) Berkeley this work MST 0 10 20 30 40 50 60 70 sentence length 39

Parsing Time vs. Sentence Length parsing speed (scatter plot) compared to other parsers parsing time (secs) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Charniak O(n 2.5 ) O(n 2.4 ) Berkeley this work DP for ncremental Parsing (Huang and Sagae) 39 MST 0 10 20 30 40 50 60 70 sentence length O(n 2 ) O(n)

DP for ncremental Parsing (Huang and Sagae) Final Results much faster than major parsers (even with Python!) first linear-time incremental dynamic programming parser best reported dependency accuracy on Penn Treebank McDonald et al 05 - MST Koo et al 08 baseline* Zhang & Clark 08 single this work Charniak 00 Petrov & Klein 07 90.2 92.0 91.4 92.1 92.5 92.4 time complexity trees searched 0.12 O(n 2 ) exponential - O(n 4 ) exponential 0.11 O(n) constant 0.04 O(n) exponential 0.49 O(n 2.5 ) exponential 0.21 O(n 2.4 ) exponential 89 91 93

Final Results much faster than major parsers (even with Python!) first linear-time incremental dynamic programming parser best reported dependency accuracy on Penn Treebank McDonald et al 05 - MST Koo et al 08 baseline* Zhang & Clark 08 single this work Charniak 00 Petrov & Klein 07 90.2 92.0 91.4 92.1 92.5 92.4 time complexity trees searched 0.12 O(n 2 ) exponential - O(n 4 ) exponential 0.11 O(n) constant 0.04 O(n) exponential 0.49 O(n 2.5 ) exponential 0.21 O(n 2.4 ) exponential 89 91 93 DP for ncremental Parsing (Huang and Sagae) *at this ACL: Koo & Collins 10: 93.0 with O(n 4 )

Final Results on Chinese also the best parsing accuracy on Chinese Penn Chinese Treebank (CTB 5) all numbers below use gold-standard POS tags Duan et al. 2007 Zhang & Clark 08 (single) this work 73.7 76.7 78.3 83.9 word 84.4 non-root root 84.3 84.7 85.2 85.5 DP for ncremental Parsing (Huang and Sagae) 70 85 41

Conclusion greedy search incremental parsing (e.g. shift-reduce) principled search fast (linear-time) full dynamic programming (e.g. CKY) slow (cubic-time) DP for ncremental Parsing (Huang and Sagae) 42

DP for ncremental Parsing (Huang and Sagae) 42 Conclusion greedy search principled search incremental parsing (e.g. shift-reduce) linear-time shift-reduce parsing w/ dynamic programming fast (linear-time) full dynamic programming (e.g. CKY) slow (cubic-time)

DP for ncremental Parsing (Huang and Sagae) Thank You a general theory of DP for shift-reduce parsing fast, accurate DP parser release coming soon: as long as features are bounded and monotonic http://www.isi.edu/~lhuang http://www.ict.usc.edu/~sagae future work adapt to constituency parsing (straightforward) other grammar formalisms like CCG and TAG integrate POS tagging into the parser 43