Changeset 22654 for docs

Show
Ignore:
Timestamp:
10/19/08 19:55:59 (3 months ago)
Author:
moritz
Message:

[docs/tutorial] (aka book) many clean ups and enhancements to ch07_grammars.pod.
Also other small fixes.

Location:
docs/tutorial
Files:
3 modified

Legend:

Unmodified
Added
Removed
  • docs/tutorial/ch01_overview.pod

    r22436 r22654  
    253253catalyst for his thoughts. 
    254254 
     255=head3 The test suite 
     256 
     257Z<CHP-1-SECT-3.1.4> 
     258 
     259X<Tang, Audrey> 
     260X<Pugs> 
     261X<Test suite> 
     262 
     263The design documents describe the Perl 6 language in prose, and the test 
     264suite is intended to translate that specification into code. 
     265 
     266In 2005 Audrey Tang started a Perl 6 compiler named I<Pugs>. It is 
     267written in Haskell, and moved very fast. The test suite began both as 
     268regression tests and as a feature wish list, and is now slowly being 
     269translated into an implementation agnostic, offical test suite that can 
     270be used by all implementations. 
     271 
     272Once it is done, every compiler that passes the test suite may name 
     273itself I<Perl 6>. 
     274 
    255275=cut 
     276 
     277# vim: sw=3 ts=3 expandtab tw=72 
  • docs/tutorial/ch05_subroutines.pod

    r22445 r22654  
    923923 
    924924   macro funky (Str $whatever)  
    925       is parsed (/:w like (\w+), you know/) 
     925      is parsed (/:s like (\w+), you know/) 
    926926  { 
    927927      return { plain($whatever); }; 
  • docs/tutorial/ch07_grammars.pod

    r22445 r22654  
    88TODO: This chapter is outdated in some ways 
    99 
    10   * It uses the old word "rule" instead of "regex" in far too many places 
     10  * It should be explained when we use "rule" and when "regex", and what 
     11    a "subrule" is. 
    1112  * The interpolation rules are outdated 
    12   * <'...'> and <"..."> are now '...' and "..." 
    1313  * some of the assertion syntax has changed, for example <foo()> means 
    1414    something different now 
    15   * Modifiers: explain :sigspace and :ratchet modifiers 
    16   * Modifiers: :u1, :u2... are now :bytes, :codes, :graphemes etc. 
    17   * Other missing modifiers: :overlap, :ignoreaccent 
    18   * The match object needs explanation 
     15  * Modifiers: explain :ratchet modifier 
     16  * The match object needs more explanation 
    1917 
    2018Z<CHP-7> 
     
    131129  regex digits {\d+} 
    132130 
    133 There are two more keywords that defines regexes similarly to C<regex>, but 
    134 they imply different modifiers. C<token> introduces a regex that does 
     131There are two more keywords that defines regexes similarly to C<regex>, which 
     132imply slightly different behavior. C<token> introduces a regex that does 
    135133not backtrack,N<technically it implies the C<:ratchet> modifier> (more details 
    136134on that below; for now it's enough to know that it matches simple regexes 
     
    272270The "word"-like metacharacters are C<.>, C<^>, C<^^>, C<$>, C<$$>. The 
    273271C<.> matches any single character, even a newline character. Actually, 
    274 what it matches by default is a Unicode grapheme, but you can change 
    275 that behavior with a pragma in your code, or a modifier on the regex. 
     272Perl 6 has a the notion of a Unicode level, which determines if string 
     273manipulation happens on the byte, codepoint or grapheme level. C<.> 
     274matches a character in the current level, which defaults to grapheme. 
     275The Unicode level can be adjusted with a pragma or with modifiers. 
    276276We'll talk more about modifiers in A<CHP-7-SECT-2.5>"Modifiers" later 
    277277in this chapter. The C<^> and C<$> metacharacters are zero-width 
     
    416416=cell Match an assertion. 
    417417 
     418=row 
     419 
     420=cell C<< <( >> 
     421 
     422=cell Begin of capturing 
     423 
     424=row 
     425 
     426=cell C<< )> >> 
     427 
     428=cell End of capturring 
     429 
    418430=end table 
    419431 
     
    438450X<regexes;variable interpolation> 
    439451Note that since an ordinary variable now interpolates as a literal 
    440 string by default, the C<\Q> escape is rarely needed. 
     452string by default, the C<\Q> escape is rarely needed. An interpolated 
     453array is interpreted as an alternation of all array elements. 
    441454 
    442455A<CHP-7-TABLE-3>Table 7-3 shows the escape sequences for regexes.  
     
    455468 
    456469=bodyrows  
     470 
     471=row 
     472 
     473=cell  C<'...'> 
     474 
     475=cell Tread everyhing between the quotes literally, except the backslash C<\> 
     476and  single quotes C<'> 
     477 
     478=row 
     479 
     480=cell  C<"..."> 
     481 
     482=cell Like C<'...'>, but backslash escape sequences and variable interpolation 
     483are enabled 
    457484 
    458485=row  
     
    704731=row  
    705732 
    706 =cell C**n> 
     733=cell C<**n> 
    707734 
    708735=cell C<**?n> 
     
    712739=row  
    713740 
    714 =cell C<**{n..m}> 
    715 =cell C<**?{n..m}> 
     741=cell C<**n..m> 
     742 
     743=cell C<**?n..m> 
    716744 
    717745=cell Match at least R<n> and no more than R<m> times. 
     
    719747=row  
    720748 
    721 =cell C<E<lt>>R<n>C<...E<gt>> 
    722  
    723 =cell C<E<lt>>R<n>C<...E<gt>?> 
     749=cell C<**n..*> 
     750 
     751=cell C<**?n..*> 
    724752 
    725753=cell Match at least R<n> times. 
     
    802830=cell Negate any assertion. 
    803831 
     832=cell 
     833 
     834=row C<< <.rule> >> 
     835 
     836=row Match named rule, wihtout capturing. 
     837 
    804838=row  
    805839 
     
    810844=row  
    811845 
    812 =cell C<E<lt>[...]E<gt>> 
     846=cell C<E<lt>[...]E<gt>>, C<< <+[...]> >> 
    813847 
    814848=cell Match an enumerated character class. 
     
    819853 
    820854=cell Complement a character class (named or enumerated). 
    821  
    822 =row  
    823  
    824 =cell C<E<lt>"..."E<gt>> 
    825  
    826 =cell Match a literal string (interpolated at match time). 
    827  
    828 =row  
    829  
    830 =cell C<E<lt>'...'E<gt>> 
    831  
    832 =cell Match a literal string (not interpolated). 
    833  
    834 =row  
    835  
    836 =cell C<E<lt>(...)E<gt>> 
    837  
    838 =cell Boolean assertion. Execute a closure and match if it returns a true 
    839 result. 
    840855 
    841856=row  
     
    860875=row  
    861876 
    862 =cell C<E<lt>E<amp>sub()E<gt>> 
    863  
    864 =cell Match an anonymous rule returned by a sub. 
     877=cell C<< <rule(...)> >> 
     878 
     879=cell Call a named rule with arguments. 
    865880 
    866881=row  
     
    890905cannot attach to the outside of a bare C</.../>. For example: 
    891906 
    892   m:i/marvin/ # case insensitive 
     907  m:i /marvin/ # case insensitive 
    893908  rule names :i { marvin | ford | arthur } 
    894909 
    895 The single-character modifiers can be grouped, but the others must be 
    896 separated by a colon: 
    897  
    898   m:wig/ zaphod /                        # OK 
    899   m:words:ignorecase:globally / zaphod / # OK 
    900   m:wordsignorecaseglobally / zaphod /   # Not OK 
     910Multiple modifiers can be chained, short and long names can 
     911be mixed: 
     912 
     913  m:s :i :g/ zaphod / 
     914  m:sigspace :i :global / zaphod / 
     915 
     916Modifiers can be negated with the C<:!pair> notation, so C<:!i> forces 
     917case-sensitive matching. 
    901918 
    902919Most of the modifiers can also go inside the rule, attached to the 
     
    905922alteration of the pattern: 
    906923 
    907   m/:w I saw [:i zaphod] / # only 'zaphod' is case insensitive 
    908  
    909 The repetition modifiers (C<:R<N>x>, C<:R<N>th>, C<:once>, 
    910 C<:globally>, and C<:exhaustive>) and the continue modifier (C<:cont>) 
     924  m/:s I saw [:i zaphod] / # only 'zaphod' is case insensitive 
     925 
     926The repetition modifiers (C<:R<N>x>, C<:R<N>th>, 
     927C<:global>, and C<:exhaustive>) and the continue modifier (C<:cont>) 
    911928can't be lexically scoped, because they alter the return value of the 
    912929entire rule. 
     
    917934of the number.  
    918935 
    919 The C<:once> modifier on a rule only allows it to match once. The rule 
    920 will not match again until the you call the C<.reset> method on the rule 
    921 object. 
    922  
    923 The C<:globally> modifier matches as many times as possible. The 
     936The C<:global> modifier matches as many times as possible. The 
    924937C<:exhaustive> modifier also matches as many times as possible, but in 
    925938as many different ways as possible. 
     
    935948 
    936949By default, rules ignore literal whitespace within the pattern.  The 
    937 C<:w> modifier makes rules sensitive to literal whitespace, but in an 
    938 intelligent way. Any cluster of literal whitespace acts like an explicit 
    939 C<\s+> when it separates two identifiers and C<\s*> everywhere else. 
     950C<:s> or C<:sigspace> modifier makes rules sensitive to literal whitespace, 
     951but in an intelligent way. Any cluster of literal whitespace acts like an 
     952explicit C<\s+> when it separates two identifiers and C<\s*> everywhere else. 
     953 
     954More specifically any literal whitespace in the regex is translated to 
     955an implict call to C<E<lt>.wsE<gt>>, where the C<ws> rule matches as 
     956mentioned above, but can also be overridden by the user. 
    940957 
    941958There are no modifiers to alter whether the matched string is treated as 
     
    970987=cell Case-insensitive match. 
    971988 
    972 =row  
    973  
    974 =cell C<:I> 
    975  
    976 =cell  
    977  
    978 =cell Case-sensitive match (on by default). 
    979  
    980 =row  
    981  
    982 =cell C<:c> 
    983  
    984 =cell C<:cont> 
    985  
    986 =cell Continue where the previous match on the string left off. 
    987  
    988 =row  
    989  
    990 =cell C<:w> 
    991  
    992 =cell C<:words> 
     989=row 
     990 
     991=cell C<:a> 
     992 
     993=cell C<:ignoreaccent> 
     994 
     995=cell Ignore accents and other markings on characters. 
     996 
     997=row  
     998 
     999=cell C<:c($pos)> 
     1000 
     1001=cell C<:continue($pos)> 
     1002 
     1003=cell Match at position C<$pos> or later. If C<$pos> is ommited, start where 
     1004 
     1005=row 
     1006 
     1007=cell C<:p> 
     1008 
     1009=cell C<:pos> 
     1010 
     1011=cell Match anchored at position C<$pos>. If C<$pos> is ommited, start where 
     1012the previous match left off. 
     1013 
     1014=row  
     1015 
     1016=cell C<:s> 
     1017 
     1018=cell C<:sigspace> 
    9931019 
    9941020=cell Literal whitespace in the pattern matches as C<\s+> 
     
    10131039=row  
    10141040 
    1015 =cell  
    1016  
    1017 =cell C<:once> 
    1018  
    1019 =cell Only match the pattern once. 
    1020  
    1021 =row  
    1022  
    10231041=cell C<:g> 
    10241042 
    1025 =cell C<:globally> 
     1043=cell C<:global> 
    10261044 
    10271045=cell Match the pattern as many times as possible, but only possibilities 
    10281046that don't overlap. 
    10291047 
    1030 =row  
    1031  
    1032 =cell C<:e> 
     1048=row 
     1049 
     1050=cell C<:ov> 
     1051 
     1052=cell C<:overlap> 
     1053 
     1054=cell Match the pattern as many timies as possible, and allow overlapping 
     1055matches, but only one match per starting position. 
     1056 
     1057=row  
     1058 
     1059=cell C<:ex> 
    10331060 
    10341061=cell C<:exhaustive> 
     
    10411068=cell  
    10421069 
    1043 =cell C<:u0> 
    1044  
    1045 =cell . is a byte. 
     1070=cell C<:bytes> 
     1071 
     1072=cell C<.> is a byte. 
    10461073 
    10471074=row  
     
    10491076=cell  
    10501077 
    1051 =cell C<:u1> 
    1052  
    1053 =cell . is a Unicode codepoint. 
     1078=cell C<:codes> 
     1079 
     1080=cell C<.> is a Unicode codepoint. 
    10541081 
    10551082=row  
     
    10571084=cell  
    10581085 
    1059 =cell C<:u2> 
    1060  
    1061 =cell . is a Unicode grapheme. 
     1086=cell C<:graphs> 
     1087 
     1088=cell C<.> is a Unicode grapheme. 
    10621089 
    10631090=row  
     
    10651092=cell  
    10661093 
    1067 =cell C<:u3> 
    1068  
    1069 =cell . is language dependent. 
     1094=cell C<:chars> 
     1095 
     1096=cell C<.> matches whatever the current Unicode level corresponds to 
     1097(this is the default). 
     1098 
     1099=row 
     1100 
     1101=cell 
     1102 
     1103=cell C<:ratchet> 
     1104 
     1105=cell Imply a C<:> after each atom (see "Backtracking Control" below). 
    10701106 
    10711107=row  
     
    10791115=end table 
    10801116 
     1117=head2 Substition Modifiers 
     1118 
     1119Special modifiers are available for substitions that do not make sense 
     1120on normal matches. 
     1121 
     1122The C<:samecase>, or short C<:ii> modifier implies the C<:ignorecase> 
     1123modifier, but also carries the case information on a 
     1124character-by-character base 
     1125 
     1126   my $s = 'The Quick Brown Fox'; 
     1127   $s ~~ s:ii/brown/blue/; 
     1128   say $s;           # The Quick Blue Fox 
     1129 
     1130If the C<:sigspace> modifier is also present, a slightly more 
     1131intelligent algorithm is used. If the source string follows one of the 
     1132case patterns in $table (XXX: make that a proper cross-link), 
     1133that pattern is recognized and applied onto the 
     1134substitution string. 
     1135 
     1136   $_ = 'All Words Capialized'; 
     1137   s:s:ii/.*/other words/; 
     1138   .say;             # Other Words 
     1139 
     1140There's a shortcut for C<s:s> named C<ss>, so you could have written the 
     1141example above aswidth="348" height="300" 
     1142 C<ss:ii/.*/other words/>. 
     1143 
     1144=begin table picture Case patterns for the :samecase modifier 
     1145 
     1146=headrow 
     1147 
     1148=cell Pattern 
     1149 
     1150=cell Corresponding code 
     1151 
     1152=bodyrows 
     1153 
     1154=row 
     1155 
     1156=cell ALL UPPERCASE 
     1157 
     1158=cell C<.uc> 
     1159 
     1160=row 
     1161 
     1162=cell all lowercase 
     1163 
     1164=cell C<.lc> 
     1165 
     1166=row 
     1167 
     1168=cell Every Word Capitalized 
     1169 
     1170=cell C<.lc.capitalize> 
     1171 
     1172=row 
     1173 
     1174=cell First letter upper, rest lower 
     1175 
     1176=cell C<.lc.ucfirst> 
     1177 
     1178=row 
     1179 
     1180=cell fIRST LETTER LOWER, REST UPPER 
     1181 
     1182=cell C<.uc.lcfirst> 
     1183 
     1184=end table 
     1185 
     1186A similar modifier is C<:sameaccent> (short C<:aa>). Instead of carrying 
     1187case information, it carries accent and marking information. 
     1188 
     1189   my $stuff = 'Möhre'; 
     1190   $stuff ~~ s:aa/a/o/; 
     1191   say $stuff;          # Mähre 
     1192 
     1193The third substitution modifier is C<:samespace>, short C<:ss>. It preserves 
     1194whitespace that is matched by implicit C<E<lt>.wsE<gt>> rules: 
     1195 
     1196   my $s = "Some   white\t\n spaces"; 
     1197   $s ~~ s:ss/\w+ \w+ \w+/Completely different text/; 
     1198   # $s is now "Completely   different\t\n text" 
    10811199 
    10821200=head1 Built-in Rules 
     
    11601278=cell Zero-width lookbehind. Assert that you're I<after> a pattern. 
    11611279 
    1162 =row  
    1163  
    1164 =cell C<E<lt>prop ...E<gt>> 
    1165  
    1166 =cell Match any character with the named property. 
    1167  
    1168 =row  
    1169  
    1170 =cell C<E<lt>replace(...)E<gt>> 
    1171  
    1172 =cell Replace everything matched so far in the rule or subrule with the 
    1173 given string (under consideration). 
    1174  
    11751280=end table 
    11761281 
     
    12451350=end table 
    12461351 
    1247 =head1 Hypothetical Variables 
     1352The C<:ratchet> modifier, which is implied by regexes declared with the 
     1353C<token> or C<rule> keyword, disables backtracking in the subrule, which 
     1354is the same as adding a C<:> after every atom. 
     1355 
     1356=head1 The Match Object 
    12481357 
    12491358Z<CHP-7-SECT-5> 
    12501359 
    1251 X<variables;hypothetical> 
    1252 X<hypothetical variables> 
     1360X<object;match> 
    12531361X<rules;captures> 
    1254 Hypothetical variables are a powerful way of building up data structures 
    1255 from within a match. Ordinary captures with C<()> store the result of 
    1256 the captures in C<$0>, C<$1>, etc. The values stored in these variables 
    1257 will be kept if the match is successful, but thrown away if the match 
    1258 fails (hence the term "hypothetical"). The numbered capture variables 
    1259 are accessible outside the match, but only within the immediate 
    1260 surrounding lexical scope: 
    1261  
    1262   "Zaphod Beeblebrox" ~~ m:w/ (\w+) (\w+) /; 
    1263    
    1264   print $0; # prints Zaphod 
    1265  
    1266 You can also capture into any user-defined variable with the binding 
    1267 operator C<:=>. These variables must already be defined in the lexical 
    1268 scope surrounding the rule: 
    1269  
    1270   my $person; 
    1271   "Zaphod's just this guy." ~~ / ^ $person := (\w+) /; 
    1272   print $person; # prints Zaphod 
    1273  
    1274 Repeated matches can be captured into an array: 
    1275  
    1276   my @words; 
    1277   "feefifofum" ~~ / @words := (f<-[f]>+)* /; 
    1278   # @words contains ("fee", "fi", "fo", "fum") 
    1279  
    1280 Pairs of repeated matches can be captured into a hash: 
    1281  
    1282   my %customers; 
    1283   $records ~~ m:w/ %customers := [ E<lt>idE<gt> = E<lt>nameE<gt> \n]* /; 
    1284  
    1285 If you don't need the captured value outside the rule, use a C<$?> 
    1286 variable instead. These are only directly accessible within the rule: 
    1287  
    1288   "Zaphod saw Zaphod" ~~ m:w/ $?name := (\w+) \w+ $?name/; 
    1289  
    1290 A match of a named rule stores the result in a C<$?> variable with the 
    1291 same name as the rule. These variables are also accessible only within 
    1292 the rule: 
    1293  
    1294   "Zaphod saw Zaphod" ~~ m:w/ E<lt>nameE<gt> \w+ $?name /; 
    1295  
     1362 
     1363A regex match produces a I<Match> object, which contains all information 
     1364about the match, including start and end position, matched string, and all 
     1365captures. 
     1366 
     1367The match object is returned from a regex match, and is also stored in 
     1368the special variable C<$/>. 
     1369 
     1370   my $match = 'Zaphod Beeblebrox' ~~ m/\w+/;    
     1371   say $match;    # prints Zaphod 
     1372 
     1373In string context it evaluates to the text of the matched part of the 
     1374string. 
     1375 
     1376Table A<CHP-7-TABLE-Match> summarises the properties of the match object. 
     1377 
     1378The variables C<$0>, C<$1>, C<$2> etc. are aliases to C<$/[0]>, 
     1379C<$/[1]>, C<$/[2]>, and C<$E<lt>nameE<gt>> is an alias to 
     1380C<$/E<lt>nameE<gt>>. Likewise an empty C<@()> is the same as C<@($/)>, 
     1381and C<%()> stands for C<%($/)>. 
     1382 
     1383Match variables can also store a different scalar object. A closure in a 
     1384regex can store such an object by calling C<make>, and can be accessed 
     1385by forcing scalar context with C<$( $/ )>: 
     1386 
     1387   regex herd :i :s { 
     1388         (\d+) 
     1389         (\w+)s? 
     1390         { 
     1391            make Herd.new( 
     1392                  animal => $1.capitalize 
     1393                  count  => $0, 
     1394                 ); 
     1395         } 
     1396   } 
     1397   'Yesterday we saw 4 mooses' ~~ m/ <herd> /; 
     1398   # now $($<herd>) contains the new Herd object 
     1399 
     1400This can be used to build object trees directly from regex matches. 
     1401 
     1402=begin table picture Properties of the Match object 
     1403 
     1404Z<CHP-7-TABLE-Match> 
     1405 
     1406=headrow 
     1407 
     1408=cell Property  
     1409 
     1410=cell Description 
     1411 
     1412=bodyrows 
     1413 
     1414=row 
     1415 
     1416=cell C<?$/> 
     1417 
     1418=cell True if the match was successful. 
     1419 
     1420=row 
     1421 
     1422=cell C<$/.text> 
     1423 
     1424=cell The matched part of the string. 
     1425 
     1426=row 
     1427 
     1428=cell C<$/.from> 
     1429 
     1430=cell Start position of the match. 
     1431 
     1432=row 
     1433 
     1434=cell C<$/.to> 
     1435 
     1436=cell End position of the match. 
     1437 
     1438=row 
     1439 
     1440=cell C<@( $/ )> 
     1441 
     1442=cell List of all positional captures. 
     1443 
     1444=row  
     1445 
     1446=cell C<%( $/ )> 
     1447 
     1448=cell Hash of all named captures. 
     1449 
     1450=row 
     1451 
     1452=cell C<$/[$n]> 
     1453 
     1454=cell C<$n>th positional capture. 
     1455 
     1456=row 
     1457 
     1458=cell C<$/E<lt>nameE<gt>> 
     1459 
     1460=cell Access to particular named capture. 
     1461 
     1462=end table 
     1463 
     1464Capture variables are always match objects, and contain the information 
     1465of their respective sub matches. 
     1466 
     1467   m/ ( a ( geek ) ( passes ) )  ( many tests ) / 
     1468      |   |      | |        | |  |            | 
     1469      |   $/[0][0] $/[0][1]-+ |  |            | 
     1470      |                       |  |            | 
     1471      $/[0]-------------------+  $/[1] -------+ 
     1472 
     1473If a capturing group is quantified, it automatically becomes an array of 
     1474match objects. Subsequent matches are not renumbered: 
     1475 
     1476   '12 45 books' ~~ m:s/ ( \d+ )+ (\w+) / 
     1477   say $0[0];     # 12 
     1478   say $0[1];     # 45 
     1479   say $1;        # books 
     1480 
     1481When a subrule is called with the C<E<lt>subruleE<gt>> syntax, it 
     1482produces a named capture of name C<subrule>. That named can be 
     1483changed with the C<E<lt>newname=subruleE<gt>> syntax. 
     1484 
     1485   token identifier { \w+ } 
     1486   token number     { \d+ } 
     1487   $_ = '24 hours' 
     1488   if m:s/<number> <unit=identifier> / { 
     1489      say "Number: $<number>. Unit: $<unit>"; 
     1490   } 
     1491 
     1492These variables are also available iin the regex itself: 
     1493 
     1494  "Zaphod saw Zaphod" ~~ m:s/ E<lt>nameE<gt> \w+ $/<name> /; 
    12961495 
    12971496 
    12981497=cut 
     1498 
     1499# vim: sw=3 ts=3 expandtab ft=pod tw=72