- Timestamp:
- 10/19/08 19:55:59 (3 months ago)
- Location:
- docs/tutorial
- Files:
-
- 3 modified
-
ch01_overview.pod (modified) (1 diff)
-
ch05_subroutines.pod (modified) (1 diff)
-
ch07_grammars.pod (modified) (26 diffs)
Legend:
- Unmodified
- Added
- Removed
-
docs/tutorial/ch01_overview.pod
r22436 r22654 253 253 catalyst for his thoughts. 254 254 255 =head3 The test suite 256 257 Z<CHP-1-SECT-3.1.4> 258 259 X<Tang, Audrey> 260 X<Pugs> 261 X<Test suite> 262 263 The design documents describe the Perl 6 language in prose, and the test 264 suite is intended to translate that specification into code. 265 266 In 2005 Audrey Tang started a Perl 6 compiler named I<Pugs>. It is 267 written in Haskell, and moved very fast. The test suite began both as 268 regression tests and as a feature wish list, and is now slowly being 269 translated into an implementation agnostic, offical test suite that can 270 be used by all implementations. 271 272 Once it is done, every compiler that passes the test suite may name 273 itself I<Perl 6>. 274 255 275 =cut 276 277 # vim: sw=3 ts=3 expandtab tw=72 -
docs/tutorial/ch05_subroutines.pod
r22445 r22654 923 923 924 924 macro funky (Str $whatever) 925 is parsed (/: wlike (\w+), you know/)925 is parsed (/:s like (\w+), you know/) 926 926 { 927 927 return { plain($whatever); }; -
docs/tutorial/ch07_grammars.pod
r22445 r22654 8 8 TODO: This chapter is outdated in some ways 9 9 10 * It uses the old word "rule" instead of "regex" in far too many places 10 * It should be explained when we use "rule" and when "regex", and what 11 a "subrule" is. 11 12 * The interpolation rules are outdated 12 * <'...'> and <"..."> are now '...' and "..."13 13 * some of the assertion syntax has changed, for example <foo()> means 14 14 something different now 15 * Modifiers: explain :sigspace and :ratchet modifiers 16 * Modifiers: :u1, :u2... are now :bytes, :codes, :graphemes etc. 17 * Other missing modifiers: :overlap, :ignoreaccent 18 * The match object needs explanation 15 * Modifiers: explain :ratchet modifier 16 * The match object needs more explanation 19 17 20 18 Z<CHP-7> … … 131 129 regex digits {\d+} 132 130 133 There are two more keywords that defines regexes similarly to C<regex>, but134 they imply different modifiers. C<token> introduces a regex that does131 There are two more keywords that defines regexes similarly to C<regex>, which 132 imply slightly different behavior. C<token> introduces a regex that does 135 133 not backtrack,N<technically it implies the C<:ratchet> modifier> (more details 136 134 on that below; for now it's enough to know that it matches simple regexes … … 272 270 The "word"-like metacharacters are C<.>, C<^>, C<^^>, C<$>, C<$$>. The 273 271 C<.> matches any single character, even a newline character. Actually, 274 what it matches by default is a Unicode grapheme, but you can change 275 that behavior with a pragma in your code, or a modifier on the regex. 272 Perl 6 has a the notion of a Unicode level, which determines if string 273 manipulation happens on the byte, codepoint or grapheme level. C<.> 274 matches a character in the current level, which defaults to grapheme. 275 The Unicode level can be adjusted with a pragma or with modifiers. 276 276 We'll talk more about modifiers in A<CHP-7-SECT-2.5>"Modifiers" later 277 277 in this chapter. The C<^> and C<$> metacharacters are zero-width … … 416 416 =cell Match an assertion. 417 417 418 =row 419 420 =cell C<< <( >> 421 422 =cell Begin of capturing 423 424 =row 425 426 =cell C<< )> >> 427 428 =cell End of capturring 429 418 430 =end table 419 431 … … 438 450 X<regexes;variable interpolation> 439 451 Note that since an ordinary variable now interpolates as a literal 440 string by default, the C<\Q> escape is rarely needed. 452 string by default, the C<\Q> escape is rarely needed. An interpolated 453 array is interpreted as an alternation of all array elements. 441 454 442 455 A<CHP-7-TABLE-3>Table 7-3 shows the escape sequences for regexes. … … 455 468 456 469 =bodyrows 470 471 =row 472 473 =cell C<'...'> 474 475 =cell Tread everyhing between the quotes literally, except the backslash C<\> 476 and single quotes C<'> 477 478 =row 479 480 =cell C<"..."> 481 482 =cell Like C<'...'>, but backslash escape sequences and variable interpolation 483 are enabled 457 484 458 485 =row … … 704 731 =row 705 732 706 =cell C **n>733 =cell C<**n> 707 734 708 735 =cell C<**?n> … … 712 739 =row 713 740 714 =cell C<**{n..m}> 715 =cell C<**?{n..m}> 741 =cell C<**n..m> 742 743 =cell C<**?n..m> 716 744 717 745 =cell Match at least R<n> and no more than R<m> times. … … 719 747 =row 720 748 721 =cell C< E<lt>>R<n>C<...E<gt>>722 723 =cell C< E<lt>>R<n>C<...E<gt>?>749 =cell C<**n..*> 750 751 =cell C<**?n..*> 724 752 725 753 =cell Match at least R<n> times. … … 802 830 =cell Negate any assertion. 803 831 832 =cell 833 834 =row C<< <.rule> >> 835 836 =row Match named rule, wihtout capturing. 837 804 838 =row 805 839 … … 810 844 =row 811 845 812 =cell C<E<lt>[...]E<gt>> 846 =cell C<E<lt>[...]E<gt>>, C<< <+[...]> >> 813 847 814 848 =cell Match an enumerated character class. … … 819 853 820 854 =cell Complement a character class (named or enumerated). 821 822 =row823 824 =cell C<E<lt>"..."E<gt>>825 826 =cell Match a literal string (interpolated at match time).827 828 =row829 830 =cell C<E<lt>'...'E<gt>>831 832 =cell Match a literal string (not interpolated).833 834 =row835 836 =cell C<E<lt>(...)E<gt>>837 838 =cell Boolean assertion. Execute a closure and match if it returns a true839 result.840 855 841 856 =row … … 860 875 =row 861 876 862 =cell C< E<lt>E<amp>sub()E<gt>>863 864 =cell Match an anonymous rule returned by a sub.877 =cell C<< <rule(...)> >> 878 879 =cell Call a named rule with arguments. 865 880 866 881 =row … … 890 905 cannot attach to the outside of a bare C</.../>. For example: 891 906 892 m:i /marvin/ # case insensitive907 m:i /marvin/ # case insensitive 893 908 rule names :i { marvin | ford | arthur } 894 909 895 The single-character modifiers can be grouped, but the others must be 896 separated by a colon: 897 898 m:wig/ zaphod / # OK 899 m:words:ignorecase:globally / zaphod / # OK 900 m:wordsignorecaseglobally / zaphod / # Not OK 910 Multiple modifiers can be chained, short and long names can 911 be mixed: 912 913 m:s :i :g/ zaphod / 914 m:sigspace :i :global / zaphod / 915 916 Modifiers can be negated with the C<:!pair> notation, so C<:!i> forces 917 case-sensitive matching. 901 918 902 919 Most of the modifiers can also go inside the rule, attached to the … … 905 922 alteration of the pattern: 906 923 907 m/: wI saw [:i zaphod] / # only 'zaphod' is case insensitive908 909 The repetition modifiers (C<:R<N>x>, C<:R<N>th>, C<:once>,910 C<:global ly>, and C<:exhaustive>) and the continue modifier (C<:cont>)924 m/:s I saw [:i zaphod] / # only 'zaphod' is case insensitive 925 926 The repetition modifiers (C<:R<N>x>, C<:R<N>th>, 927 C<:global>, and C<:exhaustive>) and the continue modifier (C<:cont>) 911 928 can't be lexically scoped, because they alter the return value of the 912 929 entire rule. … … 917 934 of the number. 918 935 919 The C<:once> modifier on a rule only allows it to match once. The rule 920 will not match again until the you call the C<.reset> method on the rule 921 object. 922 923 The C<:globally> modifier matches as many times as possible. The 936 The C<:global> modifier matches as many times as possible. The 924 937 C<:exhaustive> modifier also matches as many times as possible, but in 925 938 as many different ways as possible. … … 935 948 936 949 By default, rules ignore literal whitespace within the pattern. The 937 C<:w> modifier makes rules sensitive to literal whitespace, but in an 938 intelligent way. Any cluster of literal whitespace acts like an explicit 939 C<\s+> when it separates two identifiers and C<\s*> everywhere else. 950 C<:s> or C<:sigspace> modifier makes rules sensitive to literal whitespace, 951 but in an intelligent way. Any cluster of literal whitespace acts like an 952 explicit C<\s+> when it separates two identifiers and C<\s*> everywhere else. 953 954 More specifically any literal whitespace in the regex is translated to 955 an implict call to C<E<lt>.wsE<gt>>, where the C<ws> rule matches as 956 mentioned above, but can also be overridden by the user. 940 957 941 958 There are no modifiers to alter whether the matched string is treated as … … 970 987 =cell Case-insensitive match. 971 988 972 =row 973 974 =cell C<:I> 975 976 =cell 977 978 =cell Case-sensitive match (on by default). 979 980 =row 981 982 =cell C<:c> 983 984 =cell C<:cont> 985 986 =cell Continue where the previous match on the string left off. 987 988 =row 989 990 =cell C<:w> 991 992 =cell C<:words> 989 =row 990 991 =cell C<:a> 992 993 =cell C<:ignoreaccent> 994 995 =cell Ignore accents and other markings on characters. 996 997 =row 998 999 =cell C<:c($pos)> 1000 1001 =cell C<:continue($pos)> 1002 1003 =cell Match at position C<$pos> or later. If C<$pos> is ommited, start where 1004 1005 =row 1006 1007 =cell C<:p> 1008 1009 =cell C<:pos> 1010 1011 =cell Match anchored at position C<$pos>. If C<$pos> is ommited, start where 1012 the previous match left off. 1013 1014 =row 1015 1016 =cell C<:s> 1017 1018 =cell C<:sigspace> 993 1019 994 1020 =cell Literal whitespace in the pattern matches as C<\s+> … … 1013 1039 =row 1014 1040 1015 =cell1016 1017 =cell C<:once>1018 1019 =cell Only match the pattern once.1020 1021 =row1022 1023 1041 =cell C<:g> 1024 1042 1025 =cell C<:global ly>1043 =cell C<:global> 1026 1044 1027 1045 =cell Match the pattern as many times as possible, but only possibilities 1028 1046 that don't overlap. 1029 1047 1030 =row 1031 1032 =cell C<:e> 1048 =row 1049 1050 =cell C<:ov> 1051 1052 =cell C<:overlap> 1053 1054 =cell Match the pattern as many timies as possible, and allow overlapping 1055 matches, but only one match per starting position. 1056 1057 =row 1058 1059 =cell C<:ex> 1033 1060 1034 1061 =cell C<:exhaustive> … … 1041 1068 =cell 1042 1069 1043 =cell C<: u0>1044 1045 =cell .is a byte.1070 =cell C<:bytes> 1071 1072 =cell C<.> is a byte. 1046 1073 1047 1074 =row … … 1049 1076 =cell 1050 1077 1051 =cell C<: u1>1052 1053 =cell .is a Unicode codepoint.1078 =cell C<:codes> 1079 1080 =cell C<.> is a Unicode codepoint. 1054 1081 1055 1082 =row … … 1057 1084 =cell 1058 1085 1059 =cell C<: u2>1060 1061 =cell .is a Unicode grapheme.1086 =cell C<:graphs> 1087 1088 =cell C<.> is a Unicode grapheme. 1062 1089 1063 1090 =row … … 1065 1092 =cell 1066 1093 1067 =cell C<:u3> 1068 1069 =cell . is language dependent. 1094 =cell C<:chars> 1095 1096 =cell C<.> matches whatever the current Unicode level corresponds to 1097 (this is the default). 1098 1099 =row 1100 1101 =cell 1102 1103 =cell C<:ratchet> 1104 1105 =cell Imply a C<:> after each atom (see "Backtracking Control" below). 1070 1106 1071 1107 =row … … 1079 1115 =end table 1080 1116 1117 =head2 Substition Modifiers 1118 1119 Special modifiers are available for substitions that do not make sense 1120 on normal matches. 1121 1122 The C<:samecase>, or short C<:ii> modifier implies the C<:ignorecase> 1123 modifier, but also carries the case information on a 1124 character-by-character base 1125 1126 my $s = 'The Quick Brown Fox'; 1127 $s ~~ s:ii/brown/blue/; 1128 say $s; # The Quick Blue Fox 1129 1130 If the C<:sigspace> modifier is also present, a slightly more 1131 intelligent algorithm is used. If the source string follows one of the 1132 case patterns in $table (XXX: make that a proper cross-link), 1133 that pattern is recognized and applied onto the 1134 substitution string. 1135 1136 $_ = 'All Words Capialized'; 1137 s:s:ii/.*/other words/; 1138 .say; # Other Words 1139 1140 There's a shortcut for C<s:s> named C<ss>, so you could have written the 1141 example above aswidth="348" height="300" 1142 C<ss:ii/.*/other words/>. 1143 1144 =begin table picture Case patterns for the :samecase modifier 1145 1146 =headrow 1147 1148 =cell Pattern 1149 1150 =cell Corresponding code 1151 1152 =bodyrows 1153 1154 =row 1155 1156 =cell ALL UPPERCASE 1157 1158 =cell C<.uc> 1159 1160 =row 1161 1162 =cell all lowercase 1163 1164 =cell C<.lc> 1165 1166 =row 1167 1168 =cell Every Word Capitalized 1169 1170 =cell C<.lc.capitalize> 1171 1172 =row 1173 1174 =cell First letter upper, rest lower 1175 1176 =cell C<.lc.ucfirst> 1177 1178 =row 1179 1180 =cell fIRST LETTER LOWER, REST UPPER 1181 1182 =cell C<.uc.lcfirst> 1183 1184 =end table 1185 1186 A similar modifier is C<:sameaccent> (short C<:aa>). Instead of carrying 1187 case information, it carries accent and marking information. 1188 1189 my $stuff = 'Möhre'; 1190 $stuff ~~ s:aa/a/o/; 1191 say $stuff; # Mähre 1192 1193 The third substitution modifier is C<:samespace>, short C<:ss>. It preserves 1194 whitespace that is matched by implicit C<E<lt>.wsE<gt>> rules: 1195 1196 my $s = "Some white\t\n spaces"; 1197 $s ~~ s:ss/\w+ \w+ \w+/Completely different text/; 1198 # $s is now "Completely different\t\n text" 1081 1199 1082 1200 =head1 Built-in Rules … … 1160 1278 =cell Zero-width lookbehind. Assert that you're I<after> a pattern. 1161 1279 1162 =row1163 1164 =cell C<E<lt>prop ...E<gt>>1165 1166 =cell Match any character with the named property.1167 1168 =row1169 1170 =cell C<E<lt>replace(...)E<gt>>1171 1172 =cell Replace everything matched so far in the rule or subrule with the1173 given string (under consideration).1174 1175 1280 =end table 1176 1281 … … 1245 1350 =end table 1246 1351 1247 =head1 Hypothetical Variables 1352 The C<:ratchet> modifier, which is implied by regexes declared with the 1353 C<token> or C<rule> keyword, disables backtracking in the subrule, which 1354 is the same as adding a C<:> after every atom. 1355 1356 =head1 The Match Object 1248 1357 1249 1358 Z<CHP-7-SECT-5> 1250 1359 1251 X<variables;hypothetical> 1252 X<hypothetical variables> 1360 X<object;match> 1253 1361 X<rules;captures> 1254 Hypothetical variables are a powerful way of building up data structures 1255 from within a match. Ordinary captures with C<()> store the result of 1256 the captures in C<$0>, C<$1>, etc. The values stored in these variables 1257 will be kept if the match is successful, but thrown away if the match 1258 fails (hence the term "hypothetical"). The numbered capture variables 1259 are accessible outside the match, but only within the immediate 1260 surrounding lexical scope: 1261 1262 "Zaphod Beeblebrox" ~~ m:w/ (\w+) (\w+) /; 1263 1264 print $0; # prints Zaphod 1265 1266 You can also capture into any user-defined variable with the binding 1267 operator C<:=>. These variables must already be defined in the lexical 1268 scope surrounding the rule: 1269 1270 my $person; 1271 "Zaphod's just this guy." ~~ / ^ $person := (\w+) /; 1272 print $person; # prints Zaphod 1273 1274 Repeated matches can be captured into an array: 1275 1276 my @words; 1277 "feefifofum" ~~ / @words := (f<-[f]>+)* /; 1278 # @words contains ("fee", "fi", "fo", "fum") 1279 1280 Pairs of repeated matches can be captured into a hash: 1281 1282 my %customers; 1283 $records ~~ m:w/ %customers := [ E<lt>idE<gt> = E<lt>nameE<gt> \n]* /; 1284 1285 If you don't need the captured value outside the rule, use a C<$?> 1286 variable instead. These are only directly accessible within the rule: 1287 1288 "Zaphod saw Zaphod" ~~ m:w/ $?name := (\w+) \w+ $?name/; 1289 1290 A match of a named rule stores the result in a C<$?> variable with the 1291 same name as the rule. These variables are also accessible only within 1292 the rule: 1293 1294 "Zaphod saw Zaphod" ~~ m:w/ E<lt>nameE<gt> \w+ $?name /; 1295 1362 1363 A regex match produces a I<Match> object, which contains all information 1364 about the match, including start and end position, matched string, and all 1365 captures. 1366 1367 The match object is returned from a regex match, and is also stored in 1368 the special variable C<$/>. 1369 1370 my $match = 'Zaphod Beeblebrox' ~~ m/\w+/; 1371 say $match; # prints Zaphod 1372 1373 In string context it evaluates to the text of the matched part of the 1374 string. 1375 1376 Table A<CHP-7-TABLE-Match> summarises the properties of the match object. 1377 1378 The variables C<$0>, C<$1>, C<$2> etc. are aliases to C<$/[0]>, 1379 C<$/[1]>, C<$/[2]>, and C<$E<lt>nameE<gt>> is an alias to 1380 C<$/E<lt>nameE<gt>>. Likewise an empty C<@()> is the same as C<@($/)>, 1381 and C<%()> stands for C<%($/)>. 1382 1383 Match variables can also store a different scalar object. A closure in a 1384 regex can store such an object by calling C<make>, and can be accessed 1385 by forcing scalar context with C<$( $/ )>: 1386 1387 regex herd :i :s { 1388 (\d+) 1389 (\w+)s? 1390 { 1391 make Herd.new( 1392 animal => $1.capitalize 1393 count => $0, 1394 ); 1395 } 1396 } 1397 'Yesterday we saw 4 mooses' ~~ m/ <herd> /; 1398 # now $($<herd>) contains the new Herd object 1399 1400 This can be used to build object trees directly from regex matches. 1401 1402 =begin table picture Properties of the Match object 1403 1404 Z<CHP-7-TABLE-Match> 1405 1406 =headrow 1407 1408 =cell Property 1409 1410 =cell Description 1411 1412 =bodyrows 1413 1414 =row 1415 1416 =cell C<?$/> 1417 1418 =cell True if the match was successful. 1419 1420 =row 1421 1422 =cell C<$/.text> 1423 1424 =cell The matched part of the string. 1425 1426 =row 1427 1428 =cell C<$/.from> 1429 1430 =cell Start position of the match. 1431 1432 =row 1433 1434 =cell C<$/.to> 1435 1436 =cell End position of the match. 1437 1438 =row 1439 1440 =cell C<@( $/ )> 1441 1442 =cell List of all positional captures. 1443 1444 =row 1445 1446 =cell C<%( $/ )> 1447 1448 =cell Hash of all named captures. 1449 1450 =row 1451 1452 =cell C<$/[$n]> 1453 1454 =cell C<$n>th positional capture. 1455 1456 =row 1457 1458 =cell C<$/E<lt>nameE<gt>> 1459 1460 =cell Access to particular named capture. 1461 1462 =end table 1463 1464 Capture variables are always match objects, and contain the information 1465 of their respective sub matches. 1466 1467 m/ ( a ( geek ) ( passes ) ) ( many tests ) / 1468 | | | | | | | | 1469 | $/[0][0] $/[0][1]-+ | | | 1470 | | | | 1471 $/[0]-------------------+ $/[1] -------+ 1472 1473 If a capturing group is quantified, it automatically becomes an array of 1474 match objects. Subsequent matches are not renumbered: 1475 1476 '12 45 books' ~~ m:s/ ( \d+ )+ (\w+) / 1477 say $0[0]; # 12 1478 say $0[1]; # 45 1479 say $1; # books 1480 1481 When a subrule is called with the C<E<lt>subruleE<gt>> syntax, it 1482 produces a named capture of name C<subrule>. That named can be 1483 changed with the C<E<lt>newname=subruleE<gt>> syntax. 1484 1485 token identifier { \w+ } 1486 token number { \d+ } 1487 $_ = '24 hours' 1488 if m:s/<number> <unit=identifier> / { 1489 say "Number: $<number>. Unit: $<unit>"; 1490 } 1491 1492 These variables are also available iin the regex itself: 1493 1494 "Zaphod saw Zaphod" ~~ m:s/ E<lt>nameE<gt> \w+ $/<name> /; 1296 1495 1297 1496 1298 1497 =cut 1498 1499 # vim: sw=3 ts=3 expandtab ft=pod tw=72
