` (with restrictions [2]). * - The empty end tag `` which is ignored in the browser and DOM. * * [1]: There are no CDATA sections in HTML. When encountering `` becomes a bogus HTML comment, meaning there can be no CDATA * section in an HTML document containing `>`. The Tag Processor will first find * all valid and bogus HTML comments, and then if the comment _would_ have been a * CDATA section _were they to exist_, it will indicate this as the type of comment. * * [2]: XML allows a broader range of characters in a processing instruction's target name * and disallows "xml" as a name, since it's special. The Tag Processor only recognizes * target names with an ASCII-representable subset of characters. It also exhibits the * same constraint as with CDATA sections, in that `>` cannot exist within the token * since Processing Instructions do no exist within HTML and their syntax transforms * into a bogus comment in the DOM. * * ## Design and limitations * * The Tag Processor is designed to linearly scan HTML documents and tokenize * HTML tags and their attributes. It's designed to do this as efficiently as * possible without compromising parsing integrity. Therefore it will be * slower than some methods of modifying HTML, such as those incorporating * over-simplified PCRE patterns, but will not introduce the defects and * failures that those methods bring in, which lead to broken page renders * and often to security vulnerabilities. On the other hand, it will be faster * than full-blown HTML parsers such as DOMDocument and use considerably * less memory. It requires a negligible memory overhead, enough to consider * it a zero-overhead system. * * The performance characteristics are maintained by avoiding tree construction * and semantic cleanups which are specified in HTML5. Because of this, for * example, it's not possible for the Tag Processor to associate any given * opening tag with its corresponding closing tag, or to return the inner markup * inside an element. Systems may be built on top of the Tag Processor to do * this, but the Tag Processor is and should be constrained so it can remain an * efficient, low-level, and reliable HTML scanner. * * The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy. * HTML5 specifies that certain invalid content be transformed into different forms * for display, such as removing null bytes from an input document and replacing * invalid characters with the Unicode replacement character `U+FFFD` (visually "�"). * Where errors or transformations exist within the HTML5 specification, the Tag Processor * leaves those invalid inputs untouched, passing them through to the final browser * to handle. While this implies that certain operations will be non-spec-compliant, * such as reading the value of an attribute with invalid content, it also preserves a * simplicity and efficiency for handling those error cases. * * Most operations within the Tag Processor are designed to minimize the difference * between an input and output document for any given change. For example, the * `add_class` and `remove_class` methods preserve whitespace and the class ordering * within the `class` attribute; and when encountering tags with duplicated attributes, * the Tag Processor will leave those invalid duplicate attributes where they are but * update the proper attribute which the browser will read for parsing its value. An * exception to this rule is that all attribute updates store their values as * double-quoted strings, meaning that attributes on input with single-quoted or * unquoted values will appear in the output with double-quotes. * * ### Scripting Flag * * The Tag Processor parses HTML with the "scripting flag" disabled. This means * that it doesn't run any scripts while parsing the page. In a browser with * JavaScript enabled, for example, the script can change the parse of the * document as it loads. On the server, however, evaluating JavaScript is not * only impractical, but also unwanted. * * Practically this means that the Tag Processor will descend into NOSCRIPT * elements and process its child tags. Were the scripting flag enabled, such * as in a typical browser, the contents of NOSCRIPT are skipped entirely. * * This allows the HTML API to process the content that will be presented in * a browser when scripting is disabled, but it offers a different view of a * page than most browser sessions will experience. E.g. the tags inside the * NOSCRIPT disappear. * * ### Text Encoding * * The Tag Processor assumes that the input HTML document is encoded with a * text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=', * "'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab, * carriage-return, newline, and form-feed. * * In practice, this includes almost every single-byte encoding as well as * UTF-8. Notably, however, it does not include UTF-16. If providing input * that's incompatible, then convert the encoding beforehand. * * @since 6.2.0 * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive. * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE. * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token. * Introduces "special" elements which act like void elements, e.g. TITLE, STYLE. * Allows scanning through all tokens and processing modifiable text, where applicable. */ class WP_HTML_Tag_Processor { const MAX_BOOKMARKS = 10; const MAX_SEEK_OPS = 1000; protected $html; private $last_query; private $sought_tag_name; private $sought_class_name; private $sought_match_offset; private $stop_on_tag_closers; protected $parser_state = self::STATE_READY; protected $compat_mode = self::NO_QUIRKS_MODE; private $parsing_namespace = 'html'; protected $comment_type = null; protected $text_node_classification = self::TEXT_IS_GENERIC; private $bytes_already_parsed = 0; private $token_starts_at; private $token_length; private $tag_name_starts_at; private $tag_name_length; private $text_starts_at; private $text_length; private $is_closing_tag; private $attributes = array(); private $duplicate_attributes = null; private $classname_updates = array(); protected $bookmarks = array(); const ADD_CLASS = true; const REMOVE_CLASS = false; const SKIP_CLASS = null; protected $lexical_updates = array(); protected $seek_count = 0; private $skip_newline_at = null; public function __construct( $html ) { $this->html = $html; } public function change_parsing_namespace( string $new_namespace ): bool { if ( ! in_array( $new_namespace, array( 'html', 'math', 'svg' ), true ) ) { return false; } $this->parsing_namespace = $new_namespace; return true; } public function next_tag( $query = null ): bool { $this->parse_query( $query ); $already_found = 0; do { if ( false === $this->next_token() ) { return false; } if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { continue; } if ( $this->matches() ) { ++$already_found; } } while ( $already_found < $this->sought_match_offset ); return true; } public function next_token(): bool { return $this->base_class_next_token(); } private function base_class_next_token(): bool { $was_at = $this->bytes_already_parsed; $this->after_tag(); // Don't proceed if there's nothing more to scan. if ( self::STATE_COMPLETE === $this->parser_state || self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { return false; } /* * The next step in the parsing loop determines the parsing state; * clear it so that state doesn't linger from the previous step. */ $this->parser_state = self::STATE_READY; if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { $this->parser_state = self::STATE_COMPLETE; return false; } // Find the next tag if it exists. if ( false === $this->parse_next_tag() ) { if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { $this->bytes_already_parsed = $was_at; } return false; } /* * For legacy reasons the rest of this function handles tags and their * attributes. If the processor has reached the end of the document * or if it matched any other token then it should return here to avoid * attempting to process tag-specific syntax. */ if ( self::STATE_INCOMPLETE_INPUT !== $this->parser_state && self::STATE_COMPLETE !== $this->parser_state && self::STATE_MATCHED_TAG !== $this->parser_state ) { return true; } // Parse all of its attributes. while ( $this->parse_next_attribute() ) { continue; } // Ensure that the tag closes before the end of the document. if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state || $this->bytes_already_parsed >= strlen( $this->html ) ) { // Does this appropriately clear state (parsed attributes)? $this->parser_state = self::STATE_INCOMPLETE_INPUT; $this->bytes_already_parsed = $was_at; return false; } $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed ); if ( false === $tag_ends_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; $this->bytes_already_parsed = $was_at; return false; } $this->parser_state = self::STATE_MATCHED_TAG; $this->bytes_already_parsed = $tag_ends_at + 1; $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; /* * Certain tags require additional processing. The first-letter pre-check * avoids unnecessary string allocation when comparing the tag names. * * - IFRAME * - LISTING (deprecated) * - NOEMBED (deprecated) * - NOFRAMES (deprecated) * - PRE * - SCRIPT * - STYLE * - TEXTAREA * - TITLE * - XMP (deprecated) */ if ( $this->is_closing_tag || 'html' !== $this->parsing_namespace || 1 !== strspn( $this->html, 'iIlLnNpPsStTxX', $this->tag_name_starts_at, 1 ) ) { return true; } $tag_name = $this->get_tag(); /* * For LISTING, PRE, and TEXTAREA, the first linefeed of an immediately-following * text node is ignored as an authoring convenience. * * @see static::skip_newline_at */ if ( 'LISTING' === $tag_name || 'PRE' === $tag_name ) { $this->skip_newline_at = $this->bytes_already_parsed; return true; } /* * There are certain elements whose children are not DATA but are instead * RCDATA or RAWTEXT. These cannot contain other elements, and the contents * are parsed as plaintext, with character references decoded in RCDATA but * not in RAWTEXT. * * These elements are described here as "self-contained" or special atomic * elements whose end tag is consumed with the opening tag, and they will * contain modifiable text inside of them. * * Preserve the opening tag pointers, as these will be overwritten * when finding the closing tag. They will be reset after finding * the closing to tag to point to the opening of the special atomic * tag sequence. */ $tag_name_starts_at = $this->tag_name_starts_at; $tag_name_length = $this->tag_name_length; $tag_ends_at = $this->token_starts_at + $this->token_length; $attributes = $this->attributes; $duplicate_attributes = $this->duplicate_attributes; // Find the closing tag if necessary. switch ( $tag_name ) { case 'SCRIPT': $found_closer = $this->skip_script_data(); break; case 'TEXTAREA': case 'TITLE': $found_closer = $this->skip_rcdata( $tag_name ); break; /* * In the browser this list would include the NOSCRIPT element, * but the Tag Processor is an environment with the scripting * flag disabled, meaning that it needs to descend into the * NOSCRIPT element to be able to properly process what will be * sent to a browser. * * Note that this rule makes HTML5 syntax incompatible with XML, * because the parsing of this token depends on client application. * The NOSCRIPT element cannot be represented in the XHTML syntax. */ case 'IFRAME': case 'NOEMBED': case 'NOFRAMES': case 'STYLE': case 'XMP': $found_closer = $this->skip_rawtext( $tag_name ); break; // No other tags should be treated in their entirety here. default: return true; } if ( ! $found_closer ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; $this->bytes_already_parsed = $was_at; return false; } /* * The values here look like they reference the opening tag but they reference * the closing tag instead. This is why the opening tag values were stored * above in a variable. It reads confusingly here, but that's because the * functions that skip the contents have moved all the internal cursors past * the inner content of the tag. */ $this->token_starts_at = $was_at; $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; $this->text_starts_at = $tag_ends_at; $this->text_length = $this->tag_name_starts_at - $this->text_starts_at; $this->tag_name_starts_at = $tag_name_starts_at; $this->tag_name_length = $tag_name_length; $this->attributes = $attributes; $this->duplicate_attributes = $duplicate_attributes; return true; } public function paused_at_incomplete_token(): bool { return self::STATE_INCOMPLETE_INPUT === $this->parser_state; } public function class_list() { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return; } $class = $this->get_attribute( 'class' ); if ( ! is_string( $class ) ) { return; } $seen = array(); $is_quirks = self::QUIRKS_MODE === $this->compat_mode; $at = 0; while ( $at < strlen( $class ) ) { // Skip past any initial boundary characters. $at += strspn( $class, " \t\f\r\n", $at ); if ( $at >= strlen( $class ) ) { return; } // Find the byte length until the next boundary. $length = strcspn( $class, " \t\f\r\n", $at ); if ( 0 === $length ) { return; } $name = str_replace( "\x00", "\u{FFFD}", substr( $class, $at, $length ) ); if ( $is_quirks ) { $name = strtolower( $name ); } $at += $length; /* * It's expected that the number of class names for a given tag is relatively small. * Given this, it is probably faster overall to scan an array for a value rather * than to use the class name as a key and check if it's a key of $seen. */ if ( in_array( $name, $seen, true ) ) { continue; } $seen[] = $name; yield $name; } } public function has_class( $wanted_class ): ?bool { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return null; } $case_insensitive = self::QUIRKS_MODE === $this->compat_mode; $wanted_length = strlen( $wanted_class ); foreach ( $this->class_list() as $class_name ) { if ( strlen( $class_name ) === $wanted_length && 0 === substr_compare( $class_name, $wanted_class, 0, strlen( $wanted_class ), $case_insensitive ) ) { return true; } } return false; } public function set_bookmark( $name ): bool { // It only makes sense to set a bookmark if the parser has paused on a concrete token. if ( self::STATE_COMPLETE === $this->parser_state || self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { return false; } if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) { _doing_it_wrong( __METHOD__, __( 'Too many bookmarks: cannot create any more.' ), '6.2.0' ); return false; } $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length ); return true; } public function release_bookmark( $name ): bool { if ( ! array_key_exists( $name, $this->bookmarks ) ) { return false; } unset( $this->bookmarks[ $name ] ); return true; } private function skip_rawtext( string $tag_name ): bool { /* * These two functions distinguish themselves on whether character references are * decoded, and since functionality to read the inner markup isn't supported, it's * not necessary to implement these two functions separately. */ return $this->skip_rcdata( $tag_name ); } private function skip_rcdata( string $tag_name ): bool { $html = $this->html; $doc_length = strlen( $html ); $tag_length = strlen( $tag_name ); $at = $this->bytes_already_parsed; while ( false !== $at && $at < $doc_length ) { $at = strpos( $this->html, 'tag_name_starts_at = $at; // Fail if there is no possible tag closer. if ( false === $at || ( $at + $tag_length ) >= $doc_length ) { return false; } $at += 2; /* * Find a case-insensitive match to the tag name. * * Because tag names are limited to US-ASCII there is no * need to perform any kind of Unicode normalization when * comparing; any character which could be impacted by such * normalization could not be part of a tag name. */ for ( $i = 0; $i < $tag_length; $i++ ) { $tag_char = $tag_name[ $i ]; $html_char = $html[ $at + $i ]; if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) { $at += $i; continue 2; } } $at += $tag_length; $this->bytes_already_parsed = $at; if ( $at >= strlen( $html ) ) { return false; } /* * Ensure that the tag name terminates to avoid matching on * substrings of a longer tag name. For example, the sequence * "' !== $c ) { continue; } while ( $this->parse_next_attribute() ) { continue; } $at = $this->bytes_already_parsed; if ( $at >= strlen( $this->html ) ) { return false; } if ( '>' === $html[ $at ] ) { $this->bytes_already_parsed = $at + 1; return true; } if ( $at + 1 >= strlen( $this->html ) ) { return false; } if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) { $this->bytes_already_parsed = $at + 2; return true; } } return false; } private function skip_script_data(): bool { $state = 'unescaped'; $html = $this->html; $doc_length = strlen( $html ); $at = $this->bytes_already_parsed; while ( false !== $at && $at < $doc_length ) { $at += strcspn( $html, '-<', $at ); /* * For all script states a "-->" transitions * back into the normal unescaped script mode, * even if that's the current state. */ if ( $at + 2 < $doc_length && '-' === $html[ $at ] && '-' === $html[ $at + 1 ] && '>' === $html[ $at + 2 ] ) { $at += 3; $state = 'unescaped'; continue; } if ( $at + 1 >= $doc_length ) { return false; } /* * Everything of interest past here starts with "<". * Check this character and advance position regardless. */ if ( '<' !== $html[ $at++ ] ) { continue; } /* * Unlike with "-->", the "`. Unlike other comment * and bogus comment syntax, these leave no clear insertion point for text and * they need to be modified specially in order to contain text. E.g. to store * `?` as the modifiable text, the `` needs to become ``, which * involves inserting an additional `-` into the token after the modifiable text. */ $this->parser_state = self::STATE_COMMENT; $this->comment_type = self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT; $this->token_length = $closer_at + $span_of_dashes + 1 - $this->token_starts_at; // Only provide modifiable text if the token is long enough to contain it. if ( $span_of_dashes >= 2 ) { $this->comment_type = self::COMMENT_AS_HTML_COMMENT; $this->text_starts_at = $this->token_starts_at + 4; $this->text_length = $span_of_dashes - 2; } $this->bytes_already_parsed = $closer_at + $span_of_dashes + 1; return true; } /* * Comments may be closed by either a --> or an invalid --!>. * The first occurrence closes the comment. * * See https://html.spec.whatwg.org/#parse-error-incorrectly-closed-comment */ --$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping. while ( ++$closer_at < $doc_length ) { $closer_at = strpos( $html, '--', $closer_at ); if ( false === $closer_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) { $this->parser_state = self::STATE_COMMENT; $this->comment_type = self::COMMENT_AS_HTML_COMMENT; $this->token_length = $closer_at + 3 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 4; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 3; return true; } if ( $closer_at + 3 < $doc_length && '!' === $html[ $closer_at + 2 ] && '>' === $html[ $closer_at + 3 ] ) { $this->parser_state = self::STATE_COMMENT; $this->comment_type = self::COMMENT_AS_HTML_COMMENT; $this->token_length = $closer_at + 4 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 4; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 4; return true; } } } /* * ` * These are ASCII-case-insensitive. * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state */ if ( $doc_length > $at + 8 && ( 'D' === $html[ $at + 2 ] || 'd' === $html[ $at + 2 ] ) && ( 'O' === $html[ $at + 3 ] || 'o' === $html[ $at + 3 ] ) && ( 'C' === $html[ $at + 4 ] || 'c' === $html[ $at + 4 ] ) && ( 'T' === $html[ $at + 5 ] || 't' === $html[ $at + 5 ] ) && ( 'Y' === $html[ $at + 6 ] || 'y' === $html[ $at + 6 ] ) && ( 'P' === $html[ $at + 7 ] || 'p' === $html[ $at + 7 ] ) && ( 'E' === $html[ $at + 8 ] || 'e' === $html[ $at + 8 ] ) ) { $closer_at = strpos( $html, '>', $at + 9 ); if ( false === $closer_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } $this->parser_state = self::STATE_DOCTYPE; $this->token_length = $closer_at + 1 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 9; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 1; return true; } if ( 'html' !== $this->parsing_namespace && strlen( $html ) > $at + 8 && '[' === $html[ $at + 2 ] && 'C' === $html[ $at + 3 ] && 'D' === $html[ $at + 4 ] && 'A' === $html[ $at + 5 ] && 'T' === $html[ $at + 6 ] && 'A' === $html[ $at + 7 ] && '[' === $html[ $at + 8 ] ) { $closer_at = strpos( $html, ']]>', $at + 9 ); if ( false === $closer_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } $this->parser_state = self::STATE_CDATA_NODE; $this->text_starts_at = $at + 9; $this->text_length = $closer_at - $this->text_starts_at; $this->token_length = $closer_at + 3 - $this->token_starts_at; $this->bytes_already_parsed = $closer_at + 3; return true; } /* * Anything else here is an incorrectly-opened comment and transitions * to the bogus comment state - skip to the nearest >. If no closer is * found then the HTML was truncated inside the markup declaration. */ $closer_at = strpos( $html, '>', $at + 1 ); if ( false === $closer_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } $this->parser_state = self::STATE_COMMENT; $this->comment_type = self::COMMENT_AS_INVALID_HTML; $this->token_length = $closer_at + 1 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 2; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 1; /* * Identify nodes that would be CDATA if HTML had CDATA sections. * * This section must occur after identifying the bogus comment end * because in an HTML parser it will span to the nearest `>`, even * if there's no `]]>` as would be required in an XML document. It * is therefore not possible to parse a CDATA section containing * a `>` in the HTML syntax. * * Inside foreign elements there is a discrepancy between browsers * and the specification on this. * * @todo Track whether the Tag Processor is inside a foreign element * and require the proper closing `]]>` in those cases. */ if ( $this->token_length >= 10 && '[' === $html[ $this->token_starts_at + 2 ] && 'C' === $html[ $this->token_starts_at + 3 ] && 'D' === $html[ $this->token_starts_at + 4 ] && 'A' === $html[ $this->token_starts_at + 5 ] && 'T' === $html[ $this->token_starts_at + 6 ] && 'A' === $html[ $this->token_starts_at + 7 ] && '[' === $html[ $this->token_starts_at + 8 ] && ']' === $html[ $closer_at - 1 ] && ']' === $html[ $closer_at - 2 ] ) { $this->parser_state = self::STATE_COMMENT; $this->comment_type = self::COMMENT_AS_CDATA_LOOKALIKE; $this->text_starts_at += 7; $this->text_length -= 9; } return true; } /* * is a missing end tag name, which is ignored. * * This was also known as the "presumptuous empty tag" * in early discussions as it was proposed to close * the nearest previous opening tag. * * See https://html.spec.whatwg.org/#parse-error-missing-end-tag-name */ if ( '>' === $html[ $at + 1 ] ) { // `<>` is interpreted as plaintext. if ( ! $this->is_closing_tag ) { ++$at; continue; } $this->parser_state = self::STATE_PRESUMPTUOUS_TAG; $this->token_length = $at + 2 - $this->token_starts_at; $this->bytes_already_parsed = $at + 2; return true; } /* * ` * See https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state */ if ( ! $this->is_closing_tag && '?' === $html[ $at + 1 ] ) { $closer_at = strpos( $html, '>', $at + 2 ); if ( false === $closer_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } $this->parser_state = self::STATE_COMMENT; $this->comment_type = self::COMMENT_AS_INVALID_HTML; $this->token_length = $closer_at + 1 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 2; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 1; /* * Identify a Processing Instruction node were HTML to have them. * * This section must occur after identifying the bogus comment end * because in an HTML parser it will span to the nearest `>`, even * if there's no `?>` as would be required in an XML document. It * is therefore not possible to parse a Processing Instruction node * containing a `>` in the HTML syntax. * * XML allows for more target names, but this code only identifies * those with ASCII-representable target names. This means that it * may identify some Processing Instruction nodes as bogus comments, * but it will not misinterpret the HTML structure. By limiting the * identification to these target names the Tag Processor can avoid * the need to start parsing UTF-8 sequences. * * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | * [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | * [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | * [#x10000-#xEFFFF] * > NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] * * @todo Processing instruction nodes in SGML may contain any kind of markup. XML defines a * special case with `` syntax, but the `?` is part of the bogus comment. * * @see https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget */ if ( $this->token_length >= 5 && '?' === $html[ $closer_at - 1 ] ) { $comment_text = substr( $html, $this->token_starts_at + 2, $this->token_length - 4 ); $pi_target_length = strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_' ); if ( 0 < $pi_target_length ) { $pi_target_length += strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-.', $pi_target_length ); $this->comment_type = self::COMMENT_AS_PI_NODE_LOOKALIKE; $this->tag_name_starts_at = $this->token_starts_at + 2; $this->tag_name_length = $pi_target_length; $this->text_starts_at += $pi_target_length; $this->text_length -= $pi_target_length + 1; } } return true; } /* * If a non-alpha starts the tag name in a tag closer it's a comment. * Find the first `>`, which closes the comment. * * This parser classifies these particular comments as special "funky comments" * which are made available for further processing. * * See https://html.spec.whatwg.org/#parse-error-invalid-first-character-of-tag-name */ if ( $this->is_closing_tag ) { // No chance of finding a closer. if ( $at + 3 > $doc_length ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } $closer_at = strpos( $html, '>', $at + 2 ); if ( false === $closer_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } $this->parser_state = self::STATE_FUNKY_COMMENT; $this->token_length = $closer_at + 1 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 2; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 1; return true; } ++$at; } /* * This does not imply an incomplete parse; it indicates that there * can be nothing left in the document other than a #text node. */ $this->parser_state = self::STATE_TEXT_NODE; $this->token_starts_at = $was_at; $this->token_length = $doc_length - $was_at; $this->text_starts_at = $was_at; $this->text_length = $this->token_length; $this->bytes_already_parsed = $doc_length; return true; } private function parse_next_attribute(): bool { $doc_length = strlen( $this->html ); // Skip whitespace and slashes. $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed ); if ( $this->bytes_already_parsed >= $doc_length ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } /* * Treat the equal sign as a part of the attribute * name if it is the first encountered byte. * * @see https://html.spec.whatwg.org/multipage/parsing.html#before-attribute-name-state */ $name_length = '=' === $this->html[ $this->bytes_already_parsed ] ? 1 + strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed + 1 ) : strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed ); // No attribute, just tag closer. if ( 0 === $name_length || $this->bytes_already_parsed + $name_length >= $doc_length ) { return false; } $attribute_start = $this->bytes_already_parsed; $attribute_name = substr( $this->html, $attribute_start, $name_length ); $this->bytes_already_parsed += $name_length; if ( $this->bytes_already_parsed >= $doc_length ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } $this->skip_whitespace(); if ( $this->bytes_already_parsed >= $doc_length ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } $has_value = '=' === $this->html[ $this->bytes_already_parsed ]; if ( $has_value ) { ++$this->bytes_already_parsed; $this->skip_whitespace(); if ( $this->bytes_already_parsed >= $doc_length ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } switch ( $this->html[ $this->bytes_already_parsed ] ) { case "'": case '"': $quote = $this->html[ $this->bytes_already_parsed ]; $value_start = $this->bytes_already_parsed + 1; $end_quote_at = strpos( $this->html, $quote, $value_start ); $end_quote_at = false === $end_quote_at ? $doc_length : $end_quote_at; $value_length = $end_quote_at - $value_start; $attribute_end = $end_quote_at + 1; $this->bytes_already_parsed = $attribute_end; break; default: $value_start = $this->bytes_already_parsed; $value_length = strcspn( $this->html, "> \t\f\r\n", $value_start ); $attribute_end = $value_start + $value_length; $this->bytes_already_parsed = $attribute_end; } } else { $value_start = $this->bytes_already_parsed; $value_length = 0; $attribute_end = $attribute_start + $name_length; } if ( $attribute_end >= $doc_length ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return false; } if ( $this->is_closing_tag ) { return true; } /* * > There must never be two or more attributes on * > the same start tag whose names are an ASCII * > case-insensitive match for each other. * - HTML 5 spec * * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive */ $comparable_name = strtolower( $attribute_name ); // If an attribute is listed many times, only use the first declaration and ignore the rest. if ( ! isset( $this->attributes[ $comparable_name ] ) ) { $this->attributes[ $comparable_name ] = new WP_HTML_Attribute_Token( $attribute_name, $value_start, $value_length, $attribute_start, $attribute_end - $attribute_start, ! $has_value ); return true; } /* * Track the duplicate attributes so if we remove it, all disappear together. * * While `$this->duplicated_attributes` could always be stored as an `array()`, * which would simplify the logic here, storing a `null` and only allocating * an array when encountering duplicates avoids needless allocations in the * normative case of parsing tags with no duplicate attributes. */ $duplicate_span = new WP_HTML_Span( $attribute_start, $attribute_end - $attribute_start ); if ( null === $this->duplicate_attributes ) { $this->duplicate_attributes = array( $comparable_name => array( $duplicate_span ) ); } elseif ( ! isset( $this->duplicate_attributes[ $comparable_name ] ) ) { $this->duplicate_attributes[ $comparable_name ] = array( $duplicate_span ); } else { $this->duplicate_attributes[ $comparable_name ][] = $duplicate_span; } return true; } private function skip_whitespace(): void { $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n", $this->bytes_already_parsed ); } private function after_tag(): void { /* * There could be lexical updates enqueued for an attribute that * also exists on the next tag. In order to avoid conflating the * attributes across the two tags, lexical updates with names * need to be flushed to raw lexical updates. */ $this->class_name_updates_to_attributes_updates(); /* * Purge updates if there are too many. The actual count isn't * scientific, but a few values from 100 to a few thousand were * tests to find a practically-useful limit. * * If the update queue grows too big, then the Tag Processor * will spend more time iterating through them and lose the * efficiency gains of deferring applying them. */ if ( 1000 < count( $this->lexical_updates ) ) { $this->get_updated_html(); } foreach ( $this->lexical_updates as $name => $update ) { /* * Any updates appearing after the cursor should be applied * before proceeding, otherwise they may be overlooked. */ if ( $update->start >= $this->bytes_already_parsed ) { $this->get_updated_html(); break; } if ( is_int( $name ) ) { continue; } $this->lexical_updates[] = $update; unset( $this->lexical_updates[ $name ] ); } $this->token_starts_at = null; $this->token_length = null; $this->tag_name_starts_at = null; $this->tag_name_length = null; $this->text_starts_at = 0; $this->text_length = 0; $this->is_closing_tag = null; $this->attributes = array(); $this->comment_type = null; $this->text_node_classification = self::TEXT_IS_GENERIC; $this->duplicate_attributes = null; } private function class_name_updates_to_attributes_updates(): void { if ( count( $this->classname_updates ) === 0 ) { return; } $existing_class = $this->get_enqueued_attribute_value( 'class' ); if ( null === $existing_class || true === $existing_class ) { $existing_class = ''; } if ( false === $existing_class && isset( $this->attributes['class'] ) ) { $existing_class = substr( $this->html, $this->attributes['class']->value_starts_at, $this->attributes['class']->value_length ); } if ( false === $existing_class ) { $existing_class = ''; } $class = ''; $at = 0; $modified = false; $seen = array(); $to_remove = array(); $is_quirks = self::QUIRKS_MODE === $this->compat_mode; if ( $is_quirks ) { foreach ( $this->classname_updates as $updated_name => $action ) { if ( self::REMOVE_CLASS === $action ) { $to_remove[] = strtolower( $updated_name ); } } } else { foreach ( $this->classname_updates as $updated_name => $action ) { if ( self::REMOVE_CLASS === $action ) { $to_remove[] = $updated_name; } } } // Remove unwanted classes by only copying the new ones. $existing_class_length = strlen( $existing_class ); while ( $at < $existing_class_length ) { // Skip to the first non-whitespace character. $ws_at = $at; $ws_length = strspn( $existing_class, " \t\f\r\n", $ws_at ); $at += $ws_length; // Capture the class name – it's everything until the next whitespace. $name_length = strcspn( $existing_class, " \t\f\r\n", $at ); if ( 0 === $name_length ) { // If no more class names are found then that's the end. break; } $name = substr( $existing_class, $at, $name_length ); $comparable_class_name = $is_quirks ? strtolower( $name ) : $name; $at += $name_length; // If this class is marked for removal, remove it and move on to the next one. if ( in_array( $comparable_class_name, $to_remove, true ) ) { $modified = true; continue; } // If a class has already been seen then skip it; it should not be added twice. if ( in_array( $comparable_class_name, $seen, true ) ) { continue; } $seen[] = $comparable_class_name; /* * Otherwise, append it to the new "class" attribute value. * * There are options for handling whitespace between tags. * Preserving the existing whitespace produces fewer changes * to the HTML content and should clarify the before/after * content when debugging the modified output. * * This approach contrasts normalizing the inter-class * whitespace to a single space, which might appear cleaner * in the output HTML but produce a noisier change. */ if ( '' !== $class ) { $class .= substr( $existing_class, $ws_at, $ws_length ); } $class .= $name; } // Add new classes by appending those which haven't already been seen. foreach ( $this->classname_updates as $name => $operation ) { $comparable_name = $is_quirks ? strtolower( $name ) : $name; if ( self::ADD_CLASS === $operation && ! in_array( $comparable_name, $seen, true ) ) { $modified = true; $class .= strlen( $class ) > 0 ? ' ' : ''; $class .= $name; } } $this->classname_updates = array(); if ( ! $modified ) { return; } if ( strlen( $class ) > 0 ) { $this->set_attribute( 'class', $class ); } else { $this->remove_attribute( 'class' ); } } private function apply_attributes_updates( int $shift_this_point ): int { if ( ! count( $this->lexical_updates ) ) { return 0; } $accumulated_shift_for_given_point = 0; /* * Attribute updates can be enqueued in any order but updates * to the document must occur in lexical order; that is, each * replacement must be made before all others which follow it * at later string indices in the input document. * * Sorting avoid making out-of-order replacements which * can lead to mangled output, partially-duplicated * attributes, and overwritten attributes. */ usort( $this->lexical_updates, array( self::class, 'sort_start_ascending' ) ); $bytes_already_copied = 0; $output_buffer = ''; foreach ( $this->lexical_updates as $diff ) { $shift = strlen( $diff->text ) - $diff->length; // Adjust the cursor position by however much an update affects it. if ( $diff->start < $this->bytes_already_parsed ) { $this->bytes_already_parsed += $shift; } // Accumulate shift of the given pointer within this function call. if ( $diff->start < $shift_this_point ) { $accumulated_shift_for_given_point += $shift; } $output_buffer .= substr( $this->html, $bytes_already_copied, $diff->start - $bytes_already_copied ); $output_buffer .= $diff->text; $bytes_already_copied = $diff->start + $diff->length; } $this->html = $output_buffer . substr( $this->html, $bytes_already_copied ); /* * Adjust bookmark locations to account for how the text * replacements adjust offsets in the input document. */ foreach ( $this->bookmarks as $bookmark_name => $bookmark ) { $bookmark_end = $bookmark->start + $bookmark->length; /* * Each lexical update which appears before the bookmark's endpoints * might shift the offsets for those endpoints. Loop through each change * and accumulate the total shift for each bookmark, then apply that * shift after tallying the full delta. */ $head_delta = 0; $tail_delta = 0; foreach ( $this->lexical_updates as $diff ) { $diff_end = $diff->start + $diff->length; if ( $bookmark->start < $diff->start && $bookmark_end < $diff->start ) { break; } if ( $bookmark->start >= $diff->start && $bookmark_end < $diff_end ) { $this->release_bookmark( $bookmark_name ); continue 2; } $delta = strlen( $diff->text ) - $diff->length; if ( $bookmark->start >= $diff->start ) { $head_delta += $delta; } if ( $bookmark_end >= $diff_end ) { $tail_delta += $delta; } } $bookmark->start += $head_delta; $bookmark->length += $tail_delta - $head_delta; } $this->lexical_updates = array(); return $accumulated_shift_for_given_point; } public function has_bookmark( $bookmark_name ): bool { return array_key_exists( $bookmark_name, $this->bookmarks ); } public function seek( $bookmark_name ): bool { if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) { _doing_it_wrong( __METHOD__, __( 'Unknown bookmark name.' ), '6.2.0' ); return false; } $existing_bookmark = $this->bookmarks[ $bookmark_name ]; if ( $this->token_starts_at === $existing_bookmark->start && $this->token_length === $existing_bookmark->length ) { return true; } if ( ++$this->seek_count > static::MAX_SEEK_OPS ) { _doing_it_wrong( __METHOD__, __( 'Too many calls to seek() - this can lead to performance issues.' ), '6.2.0' ); return false; } // Flush out any pending updates to the document. $this->get_updated_html(); // Point this tag processor before the sought tag opener and consume it. $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start; $this->parser_state = self::STATE_READY; return $this->next_token(); } private static function sort_start_ascending( WP_HTML_Text_Replacement $a, WP_HTML_Text_Replacement $b ): int { $by_start = $a->start - $b->start; if ( 0 !== $by_start ) { return $by_start; } $by_text = isset( $a->text, $b->text ) ? strcmp( $a->text, $b->text ) : 0; if ( 0 !== $by_text ) { return $by_text; } /* * This code should be unreachable, because it implies the two replacements * start at the same location and contain the same text. */ return $a->length - $b->length; } private function get_enqueued_attribute_value( string $comparable_name ) { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return false; } if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) { return false; } $enqueued_text = $this->lexical_updates[ $comparable_name ]->text; // Removed attributes erase the entire span. if ( '' === $enqueued_text ) { return null; } /* * Boolean attribute updates are just the attribute name without a corresponding value. * * This value might differ from the given comparable name in that there could be leading * or trailing whitespace, and that the casing follows the name given in `set_attribute`. * * Example: * * $p->set_attribute( 'data-TEST-id', 'update' ); * 'update' === $p->get_enqueued_attribute_value( 'data-test-id' ); * * Detect this difference based on the absence of the `=`, which _must_ exist in any * attribute containing a value, e.g. ``. * ¹ ² * 1. Attribute with a string value. * 2. Boolean attribute whose value is `true`. */ $equals_at = strpos( $enqueued_text, '=' ); if ( false === $equals_at ) { return true; } /* * Finally, a normal update's value will appear after the `=` and * be double-quoted, as performed incidentally by `set_attribute`. * * e.g. `type="text"` * ¹² ³ * 1. Equals is here. * 2. Double-quoting starts one after the equals sign. * 3. Double-quoting ends at the last character in the update. */ $enqueued_value = substr( $enqueued_text, $equals_at + 2, -1 ); return WP_HTML_Decoder::decode_attribute( $enqueued_value ); } public function get_attribute( $name ) { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return null; } $comparable = strtolower( $name ); /* * For every attribute other than `class` it's possible to perform a quick check if * there's an enqueued lexical update whose value takes priority over what's found in * the input document. * * The `class` attribute is special though because of the exposed helpers `add_class` * and `remove_class`. These form a builder for the `class` attribute, so an additional * check for enqueued class changes is required in addition to the check for any enqueued * attribute values. If any exist, those enqueued class changes must first be flushed out * into an attribute value update. */ if ( 'class' === $name ) { $this->class_name_updates_to_attributes_updates(); } // Return any enqueued attribute value updates if they exist. $enqueued_value = $this->get_enqueued_attribute_value( $comparable ); if ( false !== $enqueued_value ) { return $enqueued_value; } if ( ! isset( $this->attributes[ $comparable ] ) ) { return null; } $attribute = $this->attributes[ $comparable ]; /* * This flag distinguishes an attribute with no value * from an attribute with an empty string value. For * unquoted attributes this could look very similar. * It refers to whether an `=` follows the name. * * e.g.
* ¹ ² * 1. Attribute `boolean-attribute` is `true`. * 2. Attribute `empty-attribute` is `""`. */ if ( true === $attribute->is_true ) { return true; } $raw_value = substr( $this->html, $attribute->value_starts_at, $attribute->value_length ); return WP_HTML_Decoder::decode_attribute( $raw_value ); } public function get_attribute_names_with_prefix( $prefix ): ?array { if ( self::STATE_MATCHED_TAG !== $this->parser_state || $this->is_closing_tag ) { return null; } $comparable = strtolower( $prefix ); $matches = array(); foreach ( array_keys( $this->attributes ) as $attr_name ) { if ( str_starts_with( $attr_name, $comparable ) ) { $matches[] = $attr_name; } } return $matches; } public function get_namespace(): string { return $this->parsing_namespace; } public function get_tag(): ?string { if ( null === $this->tag_name_starts_at ) { return null; } $tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length ); if ( self::STATE_MATCHED_TAG === $this->parser_state ) { return strtoupper( $tag_name ); } if ( self::STATE_COMMENT === $this->parser_state && self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type() ) { return $tag_name; } return null; } public function get_qualified_tag_name(): ?string { $tag_name = $this->get_tag(); if ( null === $tag_name ) { return null; } if ( 'html' === $this->get_namespace() ) { return $tag_name; } $lower_tag_name = strtolower( $tag_name ); if ( 'math' === $this->get_namespace() ) { return $lower_tag_name; } if ( 'svg' === $this->get_namespace() ) { switch ( $lower_tag_name ) { case 'altglyph': return 'altGlyph'; case 'altglyphdef': return 'altGlyphDef'; case 'altglyphitem': return 'altGlyphItem'; case 'animatecolor': return 'animateColor'; case 'animatemotion': return 'animateMotion'; case 'animatetransform': return 'animateTransform'; case 'clippath': return 'clipPath'; case 'feblend': return 'feBlend'; case 'fecolormatrix': return 'feColorMatrix'; case 'fecomponenttransfer': return 'feComponentTransfer'; case 'fecomposite': return 'feComposite'; case 'feconvolvematrix': return 'feConvolveMatrix'; case 'fediffuselighting': return 'feDiffuseLighting'; case 'fedisplacementmap': return 'feDisplacementMap'; case 'fedistantlight': return 'feDistantLight'; case 'fedropshadow': return 'feDropShadow'; case 'feflood': return 'feFlood'; case 'fefunca': return 'feFuncA'; case 'fefuncb': return 'feFuncB'; case 'fefuncg': return 'feFuncG'; case 'fefuncr': return 'feFuncR'; case 'fegaussianblur': return 'feGaussianBlur'; case 'feimage': return 'feImage'; case 'femerge': return 'feMerge'; case 'femergenode': return 'feMergeNode'; case 'femorphology': return 'feMorphology'; case 'feoffset': return 'feOffset'; case 'fepointlight': return 'fePointLight'; case 'fespecularlighting': return 'feSpecularLighting'; case 'fespotlight': return 'feSpotLight'; case 'fetile': return 'feTile'; case 'feturbulence': return 'feTurbulence'; case 'foreignobject': return 'foreignObject'; case 'glyphref': return 'glyphRef'; case 'lineargradient': return 'linearGradient'; case 'radialgradient': return 'radialGradient'; case 'textpath': return 'textPath'; default: return $lower_tag_name; } } // This unnecessary return prevents tools from inaccurately reporting type errors. return $tag_name; } public function get_qualified_attribute_name( $attribute_name ): ?string { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return null; } $namespace = $this->get_namespace(); $lower_name = strtolower( $attribute_name ); if ( 'math' === $namespace && 'definitionurl' === $lower_name ) { return 'definitionURL'; } if ( 'svg' === $this->get_namespace() ) { switch ( $lower_name ) { case 'attributename': return 'attributeName'; case 'attributetype': return 'attributeType'; case 'basefrequency': return 'baseFrequency'; case 'baseprofile': return 'baseProfile'; case 'calcmode': return 'calcMode'; case 'clippathunits': return 'clipPathUnits'; case 'diffuseconstant': return 'diffuseConstant'; case 'edgemode': return 'edgeMode'; case 'filterunits': return 'filterUnits'; case 'glyphref': return 'glyphRef'; case 'gradienttransform': return 'gradientTransform'; case 'gradientunits': return 'gradientUnits'; case 'kernelmatrix': return 'kernelMatrix'; case 'kernelunitlength': return 'kernelUnitLength'; case 'keypoints': return 'keyPoints'; case 'keysplines': return 'keySplines'; case 'keytimes': return 'keyTimes'; case 'lengthadjust': return 'lengthAdjust'; case 'limitingconeangle': return 'limitingConeAngle'; case 'markerheight': return 'markerHeight'; case 'markerunits': return 'markerUnits'; case 'markerwidth': return 'markerWidth'; case 'maskcontentunits': return 'maskContentUnits'; case 'maskunits': return 'maskUnits'; case 'numoctaves': return 'numOctaves'; case 'pathlength': return 'pathLength'; case 'patterncontentunits': return 'patternContentUnits'; case 'patterntransform': return 'patternTransform'; case 'patternunits': return 'patternUnits'; case 'pointsatx': return 'pointsAtX'; case 'pointsaty': return 'pointsAtY'; case 'pointsatz': return 'pointsAtZ'; case 'preservealpha': return 'preserveAlpha'; case 'preserveaspectratio': return 'preserveAspectRatio'; case 'primitiveunits': return 'primitiveUnits'; case 'refx': return 'refX'; case 'refy': return 'refY'; case 'repeatcount': return 'repeatCount'; case 'repeatdur': return 'repeatDur'; case 'requiredextensions': return 'requiredExtensions'; case 'requiredfeatures': return 'requiredFeatures'; case 'specularconstant': return 'specularConstant'; case 'specularexponent': return 'specularExponent'; case 'spreadmethod': return 'spreadMethod'; case 'startoffset': return 'startOffset'; case 'stddeviation': return 'stdDeviation'; case 'stitchtiles': return 'stitchTiles'; case 'surfacescale': return 'surfaceScale'; case 'systemlanguage': return 'systemLanguage'; case 'tablevalues': return 'tableValues'; case 'targetx': return 'targetX'; case 'targety': return 'targetY'; case 'textlength': return 'textLength'; case 'viewbox': return 'viewBox'; case 'viewtarget': return 'viewTarget'; case 'xchannelselector': return 'xChannelSelector'; case 'ychannelselector': return 'yChannelSelector'; case 'zoomandpan': return 'zoomAndPan'; } } if ( 'html' !== $namespace ) { switch ( $lower_name ) { case 'xlink:actuate': return 'xlink actuate'; case 'xlink:arcrole': return 'xlink arcrole'; case 'xlink:href': return 'xlink href'; case 'xlink:role': return 'xlink role'; case 'xlink:show': return 'xlink show'; case 'xlink:title': return 'xlink title'; case 'xlink:type': return 'xlink type'; case 'xml:lang': return 'xml lang'; case 'xml:space': return 'xml space'; case 'xmlns': return 'xmlns'; case 'xmlns:xlink': return 'xmlns xlink'; } } return $attribute_name; } public function has_self_closing_flag(): bool { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return false; } /* * The self-closing flag is the solidus at the _end_ of the tag, not the beginning. * * Example: * *
* ^ this appears one character before the end of the closing ">". */ return '/' === $this->html[ $this->token_starts_at + $this->token_length - 2 ]; } public function is_tag_closer(): bool { return ( self::STATE_MATCHED_TAG === $this->parser_state && $this->is_closing_tag && /* * The BR tag can only exist as an opening tag. If something like `
` * appears then the HTML parser will treat it as an opening tag with no * attributes. The BR tag is unique in this way. * * @see https://html.spec.whatwg.org/#parsing-main-inbody */ 'BR' !== $this->get_tag() ); } public function get_token_type(): ?string { switch ( $this->parser_state ) { case self::STATE_MATCHED_TAG: return '#tag'; case self::STATE_DOCTYPE: return '#doctype'; default: return $this->get_token_name(); } } public function get_token_name(): ?string { switch ( $this->parser_state ) { case self::STATE_MATCHED_TAG: return $this->get_tag(); case self::STATE_TEXT_NODE: return '#text'; case self::STATE_CDATA_NODE: return '#cdata-section'; case self::STATE_COMMENT: return '#comment'; case self::STATE_DOCTYPE: return 'html'; case self::STATE_PRESUMPTUOUS_TAG: return '#presumptuous-tag'; case self::STATE_FUNKY_COMMENT: return '#funky-comment'; } return null; } public function get_comment_type(): ?string { if ( self::STATE_COMMENT !== $this->parser_state ) { return null; } return $this->comment_type; } public function get_full_comment_text(): ?string { if ( self::STATE_FUNKY_COMMENT === $this->parser_state ) { return $this->get_modifiable_text(); } if ( self::STATE_COMMENT !== $this->parser_state ) { return null; } switch ( $this->get_comment_type() ) { case self::COMMENT_AS_HTML_COMMENT: case self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT: return $this->get_modifiable_text(); case self::COMMENT_AS_CDATA_LOOKALIKE: return "[CDATA[{$this->get_modifiable_text()}]]"; case self::COMMENT_AS_PI_NODE_LOOKALIKE: return "?{$this->get_tag()}{$this->get_modifiable_text()}?"; /* * This represents "bogus comments state" from HTML tokenization. * This can be entered by `html[ $this->text_starts_at - 1 ]; $comment_start = '?' === $preceding_character ? '?' : ''; return "{$comment_start}{$this->get_modifiable_text()}"; } return null; } public function subdivide_text_appropriately(): bool { if ( self::STATE_TEXT_NODE !== $this->parser_state ) { return false; } $this->text_node_classification = self::TEXT_IS_GENERIC; /* * NULL bytes are treated categorically different than numeric character * references whose number is zero. `�` is not the same as `"\x00"`. */ $leading_nulls = strspn( $this->html, "\x00", $this->text_starts_at, $this->text_length ); if ( $leading_nulls > 0 ) { $this->token_length = $leading_nulls; $this->text_length = $leading_nulls; $this->bytes_already_parsed = $this->token_starts_at + $leading_nulls; $this->text_node_classification = self::TEXT_IS_NULL_SEQUENCE; return true; } /* * Start a decoding loop to determine the point at which the * text subdivides. This entails raw whitespace bytes and any * character reference that decodes to the same. */ $at = $this->text_starts_at; $end = $this->text_starts_at + $this->text_length; while ( $at < $end ) { $skipped = strspn( $this->html, " \t\f\r\n", $at, $end - $at ); $at += $skipped; if ( $at < $end && '&' === $this->html[ $at ] ) { $matched_byte_length = null; $replacement = WP_HTML_Decoder::read_character_reference( 'data', $this->html, $at, $matched_byte_length ); if ( isset( $replacement ) && 1 === strspn( $replacement, " \t\f\r\n" ) ) { $at += $matched_byte_length; continue; } } break; } if ( $at > $this->text_starts_at ) { $new_length = $at - $this->text_starts_at; $this->text_length = $new_length; $this->token_length = $new_length; $this->bytes_already_parsed = $at; $this->text_node_classification = self::TEXT_IS_WHITESPACE; return true; } return false; } public function get_modifiable_text(): string { $has_enqueued_update = isset( $this->lexical_updates['modifiable text'] ); if ( ! $has_enqueued_update && ( null === $this->text_starts_at || 0 === $this->text_length ) ) { return ''; } $text = $has_enqueued_update ? $this->lexical_updates['modifiable text']->text : substr( $this->html, $this->text_starts_at, $this->text_length ); /* * Pre-processing the input stream would normally happen before * any parsing is done, but deferring it means it's possible to * skip in most cases. When getting the modifiable text, however * it's important to apply the pre-processing steps, which is * normalizing newlines. * * @see https://html.spec.whatwg.org/#preprocessing-the-input-stream * @see https://infra.spec.whatwg.org/#normalize-newlines */ $text = str_replace( "\r\n", "\n", $text ); $text = str_replace( "\r", "\n", $text ); // Comment data is not decoded. if ( self::STATE_CDATA_NODE === $this->parser_state || self::STATE_COMMENT === $this->parser_state || self::STATE_DOCTYPE === $this->parser_state || self::STATE_FUNKY_COMMENT === $this->parser_state ) { return str_replace( "\x00", "\u{FFFD}", $text ); } $tag_name = $this->get_token_name(); if ( // Script data is not decoded. 'SCRIPT' === $tag_name || // RAWTEXT data is not decoded. 'IFRAME' === $tag_name || 'NOEMBED' === $tag_name || 'NOFRAMES' === $tag_name || 'STYLE' === $tag_name || 'XMP' === $tag_name ) { return str_replace( "\x00", "\u{FFFD}", $text ); } $decoded = WP_HTML_Decoder::decode_text_node( $text ); /* * Skip the first line feed after LISTING, PRE, and TEXTAREA opening tags. * * Note that this first newline may come in the form of a character * reference, such as ` `, and so it's important to perform * this transformation only after decoding the raw text content. */ if ( ( "\n" === ( $decoded[0] ?? '' ) ) && ( ( $this->skip_newline_at === $this->token_starts_at && '#text' === $tag_name ) || 'TEXTAREA' === $tag_name ) ) { $decoded = substr( $decoded, 1 ); } /* * Only in normative text nodes does the NULL byte (U+0000) get removed. * In all other contexts it's replaced by the replacement character (U+FFFD) * for security reasons (to avoid joining together strings that were safe * when separated, but not when joined). * * @todo Inside HTML integration points and MathML integration points, the * text is processed according to the insertion mode, not according * to the foreign content rules. This should strip the NULL bytes. */ return ( '#text' === $tag_name && 'html' === $this->get_namespace() ) ? str_replace( "\x00", '', $decoded ) : str_replace( "\x00", "\u{FFFD}", $decoded ); } public function set_modifiable_text( string $plaintext_content ): bool { if ( self::STATE_TEXT_NODE === $this->parser_state ) { $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( $this->text_starts_at, $this->text_length, htmlspecialchars( $plaintext_content, ENT_QUOTES | ENT_HTML5 ) ); return true; } // Comment data is not encoded. if ( self::STATE_COMMENT === $this->parser_state && self::COMMENT_AS_HTML_COMMENT === $this->comment_type ) { // Check if the text could close the comment. if ( 1 === preg_match( '/--!?>/', $plaintext_content ) ) { return false; } $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( $this->text_starts_at, $this->text_length, $plaintext_content ); return true; } if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return false; } switch ( $this->get_tag() ) { case 'SCRIPT': /* * This is over-protective, but ensures the update doesn't break * out of the SCRIPT element. A more thorough check would need to * ensure that the script closing tag doesn't exist, and isn't * also "hidden" inside the script double-escaped state. * * It may seem like replacing `lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( $this->text_starts_at, $this->text_length, $plaintext_content ); return true; case 'STYLE': $plaintext_content = preg_replace_callback( '~style)~i', static function ( $tag_match ) { return "\\3c\\2f{$tag_match['TAG_NAME']}"; }, $plaintext_content ); $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( $this->text_starts_at, $this->text_length, $plaintext_content ); return true; case 'TEXTAREA': case 'TITLE': $plaintext_content = preg_replace_callback( "~{$this->get_tag()})~i", static function ( $tag_match ) { return "</{$tag_match['TAG_NAME']}"; }, $plaintext_content ); /* * These don't _need_ to be escaped, but since they are decoded it's * safe to leave them escaped and this can prevent other code from * naively detecting tags within the contents. * * @todo It would be useful to prefix a multiline replacement text * with a newline, but not necessary. This is for aesthetics. */ $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( $this->text_starts_at, $this->text_length, $plaintext_content ); return true; } return false; } public function set_attribute( $name, $value ): bool { if ( self::STATE_MATCHED_TAG !== $this->parser_state || $this->is_closing_tag ) { return false; } /* * WordPress rejects more characters than are strictly forbidden * in HTML5. This is to prevent additional security risks deeper * in the WordPress and plugin stack. Specifically the * less-than (<) greater-than (>) and ampersand (&) aren't allowed. * * The use of a PCRE match enables looking for specific Unicode * code points without writing a UTF-8 decoder. Whereas scanning * for one-byte characters is trivial (with `strcspn`), scanning * for the longer byte sequences would be more complicated. Given * that this shouldn't be in the hot path for execution, it's a * reasonable compromise in efficiency without introducing a * noticeable impact on the overall system. * * @see https://html.spec.whatwg.org/#attributes-2 * * @todo As the only regex pattern maybe we should take it out? * Are Unicode patterns available broadly in Core? */ if ( preg_match( '~[' . // Syntax-like characters. '"\'>& The values "true" and "false" are not allowed on boolean attributes. * > To represent a false value, the attribute has to be omitted altogether. * - HTML5 spec, https://html.spec.whatwg.org/#boolean-attributes */ if ( false === $value ) { return $this->remove_attribute( $name ); } if ( true === $value ) { $updated_attribute = $name; } else { $comparable_name = strtolower( $name ); /* * Escape URL attributes. * * @see https://html.spec.whatwg.org/#attributes-3 */ $escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value ); // If the escaping functions wiped out the update, reject it and indicate it was rejected. if ( '' === $escaped_new_value && '' !== $value ) { return false; } $updated_attribute = "{$name}=\"{$escaped_new_value}\""; } /* * > There must never be two or more attributes on * > the same start tag whose names are an ASCII * > case-insensitive match for each other. * - HTML 5 spec * * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive */ $comparable_name = strtolower( $name ); if ( isset( $this->attributes[ $comparable_name ] ) ) { /* * Update an existing attribute. * * Example – set attribute id to "new" in
: * *
* ^-------------^ * start end * replacement: `id="new"` * * Result:
*/ $existing_attribute = $this->attributes[ $comparable_name ]; $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement( $existing_attribute->start, $existing_attribute->length, $updated_attribute ); } else { /* * Create a new attribute at the tag's name end. * * Example – add attribute id="new" to
: * *
* ^ * start and end * replacement: ` id="new"` * * Result:
*/ $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement( $this->tag_name_starts_at + $this->tag_name_length, 0, ' ' . $updated_attribute ); } /* * Any calls to update the `class` attribute directly should wipe out any * enqueued class changes from `add_class` and `remove_class`. */ if ( 'class' === $comparable_name && ! empty( $this->classname_updates ) ) { $this->classname_updates = array(); } return true; } public function remove_attribute( $name ): bool { if ( self::STATE_MATCHED_TAG !== $this->parser_state || $this->is_closing_tag ) { return false; } /* * > There must never be two or more attributes on * > the same start tag whose names are an ASCII * > case-insensitive match for each other. * - HTML 5 spec * * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive */ $name = strtolower( $name ); /* * Any calls to update the `class` attribute directly should wipe out any * enqueued class changes from `add_class` and `remove_class`. */ if ( 'class' === $name && count( $this->classname_updates ) !== 0 ) { $this->classname_updates = array(); } /* * If updating an attribute that didn't exist in the input * document, then remove the enqueued update and move on. * * For example, this might occur when calling `remove_attribute()` * after calling `set_attribute()` for the same attribute * and when that attribute wasn't originally present. */ if ( ! isset( $this->attributes[ $name ] ) ) { if ( isset( $this->lexical_updates[ $name ] ) ) { unset( $this->lexical_updates[ $name ] ); } return false; } /* * Removes an existing tag attribute. * * Example – remove the attribute id from
: *
* ^-------------^ * start end * replacement: `` * * Result:
*/ $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement( $this->attributes[ $name ]->start, $this->attributes[ $name ]->length, '' ); // Removes any duplicated attributes if they were also present. foreach ( $this->duplicate_attributes[ $name ] ?? array() as $attribute_token ) { $this->lexical_updates[] = new WP_HTML_Text_Replacement( $attribute_token->start, $attribute_token->length, '' ); } return true; } public function add_class( $class_name ): bool { if ( self::STATE_MATCHED_TAG !== $this->parser_state || $this->is_closing_tag ) { return false; } if ( self::QUIRKS_MODE !== $this->compat_mode ) { $this->classname_updates[ $class_name ] = self::ADD_CLASS; return true; } /* * Because class names are matched ASCII-case-insensitively in quirks mode, * this needs to see if a case variant of the given class name is already * enqueued and update that existing entry, if so. This picks the casing of * the first-provided class name for all lexical variations. */ $class_name_length = strlen( $class_name ); foreach ( $this->classname_updates as $updated_name => $action ) { if ( strlen( $updated_name ) === $class_name_length && 0 === substr_compare( $updated_name, $class_name, 0, $class_name_length, true ) ) { $this->classname_updates[ $updated_name ] = self::ADD_CLASS; return true; } } $this->classname_updates[ $class_name ] = self::ADD_CLASS; return true; } public function remove_class( $class_name ): bool { if ( self::STATE_MATCHED_TAG !== $this->parser_state || $this->is_closing_tag ) { return false; } if ( self::QUIRKS_MODE !== $this->compat_mode ) { $this->classname_updates[ $class_name ] = self::REMOVE_CLASS; return true; } /* * Because class names are matched ASCII-case-insensitively in quirks mode, * this needs to see if a case variant of the given class name is already * enqueued and update that existing entry, if so. This picks the casing of * the first-provided class name for all lexical variations. */ $class_name_length = strlen( $class_name ); foreach ( $this->classname_updates as $updated_name => $action ) { if ( strlen( $updated_name ) === $class_name_length && 0 === substr_compare( $updated_name, $class_name, 0, $class_name_length, true ) ) { $this->classname_updates[ $updated_name ] = self::REMOVE_CLASS; return true; } } $this->classname_updates[ $class_name ] = self::REMOVE_CLASS; return true; } public function __toString(): string { return $this->get_updated_html(); } public function get_updated_html(): string { $requires_no_updating = 0 === count( $this->classname_updates ) && 0 === count( $this->lexical_updates ); /* * When there is nothing more to update and nothing has already been * updated, return the original document and avoid a string copy. */ if ( $requires_no_updating ) { return $this->html; } /* * Keep track of the position right before the current tag. This will * be necessary for reparsing the current tag after updating the HTML. */ $before_current_tag = $this->token_starts_at ?? 0; /* * 1. Apply the enqueued edits and update all the pointers to reflect those changes. */ $this->class_name_updates_to_attributes_updates(); $before_current_tag += $this->apply_attributes_updates( $before_current_tag ); /* * 2. Rewind to before the current tag and reparse to get updated attributes. * * At this point the internal cursor points to the end of the tag name. * Rewind before the tag name starts so that it's as if the cursor didn't * move; a call to `next_tag()` will reparse the recently-updated attributes * and additional calls to modify the attributes will apply at this same * location, but in order to avoid issues with subclasses that might add * behaviors to `next_tag()`, the internal methods should be called here * instead. * * It's important to note that in this specific place there will be no change * because the processor was already at a tag when this was called and it's * rewinding only to the beginning of this very tag before reprocessing it * and its attributes. * *

Previous HTMLMore HTML

* ↑ │ back up by the length of the tag name plus the opening < * └←─┘ back up by strlen("em") + 1 ==> 3 */ $this->bytes_already_parsed = $before_current_tag; $this->base_class_next_token(); return $this->html; } private function parse_query( $query ) { if ( null !== $query && $query === $this->last_query ) { return; } $this->last_query = $query; $this->sought_tag_name = null; $this->sought_class_name = null; $this->sought_match_offset = 1; $this->stop_on_tag_closers = false; // A single string value means "find the tag of this name". if ( is_string( $query ) ) { $this->sought_tag_name = $query; return; } // An empty query parameter applies no restrictions on the search. if ( null === $query ) { return; } // If not using the string interface, an associative array is required. if ( ! is_array( $query ) ) { _doing_it_wrong( __METHOD__, __( 'The query argument must be an array or a tag name.' ), '6.2.0' ); return; } if ( isset( $query['tag_name'] ) && is_string( $query['tag_name'] ) ) { $this->sought_tag_name = $query['tag_name']; } if ( isset( $query['class_name'] ) && is_string( $query['class_name'] ) ) { $this->sought_class_name = $query['class_name']; } if ( isset( $query['match_offset'] ) && is_int( $query['match_offset'] ) && 0 < $query['match_offset'] ) { $this->sought_match_offset = $query['match_offset']; } if ( isset( $query['tag_closers'] ) ) { $this->stop_on_tag_closers = 'visit' === $query['tag_closers']; } } private function matches(): bool { if ( $this->is_closing_tag && ! $this->stop_on_tag_closers ) { return false; } // Does the tag name match the requested tag name in a case-insensitive manner? if ( isset( $this->sought_tag_name ) && ( strlen( $this->sought_tag_name ) !== $this->tag_name_length || 0 !== substr_compare( $this->html, $this->sought_tag_name, $this->tag_name_starts_at, $this->tag_name_length, true ) ) ) { return false; } if ( null !== $this->sought_class_name && ! $this->has_class( $this->sought_class_name ) ) { return false; } return true; } public function get_doctype_info(): ?WP_HTML_Doctype_Info { if ( self::STATE_DOCTYPE !== $this->parser_state ) { return null; } return WP_HTML_Doctype_Info::from_doctype_token( substr( $this->html, $this->token_starts_at, $this->token_length ) ); } const STATE_READY = 'STATE_READY'; const STATE_COMPLETE = 'STATE_COMPLETE'; const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT'; const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG'; const STATE_TEXT_NODE = 'STATE_TEXT_NODE'; const STATE_CDATA_NODE = 'STATE_CDATA_NODE'; const STATE_COMMENT = 'STATE_COMMENT'; const STATE_DOCTYPE = 'STATE_DOCTYPE'; const STATE_PRESUMPTUOUS_TAG = 'STATE_PRESUMPTUOUS_TAG'; const STATE_FUNKY_COMMENT = 'STATE_WP_FUNKY'; const COMMENT_AS_ABRUPTLY_CLOSED_COMMENT = 'COMMENT_AS_ABRUPTLY_CLOSED_COMMENT'; const COMMENT_AS_CDATA_LOOKALIKE = 'COMMENT_AS_CDATA_LOOKALIKE'; const COMMENT_AS_HTML_COMMENT = 'COMMENT_AS_HTML_COMMENT'; const COMMENT_AS_PI_NODE_LOOKALIKE = 'COMMENT_AS_PI_NODE_LOOKALIKE'; const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML'; const NO_QUIRKS_MODE = 'no-quirks-mode'; const QUIRKS_MODE = 'quirks-mode'; const TEXT_IS_GENERIC = 'TEXT_IS_GENERIC'; const TEXT_IS_NULL_SEQUENCE = 'TEXT_IS_NULL_SEQUENCE'; const TEXT_IS_WHITESPACE = 'TEXT_IS_WHITESPACE'; }