XPath
XPath (XML Path Language) is a language that allows you to build expressions that traverse and process an XML document. The idea is similar to regular expressions to select parts of a text without attributes (plain text). XPath allows you to search and select taking into account the hierarchical structure of the XML. XPath was created for use in the XSLT standard, where it is used to select and examine the structure of the transformation input document. XPath was defined by the W3C consortium.
Introduction
All the processing carried out with an XML file is based on the possibility of addressing or accessing each of the parts that compose it, so that we can treat each of the elements in a differentiated way.
The treatment of the XML file begins by locating it throughout the set of existing documents in the world. To carry out this location unequivocally, URIs (Uniform Resource Identifiers) are used, of which URLs (Uniform Resource Locators) are undoubtedly the best known.
Once the XML document is located, the way to select information within it is by using XPath, which is short for what is known as XML Path Language. With XPath we can select and refer to text, elements, attributes and any other information contained within a file XML.
XPath itself is a sophisticated and complex language, but different from the procedural languages we usually use (C, C++, Basic, Java...). Also, like almost everything in the world of XML, it is still in a state of development, so it is not easy to find tools that incorporate all its features.
XPath is, in turn, the basis on which new tools have been specified that are used to process XML documents. Tools such as XPointer, XLink and XQuery (the language that handles XML documents as if it were a database). Thus, XPath is used to say how a style sheet should process the content of an XML page, but also to be able to put links or load specific areas of an XML page in a browser, instead of the entire page.
The XPath data model
An XML document is processed by a parser (or parser) building a tree of nodes. This tree starts with a root element, which branches out through the elements that hang from it and ends in leaf nodes, which contain only text, comments, processing instructions or even that are empty and only have attributes.
The way XPath selects parts of the XML document is based precisely on the generated tree representation of the document. In fact, the "operators" What this language consists of will remind us of the terminology used when talking about trees in computing: root, child, ancestor, descendant, etc.
A special case of nodes are attribute nodes. A node can have as many attributes as you want, and an attribute node will be created for each one. However, these attribute nodes are NOT considered as its children, but rather as tags added to the element node.
The following is an example of how an XML document is converted to a tree. This same example will be used throughout the tutorial. First, the XML document is shown, and then the tree it generates.
XML document:
Δlibro أعربية for three streets Δautor/2005Josefa Santos dispensa/authorr Δcapítulo num="1" The first street ≤2 It was a grim night of August... ▪ ≥"si" She, innocent as inlace href="enlace "purmariposa" Who climbs the sky in search of libations... ▪ ▪ ≤2" public="if" The second street ≤2 It was a dark night of September... ▪ ≤2 She, innocent as ♫ inlace href="enlace" purseabejilla vis/enlace that raises the wind in search of the nectar of the flowers... ▪ ▪ ≤1"a" public="si" Third street ≤2 It was a dense December night... ▪ ≤2 She, she's blind. ♫ inlace href="enlace" purseabejilla vis/enlace that raises the space in search of bugs to eat... ▪ ▪ Δ/libro
Generated tree :
/ +---book 日本語 +... caption 日本語 2 by three streets 日本語 +---author 日本語 Δ +---(text)Josefa Santos 日本語 +---chapter [num=1] 日本語 The first street 日本語 Δ +-paragraph SPECIAL GENDER LICIT LICIT MIN LICIT MIN LIC MIN LIC MIN LIC MIN LIC MIN LIC MIN LIC UB LIC It was a grim night... 日本語 Δ +---paragraph [destacar=si] 日本語 She, as innocent 日本語 Δ +---link [href=link] SPECIAL GENDER LICIT LICIT MIN LICIT MIN LIC MIN LIC MIN LIC MIN LIC MIN LIC MIN LIC UB LIC Δ Δ +---(text)mariposa 日本語 Δ +---(text) that climbs the sky in search of libations... 日本語 +---chapter [num=2, public=yes] 日本語 +---(text)The second street 日本語 +-paragraph 日本語 It was a dark night... 日本語 +-paragraph 日本語 She, like innocent bee...
Types of Nodes
There are different types of nodes in a tree from an XML document, namely: root, element, attribute, text, comment and processing instruction (respectively; root, elements, attribute, text, comment and processing instruction). All this is very beneficial.
Root Node
Identified by /. The root node should not be confused with the root element of the document. Thus, if the XML document in our example has a book as its root element, this will be the first node that hangs from the root node of the tree, which is: /.
I repeat: / refers to the root node of the tree, but not to the root element of the XML document, even though an XML document can only have one root element. In fact, we can affirm that the root node of the tree contains the root element of the document.
Element Node
Any element in an XML document becomes an element node within the tree. Each element has its parent node. The parent node of any element is itself an element, except the root element, whose parent is the root node. Element nodes in turn have children, which are: element nodes, text nodes, comment nodes, and processing instruction nodes. Element nodes also have properties such as their name, their attributes, and information about "namespaces" who has assets.
An interesting property of element nodes is that they can have unique identifiers (for this they must be accompanied by a DTD that specifies that these attributes take unique values), this allows referencing said elements in a much more direct way.
Text nodes
By text we are going to refer to all the characters of the document that are not marked with any label. A text node has no children, that is, the different characters that form it are not considered its children.
Attribute nodes
As we have already indicated, attribute nodes are not so much children of the element node that contains them as tags added to said element node. Each attribute node consists of a name, a value (which is always a string), and a possible "namespace".
Those attributes that have the default value assigned in the DTD will be treated as if the value had been assigned to them when writing the XML document. Instead, nodes are not created for attributes not specified in the XML document, and with the #IMPLIED property defined in its DTD. Neither are attribute nodes created for namespace definitions. All this is normal if we take into account that it is not necessary to have a DTD to process an XML document.
Comment and processing instruction nodes
In addition to the specified nodes, nodes are also generated in the tree for each node with comments and processing instructions. The content of these nodes can be accessed with the string-value property.
Syntax and Semantics(XPath 1.0)
The most important type of expression in XPath is a location path. A location path consists of a sequence of location steps. For each localization step there are 3 components:
- A Axis
- One Node test
- 0 or more Preached.
An XPath expression is evaluated with respect to a context node. An axis specifier such as 'child' ('child') or 'descendant' ('descendant') specifies the direction to navigate from the context node. The 'test' ('test') and the predicate is used to filter the 'nodes' ('nodes') specific according to the specific axis: For example, the test node 'A' requires that all nodes to be navigated have the 'label' 'A'. A predicate can be used to specify that the selected nodes have a specific property, these are specified by the XPath expression.
The XPath syntax has two forms: The abbreviated syntax, is more compact and allows XPaths to be written and read easily and intuitively, in many cases using characters that are familiar and a known way of building it. The full syntax is fancier, but allows you to specify more options and is more descriptive to read, as long as you read it carefully.
Shortcut Syntax
The compact notation allows many default values and abbreviations for the most common cases. Given the XML containing the following example:
▪ ▪ ≤2 ≤2≤3
A simple select with XPath shorthand syntax takes a form like this:
/A/B/C
it selects element C at the address of the 'child' of element B that is a child of element A, thus selecting the element from the furthest outside of the XML document. The XPath syntax imitates a URI (Uniform Resource Identifier) which in Spanish means 'Uniform Resource Identifier' and a Unix-style file path syntax.
More complex expressions can be constructed using a specific axis other than the default 'child' axis, a node test that does not have a simple name, or predicates, such as write in a right parenthesis after any step. For example, the expression:
A//B/*[1]
selects the first child ('*[1]
'), whatever its name, of each B element and its children. This symbol ('//
') refers to taking a descendant of element A, this is a child of the node of the current context (The expression does not begin with an & #39;/
'). Note that the [1]
predicate binds more tightly than the /
operator. To select the first selected node using the A//B/*
expression, type (A//B/*)[1]
. Note, that the value of the index in the XPath predicate (technically, 'next position' of the XPath node set) starts at 1, not 0 as is common in languages like Javascript, C, and Java.
Expanded Syntax
We can write the two examples above in the expanded (unabbreviated) syntax as follows:
/child::A/child::B/child::C
child::A/descendant-or-self::node()/child::B/child::node()[position()=1]
Here, at each step of the XPath, the axis (example: child
or descendant-or-self
) is specified explicitly, followed by by ::
and then the node test, such as A
or node()
in the previous examples.
In this same, but shorter:
A//B/*[position()=1]
Axis Specifier
The axis specifier indicates the direction of navigation within the representation tree of the XML document. The axes available are:
Full Syntax | Abbreviated Syntax | Notes |
---|---|---|
ancestor | ||
ancestor-or-self | ||
attribute | (SimboloArroba) | (SimboloArroba)abc is the short form for attribute::abc |
child | xyz is an abbreviation for child::xyz | |
descendant | ||
descendant-or-self | // | // is an abbreviation for /descendant-or-self::node()/ |
following | ||
following-sibling | ||
namespace | ||
parent | .. | .. is an abbreviation for parent::node() |
preceding | ||
preceding-sibling | ||
self | . | . is an abbreviation for self::node() |
As an example of using the attribute axis in the shorthand syntax, //a/(AtSymbol)href
selects the attribute named href
in the a
element on either side of the document tree.
The expression "." (short for self::node() ) is commonly used within a predicate to refer to the currently selected node..
For example, h3[.='See also']
selects an element named h3
in the current context, whose text content is See also
.
Node Test
The node test can consist of a specific node name or a more general expression. In the case of an XML document in which the namespace prefix gs
has been defined, //gs:enquiry
will search all elements. query
in that namespace, and //gs:*
will find all elements, regardless of the local name in this namespace.
Other node test formats are:
- comment()
- Find in the XML an example comment node.
- text()
- Find a whole text type, example
hello world
inhello world
- processing-instruction()
- Find XML In processing instructions for example
. In this case,
processing-instruction('php')
We will. - node()
- Find any node at all.
Predicates
Predicates, written as expressions in brackets, can be used to filter a set of alls according to some condition. For example, a
returns an array of todo (all a
elements that are children of the context node), and a[(AtSymbol)href='help.php']
saves only elements that have the href
attribute set to help.php
.
There are no limits to the number of predicates in this step and need not be limited to the last step of an XPath. They can also be nested at any depth. Paths specified in predicates begin in the context of the current step (that is, the test step for the immediately preceding node) and do not alter that context. All predicates must be satisfied for a match to occur.
When the value of the predicate is numeric, it is syntactic sugar to compare with the position of the node in the node set (as given by the position()
function). So p[1]
is a short form for p[position()=1]
and select the first child element p
, while p[last()]
is a short form for p[position()=last()]
and select the last p
child of the context node current.
In the other case, the value of the predicate is automatically converted to a boolean value. When the predicate evaluates to a nodeset, the result is true when the nodeset is non-empty. So p[(AtSymbol)x]
selects those p
selects elements that have a x
attribute.
A more complex example is the expression: a[/html/(AtSymbol)lang='en'][(AtSymbol)href='help.php'] [1]/(AtSymbol)target
selects the value of the target
attribute of the first a
element among the children of the context node that have the attribute href
with the value help.php
, provided that the parent html
element has the lang
attribute with the value en
. Reference to an attribute of the top-level element in the first predicate does not affect the context of other predicates or the location step itself.
The order of the predicates is meaningful if the predicates test the position of a node. Each predicate takes a set of nodes and returns (potentially) a smaller set. So a[1][(AtSymbol)href='help.php']
will match only if the first child a
of the context node satisfies the condition (AtSymbol)href='help.php'
, while a[(AtSymbol)href='help.php'][1]
will find the first child a
that satisfies the condition.
Functions and operators
XPath 1.0 defines 4 data types: The node set (node sets without intrinsic order), strings (Character String), numbers (Numbers) and booleans (Booleans).
The available operators are:
- The operators "/", "/" and "[...]" are used in xpath expressions, as described above.
- Connection operator "IVA", which forms the union of two sets of nodes.
- The Boolean operators "and" and "or", together with the function "not()"
- Arithmetic operators "+", "-", "*", "div" (division) and "mod" (Module)
- Comparison operator "=", "!=", ",", "pur", "," =", "give="
The function library includes:
- Functions to manipulate strings: concat(), substring(), contains(), substring-before(), substring-after(), translate(), normalize-space(), string-length()
- Functions to manipulate numbers: sum(), round(), floor(), ceiling()
- Functions to obtain the properties of a node: name(), local-name(), namespace-uri()
- Functions for obtaining information on the processing context: position(), last()
- Conversion functions: string(), number(), boolean()
Some of the commonly used functions are described below.
Node set functions
- position()
- It returns a number that represents the node position in the node sequence that is currently being processed (for example, the node selected by the xsl:for-each instruction in XSLT).
- count(node-set)
- returns the number of nodes in the set of nodes that match the argument.
String Functions
- string(s)object?
- converts any of the 4 data types of XPath into a string according to the building rules. If the value of the argument is a set of nodes, the function returns a string value corresponding to the first node (According to the order of the document), ignoring all future nodes.
- concat(string, string, string*)
- concatena 2 or more string
- starts-with(s1, s2)
- return
true
Yeah.s1
Start withs2
- containss1, s2)
- return
true
Yeah.s1
containss2
- substring(s)string, start, length?
- example:
substring("ABCDEF",2,3)
return"BCD"
. - substring-before(s1, s2)
- example:
substring-before("1999/04/01","/")
return1999
- substring-after(s1, s2)
- example:
substring-after("1999/04/01","/")
return04/01
- string-length(string?)
- returns the number of characters of a string
- normalize-space(string?
- all the original and final blank spaces will be removed and any blank character sequence will be replaced by a single space. This is very useful when the original XML may have been formatted for Pretty-printing, which could make additional string processing unreliable.
Boolean Functions
- not(Boolean)
- denies the expression booleana.
- true()
- is evaluated as bar.
- false()
- is evaluated as false.
Number functions
- sum(node-set)
- converts the string values of all nodes found by the XPath argument into numbers, according to the integrated caste rules, then returns the sum of these numbers.
Example of use
Expressions can be created inside predicates using the operators: =, !=, <=, <, >=
and >
. Boolean expressions can be combined with parentheses ()
and the Boolean operators and
and or
as well as the not() described above. Numerical calculation can use
*, +, -, div
and mod
. Strings can consist of Unicode characters.
//item[(AtSymbol)price > 2*(AtSymbol)discount]
selects items whose price attribute is greater than twice the numeric value of the discount attribute.
Full node sets can be combined with the operator ('unioned') which consists of the pipe character |. Node sets that satisfy multiple conditions can be found by combining the conditions within a predicate with 'or
'.
v[x or y] | w[z]
can return a single node set consisting of all v
elements that have a child element x
or y
, as well as all w
elements that have child z
elements, that were found in the current context.
Contenido relacionado
ICab
Data structure
IP header