Regular phrase

format_list_bulleted Contenido keyboard_arrow_down
ImprimirCitar
Regular expression (?<=.) {2,}(?=[A-Z]) returns results when there are at least two spaces that occur after the point (.) and before a capital letter, as highlighted in this text
Stephen Kleene, who helped found the concept

In computational theory and formal language theory, a regular expression, or rational expression, is also known as a regex or regexp, due to its contraction of the English words regular expression, is a sequence of characters that forms a search pattern. They are mainly used for searching for string patterns or substitution operations.

Regular expressions are patterns used to find a certain combination of characters within a text string. Regular expressions provide a very flexible way to search or recognize strings of text. For example, the group consisting of the strings Handel, Händel, and Haendel is described with the pattern "H(a|ä| ae)ndel".

Most formalizations provide the following constructs: A regular expression is a way of representing regular languages (finite or infinite) and is constructed using characters from the alphabet on which the language is defined.

Building a regular expression

Specifically, regular expressions are constructed using Kleene's union, concatenation, and closure operators. Every regular expression has some associated finite automaton.

Toggle

A vertical bar separates the alternatives, which are evaluated from left to right. For example, "yellow|blue" corresponds to yellow or blue.

Quantification

A quantifier after a character specifies how often the character can occur. The most common quantifiers are "?", "+" and "*":

  • The question mark ? indicates that the character that precedes it can appear as much once. For example, "ob?scuro" corresponds to Dark and obscure.
  • The sign more + indicates that the character that precedes it must appear at least once. For example, "ho+la" describes the infinite set Hello., hoola, hooola, hoooolaetc.
  • The asterisk indicates that the character that precedes it may appear zero, one, or more times. For example, "0*42" corresponds to 42, 042, 0042, .42etc.

Grouping

The parentheses can be used to define the scope and precedence of the other operators. For example, "(p|m)father" is the same as "father|mother", and "(dis)?love" corresponds to love and heartbreak.

Constructors can be freely combined within the same expression, so "H(ae?|ä)ndel" is equivalent to "H(a|ae|ä)ndel". The precise syntax of regular expressions changes depending on the tools and applications considered.

Applications

Its most obvious use is to describe a set of strings for a certain function, resulting in utility in text editors and other computer applications for searching and manipulating text.

Numerous text editors and other tools use regular expressions to find and replace patterns in text. For example, the tools provided by Unix distributions (including the sed editor and the grep filter) popularized the concept of a regular expression among non-programmers, even though it was already familiar to programmers.

Initially, this string recognition was programmed for each application without any mechanism inherent to the programming language but, over time, the use of regular expressions has been incorporated to make it easier to program the detection of certain strings. For example, Perl has a powerful regular expression engine built right into its syntax. Other languages have incorporated it as specific functions without incorporating it into their syntax.

Regular expressions in programming

Note: For the full understanding of this section it is necessary to possess general knowledge about programming languages

In programming, regular expressions are a method by which you can search within strings. Regardless of the breadth of the search required for a defined pattern of characters, regular expressions provide a practical solution to the problem. Additionally, a derivative use of pattern matching is the validation of a specific format on a given string, such as dates or identifiers.

In order to use regular expressions when programming, you need to have access to a search engine with the ability to use them. It is possible to classify the available motors into two types according to their use: motors for the programmer and motors for the end user.

End-user engines: these are programs that allow searches on the content of a file or on a text extracted and placed in the program. They are designed to allow the user to perform advanced searches using this mechanism, however it is necessary to learn how to write proper regular expressions in order to use them efficiently. Some programs available of this type are:

  • grep: Unix/Linux operating systems program.
  • sed: Unix/Linux operating systems program that allows the modification of the output.
  • PowerGrep: grep version for Windows operating systems.
  • Sublime Text: allows you to perform searches/replacements with regular file expressions (free).
  • RegexBuddy: helps create regular expressions interactively and then allows the user to use them and save them (not free).
  • EditPad Pro: allows you to perform searches with regular file expressions and displays them through color code to facilitate your reading and comprehension (not free).

Developer Engines: These allow you to automate the search process so that it can be used many times for a specific purpose. Here are some of the programming tools available that offer search engines with regular expression support:

  • AWK: It forms an essential part of the language and by extension of Unix/Linux awk tool
  • C++: From its version C++11 it is possible to use regular expressions using the standard library, using the header.
  • Java: There are several java-made libraries that allow the use of RegEx, and Sun plans to support these from SDK
  • JavaScript: from version 1.2 (ie4+, ns4+) JavaScript has integrated support for regular expressions.
  • Perl: is the language that made regular expressions grow in the field of programming until they reach what they are today.
  • PCRE: ExReg library for C, C++ and other languages that can use dll libraries (Visual Basic 6 for example).
  • PHP: has two different types of regular expressions available to the programmer, although the POSIX (ereg) variant will be discarded in PHP 6.
  • Python: language scripting with regular expression support through your library re.
  • .Net Framework: provides a set of classes by which it is possible to use regular expressions to make searches, replace chains and validate patterns.

Note: EditPad Pro and the.Net Framework are used for examples of the tools mentioned above, it is also possible to use regular expressions with any combination of the tools mentioned. Although in general the Regular Expressions use a common language in all the tools, the practical explanations about the use of the tools and the code examples should be interpreted differently. It is also necessary to note that there are some syntax details of regular expressions that are specific to the.Net Framework that are used differently in other programming tools. When these cases occur, they will be explicitly noted so that the reader can search for information regarding these details in additional sources. Examples of other programming languages and tools will be included in the future.

Regular expressions as a search engine

Regular expressions allow you to find specific portions of text within a larger string of characters. Thus, if it is necessary to find the text "batch" in the expression "the ocelot jumped into the next lot" any search engine would be able to do this job. However, most search engines would also find the "batch" from the word "ocelot", which might not be the expected result. Some search engines additionally allow you to specify that you want to find only whole words, solving this problem. Regular expressions allow you to specify all of these additional options and many others without having to set any additional options, but instead using the same search text as a language that allows you to send the search engine exactly what you want to find in all cases, without the need to activate Additional options when searching.

Regular expressions as a language

To specify options within the text to be searched for, a language or convention is used by which the result to be obtained is transmitted to the search engine. This language gives a special meaning to a series of characters. So when the regular expression search engine finds these characters it will not search for them in the text in literal form, but will look for what the characters mean. These characters are sometimes called "meta-characters". Listed below are the main meta-characters and their function and how the regular expression engine interprets them.

Description of regular expressions

The point and#34;.y#34;

Dot is interpreted by the search engine as "any character", that is, it searches for any character including newlines. Regular expression engines have a configuration option that allows you to modify this behavior. In the.Net Framework, the RegexOptions.Singleline option is used to specify the option to search for all characters including the newline (n).

Dot is used as follows: If you tell the RegEx engine to look for "g.t" on the chain "the stone cat on the gothic getisboro goot gate" the search engine will find "gat", "gót" and finally "get". Note that the search engine does not find "goot"; this is because the point represents a single character and only one. If it is necessary for the engine to also find the expression "goot", it will be necessary to use repetitions, which are explained later.

Although the period is very useful to find characters that we don't know, it is necessary to remember that it corresponds to any character and that many times this is not what is required. It is very different to search for any character than it is to search for any alphanumeric character or any digit or any non-digit or any non-alphanumeric. This should be taken into account before using the point and obtaining unwanted results.

The exclamation mark is#34;!y#34;

Used to perform a "negative lookahead". The construction of the regular expression is with the pair of parentheses, the opening parenthesis followed by a question mark and an exclamation point. Within the search we have the regular expression. For example, to exclude exactly one word, use "^(word.+|(?!word).*)$"

The backslash or counterslash n#34;y#34;

The backslash is used to escape the next character in the search expression so that it has or has no special meaning. That is, the backslash is never used by itself, but in combination with other characters. When used for example in combination with the point "." it no longer has its normal meaning and behaves like a literal character.

In the same way, when the backslash is followed by any of the special characters that we will discuss below, they lose their special meaning and become literal search characters.

As mentioned above, the backslash can also give special meaning to characters that don't have it. Below is a list of some of these combinations:

  • t - Represents a tabulator.
  • r — Represents the "carrior return" or "return to the top" or the place where the line starts again.
  • n — Represents the "new line" character through which a line starts. It is necessary to remember that in Windows a combination of rn to start a new line, while Unix is only used n and on classic Mac_OS is used only r.
  • a — Represents a "campaign" or "beep" that occurs when printing this character.
  • e — Represents the "Esc" or "Escape" key
  • f — Represents a page jump
  • v — Represents a vertical tabulator
  • x — It is used to represent ASCII or ANSI characters if you know your code. In this way, if you are looking for the copyright symbol and the source you are looking for uses the Latin-1 character set it is possible to find it using xA9".
  • u — It is used to represent Unicode characters if your code is known. "u00A2" represents the penny symbol. Not all Regular Expressions engines support Unicode. The.Net Framework does, but the EditPad Pro does not, for example.
  • d — Represents a digit from 0 to 9.
  • w — Represents any alphanumeric character.
  • s — Represents a blank space.
  • D — Represents any character other than a digit from 0 to 9.
  • W — Represents any non-alphanumeric character.
  • S — Represents any character other than a blank space.
  • A — Represents the start of the chain. Not a character but a position.
  • Z — Represents the end of the chain. Not a character but a position.
  • b — Marks the position of a word limited by blank spaces, score or the beginning/end of a string.
  • B — Marks the position between two alphanumeric characters or two non-alphanumeric characters.
  • Q y E — Everything that goes between these two marks is interpreted as literal. Example: "Q.*E" is interpreted as the literal ".*"


Notes:

  • Utilities such as Charmap.exe from Windows or GNOME gucharmap allow you to find the ASCII/ANSI/UNICODE codes for use in Regular Expressions.
  • Some languages, like Java, assign their own meaning to the inverted bar, so it must be repeated to be considered a regular expression (example String expresion="\d.\d" to indicate the pattern d.d).

The square brackets "[ ]"

The function of square brackets in the language of regular expressions is to represent "character classes", that is, to group characters into groups or classes. They are useful when you need to search for one of a group of characters. Within the brackets it is possible to use the hyphen "-" to specify ranges of characters. Additionally, metacharacters lose their meaning and become literals when they are inside the brackets. For example, as we saw in the previous installment "d" It is useful for finding any character that represents a digit. However, this denomination does not include the point "." that divides the decimal part of a number. To search for any character that represents a digit or a period we can use the regular expression "[d.]". As noted above, within the brackets, the period represents a literal character and not a metacharacter, so there is no need to precede it with the backslash. The only character that needs to be preceded by the backslash inside the brackets is the backslash itself. The regular expression "[dA-Fa-f]" allows us to find hexadecimal digits. Brackets also allow us to find words even if they are misspelled, for example, the regular expression "expresi[oó]n" allows you to find the word "expression" although it has been written with or without tilde. It is necessary to clarify that regardless of how many characters are introduced into the group by means of the square brackets, the group only tells the search engine to find a single character at a time, that is, that "expression[oó]n& #3. 4; will find "expression" or "expression".

The Bar "|"

It is used to indicate one of several options. For example, the regular expression "a|e" will find any "a" or "e" within the text. The regular expression "east|west|north|south" will allow you to find any of the names of the cardinal points. The slash is commonly used in conjunction with other special characters.

The Dollar Sign "$y#34;

Represents the end of the character string or the end of the line, if multi-line mode is used. It does not represent a special character but a position. If you use the regular expression ".$" the engine will find all the places where a period ends the line, which is useful for moving between paragraphs.

The circumflex e#34;^y#34;

This character has a dual functionality, which differs when used individually and when used in conjunction with other special characters. First of all its functionality as a single character: the character "^" represents the start of the string (in the same way that the dollar sign "$" represents the end of the string). Therefore, if you use the regular expression "^[a-z]" the engine will find all paragraphs that start with a lowercase letter. When used in conjunction with square brackets of the form "[^w ]" allows you to find any character that is NOT within the indicated group. The indicated expression allows to find, for example, any character that is not alphanumeric or a space, that is, it searches for all punctuation symbols and other special characters.

The use of the special characters "^" and "$" Allows for easy validation. For example "^d$" allows you to ensure that the string to check represents a single digit "^dd/dd/dddd$" allows you to validate a date in short format, although it does not allow you to check if it is a valid date, since 99/99/9999 would also be valid in this format; full validation of a date is also possible using regular expressions, as exemplified later.

The parentheses "()"

Similar to brackets, parentheses are used to group characters, however there are several fundamental differences between groups set by brackets and groups set by parentheses:

  • Special characters retain their meaning within parentheses.
  • The groups established with parenthesis establish a "label" or "point of reference" for the search engine that can be used later as denotes later.
  • Used in conjunction with the "SD" bar allows optional searches. For example, the regular expression "to (this)SystenoteSystenisur) of" allows us to search for texts that give indications through cardinal points, while the regular expression "esteSystenorteSystem" would find "this" in the word "Steban", not being able to fulfill this purpose.
  • Used in conjunction with other special characters that are subsequently detailed, it offers additional functionality.

The question mark "?y#34;

The question mark has several functions within the regular expression language. The first one is to specify that a part of the search is optional. For example, the regular expression "ob?darkness" allows you to find both "darkness" as "darkness". In conjunction with the round brackets it allows you to specify that a larger set of characters is optional; for example "Nov(.|iember|ember)?" allows you to find both "Nov" as "Nov.", "November" and "November". As mentioned above, the parentheses allow us to set a "reference point" for the search engine. However, sometimes you don't want to use them for this purpose, as in the "Nov(.|iembre|ember)?" example above. In this case, setting this benchmark (detailed below) represents a useless investment of resources by the search engine. To avoid this, you can use the question mark as follows: "Nov(?:.|iembre|ember)?". Although the result obtained will be the same, the search engine will not make a useless investment of resources in this group, but will ignore it. When it is not necessary to reuse the group, it is advisable to use this format. Similarly, it is possible to use the question mark with another meaning: The parentheses define "anonymous" groups, however the question mark in conjunction with the triangle brackets "<>&# 3. 4; allows "naming" these groups as follows: "^(?<Day>dd)/(?<Month>dd)/(?<Year>ddd d)$"; This specifies to the search engine that the first two digits found will be labeled "Day", the second will be labeled "Month" and the last four digits will be labeled "Year".

NOTE: Despite the complexity and flexibility given by the special characters studied so far, they mostly allow us to find only one character at a time, or a group of characters at a time. The metacharacters listed below allow repetitions to be set.

The braces "{}"

Brackets are commonly character literals when used separately in a regular expression. In order for them to acquire their metacharacter function, it is necessary that they enclose one or more numbers separated by commas and that they be placed to the right of another regular expression in the following way: "d{2}" This expression tells the search engine to find two contiguous digits. Using this formula we could convert the example "^dd/dd/dddd$" which was used to validate a date format in "^d{2}/d{2}/d{4}$" for greater clarity in reading the expression.

"d{2,4}" This form adds a second number separated by a comma, which tells the search engine that the regular expression d could appear at most 4 times. The possible values are:

  • "^d$" (minimum 2 repetitions)
  • "^dd$"(has 3 repetitions, so enter the 2-4 range)
  • "^ddd$" (maximum 4 repetitions)

Note: although this way of finding repeated elements is very useful, sometimes it is not clearly known how many times what you are looking for is repeated or its degree of repetition is variable. In these cases the following metacharacters are useful.

The asterisk "*"

The asterisk is used to find something that is repeated 0 or more times. For example, using the expression "[a-zA-Z]d*" it will be possible to find both "H" such as "H1", "H01", "H100" and "H1000", that is, a letter followed by an indefinite number of digits. It is necessary to be careful with the behavior of the asterisk, since it, by default, tries to find the largest possible number of characters that correspond to the pattern being searched for. Thus if you use "(.*)" to find any strings within parentheses and apply it to the text "See (Fig. 1) and (Fig. 2)" the search engine would be expected to find the texts "(Fig. 1)" and "(Fig. 2)", however, due to this feature, you will find the text "(Fig. 1) and (Fig. 2)" instead. This happens because the asterisk tells the search engine to fill in all possible spaces between the two parentheses. To obtain the desired result, the asterisk must be used together with the question mark as follows: "(.*?)" This is equivalent to telling the search engine to "Find an opening parenthesis and then find any sequence of characters until it finds a closing parenthesis".

The plus sign "+y#34;

Used to find a string that is repeated one or more times. Unlike the asterisk, the expression "[a-zA-Z]d+" you will find "H1" but it won't find "H". It is also possible to use this metacharacter in conjunction with the question mark to limit the extent to which repetition is performed.

Anonymous groups

Anonymous groups are set whenever a regular expression is enclosed in parentheses, so the expression "<([a-zA-Z]w*?)>" defines an anonymous group. The search engine will store a reference to the anonymous group that corresponds to the expression enclosed in the parentheses.

The most immediate way to use the defined groups is within the regular expression itself, which is done using the backslash "" followed by the number of the group to which you want to refer as follows: "<([a-zA-Z]w*?)>.*?</1>" This regular expression will match both the string "<span>This</span>" as the string "<b>test</b>" in the text "<span>This</span> is a <b>test</b>" even though the expression does not contain the literals "font" and "B".

Another way to use groups is in the programming language you are using. Each language has a different way of accessing groups. The examples listed below use the.Net Framework classes, using C# syntax (which can easily be adapted to VB.Net or any other Framework language or even Java or JavaScript).

To use the.Net Framework search engine, you first need to reference the System.Text.RegularExpressions namespace. Then it is necessary to declare an instance of the Regex class as follows:

 Regex _TagParser = new Regex("[a-zA-Z]w*?));

Then assuming that the text you want to examine with the regular expression is in the variable "sText" we can iterate through all the instances found as follows:

 foreach(Match CurrentMatch in _TagParser.Matches(sText){ // --- extra code here-- !

You can then use the Groups property of the Match class to return the search result:

 foreach(Match CurrentMatch in _TagParser.Matches(sText){ String sTagName = CurrentMatch. Groups[chuckles]1].Value; !

Nominal groups

Nominal groups are those that are assigned a name, within the regular expression to be able to use them later. This is done differently in different search engines, the following explains how to do it in the.Net Framework engine.

Using the above example it is possible to convert "<([a-zA-Z]w*?)>" in "<(?<TagName>[a-zA-Z]w*?)>" To find HTML tags. Note the question mark and the text "TagName" enclosed in triangle brackets, followed by this. To use this example in the.Net Framework it is possible to use the following code:

 Regex _TagParser = new Regex("[a-zA-Z]w*?)); foreach(Match CurrentMatch in _TagParser.Matches(sText){ String sTagName = CurrentMatch. Groups[chuckles]"TagName"]. Value; !

It is possible to define as many groups as necessary, this way you can define something like: "<(?<TagName>[a-zA-Z]w*?) ?(?< Attributes>.*?)>" to find not only the name of the HTML tag but also its attributes as follows:

 Regex _TagParser = new Regex("is(??.TagName voluntary[a-zA-Z]w*?) ?(??.Attributes contractual.*?)); foreach(Match CurrentMatch in _TagParser.Matches(sText){ String sTagName = CurrentMatch. Groups[chuckles]"TagName"]. Value; String sAttributes = CurrentMatch. Groups[chuckles]"Attributes"]. Value; !

But it is possible to go much further as follows:

 [?][?]rn]*?) ?(?:(?rn"??(?Value[chuckles]w-:;. rn])'?"? ?)

This expression allows you to find the name of the tag, the name of the attribute and its value.

However, an HTML tag can have more than one attribute. This can be solved using iterations as follows:

 [?][?]rn]*?) ?(?:(?rn"??(?Value[chuckles]w-:;. rn])'?"?)*?

And in code it can be used as follows:

 Regex _TagParser =  new Regex([?][?]rn]*? ((?Attribute[chuckles]w-rn])='?"? (?Value[chuckles]w-:;. rn])'?"?)*?); foreach(Match CurrentMatch in _TagParser.Matches(sText){ String sTagName = CurrentMatch. Groups[chuckles]"TagName"]. Value; foreach(Capture CurrentCapture in CurrentMatch. Groups[chuckles]"Attribute"]. Captures AttributesCollection. Add(CurrentCapture. Value) ! foreach(Capture CurrentCapture in CurrentMatch. Groups[chuckles]"value"]. Captures ValuesCollection. Add(CurrentCapture. Value) ! !

It is possible to drill down using a regular expression like this:

 [?][?]rn]*?) ?(?:(?rn"??(?Value[chuckles]w-:;. rn])'?"?)*? tax(? margin.*?)1

Which would allow to find the name of the tag, its attributes, values and its content, all with a single regular expression.

Contenido relacionado

ISO/IEC 10646

The international standard ISO/IEC 10646 defines the Universal Character Set as a multi-octet character encoding system. The latest version contains about...

GNU GRUB

GNU GRUB is a multi-boot loader, developed by the GNU project that allows us to choose which Operating System to boot from those...

Haskell curry

Haskell Brooks Curry was an American mathematician and logician. Born in Millis, Massachusetts, he was educated at Harvard University and received a Ph.D. in...
Más resultados...
Tamaño del texto:
Copiar