Computer Science

Regular Expressions

Regular expressions are sequences of characters that define a search pattern, used for string manipulation and searching within text. They provide a powerful way to match, search, and manipulate text based on patterns, allowing for complex and flexible text processing. In computer science, regular expressions are widely used in tasks such as text parsing, data validation, and pattern matching.

Written by Perlego with AI-assistance

3 Key excerpts on "Regular Expressions"

Learn about this page

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

eBook - ePub
Language and Computers
- Markus Dickinson, Chris Brew, Detmar Meurers(Authors)
- 2012(Publication Date)
- Wiley-Blackwell
  (Publisher)
In formal language theory, language is treated mathematically, and a set of strings defines a language. For instance, English is defined as the set of all legitimate English sentences. As in other formalisms, Regular Expressions as such have no linguistic content; they are simply descriptions of a set of strings encoding a natural language text. While some patterns cannot be specified using Regular Expressions (see Under the Hood 4 on the complexity of grammar), Regular Expressions are quite suitable for our purposes. Regular Expressions are used throughout the computer world, and thus there are a variety of Unix tools (grep, sed, etc.), editors (Emacs, jEdit, etc.), and programming languages (Perl, Python, Java, etc.) that incorporate Regular Expressions. There is even some support for regular expression usage on Windows platforms (e.g., wingrep). The various tools and languages differ with respect to the exact syntax of the Regular Expressions they allow, but the principles are the same. Implementations are very efficient so that large text files can be searched quickly, but they are generally not efficient enough for web searching. 4.4.1 Syntax of Regular Expressions We can now turn to how Regular Expressions are used to describe strings. In this section we will discuss the basics of the syntax, and in the next section we will walk through the use of Regular Expressions with one particular tool. Note that while some of the symbols are the same as with basic search operators (e.g., *), as outlined in section 4.2, they often have different meanings. Unlike search operators – whose definitions can vary across systems – Regular Expressions have a mathematical grounding, so the definition of operators does not change. In fact, Regular Expressions can consist of a variety of different types of special characters, but there is a very small set of them
Sign up to read
Learn more about book
eBook - ePub
Understanding Corpus Linguistics
- Danielle Barth, Stefan Schnell(Authors)
- 2021(Publication Date)
- Routledge
  (Publisher)
A corpus that has at least some level of annotation will be very helpful in finding categories of words. One of the most common types of annotation is PoS tagging (cf. 7.2.3), which indicates the part of speech of each word in the corpus (also called grammatical tagging). Depending on how the tagging was done, there may just be simple categories such as verb, noun, adjective, or the categories may be more refined such as past tense verb, present tense verb, etc. One thing to watch out for with PoS tagging is that it is often automated. Hand-tagging an entire corpus, even a small one, is a monumental effort. Therefore, rules are often used to describe to a computer when to label a word with a particular category or another and then a tagger is run over the corpus. Usually a tagger is trained on a smaller hand-annotated sample and then applied (tested) on a larger corpus. This means that you get a lot of coverage quickly, but that there may be incorrect labels given to words. Some words may also get more than one possible label. This happens when a word is ambiguous and the tagger’s rules cannot unambiguously determine the word category. There are also semantic taggers that give meaning information, among others. Tags are often somewhere ‘in the background’, so if you are using an interface to query your corpus, you may not see the tags, although they will constrain your results. If you are using any tagged corpus, it is good to look through a portion of the actual tagged data before you start your searches so that you can adapt your search to what is really available in the corpus, not just what you expect or hope to be available. Searching for some constructions or grammar phenomena may require more than PoS tagged corpora.

For many languages there are no tagged corpora available, so to find grammatical phenomena, corpus linguists may have to rely on string searchers. Many constructions use particular words or strings or a limited set of these. When this is a small number, Regular Expressions can be helpful (cf. 5.11). Another way one can find grammatical constructions or search for grammatical patterns is to first identify the possible structures in a smaller tagged corpus and then use string information in a larger corpus (à la Bresnan et al. 2007 described in 4.2.5).

Some kinds of grammatical phenomena are difficult to find with string searches. In that case, additional annotation for specific categories is probably needed. That is why many corpora are hand-annotated for a limited set of phenomena, although this can be time-consuming (cf. Chapter 7 ).

5.11 Regular Expressions and specialised query languages

Some corpus programs and interfaces have their own query languages that you need to learn to use if you want to search for more than basic strings. Many programs, however, use a standard kind of query called Regular Expressions, also abbreviated as regex or regexp, used for finding patterns in data. Regular Expressions can be combined with text stings to pull out information or replace it with other information.

Regular Expressions treat everything as a character, and certain Regular Expressions are used to refer to certain types of characters like letters [a-zA-Z] and classes of characters (like digits with \d ). But Regular Expressions also allow us to match less obvious things, like every non-digit character \D . Letters, digits, and underscores can be referenced with \w to match so-called word characters and non-word characters with \W
Sign up to read
Learn more about book

eBook - ePub

Bioinformatics Algorithms

Design and Implementation in Python

Miguel Rocha, Pedro G. Ferreira(Authors)
2018(Publication Date)
Academic Press
(Publisher)

Combining these definitions with the previous ones, it is easy to define an RE for a natural number as the string “[0-9]*”, or to define a DNA sequence as “[ACTGactg]*”, or a protein sequence with between 100 and 200 aminoacids as “[ACDEFGHIKLMNPQRSTVWY]{100,200}”. And, of course, the hypotheses are endless.

There are other ways to select groups of characters, using the \ symbol followed by a letter. Some examples of this syntax are given below:

• \s – includes all white space (spaces, newlines, tabs, etc);
• \S – is the negation of the previous, thus matches with all non-white-space characters;
• \d – matches with digits;
• \D – matches with non-digits.

Other important meta-characters include the | that works as a logical or (disjunction), stating that the pattern can match with either the expression on the left or the expression on the right, $ matches with the end of a line and ̂ with the beginning of a line.

Some examples of strings representing Regular Expressions and possible matching strings are given in Table 5.1 .

Table 5.1 Examples of Regular Expressions and matching strings.

RE	Matching strings
ACTG	ACTG
AC.TC	ACCTC, ACCTC, ACXTC, ...
A[AC]A	AAA, ACA
A*CCC	CCC, ACCC, AACCC, ...
ACC \|G.C	ACC, GAC, GCC, ...
AC(AC){1,2}A	ACACA, ACACACA
[AC]3	CAC, AAA, ACC, ...
[actg]*	a, ac, tg, gcgctgc, ...

Python includes, within the package re, a number of tools to work with REs, allowing to test their match over strings. The main functions and their description are provided in Table 5.2 .

Table 5.2 Functions/methods working over Regular Expressions.

Function	Description
re.search(regexp, str)	checks if regexp matches str; returns results on the first match
re.match(regexp, str)	checks if regexp matches str in the beginning of the string
re.findall(regexp, str)	checks if regexp matches vstr; returns results on all matches as a list
re.finditer(regexp, str)	same as previous, but returns results as an iterator

In these functions, the result of a match is kept in a Python object that holds relevant information about the match. The methods m.group() and m.span(), applied over an object m

Learn more about book

Explore more topic indexes

View all