Capturing Items with Regular Expressions

Introduction

When capturing items, whether they exist in Microsoft Word documents or source code, there is a need for telling a tool what patterns to search for during a parsing phase. These patterns are most often described by a regular expression (regex) which are a sequence of characters that make up a search pattern. By providing this search pattern to a parser, that tool can then recognize tags within documents, allowing it to pull specific instances of design or other types of requirements from different documents. In fact, this type of character matching is implemented in Spec-TRACERв„ў, and the purpose of this document will be to provide an introduction on using and forming your own search patterns when capturing design items.

Writing a Search Pattern

Regular expressions work by matching characters to a given pattern. This pattern is composed of the letters, numbers, or symbols of the term to be matched, and is especially simple when matching a single, defined term. For example, to match a specific phrase, the regex would simply be that phase. Suppose we have the string "Hello, World!", to match the first word, we would simply use the regex "Hello". This is easy when wanting to match simple and known patterns, however, when applied to variable terms such as the multiple tags you'd find for items codes within requirement documents or source code, there's a need for more complex pattern matching.

NOTE: Regex syntax can vary among different engines. This document covers the more widely used Perl flavor of regex, which is the accepted syntax within Spec-TRACER.

Character Classes

Special character classes exist which can form regex to match all types of complex patterns such as numerical, word, or whitespace characters. Instead of providing the exact characters of a target phrase, these character classes can be used to match all phrases which match with that particular class. The table below summarizes some of the most common character classes.

Character Class

Function

Example Regex

Matches

[ characters ]

Matches a single character with any of the provided characters

[ch]at

"cat" and "hat" from "bat, cat, hat"

[ start - end ]

Matches a single character with any character from the specified range

[a-c]at

"bat" and "cat" from "bat, cat, hat"

[ ^group ]

Matches any character not present in the specified group or range.

[^a-c]at

"hat" from "bat, cat, hat"

.

Matches any single character (wildcard).

.at

"bat", "cat", "hat" from "bat, cat, hat"

\d

Matches decimal characters. Equal to: [0-9]

\d\d\d

"254" in "DO-254"

\D

Matches non-decimal characters. Equal to: [^0-9]

\D\D\D

"DO-" in "DO-254"

\w

Matches word characters. Equal to: [a-zA-Z0-9_]

\w

'v', '1', and '2' in "v1.2"

\W

Matches any non-word characters. Equal to: [^a-zA-Z0-9_]

\W

'.' in "v1.2"

As shown, character classes can be defined by the user or by special symbols involving the backslash '\'. These character classes can be inserted anywhere within a regex as shown. For example, in the wildcard '.' case, the regex was matching any first character, followed by "at".

Anchors

While the classes above match any characters from within their respective groups anywhere in a string, it is often necessary to specify where in a string that a regex should match. This type of specification is performed by anchors. An example of where this might be particularly useful in terms of item capture is if certain item codes are mentioned in text in places other than where that code acts as an identifier. Without anchors, a parser would match an item code at any instance, and can lead to issues of incomplete or duplicate items being captured. It is therefore useful to specify that some matches should occur at the beginning of strings, or perhaps at the end of a line. The following table characterizes two of the most common anchors:

Anchor

Function

Example Regex

Matches

^

The match must occur at the beginning of a string or line.

^......

"Spec-T" in "Spec-TRACER"

$

The match must occur at the end of a string (or before a line break).

......$

"TRACER" in "Spec-TRACER"

In this case, both example regex were written to match any 6 consecutive characters as indicated by the 6 wildcards. The difference between the two, however, were the additional anchors present in both examples. In the case where the first six characters were matched from our example string, the '^' anchor was used, and this was included before the search term. To match the last six characters, the '$' anchor was used instead and it was present at the end of the search term.

Quantifiers

In some cases, the length of target codes is unknown, or can vary in size. The search patterns shown so far have all worked on a one-to-one type of match, and would not be able to handle such variability by themselves. For such cases, quantifiers may be used to specify how many times a certain element within a search pattern should occur. Quantifiers may be used to limit or expand the constraints of a particular search pattern, and they are implemented in a number of ways:

Quantifier

Function

Example Regex

Matches

{ n }

Matches the previous element n times.

ID-\d{2}

"ID-22" in "ID-1, ID-22"

{ n,m }

Matches the previous element between n or m times (inclusive).

ID-\d{1,2}

"ID-1" and "ID-22" in "ID-1, ID-22"

{ n, }

Matches the previous element n or more times.

ID-\d{2,}

"ID-22" and "ID-333" in "ID-1, ID-22, ID-333"

*

Matches the previous element zero or more times. Equal to: {0,}

\D*

"ISO " in "ISO 26262"

+

Matches the previous element one or more times. Equal to: {1,}

\d*

"26262" in "ISO 26262"

?

Matches the previous element zero or one times.Equal to: {0,1}

ID\W?\d

"ID1", "ID.2", and "ID-3" in "ID1, ID.2, ID-3"

As shown, quantifiers allow shorter regex while also providing the flexibility of capturing variable length terms. Instead of having to write out as many digit "\d" search terms as required, they can instead by specified by a simple quantifying pattern either through the built in quantifiers (*, +, and ?) or through the more customizable curly bracket-type syntax. In either case, these quantifiers will be necessary in those item or tag codes which might not share uniform length.

Special Characters

As we've seen in previous examples, a number of special characters are used for different classes, anchors, or quantifiers. This means searching for those particular characters within a string isn't as simple as including them within a search term as they will be parsed using their special behaviors. In such cases that these symbols are to be searched, they can be included via a character escape, meaning preceded by a simple backslash '\'. Any special character preceded by a backslash will be parsed as that character alone, and will provide no special behavior. Of course the inverse of this has been shown as well. In certain cases that normal characters (such as letters) are preceded by a backslash, they may take on special behavior, such as "\d" matching any single digit character.

Example Regex

Description

Matches

\\d

Matches a backslash followed by 'd'.

"\d" in "\d searches for digits such as 123"

v1\.\d

Matches "v1." followed by any number.

"v1.8" and "v1.9" in "The latest versions are: v1.8, v1.9, and v2.0."

\d\*\d

Matches two numbers with a '*' between them.

"9*8" in "9*8+11=83"

[\[\{].*[\]\}]

Matches any text contained in brackets or curly brackets.

"{this}" and "[a]" in "{this} <is> [a] (test)"

Parsing With Regex in Spec-TRACER

While there is much more to regular expressions such as additional escape characters, substitutions, grouping, alternation, and other miscellaneous constructs, the basics covered here can provide powerful search terms for capturing the types of item codes used for traceability. Looking at some of the examples included in Spec-TRACER, we can see how regex is used within the tool's different parsers to match certain tags within different files.


Following the "Traceability Tutorials using Parsers" guide within the user manual, we can set up the MS Word Parser like so to capture our concept design codes from a Microsoft Word document:

Figure 1. MS Word Parser Settings for Concept Design.


Notice that the parser here has a number of places for including different regular expressions. For simply capturing the item code, we can look at the first section and see the main regular expression provided to the parser is "CD-\d{3}". As you might recall, this search pattern will match the text "CD-" exactly, and then match three more digit-type characters. As such, "CD-002" will be included as a match within the document.

Figure 2. Concept Design item as it appears within the document.


Notice that within this document, however, that the item code is enclosed within parentheses. Because item capture works by capturing all information between matches, the closing parenthesis will be included within the item if that single regular expression was provided by itself (we also can't include those parenthesis within the regular expression as we want the item codes to be free of miscellaneous characters when entered within the database). In such a case we must provide precondition and postcondition regular expressions which help locate the correct tags while also capturing the exact information we need. In the example above, the precondition looks for any number of characters followed by an open parenthesis (because parentheses are used for grouping within regular expression, to match the character, it must follow a backslash). Similarly, the postcondition searches for the closing parenthesis.

NOTE: In the case that either precondition or postcondition are empty, they can act as anchors without having to add the relevant character to the main regex. For example, an empty precondition will act as a '^' at the beginning of your regular expressions, while an empty postcondition will act as a '$' at the end of your regular expression. As such, these anchors are not necessary when forming your regex, but their behavior still persists.

Not only can regular expressions be used to match those item codes, but they are also used to automatically set up traceability links, as shown within the parser settings. For that particular configuration, any codes or tags appearing after the matched "Covers:" term will automatically be added to an item's upstream links.


Similarly, we can see a similar case of regular expressions used within the Source Code Parser:

Figure 3. Source Code Parser Settings.


Looking at these settings, we can see that this regular expression will match any text "HDL_" followed by 3 digits. Because these tags are provided as comments, we can simply provide wildcards for our pre and postconditions.

Figure 4. HDL items within source code.


Looking at the source code, we can see HDL_005 will be captured for this particular configuration. Notice that this time the tags are set up to encapsulate a certain part of the code by using BEGIN and END keywords. These keywords are recognized automatically and shall not be included in regular expression.

Summary

Overall, regular expressions are necessary when it comes to parsing and matching the different item codes within requirements documents and source code files. Through various special characters and constructs, regex can be formed to match any syntax used for tagging content for a simple and convenient means of automatically detecting and parsing items for traceability.

Ask Us a Question
x
Ask Us a Question
x
Captcha ImageReload Captcha
Incorrect data entered.
Thank you! Your question has been submitted. Please allow 1-3 business days for someone to respond to your question.
Internal error occurred. Your question was not submitted. Please contact us using Feedback form.