Strings : Parse Regular Expressions

0
1406

Parsing Regular Expressions at HiringLibrary-Com

How do you parse regular expressions?

Rationale: First, what is a regular expression? A regular expression, in short RegEx or RegExp, is a string describing a search pattern for another string, or even a whole file. With regular expressions you can find or replace specific parts of a text easily. They are also a powerful way of checking the validity of e-mail addresses, Internet domains or postal codes.

RegEx expressions define models (or templates). For example, a regular expression for an email address would tell us that it needs to contain an @ symbol and a domain. All we need to do at this point is parse the expression and check whether a given string matches the template it defines.

Let’s see how we write patterns using regex. The following characters are allowed:

^ [A-Z] {2}[0-9] +*% $

That and nothing more. Seems puzzling, but this is actually very simple. First, the character ^ marks the beginning of the string, and $ marks its end. Between these two characters we write the actual expression. % denotes any punctuation character.

Square brackets [] are used to denote a sequence of characters. We can define them some selectors, or by writing directly the characters we can think of. If, for example, we want our expression to only accept digits, we would use the selector: [0-9]. If we want a lower case letter we use [a-z] and [A-Z] for upper case letters. These selectors can be combined to make a sequence such as [a-zA-Z0-9_]. This tells us that the only strings accepted are those that start with a lowercase letter, continue with an uppercase one and with a digit, followed by a white space.

 

Braces {} are used to define lengths. For example, [AZ] {2} means a sequence of exactly two upper case letters. If you want to set limits on both sides, we can do this using, for example, {2,5} , which tells us that a sequence must have 2 to 5 characters. The operators + and * are also used for lengths. For example [a-z]+ means at least one (but could be more than one) lower-case letter and [a-z]*denotes any number of lower case letters.

 

Knowing this, we can write a regular expression that would serve for the validation of an email address, as follows:

^[A-Z0-9._%-][email protected][A-Z0-9.-]+.[A-Z]{2,4}$

This means we are allowing at least one lower-case, upper-case, digit or punctuation character, followed by ‘@’, followed by at least one letter/digit/punctuation character, name, a ‘.’, and a domain made of letters, between 2 and 4 characters.

 

Every programming language comes with support for regular expressions. In C++, for example, we can use the <regex> library. This standard C++ library provides support for various regular expressions, with useful functions such as regex_match and regex_search. In C++ the difference to the general regular expression form we presented so far is that instead of right brackets ([,]) we use round brackets. For example, the expression a+, which we will use in our simple example, accepts a sequence of ‘a’ characters.

 

Before solving this problem do the following (not limited to):

  • Clarify doubts
  • Write/loudly speak/analyze single or multiple pseudocode(e.g. trivial algorithm to advanced algorithm/Data Structures) which can lead to solution
  • Explain any assumptions or limitations on the written pseudocode
  • Mention major data structures to be used
  • Also how the positive/negative functional tests prove that the resultant solution is correct. 

Professional Teacher at HiringLibrary.com

Clarification Points 

Clarification Point-1:

What kind of characters does the regular expression contain? What is the maximum length of the strings it can accept?

 

Diversity in WorkPlace - Japanese,Korea,Chinese

Pseudo Code

Pseudo code :

Regex_match:
    regex expression(expression value)
    s = string we are checking 
    if (string matches regular expression)
       return true
    else
       return false

How did you find your first job?

Assumptions/Limitations

Assumption/Limitation:

  • We will assume our regular expression is a standard one, of type regex in C++
  • The strings we are checking against this regular expression will have at most 100 characters.

Career Summary from a Professional for HiringLibrar

Major Data Structures 

Major data structures:

Strings, regular expressions

Example at Hiring Library

Unit tests:  
Postive Functional Test cases  
Input Expected Output
Expression = “aa” string = “a” false
Expression = “a+” string = “aa”

Expression = “(0-9)” string = “0”

true

true

 

 
Negative Functional Test Cases
Expression=”#%!#^#!” string=any Undefined (Regex error)

Answer of the coding Problem

Here is the sample answer that implements the solution described above

// regex_match example
#include <cstdio>
#include <string>
#include <regex>

using namespace std;

int main ()
{
  char string1[100] = "aaa9";
  char string2[100] = "aaa";

  regex expression("a+");

  if (regex_match(string1, expression))
    printf("String %s matches expression!\n", string1);
  else
    printf("String %s does not match expression!\n", string1);

  if (regex_match(string2, expression))
      printf("String %s matches expression!\n", string2);
    else
      printf("String %s does not match expression!\n", string2);
  return 0;
}

Output at HiringLibrary-Com

Output and Further Excercise

Further Exercise at HiringLibrary

Output: Further exercise:
String aaa9 does not match expression!

String aaa matches expression!

Q: How would you implement the regex_match function manually?

A: This process requires turning the regular expression into a graph, where each node is a vertex in the graph and each character is used to label an edge. For example, the expression “a*b” would have 2 states, one of which can be reached with character ‘a’, while the other can be reached with character ‘b’:

RegularExpressions

We have marked the vertex on the right with a double line because it is a final state. Given a string, we would then have to determine whether there is a path to the final state, in which case the string is accepted by the regular expression. For example, ‘aa’ would not be accepted by the regular expression represented in the graph, because there is no path to the final state just by using a sequence of ‘a’s.

Because of the complexity of regular expressions, and the various operators it can contain, writing a manual parser is a complex task that requires plenty of lines of code. This is why every programming language comes with a pre-defined library, to facilitate working with such expressions.

A very basic parser, on the other hand, can be implemented without building a graph, simply by iterating through the regular expression and splitting it. There are three basic operations to take into account, concatenation (e.g. ab), union (e.g. a|b) and frequency (e.g. a{2} or a*). Suppose we have [a-z]{2}[0-9]. Given a string to match against this expression, this basic parser would firstly iterate through the regular expression and check the values between the sets of straight brackets. If a ‘-‘ is detected, it would just check its two extremities (a and z in this case) and check whether the first character of the given string has an ASCII character between these two values. Finally, it would also check the frequency when iteration through the regular expression continues, and so on..