R Language Implement State Machine Pattern using S4 Class Parsing Lines using State Machine

Help us to keep this website almost Ad Free! It takes only 10 seconds of your time:
> Step 1: Go view our video on YouTube: EF Core Bulk Insert
> Step 2: And Like the video. BONUS: You can also share it!

Example

Let's apply the State Machine pattern for parsing lines with the specific pattern using S4 Class feature from R.

PROBLEM ENUNCIATION

We need to parse a file where each line provides information about a person, using a delimiter (";"), but some information provided is optional, and instead of providing an empty field, it is missing. On each line we can have the following information: Name;[Address;]Phone. Where the address information is optional, sometimes we have it and sometimes don’t, for example:

GREGORY BROWN; 25 NE 25TH; +1-786-987-6543
DAVID SMITH;786-123-4567
ALAN PEREZ; 25 SE 50TH; +1-786-987-5553

The second line does not provide address information. Therefore the number of delimiters may be deferent like in this case with one delimiter and for the other lines two delimiters. Because the number of delimiters may vary, one way to atack this problem is to recognize the presence or not of a given field based on its pattern. In such case we can use a regular expression for identifying such patterns. For example:

  • Name: "^([A-Z]'?\\s+)* *[A-Z]+(\\s+[A-Z]{1,2}\\.?,? +)*[A-Z]+((-|\\s+)[A-Z]+)*$". For example: RAFAEL REAL, DAVID R. SMITH, ERNESTO PEREZ GONZALEZ, 0' CONNOR BROWN, LUIS PEREZ-MENA, etc.
  • Address: "^\\s[0-9]{1,4}(\\s+[A-Z]{1,2}[0-9]{1,2}[A-Z]{1,2}|[A-Z\\s0-9]+)$". For example: 11020 LE JEUNE ROAD, 87 SW 27TH. For the sake of simplicity we don't include here the zipcode, city, state, but I can be included in this field or adding additional fields.
  • Phone: "^\\s*(\\+1(-|\\s+))*[0-9]{3}(-|\\s+)[0-9]{3}(-|\\s+)[0-9]{4}$". For example: 305-123-4567, 305 123 4567, +1-786-123-4567.

Notes:

  • I am considering the most common pattern of US addresses and phones, it can be easy extended to consider more general situations.
  • In R the sign "\" has special meaning for character variables, therefore we need to escape it.
  • In order to simplify the process of defining regular expressions a good recommendation is to use the following web page: regex101.com, so you can play with it, with a given example, until you get the expected result for all possible combinations.

The idea is to identify each line field based on previously defined patterns. The State pattern define the following entities (classes) that collaborate to control the specific behavior (The State Pattern is a behavior pattern):

GOF State Pattern

Let's describe each element considering the context of our problem:

  • Context: Stores the context information of the parsing process, i.e. the current state and handles the entire State Machine Process. For each state, an action is executed (handle()), but the context delegates it, based on the state, on the action method defined for a particular state (handle() from State class). It defines the interface of interest to clients. Our Context class can be defined like this:
    • Attributes: state
    • Methods: handle(), ...
  • State: The abstract class that represents any state of the State Machine. It defines an interface for encapsulating the behavior associated with a particular state of the context. It can be defined like this:
    • Attributes: name, pattern
    • Methods: doAction(), isState (using pattern attribute verify whether the input argument belong to this state pattern or not), …
  • Concrete States (state sub-classes): Each subclass of the class State that implements a behavior associated with a state of the Context. Our sub-classes are: InitState, NameState, AddressState, PhoneState. Such classes just implements the generic method using the specific logic for such states. No additional attributes are required.

Note: It is a matter of preference how to name the method that carries out the action, handle(), doAction() or goNext(). The method name doAction() can be the same for both classes (Stateor Context) we preferred to name as handle() in the Context class for avoiding a confusion when defining two generic methods with the same input arguments, but different class.

PERSON CLASS

Using the S4 syntax we can define a Person class like this:

setClass(Class = "Person",
    slots = c(name = "character", address = "character", phone = "character")
)

It is a good recommendation to initialize the class attributes. The setClass documentation suggests using a generic method labeled as "initialize", instead of using deprecated attributes such as: prototype, representation.

setMethod("initialize", "Person",
  definition = function(.Object, name = NA_character_,
    address = NA_character_, phone = NA_character_) {
        .Object@name <- name
        .Object@address <- address
        .Object@phone <- phone
        .Object
    }
)

Because the initialize method is already a standard generic method of package methods, we need to respect the original argument definition. We can verify it typing on R prompt:

> initialize

It returns the entire function definition, you can see at the top who the function is defined like:

function (.Object, ...) {...}

Therefore when we use setMethod we need to follow exaclty the same syntax (.Object).

Another existing generic method is show, it is equivalent toString() method from Java and it is a good idea to have a specific implementation for class domain:

setMethod("show", signature = "Person",
  definition = function(object) {
      info <- sprintf("%s@[name='%s', address='%s', phone='%s']", 
        class(object), object@name, object@address, object@phone)
      cat(info)
      invisible(NULL)
  }
)

Note: We use the same convention as in the default toString() Java implementation.

Let's say we want to save the parsed information (a list of Person objects) into a dataset, then we should be able first to convert a list of objects to into something the R can transform (for example coerce the object as a list). We can define the following additional method (for more detail about this see the post)

setGeneric(name = "as.list", signature = c('x'),
    def = function(x) standardGeneric("as.list"))

# Suggestion taken from here:
# http://stackoverflow.com/questions/30386009/how-to-extend-as-list-in-a-canonical-way-to-s4-objects
setMethod("as.list", signature = "Person",
    definition = function(x) {
        mapply(function(y) {
        #apply as.list if the slot is again an user-defined object
        #therefore, as.list gets applied recursively
        if (inherits(slot(x,y),"Person")) {
          as.list(slot(x,y))
        } else {
          #otherwise just return the slot
          slot(x,y)
        }
      },
        slotNames(class(x)),
        SIMPLIFY=FALSE)
    }
)

R does not provide a sugar syntax for OO because the language was initially conceived to provide valuable functions for Statisticians. Therefore each user method requires two parts: 1) the Definition part (via setGeneric) and 2) the implementation part (via setMethod). Like in the above example.

STATE CLASS

Following S4 syntax, let's define the abstract State class.

setClass(Class = "State", slots = c(name = "character", pattern = "character"))

setMethod("initialize", "State",
  definition = function(.Object, name = NA_character_, pattern = NA_character_) {
      .Object@name <- name
      .Object@pattern <- pattern
      .Object
  }
)

setMethod("show", signature = "State",
  definition = function(object) {
      info <- sprintf("%s@[name='%s', pattern='%s']", class(object), 
          object@name, object@pattern)
      cat(info)
      invisible(NULL)
  }
)

setGeneric(name = "isState", signature = c('obj', 'input'),
    def = function(obj, input) standardGeneric("isState"))

setGeneric(name = "doAction", signature = c('obj', 'input', 'context'),
    def = function(obj, input, context) standardGeneric("doAction"))

Every sub-class from State will have associated a name and pattern, but also a way to identify whether a given input belongs to this state or not (isState() method), and also implement the corresponding actions for this state (doAction() method).

In order to understand the process, let's define the transition matrix for each state based on the input received:

Input/Current StateInitNameAddressPhone
NameName
AddressAddress
PhonePhonePhone
EndEnd

Note: The cell [row, col]=[i,j] represents the destination state for the current state j, when it receives the input i.

It means that under the state Name it can receive two inputs: an address or a phone number. Another way to represents the transaction table is using the following UML State Machine diagram:

State Machine Diagram representation

Let's implement each particular state as a sub-state of the class State

STATE SUB-CLASSES

Init State:

The initial state will be implemented via the following class:

setClass("InitState", contains = "State")

setMethod("initialize", "InitState",
  definition = function(.Object, name = "init", pattern = NA_character_) {
      .Object@name <- name
      .Object@pattern <- pattern
      .Object
  }
)

setMethod("show", signature = "InitState",
  definition = function(object) {
      callNextMethod()
  }
)

In R to indicate a class is a sub-class of other class is using the attribute contains and indicating the class name of the parent class.

Because the sub-classes just implement the generic methods, without adding additional attributes, then the show method, just call the equivalent method from the upper class (via method: callNextMethod())

The initial state does not have associated a pattern, it just represents the beginning of the process, then we initialize the class with an NA value.

Now lets to implement the generic methods from the State class:

setMethod(f = "isState", signature = "InitState",
  definition = function(obj, input) {
      nameState <- new("NameState")
      result <- isState(nameState, input)
      return(result)
  }
)

For this particular state (without pattern), the idea it just initializes the parsing process expecting the first field will be a name, otherwise it will be an error.

setMethod(f = "doAction", signature = "InitState",
    definition = function(obj, input, context) {
        nameState <- new("NameState")
        if (isState(nameState, input)) {
            person <- context@person
            person@name <- trimws(input)
            context@person <- person
            context@state <- nameState
        } else {
            msg <- sprintf("The input argument: '%s' cannot be identified", input)
            stop(msg)
        }
        return(context)
    }
)

The doAction method provides the transition and updates the context with the information extracted. Here we are accessing to context information via the @-operator. Instead, we can define get/set methods, to encapsulate this process (as it is mandated in OO best practices: encapsulation), but that would add four more methods per get-set without adding value for the purpose of this example.

It is a good recommendation in all doAction implementation, to add a safeguard when the input argument is not properly identified.

Name State

Here is the definition of this class definition:

setClass ("NameState", contains = "State")

setMethod("initialize","NameState",
  definition=function(.Object, name="name",
        pattern = "^([A-Z]'?\\s+)* *[A-Z]+(\\s+[A-Z]{1,2}\\.?,? +)*[A-Z]+((-|\\s+)[A-Z]+)*$") {
        .Object@pattern <- pattern
        .Object@name <- name
        .Object       
  }
)

setMethod("show", signature = "NameState",
  definition = function(object) {
      callNextMethod()
  }
)

We use the function grepl for verifying the input belongs to a given pattern.

setMethod(f="isState", signature="NameState",
  definition=function(obj, input) {
      result <- grepl(obj@pattern, input, perl=TRUE)
      return(result)
  }
)

Now we define the action to carry out for a given state:

setMethod(f = "doAction", signature = "NameState",
  definition=function(obj, input, context) {
      addressState <- new("AddressState")
      phoneState <- new("PhoneState")
      person <- context@person
      if (isState(addressState, input)) {
          person@address <- trimws(input)
          context@person <- person
          context@state <- addressState
      } else if (isState(phoneState, input)) {
          person@phone <- trimws(input)
          context@person <- person
          context@state <- phoneState
      } else {
          msg <- sprintf("The input argument: '%s' cannot be identified", input)
          stop(msg)
      }
      return(context)
  }
)

Here we consider to possible transitions: one for Address state and the other one for Phone state. In all cases we update the context information:

  • The person information: address or phone with the input argument.
  • The state of the process

The way to identify the state is to invoke the method: isState() for a particular state. We create a default specific states (addressState, phoneState) and then ask for a particular validation.

The logic for the other sub-classes (one per state) implementation is very similar.

Address State

setClass("AddressState", contains = "State")

setMethod("initialize", "AddressState",
  definition = function(.Object, name="address",
    pattern = "^\\s[0-9]{1,4}(\\s+[A-Z]{1,2}[0-9]{1,2}[A-Z]{1,2}|[A-Z\\s0-9]+)$") {
        .Object@pattern <- pattern
        .Object@name <- name
        .Object
    }
)

setMethod("show", signature = "AddressState",
  definition = function(object) {
      callNextMethod()
  }
)

setMethod(f="isState", signature="AddressState",
    definition=function(obj, input) {
        result <- grepl(obj@pattern, input, perl=TRUE)
        return(result)
    }
)
    
setMethod(f = "doAction", "AddressState",
    definition=function(obj, input, context) {
        phoneState <- new("PhoneState")
        if (isState(phoneState, input)) {
            person <- context@person
            person@phone <- trimws(input)
            context@person <- person
            context@state <- phoneState
        } else {
            msg <- sprintf("The input argument: '%s' cannot be identified", input)
            stop(msg)
        }
        return(context)
    }
)

Phone State

setClass("PhoneState", contains = "State")

setMethod("initialize", "PhoneState",
  definition = function(.Object, name = "phone",
    pattern = "^\\s*(\\+1(-|\\s+))*[0-9]{3}(-|\\s+)[0-9]{3}(-|\\s+)[0-9]{4}$") {
        .Object@pattern <- pattern
        .Object@name <- name
        .Object
    }
)

setMethod("show", signature = "PhoneState",
  definition = function(object) {
      callNextMethod()
  }
)

setMethod(f = "isState", signature = "PhoneState",
    definition = function(obj, input) {
        result <- grepl(obj@pattern, input, perl = TRUE)
        return(result)
    }
)

Here is where we add the person information into the list of persons of the context.

setMethod(f = "doAction", "PhoneState",
    definition = function(obj, input, context) {
        context <- addPerson(context, context@person)
        context@state <- new("InitState")
        return(context)
    }   
)

CONTEXT CLASS

Now the lets to explain the Context class implementation. We can define it considering the following attributes:

setClass(Class = "Context",
     slots = c(state = "State", persons = "list", person = "Person")
)

Where

  • state: The current state of the process
  • person: The current person, it represents the information we have already parsed from the current line.
  • persons: The list of parsed persons processed.

Note: Optionally, we can add a name to identify the context by name in case we are working with more than one parser type.

setMethod(f="initialize", signature="Context",
  definition = function(.Object) {
        .Object@state <- new("InitState")
        .Object@persons <- list()
        .Object@person <- new("Person")
        return(.Object)
    }
)

setMethod("show", signature = "Context",
  definition = function(object) {
      cat("An object of class ", class(object), "\n", sep = "")
      info <- sprintf("[state='%s', persons='%s', person='%s']", object@state,
          toString(object@persons), object@person)
      cat(info)
      invisible(NULL)
  }
)

setGeneric(name = "handle", signature = c('obj', 'input', 'context'),
    def = function(obj, input, context) standardGeneric("handle"))

setGeneric(name = "addPerson", signature = c('obj', 'person'),
    def = function(obj, person) standardGeneric("addPerson"))

setGeneric(name = "parseLine", signature = c('obj', 's'),
    def = function(obj, s) standardGeneric("parseLine"))

setGeneric(name = "parseLines", signature = c('obj', 's'),
    def = function(obj, s) standardGeneric("parseLines"))

setGeneric(name = "as.df", signature = c('obj'),
    def = function(obj) standardGeneric("as.df"))

With such generic methods, we control the entire behavior of the parsing process:

  • handle(): Will invoke the particular doAction() method of the current state.
  • addPerson: Once we reach the end state, we need to add a person to the list of persons we have parsed.
  • parseLine(): Parse a single line
  • parseLines(): Parse multiple lines (an array of lines)
  • as.df(): Extract the information from persons list into a data frame object.

Let's go on now with the corresponding implementations:

handle() method, delegates on doAction() method from the current state of the context:

setMethod(f = "handle", signature = "Context",
    definition = function(obj, input) {
        obj <- doAction(obj@state, input, obj)
        return(obj)
    }
)

setMethod(f = "addPerson", signature = "Context",
  definition = function(obj, person) {
      obj@persons <- c(obj@persons, person)
      return(obj)
  }
)

First, we split the original line in an array using the delimiter to identify each element via the R-function strsplit(), then iterate for each element as an input value for a given state. The handle() method returns again the context with the updated information (state, person, persons attribute).

setMethod(f = "parseLine", signature = "Context",
  definition = function(obj, s) {
      elements <- strsplit(s, ";")[[1]]
      # Adding an empty field for considering the end state.
      elements <- c(elements, "")
      n <- length(elements)
      input <- NULL
      for (i in (1:n)) {
        input <- elements[i]  
        obj <- handle(obj, input)
      }
      return(obj@person)
  }
)

Becuase R makes a copy of the input argument, we need to return the context (obj):

setMethod(f = "parseLines", signature = "Context",
  definition = function(obj, s) {
      n <- length(s)
      listOfPersons <- list()
      for (i in (1:n)) {
          ipersons <- parseLine(obj, s[i])
          listOfPersons[[i]] <- ipersons
      }
      obj@persons <- listOfPersons
      return(obj)
  }
)

The attribute persons is a list of instance of S4 Person class. This something cannot be coerced to any standard type because R does not know of to treat an instance of a user defined class. The solution is to convert a Person into a list, using the as.list method previously defined. Then we can apply this function to each element of the list persons, via the lapply() function. Then in the next invocation to lappy() function, now applies the data.frame function for converting each element of the persons.list into a data frame. Finally, the rbind() function is called for adding each element converted as a new row of the data frame generated (for more detail about this see this post)

# Sugestion taken from this post:
# http://stackoverflow.com/questions/4227223/r-list-to-data-frame
setMethod(f = "as.df", signature = "Context",
  definition = function(obj) {
    persons <- obj@persons
    persons.list <- lapply(persons, as.list)
    persons.ds <- do.call(rbind, lapply(persons.list, data.frame, stringsAsFactors = FALSE))
    return(persons.ds)
  }
)

PUTTING ALL TOGETHER

Finally, lets to test the entire solution. Define the lines to parse where for the second line the address information is missing.

s <- c(
    "GREGORY BROWN; 25 NE 25TH; +1-786-987-6543",
    "DAVID SMITH;786-123-4567",
     "ALAN PEREZ; 25 SE 50TH; +1-786-987-5553"
)

Now we initialize the context, and parse the lines:

context <- new("Context")
context <- parseLines(context, s)

Finally obtain the corresponding dataset and print it:

df <- as.df(context)
> df
           name    address           phone
1 GREGORY BROWN 25 NE 25TH +1-786-987-6543
2   DAVID SMITH       <NA>    786-123-4567
3    ALAN PEREZ 25 SE 50TH +1-786-987-5553

Let's test now the show methods:

> show(context@persons[[1]])
Person@[name='GREGORY BROWN', address='25 NE 25TH', phone='+1-786-987-6543']

And for some sub-state:

>show(new("PhoneState"))
PhoneState@[name='phone', pattern='^\s*(\+1(-|\s+))*[0-9]{3}(-|\s+)[0-9]{3}(-|\s+)[0-9]{4}$']

Finally, test the as.list() method:

> as.list(context@persons[[1]])
$name
[1] "GREGORY BROWN"

$address
[1] "25 NE 25TH"

$phone
[1] "+1-786-987-6543"

> 

CONCLUSION

This example shows how to implement the State pattern, using one of the available mechanisms from R for using the OO paradigm. Nevertheless, the R OO solution is not user-friendly and differs so much from other OOP languages. You need to switch your mindset because the syntax is completely different, it reminds more the functional programming paradigm. For example instead of: object.setID("A1") as in Java/C#, for R you have to invoke the method in this way: setID(object, "A1"). Therefore you always have to include the object as an input argument to provide the context of the function. On the same way, there is no special this class attribute and either a "." notation for accessing methods or attributes of the given class. It is more error prompt because to refer a class or methods is done via attribute value ("Person", "isState", etc.).

Said the above, S4 class solution, requires much more lines of codes than a traditional Java/C# languages for doing simple tasks. Anyway, the State Pattern is a good and generic solution for such kind of problems. It simplifies the process delegating the logic into a particular state. Instead of having a big if-else block for controlling all situations, we have smaller if-else blocks inside on each State sub-class implementation for implementing the action to carry out in each state.

Attachment: Here you can download the entire script.

Any suggestion is welcome.



Got any R Language Question?