tauZaman DESIGN DECISIONS

tauZaman DESIGN DECISIONS (Parsing)

Parsing consists of two components in the system. One of them is input parsing, which takes a temporal constant and parses it into a timestamp. The other one is output parsing, which takes a timestamp and parses it into temporal constant. However, in this context, we'll not talk about the whole process of forming a timestamp given a temporal constant or vice versa. We'll try to focus on the question of, "how can we parse (or fetch) related information from an input?".

For input parsing, we are given a temporal constant as a String and a Property, which we have to first form its format and apply it to temporal constant and get useful information, field values, if possible. For output parsing, we are given a timestamp and again a Property, which we have to fetch useful information from timestamp and apply its formed format to these useful information to get a formatted temporal constant. As one can note these two processes are the reverses of each other, however, they share an important part. For both Input and Output a Property is always a given. For Input, useful information should be fetched from input string according to this Property's Format and put into Fields, which was derived from Property. And for Output, useful information should be fetched from Fields, which was derived from Property and outputted according to this Property's Format. So both should first form Format and Fields of a given Property. So, before we have to first explain derivation of Format and Fields from a given Property. And with that expand into Input/Output parsing.

Format, Fields Derivation and Input/Output Parsing

A Property is represented as XML and it contains three different elements (There is also another Property type, which is called "simple Property". It only contains a string value and does not contain sub elements.) ; Format, FieldInfo, ImportFormat. Explanations of these elements can be found either in the API, or in here.

If a Property contains Format in which a variable name points to an ImportFormat, that means to process this Format we need another Property, which is linked by ImportFormat. We call this situation a recursive parsing, since during the parsing of Format, we have to get another Format, which is imported by ImportFormat. Because of this, we have to form a flattened Format of a given Property, which goes through all the ImportFormats, gets imported Formats and merges them with original Format. While doing this, we also need to form Fields according to each individual Format and their corresponding FieldInfo elements. So, when we merge Formats we also merge their Fields. We'll see this formation with an example, however, let's first see an alternative to this approach.

Instead of building a Format and Fields, we could pursue a "Suspension and Resumption" parsing technique. In this style, we could go through a Format and an input simultaneously, fetching useful information, and when we come across with an ImportFormat, then we suspend and fetch Format linked by that ImportFormat, and then resume with parsing remembering our original Format. Here are the reasons that we did not go with this approach;

Obviously more complex not only to implement but also to understand than building style.
For a small response time, we plan to cache Propertyies once they are parsed. However, with "Suspension and Resumption" technique, we have to still get a Format linked by that ImportFormat no matter it has a cache or not. On the other hand, with the first style, once we cached parsed Property, we can use it without importing anything.
Building Format and Fields from a given Property is separated from fetching useful information from input string in the first style. In "Suspension and Resumption" technique they should go parallel, which makes implementation hard and hard to understand.

Now, having stated the alternative approach, let's see an example;

Assume we have an IndeterminateInstantFormat as follows;

<property name = "IndeterminateInstantInputFormat">
       <value>

             <format>

                            <support> $lower </support>

                            <support> $upper </support>

             </format>


            <importFormat variable = "lower" name = "InstantInputFormat" />
            <importFormat variable = "upper" name = "InstantInputFormat" />


       </value>
</property>

And assume that (since importFormat elements does not have url attributes) we have InstantInputFormat as follows;

<property name = "InstantInputFormat">
       <value>

             <format>

                 <instant month = "$month" year = "$year" day = "$day" />

             </format>


             <fieldInfo variable = "month" name = "monthOfYear" using = "EnglishMonthNames" />

             <fieldInfo variable = "day" name = "dayOfMonth" using = "ArabicNumeral" />

             <fieldInfo variable = "year" name = "year" using = "ArabicNumeral" />

       </value>
</property>

When we are given above "IndeterminateInstantFormat", Input first queries related Property (fetched via PropertyManager) whether it has a PropertyCache or not. If there is a related PropertyCache, then Format and Fields are fetched without any call and process advances to parsing. Otherwise Input calls FieldsBuilder to form Format and Fields.

FieldsBuilder takes three parameters; name of Property, url of Property and PropertyManager (for loading Propertyies linked by ImportFormats). It first tries to get real Property object. If url is null that means our Property object is in Property stack, if it is not that means our Property object should be loaded first. No matter what it is PropertyManager will get Property object. FieldsBuilder also allocates a new Fields with the name of Property. After these initialization steps, FieldsBuilder fetches Format element of Property and calls FormatParser to tokenize Format string into variables and non-variables. So, for example, if Format string is "This is my $month and$year and $day", it will be tokenized into a String array of {"This is my ", "$month", " and", "$year", " ", "$day"}.After this process, FieldsBuilder takes all variable tokens and checks them against FieldInfo and ImportFormat elements. If a variable corresponds to a FieldInfo, then a Field will be added to Fields. If on the other hand, a variable corresponds to a ImportFormat element, then a FieldsBuilder object will be constructed recursively with the information in ImportFormat. And process goes like this.

Once FieldsBuilder forms new Format and Fields, then only thing remains for Input is to parse input string and format (fetched from Format) string into DOM objects and traverse them in parallel "to fetch useful information from input according to format into Fields". Useful information can only be stored in Text nodes or Attr node values of input DOM. So, when Input comes across with one of these in input and format, it fetches string information form both and then another process take place between these strings. To give an example of these two strings for input it might be " %September some text" and for format it might be "%$month some text ". When we apply this process into these strings, we get {$month -> "September"} and process fills corresponding Fields's Field with name "month". As we know Fields contain recursive Field and searching of focus Field will be done by finding the first Field that satisfies name equals to "month". (Fields have markers so once it is filled they will be marked as "dirty").

Once Fields are filled successfully, Input calls IOMultiplexer to get Granule(s) produced from it. We will not go into details of IOMultiplexer here, however, what it does is according to name of Fields, it pursues a strategy to form Granule(s) by using focus CalendricSystem Calendars' implementations.

All the above process is same for Output up to building of Format and Fields from a given Property. However, once they are formed to fill the Fields, Output directly calls IOMultiplexer with it and the Granule(s) it has. And IOMultiplexer fills and returns Fields by using focus CalendricSystem Calendars' implementations. And after these Output tokenize Format by using FormatParser and replacing the variable tokens with their corresponding Field values, builds a string output.

Having gone through all the process, Input, using our example above, will first form Format and Fields;

</format>

and

. IndeterminateInstantInputFormat

. InstantInputFormat

. name = "month", name = "monthOfYear", using = "EnglishMonthNames", value = null

. name = "day", name = "dayOfMonth", using = "ArabicNumeral", value = null

. name = "year", name = "year", using = "ArabicNumeral", value = null

. InstantInputFormat

. name = "month", name = "monthOfYear", using = "EnglishMonthNames", value = null

. name = "day", name = "dayOfMonth", using = "ArabicNumeral", value = null

. name = "year", name = "year", using = "ArabicNumeral", value = null

As can be seen values are null. Then Input, parses format and input string into DOM. Let's say that our input is;

As we traverse input and format, we get filled Fields;

. IndeterminateInstantInputFormat

. InstantInputFormat

. name = "month", name = "monthOfYear", using = "EnglishMonthNames", value = 11

. name = "day", name = "dayOfMonth", using = "ArabicNumeral", value = 24

. name = "year", name = "year", using = "ArabicNumeral", value = 2004

. InstantInputFormat

. name = "month", name = "monthOfYear", using = "EnglishMonthNames", value = 10

. name = "day", name = "dayOfMonth", using = "ArabicNumeral", value = 12

. name = "year", name = "year", using = "ArabicNumeral", value = 2003

Then at last Input calls IOMultiplexer to get Granule(s) produced from this Fields.

An example for Output has the same insights but differs at last.

There is one issue, which needs further explanation for Input. And that is once we get down to traverse of input and format DOM trees and have strings of information to match, how do we fetch useful information from input? Let's use a slightly different version of previous example to see this problem. Say we have "%September Turco12 some text" fetched from a Text node from input and "%$month12 some text " fetched from a Text node from format. Also note that at this moment we have our Fields shown above. How do we get {month, "September Turco"} pair?

Input at this point again make use of FormatParser to tokenize format, in this case "%$month12 some text " into {"%", "$month", "12" some text "} (regular expression for variables is "$[a-zA-Z]+"). After tokenizing, Input traverses each token in input, for non-variables it needs an exact match (except whitespace handling, since if whitespace characteristic of format is friendly, we may just ignore non-matching whitespaces). But for variables, we fetch regular expression for month form its corresponding Field in Fields by using "using" attribute. If FVSupport that corresponds to "using" attribute does not have a regular expression, then default regular expression will be used, which can be found in CalendricSystem.

When regular expression is in hand, we have to apply in to input and seek for a lookingAt behavior, which means regular expression should match from the beginning of input but not necessarily to the end. And first match, which is the maximum, will be fetch as useful information and put into Field's value. In our example;

Once we get to "September Turco12 some text" in input and "[a-zA-Z]+ [a-zA-Z]+" regular expression for token "$month", application of this regular expression to that input will give "September Turco" as the month name and rest as the remaining input. And process goes like this until all input and format is consumed. There are also other check bits and pieces can be found in the implementation not mentioned here for clarity.

Here you can find previous attempt to design and implement parsing. The reasons we rejected that version was due to complexity and a bug in java.util.regex's Pattern.compile() for big Strings. Also previous did not handle attribute order independence in a clear manner, and lack in handling recursive parsing.