tauZaman DESIGN DECISIONS (Parsing, deprecated)

 

 

    Parsing consists of two components in the system. One of them is input parsing, which takes a temporal constant and parses it into a timestamp. The other one is output parsing, which takes a timestamp and parses it into temporal constant. However, in this context, we'll not talk about the whole process of forming a timestamp given a temporal constant or vice versa. We'll try to focus on the question of, "how can we parse (or fetch) related information from an input?".

 

    For input parsing, we are given a temporal constant and a format, which we have to apply format to temporal constant and get useful information, field values, if possible. For output parsing, we are given a timestamp and again a format, which we have to fetch useful information from timestamp and apply format to these useful information to get a formatted temporal constant. Let's see each of the processes separately.

 

Input Parsing

 

    Givens: String input, Format format

    Output: Fields consists of Fields (basically a Field contains a "useful string" fetched from input)

               

        To make it clear, let's see some examples;

Example 1

 

input = "January    01, 1970"

format = { "$month    $day, $year" , not whitespace friendly}

 

output = { {month, "January"}, {day, "01"}, {year, "1970"} }

 

Example 2

 

input = "<date day = " 12" month = 'Pre September' year = " 2003 " />"

format = { "< date year = "$year " day = "$day" month = "$month" />" , whitespace friendly}

 

output = { {day, "01"}, {month, "Pre September"}, {year, "2003"} }

 

Example 3

 

input = "<period>

               <instant day = "01" month = "September" year = "2002" />

               <instant>

                    <day> 12 </day>

                    <month value = 'October' />

                    <year value = "2003" />

               </instant>

            </period>"

format = { "<period>

                   <instant day = "$day " month = "$month" year = "$year" />

                   <instant>

                      $instantEnd

                   </instant>

                  </period>" , whitespace friendly}

 

format for instantEnd = {"

                    <day> $day </day>

                    <month value = '$month' />

                    <year value = "$year" />

                    ", whitespace friendly}

 

at the end of, so called, recursive evaluation;

 

output = { {day, "01"}, {month, "September"}, {year, "2002"},  {day, "12"}, {month, "October"}, {year, "2003"}}

    All these examples explain the goal very nicely, however, implementation is a little bit difficult then it seems at first. Let's see the design alternatives and final decisions.

 

    Design Alternative 1

 

        We are processing XML fragments (givens; input and format), so, it is natural to parse them into DOM objects and traverse both in order, getting the "useful values" when acrossing each corresponding node pairs. There are two problems with this approach; First problem is that, this approach involves in DOM parsing for both of the givens. Even if we cache parsed version of format for future uses, we still have to store them as DOM, which takes more space then a string representation (such as regular expressions, as we will explain it in a minute). And also, each time we have an input (even again we have a cached parsed format), we have to parse input into a DOM. For excessive number of inputs that can reduce the performance.

        Second problem is more related to implementation view. The problem comes from our system's ability to represent extensive kinds of inputs. Let's see one example;

 

        format: {"<period>

                        $instantStart / $instantEnd

                     </period>", whitespace friendly}

 

        input 1: "<period>

                       01

                       < month value = "September"/>

                       2003 / 02

                       < month value = "October"/>

                       2004

                     </period>"

 

        input 2: "<period>

                       01/04/2002 / 2, October 2004

                     </period>"

 

        input 3: "<period>

                        <instant>

                            01 12 1999

                        </instant>

                        /

                        <instant>

                            02 12 1999

                        </instant>

                     </period>"

 

        The difficulty is that format should get the right pieces into the right recursive variables, instantStart and instantEnd, for further processing. However, since structure of xml could be complex, it is very hard process to have a correct state after we parse it. To implement this, most likely, we have to have iterators, but since they can not foresee the items, there will be lots of stops and restarts.

 

        For example, when we traverse the format above, we see that we have get value of $instantStart from input. But when we are fetching text nodes, elements, attributes from input to satisfy this, where should we stop? We know that in format $instantStart ends with a '/'. So, we may use it to stop fetching process in input. But this is hard to do with DOM structure. For example in input 1, we find '/' at the middle of a text node, and we have to divide it in order to fetch it.

 

    Design Alternative 2

       

        We may first produce a list of regular expressions, which is equivalent to given format. And we can fetch the "useful values" by parsing inputs with these regular expressions. This approach inherently contains some disadvantages and advantages. Main disadvantage is, we have to enforce some DOM (XML) related characteristics on regular expressions. For example; we can not have attribute order independency without using a trick on produced regular expressions. Another example would be, forming regular expression such that we will be able to represent both kinds of empty elements (<foo/> and <foo></foo>). For XML there will be no difference between them, however, since we parse inputs as character based, we have to be prepared for that. And also there is no difference between <foo> and <foo >, but if we apply a not whitespace friendly format, although it applies only to text nodes and attribute values since we parse input as characters we have to ignore this for elements. All these make forming of regular expression a little tedious.

        Advantages of this approach are, we can process resume and restart with regular expressions using java built in package. That surely makes implementation more reliable. Also, another important advantage is, once we formed regular expression bag for a format, then we may cache it, and use for future inputs. As we will point that later in test runs, this reduces parsing time sharply.

 

    Before going through the details, let's see the trick that is used for enforcing attribute order independency. Attribute order independency can be handled in two ways. First way is to change input's attributes' order as format has it. But this brings additional over head, for each input we have to perform XML to DOM and then again DOM to XML transformation, moreover, depending on the transformation we use, attribute order may still change (unless we write our own DOM to XML program), also when we cache the parsed format, we have to keep some additional information about order of attributes. Second way consists of a trick which is, when we come across with attributes of an element we will form a regular expression that says (attr_name1 | attr_name2 | attr_name3 | ...). And since attribute names are unique in a single element, this works. However, for attribute values, we have to keep them as separate, since in input we never now which attribute name comes in which order, and having (attr_name1 =  "regex for attr1" | attr_name2  = "regex for attr2"| attr_name3  = "regex for attr3" | ...) does not work, since fetching values for variables depend on concatenating regular expressions that are not processed yet. So, for these reasons attribute values are kept separately from all other regular expressions.

 

    We chose the second design alternative. And here are the details;

 

    2 Stage Input Parsing

 

        As stated in alternative 2 briefly, there are two stages for input parsing.

        In the first stage, we parse format and produce a list of regular expressions. Before going that let's state some facts about format.

So, how do we produce a list of regular expression from a format? Basically for elements and attributes regular expressions are formed containing their names. Only in text nodes and attribute values, we may include variables, so, we have a special parsing for them. They are parsed character by character until a "$" sign, which is a variable and handled in a different way. All the regular expressions in the list are formed in order, however, regular expressions for attribute values in an element, are not formed in an order to enforce order independency.

 

Each regular expression is kept in an object, which is an instance of RegexElement class. This class, in addition to string regular expressions, keeps supplementary information about them. For example, type. A regular expression may correspond to a variable or a variable attribute or a recursive variable (recursive variables are the ones that need further processing, such as instants).

 

As one can note, the end product of this stage is independent to any input. Additionally, it takes time to process it each time we come across with an input. So, it will reduce respond time if we can cache this list instead of format itself.

 

In the second stage, after list of RegexElements are formed from a format, we apply them in order to an input according to different types. For example, if a RegexElement is of type variable, then it should be taken care of differently than if it is of type normal. The important questions are, how do we fetch the value of a variable and value of a recursive variable from the input? Let's see these with a visualizations. First fetching a variable;

 

Assume we are at the middle of parsing an input, and we have current input, and regular expressions that are left.

 

 

                               

First apply regular expression for this variable

 

                               

                               

 

If there is no match, then parsing failed. If there is a match, since the java.util.regex runs on maximum matching semantic, we might over matched. So, to understand it, we concatenate all other regexes (regular expressions) together and apply to the rest of the input.

 

                               

 

If it does not match, we will try to find a smaller match of regex for variable, if we succeed then we will apply concatenated regexes again.

 

                               

 

 

This time we had a match, matched part for variable will be one of the "useful value" we are seeking.

 

Second fetching a recursive variable;

 

Again assume we have the same situation, we are at the middle of parsing an input, and we have current input, and regular expressions that are left. This time we have a regex for a recursive variable.

 

                               

 

We will concatenate all left regexes together and apply it to whole input.

 

                               

When there is no match, we will apply the same chunk to input, which is trimmed from the beginning with a one character.

 

                               

 

When we find a match, then accumulated characters from trimmings will form "useful value" for recursive variable. You may be asking, then why do we need regular expression for recursive variable, after all we don't use it? Answer is, think about a regular expression for recursive variable in the other RegexElements set. If we don't have a regular expression , which is generally a general one, such as .+, we would not get a matched part when we apply it into the input.

 

Some performance results from initial implementation;

 

 A test run with different configurations...


 input:

 "<date>\n <instant year = \"2002\" month = \"September Turco\" day = \"12\"/>\n <instant year = \"2003\" /></date>"

 format: {
 "<date>\n <instant year = \"$year\" month = \"$month\" day = \"$day\"/>\n $instantEnd</date>", not whitespace friendly}


 forcing attribute order without caching: 150 ms

 forcing attribute order with caching: 24 ms

 not forcing attribute order without caching: 189 ms

 not forcing attribute order with caching: 61 ms

 

Output Parsing

 

    Givens: Granule [] granules, Format format

    Output: String output (temporal constant corresponding to granules -timestamp- )

   

    To make the discussion clear, we'll assume that given Granule [] granules are converted succesfully into Fields by a Calendar in active CalendricSystem.

         

        Let's see some examples;

Example 1

 

input = { {month, "January"}, {day, "01"}, {year, "1970"} }

format = { "$month  $day, $year" }

 

output = "January  01, 1970"

 

Example 2

 

input = { {day, "01"}, {month, "Pre September"}, {year, "2003"} }

format = { "< date year = "$year " day = "$day" month = "$month" />" }

 

output = "<date day = " 12" month = 'Pre September' year = " 2003 " />"

 

Example 3

 

input = { {day, "01"}, {month, "September"}, {year, "2002"},  {day, "12"}, {month, "October"}, {year, "2003"}}

 

format = { "<period>

                   <instant day = "$day " month = "$month" year = "$year" />

                   <instant>

                      $instantEnd

                   </instant>

                  </period>" , whitespace friendly}

 

format for instantEnd = {"

                    <day> $day </day>

                    <month value = '$month' />

                    <year value = "$year" />

                    ", whitespace friendly}

 

at the end of, so called, recursive evaluation;

 

output = "<period>

               <instant day = "01" month = "September" year = "2002" />

               <instant>

                    <day> 12 </day>

                    <month value = 'October' />

                    <year value = "2003" />

               </instant>

            </period>"

    As it can be noted, Output is quiet similar to Input in the sense that they are opposite of each other in terms of input and output. However, the idea is quite opposite. Since in Input, what we have is an input and a format, which we want to match. But in Output, what we have is a Granule, which is parsed into a Fields, and a format. So, in Input, our criteria is matching of format and input. However, in Output criteria should be different. Fields contain Granularity names and so is format. So, for Output, our criteria is to match these names in Fields to format. As long as we can match the names in Fields to names in format, this format should be a valid one. But there is one question here. This criteria may lead to wrong outputs. For example a Period format may have its end Instant before its start Instants. However,  corresponding Fields may have exactly the opposite (which is usually the normal way). There is no way to understand this contradiction in our Output algorithm.

 

Let's see a complete example, which goes from Input and forms a timestamp and then outputs it back to a temporal constant by using Output;

 

    Assume that we have an input  and a format and we want to form a Period temporal data type

   

    input = "<date>

                    <instantS> September, 20, 2002 </instantS>

                    <instantE> October, 20, 2003 </instantE>

                </date>"

 

    format = "<date>

                       <instantS> $instantStart </instantS>

                       <instantE> $instantEnd </instantE>

                   </date>"

 

    format for instants = { " $month, $day, $year ", not whitespace friendly}

 

    After we call Input's parse method, we will get this Fields;

 

    output = {{{month, "September"},{day, "20"}, {year, "2002"}}{{month, "September"},{day, "20"}, {year, "2002"}}}

 

    After this step a Calendar method for converting Fields to Granule [] will be called and a Granule [] will be produced as a timestamp.

 

    Now, when we want to output this timestamp, in Granule [], then we ask (probably the same) a Calendar to convert it to Fields. And then we'll call Output to parse this Fields into an output string.

 

    Here we have a design decision. Format that we want to use could be same as above, or some other format that is still conformant with the Fields by the above process. For example

 

    format 1 = "<date>

                       <instantS> $instantStart </instantS>

                       <instantE> $instantEnd </instantE>

                   </date>"

 

    or it could be

 

    format 2 = "<date>

                       <instantS> $month, $day, $year </instantS>

                       <instantE> $month, $day, $year </instantE>

                   </date>"

 

    We want both of the format be able to applied successfully on the Fields we have, no matter what kind of format we used when inputting. And when we apply either of them the result would be same as input string (this is specific to this example, result of Output parsing may be different than input string).

   

    Initial performance results goes here...