Defining and Loading the Data

Page Contents

XML Input Data

Input data to MathymaStats has to be in XML format. This is not quite as restrictive as it sounds, as the package includes Windows Scripts to convert other formats to XML.

XML is format which is increasingly used to hold data, especially data which is transferred between machines, for example on the Web. If you've never come across XML and would like to know more then the W3Schools tutorial is an excelent place to start. All you need to know to create XML data for MathymaStats will be described here.

XML in brief

In XML the data items are stored between tags. Each item will have an opening tag and a closing tag. Tags are enclosed between less than (< ) and greater than ( >) symbols, so a typical XML element would look like this:

< HEIGHT >175< /HEIGHT >

Those of you used to writing HTML will find these concepts quite familiar, but be aware that the syntaxt rules for XML are much stricter. In particular

  • XML is case sensitive, e.g. the following would be wrong:
    < HEIGHT >175< /HEIGHT >
  • Each tag must have a corresponding closing tag, unlike HTML where for example leaving out a paragraph end tag < /P > does not give an error.
  • Tags must be properly nested. For example in HTML you can write
    < i > < b > This is in bold italics < /i > < /b >
    whereas in XML this must be:
    < b > < i > This is in bold italics < /i > < /b >
    i.e the italics are "nested" in the bold.

XML files usually take the extension .xml

Input File Structure

The easiest way of thinking of a MathymaStats input file is as a "file" of "records". You will need a tag name for "file" and a tag name for "record". These could be "< file >" and "< record >", or "< observations >" and "< obs >", or "< data >" and "< measurement >", etc. etc. The actual names do not matter.

You will need a tag name for each piece of information in the record. Add the XML header and voila you've got your input file.

The simplest way is to consider an example, which I'll use later to illustrate data definition and so on. The data in the example are completely fabricated. Consider a survey of school-children from various towns, where we measure their weight (to nearest kilogram), height (to nearest centimeter), and whether it is a boy or a girl. The town the school is in is also noted. To cut down on the number of records, a frequency field is included, so that instead or repeating the same measure, we just note the number of times it occurs.

We will use the tag < DATA > for the file, < MEASURE > for each record. < WEIGHT >, < HEIGHT >, < SEX > and < TOWN > should be self explanatory, < FREQ > will be used for the frequency of that measurement. The file will look something like this:

< ?xml version="1.0" encoding="ISO-8859-1"? >
< DATA >
   < MEASURE > 
      < HEIGHT >148< /HEIGHT > 
      < WEIGHT >38< /WEIGHT >  
      < SEX >m< /SEX >  
      < TOWN >London< /TOWN >  
      < FREQ >10< /FREQ >  
   < /MEASURE > 
   < MEASURE > 
      < HEIGHT >146< /HEIGHT > 
      < WEIGHT >32< /WEIGHT >  
      < SEX >m< /SEX >  
      < TOWN >London< /TOWN >  
      < FREQ >24< /FREQ >  
   < /MEASURE > 
   .
   .
   .
< /data >

I say "something like this" because XML does not require that the data is indented and set on line neatly as shown. Browsers such as Internet Explorer will show data in this way however it is written in the file. Also, from MathymaStats point of view, the order of the elements within the record is not important. The following example is equivalent to the previous:

< ?xml version="1.0" encoding="ISO-8859-1"? >
< data >
< MEASURE > < HEIGHT >148< /HEIGHT > < WEIGHT >38< /WEIGHT >  
 < SEX >m< /SEX >  < TOWN >London< /TOWN >  < FREQ >10< /FREQ >  < /MEASURE > 
< MEASURE >  < SEX >m< /SEX > < WEIGHT >32< /WEIGHT >  
 < TOWN >London< /TOWN >  < FREQ >24< /FREQ >  < HEIGHT >146< /HEIGHT > < /MEASURE > 
   .
   .
   .
< /data >

Getting the data into MathymaStats

There are three stages to getting the data into MathymaStats:

  1. Creating the DataBlock object.
  2. Defining the data.
  3. Loading the data from the file.

Creating the DataBlock object

MathymaStats is a collection of JavaScript "objects" representing the data, variables, factors, models etc. The controlling object is the cDataBlock object (the 'c' stands for 'class' - each cDataBlock object is an instance of the cDataBlock class of objects).

All MathymaStat statements are JavaScript statements, and must be written between < script > tags in your HTML file - this is a small in-line program on your page. The first statement we will use is to create the cDataBlock object and link this to the input file. Suppose that the school-children data shown above is in a file called "pupil.xml". The following statements will define the data block, giving it the name "pupilData".

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   . . .
< /script >

Viewing the XML data

Once the data block has been defined, there is a simple way of putting a link to the XML input file on the HTML page. Simply use the cDataBlock method "ViewXML":

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.ViewXML();
   . . .
< /script >

Clicking on the link will open a new browser window where the XML data can be viewed.

Defining the data

MathymaStats recognises three types of variables:

  • Numerical measurement, e.g. height and weight, these are just called "variables"
  • Categorical or grouping variables , i.e. they show which category or group the individual measurement falls into. They can be numerical or alphabetic. These are called "factors"
  • Variables which show how many times this particular combination of "variables" and "factors" has occured, i.e. the frequency of this observation. These are called a "count". There will only be one "count" variable per record.

Variables

There are two variables in the school-children example: height and weight. The cDataBlock method to define variables is AddVariables. The simplest way to define a variable, is just give its XML tag name. AddVariables can be called several times to define each variable in turn, or just once to define all variables in one go:

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT");
   pupilData.AddVariables ("WEIGHT");
   . . .
< /script >

or:

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;WEIGHT");
   . . .
< /script >

Note that the variable names are separated by a semi-colon (;).

Note also that JavaScript and MathymaStats are both case-sensitive, so JavaScript key-words like "var" and "new", object and method names like "cDataBlock" and "AddVariables" must be typed as shown. Also the variable names you use must be in the same case as the XML tags, so if you have a < WEIGHT > tag, writing 'AddVariables("Weight")' would not give the desired result.

Renaming Variables

You may want the variable name which appears to be different from the XML tag of the input data. Suppose for example that instead of "WEIGHT" we wanted to call it "weight in kilos". This is achieved by typing the following:

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;weight in kilos=WEIGHT");
   . . .
< /script >

Warning: the variable is now called "weight in kilos", and you have to type this in every time you refer to the variable subsequently - so don't go overboard with variable names.

Defining your own variables

Suppose the variables in the input file do not cover your needs exactly. For example in the school-children data suppose we want a variable to show height:weight ratio. To do this we use the line:

   pupilData.AddVariables ("ratio=HEIGHT/WEIGHT");

Note that on the right hand side are the XML tag names (as when we renamed variables) and not the name of the renamed variable.

Suppose also we wanted the logarithm of the height. To achieve this we use the JavaScript Math object method - "Math.log" :

   pupilData.AddVariables ("logHeight=Math.log(HEIGHT)");

So the right hand side of the equation is any valid JavaScript statement with the XML tag names.

you can use the operators: + (plus), - (minus), * (multiply), / (divide) and the Math methods:

  • Math.abs (absolute value)
  • Math.sin, Math.cos, Math.tan (trigonometrical functions - argument in radians)
  • Math.asin, Math.acos, Math.atan, Math.atan2 (inverse trigonometrical functions - return value in radians)
  • Math.exp, Math.log (exponential and natural logarithm)
  • Math.ceil (least integer greater than argument )
  • Math.floor (greatest integer less than argument )
  • Math.round (rounds argument to nearest integer)
  • Math.max, Math.min (maximum or minimum (respectively) of two arguments )
  • Math.pow (raises first argument to the power of the second)
  • Math.sqrt (the positive square root of the argument - which must be positive)
  • Math.random (a random number between 0 and 1)

Also you can use the constant "Math.PI".

Factors

Factors are sometimes called "categorical variables", i.e. they tell you what category this observation or measurement falls into. Factors are defined by the cDataBlock method "AddFactors". As for variables, you can define one factor at a time, or several in one go separated by semi-colons (;). In the school-children example there are two factors "sex" and "town":

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;WEIGHT;ratio=HEIGHT/WEIGHT");
   pupilData.AddFactors ("SEX;TOWN");
   . . .
< /script >

Renaming Factors

You can rename factors in the same way as you can variables. So if the XML tag was "GEOGRAPHICAL LOCATION", and you would prefer to refer to it as "place" , then you use the statement:

   . . .
   dataBlock.AddFactors ("place=GEOGRAPHICAL LOCATION");
   . . .

Factors from Variables

Sometimes there is a continuous numerical variable in the input data, but for your analysis you would like to treat this as a factor, i.e. just take the measurements in a few broad categories. Suppose for example that we're not interested in the actual weight, only in if the subject is "light" (under 60 kgs. say), "medium" (60 - 75 kgs.), or "heavy" (over 75 kgs.). You can create a "weight category" factor in the following way:

   . . .
   dataBlock.AddFactors ("weightCat=WEIGHT:light,60,medium,75,heavy");
   . . .

The strict interpretation of this is:

MathymaStat does not set a limit on the length of the definition chain, just note that there should always be an odd number of elements, and that the odd numbered elements will be the level (category) names, the even numbered elements will be the boundaries between categories, and that these have to be in ascending order.

Cross Factors

You can define is the "cross-factor" of two other factors. E.g. in our example we could be interested in the combination of sex and town, so we would want to add the cross-factor "sex*town" to our analysis:

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;WEIGHT;ratio=HEIGHT/WEIGHT");
   pupilData.AddFactors ("SEX;TOWN");
   pupilData.AddFactors ("SEX*TOWN");
   . . .
< /script >

Note that the component factors must be the XML tag name, not the renamed factor:

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;WEIGHT;ratio=HEIGHT/WEIGHT");
   pupilData.AddFactors ("gender=SEX;TOWN");
   pupilData.AddFactors ("SEX*TOWN");
   . . .
< /script >

Also it is not necessary to define the component factors if these are not required in your analysis. The following is perfectly legitimate, but you won't be able to refer to "sex" and "town" separately in the subsequent analysis:

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.ViewXML ();
   pupilData.AddVariables ("HEIGHT;WEIGHT;ratio=HEIGHT/WEIGHT");
   pupilData.AddFactors ("SEX*TOWN");
   . . .
< /script >

Count

Variables which show how many times this particular combination of "variables" and "factors" has occured, i.e. the frequency of this observation are called a "count". There will only be one "count" variable per record. In the school-children example the count variable is coded in the XML tag < FREQ >, to tell MathymaStat this, use the cDataBlock method "DefineCount":

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;WEIGHT;ratio=HEIGHT/WEIGHT");
   pupilData.AddFactors ("gender=SEX;TOWN;SEX*TOWN");
   pupilData.DefineCount ("FREQ");
   . . .
< /script >

There does not have to be a count defined for the data. If the DefineCount method is not used MathymaStats will assume a count of one for each record.

Data with no variables

Sometimes there is data with no variables (in the MathymaStats sense) but just factors and counts, for example the number of children in a group divided by gender and hair-colour.

Each record will have the gender (factor), haircolour (factor), and number of children (count) that fit into that category.

Loading the data from the file

Once all the variables, factors and counts are defined the data can be loaded. You don't have to do this explicitly as MathymaStats loads the data as soon as a procedure that requires it is called. For example the cDataBlock method "Summary" which prints out basic means, minimum and maximum values, will load the data:

< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;WEIGHT;ratio=HEIGHT/WEIGHT");
   pupilData.AddFactors ("gender=SEX;TOWN;SEX*TOWN");
   pupilData.DefineCount ("FREQ");
   pupilData.Summary("WEIGHT");
   . . .
< /script >

Note that if you define or redifine variables, factors and counts once the data is loaded, then the data will have to be reloaded for subsequent analysis. MathymaStats does this automatically, though it will affect the speed at which your page builds. The best thing is to define all the variables up-front.

Writing MathymaStats Statements

MathymaStats methods either define objects for your analysis, or print out the results to the HTML page. You will often want to add HTML statements of your own, interspersing MathymaStat output in the main text of the page. There are two ways of doing this: either you can jump in-and-out of JavaScript to produce new output each time:

. . .
HTML statements/text
. . .
< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;WEIGHT;ratio=HEIGHT/WEIGHT");
   pupilData.AddFactors ("gender=SEX;TOWN;SEX*TOWN");
   pupilData.DefineCount ("FREQ");
< /script >
. . .
more HTML statements/text
. . .
< script >
   pupilData.Summary("WEIGHT");
< /script >
. . .
more HTML statements/text
. . .
< script >
   pupilData.Plot(HEIGHT:WEIGHT);
< /script >
. . .
more HTML statements/text
. . .

Alternatively, if you only want to add a small amount of HTML between two MathymaStats output you can use the JavaScript DOM method "document.write":

. . .
HTML statements/text
. . .
< script >
   var pupilData = new cDataBlock("StatsData/pupil.xml");
   pupilData.AddVariables ("HEIGHT;WEIGHT;ratio=HEIGHT/WEIGHT");
   pupilData.AddFactors ("gender=SEX;TOWN;SEX*TOWN");
   pupilData.DefineCount ("FREQ");
   document.write("< br/ >The data looks like this:");
   pupilData.Summary("WEIGHT");
   document.write("< hr/ >");
   pupilData.Plot(HEIGHT:WEIGHT);
< /script >
. . .
HTML statements/text
. . .