Recently I got a question related to processing of an XML sample like this:
The challenge in this case is to link some kind of team identifier to each of the players. As you can see there is no team data at all so the only identifier possible is the occurrence number of the team in the lineup structure. The weapon of choice is IBM InfoSphere DataStage (version 220.127.116.11) and more specifically the hierarchical stage.
With DataStage I created a job with 6 steps. Basically the original XML payload and XSD will be used in the first hierarchical stage to breakdown the lineup into individual teams (step 3), count the number of teams (step 4), breakdown each team (step 5) and review the results. To determine the XSD Freeformatter was used.
In this first hierarchical stage, the XSD will be added.
The XML processing consists of three steps.
And within the XML_Parser Step the special option for “chucking” is used. It basically will put all data (including XML tagging) on team level into 1 string.
Where as the original XML was put into 1 row, it is now broken down into 2 rows: 1 row per team. As DataStage is aware of the number of rows,the row instance number of a team can be determined and assigned to a team.
In a transformer stage, a current row number is determined using a stage variable. This variable is added to the XML definition of team as an attribute. To process the team XML a second hierarchical stage is used but this will only process the (smaller) team XML. The same approach is used as with the first hierarchical stage: Define a XSD, add it to the stage sand process the XML with three steps. But now in the XML_Parser Step “chunking” is not used.
The XML player data is directly mapped to the output of the stage.
As you can see the original XML has 3 players divided over 2 teams. The output has 3 (player) lines with each a team identifier (1 or 2).