Recoding of data can be very complex and SDMX and FMR respond to this complexity by supporting the use of regular expressions (regex) to specify recoding rules. This unit explains how.
Previously we said that representation maps can include complex rules, such as regular expressions (regex) on source values and can even define periods of time for which a mapping relationship is true. For example,
If values do not require mapping, for example if the source FREQ maps to target FREQ and the values are the same in both the source and target DSD, then the component map should not link to a representation map. The lack of a link will inform the system that the value should be copied across verbatim.
The order of rules in a representation map can be important, specifically when using regular expressions. The regular expressions are tested in the same order that they appear, this allows for more specific expressions to be tested before a general catch all.
Example
Source data will first be checked against the exact match rule (A maps to B), followed by each regular expression rule until a match is found. As the last expression matched on anything, this can be considered as ‘if nothing matches then output _Z’.
A rule can be defined to match a specific pattern, which is then used in the output. An additional use case is where there is more than one source component for a mapping.
Select each example use case for the details.
A rule can be defined to match a specific pattern, which is then used in the output.
For example, the rule can state any three characters followed by a number is converted to the same three characters without the number. This can be satisfied by using regular expressions to match the input, with a capture group. A capture group is where the regular expression rule is in parentheses which can then be referred to by number (capture group 1, 2, 3, and so on).
Example
This example consists of two capture groups:
The output expression then reverses the order of the information by outputting capture group 2, an underscore, followed by capture group 1.
An example input for the above expression, and corresponding output is as follows:
An additional use case is where there is more than one source component for a mapping e.g. CURRENCY and REF_AREA. The value for one of the source components is based on a pattern and the value for the second component is based on what matched the first pattern.
Like the pattern match on output, this rule makes use of regular expression capture groups to copy matched information from one rule to another. The capture group (everything matched in the parenthesis) is referred to by number, with a leading slash .
Example
In this example the first input matches on anything, but the REF_AREA rule is using the matched value from CURRENCY, defined by capture group \1, followed by _X. The following shows a match and a miss:
To reshape a dataset from wide to tidy, we take source data with multiple measures e.g. BIRTHS, DEATHS, MARRIAGES and convert it to a DSD with only one OBS_VALUE.
In the following example, the measure is converted into a dimension value i.e. INDICATOR.
Select the options below for the details.
REF_AREA | TIME | BIRTHS | DEATHS | MARRIAGES |
---|---|---|---|---|
UK | 2020 | 11 | 12 | 13 |
FR | 2020 | 21 | 22 | 23 |
The dataset should be mapped to convert the BIRTHS, DEATHS and MARRIAGES to the INDICATOR B, D, and M respectively. The observation value is the value of each corresponding measure.
REF_AREA | INDICATOR | TIME_PERIOD | OBS_VALUE |
---|---|---|---|
UK | B | 2020 | 11 |
UK | D | 2020 | 12 |
UK | M | 2020 | 13 |
FR | B | 2020 | 21 |
FR | D | 2020 | 22 |
FR | M | 2020 | 23 |
This mapping relationship can be defined by mapping each source MEASURE to both the OBS_VALUE component and the INDICATOR component.
Example
Component Map 1: BIRTHS maps to OBS_VALUE
Component Map 2: BIRTHS maps to INDICATOR (uses Births Representation Map)
Births Representation Map: source=[anything], target=B
Where the [anything] rule is simply the regular expression .*
Component Map 3: DEATHS maps to OBS_VALUE
Component Map 4: DEATHS maps to INDICATOR (uses Deaths Representation Map)
Deaths Representation Map: source=.*, target=D
Component Map 5: MARRIAGES maps to OBS_VALUE
Component Map 6: MARRIAGES maps to INDICATOR (uses Marriages Representation Map)
Marriages Representation Map: source=.*, target=M
Recoding capabilities in SDMX are extremely powerful and the possibilities endless. The best way to become familiar with SDMX capabilities for recoding and transforming data is to experiment with both simple and more complex transformation scenarios.
In doing so, it is important to be able to test and validate the transformations against different test cases (test datasets). The next unit provides more information on testing and automating SDMX data transformations.