pdsparser Module
PDS Ring-Moon Systems Node, SETI Institute
pdsparser is a Python module that reads a PDS3 label file and converts its entire
content to a Python dictionary.
The typical way to use this is as follows:
from pdsparser import Pds3Label
label = Pds3Label(label_path)
where label_path is the path to a PDS3 label file or a data file containing an attached
PDS3 label. The returned object label is an object of class Pds3Label, which
supports the Python dictionary API and provides access to the content of the label.
Example 1
Suppose this is the content of a PDS3 label:
PDS_VERSION_ID = PDS3
RECORD_TYPE = FIXED_LENGTH
RECORD_BYTES = 2000
FILE_RECORDS = 1001
^VICAR_HEADER = ("C3450702_GEOMED.IMG", 1)
^IMAGE = ("C3450702_GEOMED.IMG", 2000 <BYTES>)
/* Image Description */
INSTRUMENT_HOST_NAME = "VOYAGER 1"
INSTRUMENT_HOST_NAME = VG1
IMAGE_TIME = 1980-10-29T09:58:10.00
FILTER_NAME = VIOLET
EXPOSURE_DURATION = 1.920 <SECOND>
DESCRIPTION = "This image is the result of geometrically
correcting the corresponding CALIB image (C3450702_CALIB.IMG)."
OBJECT = VICAR_HEADER
HEADER_TYPE = VICAR
BYTES = 2000
RECORDS = 1
INTERCHANGE_FORMAT = ASCII
DESCRIPTION = "VICAR format label for the image."
END_OBJECT = VICAR_HEADER
OBJECT = IMAGE
LINES = 1000
LINE_SAMPLES = 1000
SAMPLE_TYPE = LSB_INTEGER
SAMPLE_BITS = 16
BIT_MASK = 16#7FFF#
END_OBJECT = IMAGE
END
The returned dictionary will be as follows:
{'PDS_VERSION_ID': 'PDS3',
'RECORD_TYPE': 'FIXED_LENGTH',
'RECORD_BYTES': 2000,
'FILE_RECORDS': 1001,
'^VICAR_HEADER': 'C3450702_GEOMED.IMG',
'^VICAR_HEADER_offset': 1,
'^VICAR_HEADER_unit': '',
'^VICAR_HEADER_fmt': '("C3450702_GEOMED.IMG", 1)',
'^IMAGE': 'C3450702_GEOMED.IMG',
'^IMAGE_offset': 2000,
'^IMAGE_unit': '<BYTES>',
'^IMAGE_fmt': '("C3450702_GEOMED.IMG", 2000 <BYTES>)',
'INSTRUMENT_HOST_NAME_1': 'VOYAGER 1',
'INSTRUMENT_HOST_NAME_2': 'VG1',
'IMAGE_TIME': datetime.datetime(1980, 10, 29, 9, 58, 10),
'IMAGE_TIME_day': -7003,
'IMAGE_TIME_sec': 35890.0,
'IMAGE_TIME_fmt': '1980-10-29T09:58:10.000',
'FILTER_NAME': 'VIOLET',
'EXPOSURE_DURATION': 1.92,
'EXPOSURE_DURATION_unit': '<SECOND>',
'DESCRIPTION': 'This image is the result of geometrically
correcting the corresponding CALIB image (C3450702_CALIB.IMG).',
'DESCRIPTION_unwrap': 'This image is the result of geometrically correcting the
corresponding CALIB image (C3450702_CALIB.IMG).',
'VICAR_HEADER': {'OBJECT': 'VICAR_HEADER',
'HEADER_TYPE': 'VICAR',
'BYTES': 2000,
'RECORDS': 1,
'INTERCHANGE_FORMAT': 'ASCII',
'DESCRIPTION': 'VICAR format label for the image.',
'END_OBJECT': 'VICAR_HEADER'},
'IMAGE': {'OBJECT': 'IMAGE',
'LINES': 1000,
'LINE_SAMPLES': 1000,
'SAMPLE_TYPE': 'LSB_INTEGER',
'SAMPLE_BITS': 16,
'BIT_MASK': 32767,
'BIT_MASK_radix': 16,
'BIT_MASK_digits': '7FFF',
'BIT_MASK_fmt': '16#7FFF#',
'END_OBJECT': 'IMAGE'},
'END': '',
'objects': ['VICAR_HEADER', 'IMAGE']}
As you can see:
Most PDS3 label keywords become keys in the dictionary without change.
OBJECTs and GROUPs are converted to sub-dictionaries and are keyed by the value of the PDS3 keyword. In this example, label[‘VICAR_HEADER’][‘HEADER_TYPE’] returns ‘VICAR’.
If a keyword is repeated at the top level or within an object or group, it receives a suffix “_1”, “_2”, “_3”, etc. to distinguish it.
If a value has units, there is an additional keyword in the dictionary with “_unit” as a suffix, containing the name of the unit.
For text values that contain a newline, trailing blanks are suppressed. In addition, a dictionary key with the suffix “_unwrap” contains the same text as full paragraphs separated by newlines.
For a file pointer of the form (filename, offset) or (filename, offset <BYTES>), the keyed value is just the filename. The offset value provided with “_offset” appended to the dictionary key, and the unit is provided with “_unit” appended to the key.
For based integers of the form “radix#digits#”, the dictionary value is converted to an integer. However, the radix and the digit string are provided using keys with the suffix “_radix” and “_digits”. Also, the key with suffix “_fmt” provides a full, PDS3-formatted version of the value.
Dates and times are converted to Python datetime objects. However, additional dictionary keys appear with the suffix “_day” for the day number relative to Janary 1, 2000 and “_sec” for the elapsed seconds within that day.
For items that have special formatting within a label, such file pointers, dates, and integers with a radix, the key with a “_fmt” suffix provides the PDS3-formatted value for reference.
Each dictionary containing OBJECTs ends with an entry keyed by “objects”, which returns the ordered list of all the OBJECT keys in that dictionary. Similarly, each dictionary containing GROUPs has an entry keyed by “groups”, which returns the list of all the GROUP keys. These provide a easy way to iterate through objects and groups in the label.
Example 2
Within TABLE and SPREADSHEET objects, the dictionary keys of the embedded COLUMN, BIT_COLUMN, FIELD, and ELEMENT_DEFINITION objects are keyed by the value of the NAME keyword (rather than by using repeated keywords “COLUMN_1”, “COLUMN_2”, “COLUMN_3”, etc.). For example, suppose this appears in a PDS3 label:
OBJECT = TABLE
OBJECT = COLUMN
NAME = VOLUME_ID
START_BYTE = 1
END_OBJECT = COLUMN
OBJECT = COLUMN
NAME = FILE_SPECIFICATION_NAME
START_BYTE = 15
END_OBJECT = COLUMN
END_OBJECT = TABLE
The returned section of the dictionary will look like this:
{'TABLE': {'OBJECT': 'TABLE',
'VOLUME_ID': {'OBJECT': 'COLUMN',
'NAME': 'VOLUME_ID',
'START_BYTE': 1,
'END_OBJECT': 'COLUMN'},
'FILE_SPECIFICATION_NAME': {'OBJECT': 'COLUMN',
'NAME': 'FILE_SPECIFICATION_NAME',
'START_BYTE': 15,
'END_OBJECT': 'COLUMN'},
'END_OBJECT': 'TABLE'},
'objects': ['VOLUME_ID', 'FILE_SPECIFICATION_NAME']}
Example 3
“Set” notation (using curly braces “{}”) was sometimes mis-used in PDS3 labels where “sequence” notation (using parentheses “()”) was meant. For example, this might appear in a label:
CUTOUT_WINDOW = {1, 1, 200, 800}
which is supposed to define the four boundaries of an image region. The user might be surprised to learn that in the dictionary, its value is the Python set {1, 200, 800}. To address this situation, for every set value, the dictionary also has a key with the same name but suffix “_list”, which contains the elements of the value as list in their original order and including duplicates. In this example, the dictionary contains:
'CUTOUT_WINDOW': {1, 200, 800},
'CUTOUT_WINDOW_list': [1, 1, 200, 800]
Options
The Pds3Label() constructor provides a variety of additional options for how to
parse the label and present its content.
You can provide the label to be parsed as a string containing the label’s content rather than as a path to a file.
Use types=True to include the type of each keyword the file and interpret its content (e.g., “integer”, “based_integer”, “text”, “date_time”, or “file_offset_pointer”) in the dictionary using the keyword plus suffix “_type”.
Use sources=True to include the source text as extracted from the PDS3 label in the dictionary using the keyword plus suffix “_source”.
Use expand=True to insert the content of any referenced ^STRUCTURE keywords into the returned dictionary.
Use vax=True to read attached labels from old-style Vax variable-length record files.
Use the repairs to correct any known syntax errors in the label prior to parsing using regular expressions.
Four methods of parsing the label are provided.
method=”strict” uses a strict implementation of the PDS3 syntax. It is sure to provide accurate results, but can be rather slow. This method can also be used to validate the syntax within a PDS3 label, because it will raise a SyntaxError if anything goes wrong.
method=”loose” uses a variant of the “strict” method, in which allowance is made for certain common syntax errors. Specifically,
It allows slashes in file names and in text strings that are not quoted (e.g., ‘N/A’).
It allows the value of END_OBJECT and END_GROUP to be absent, as long as they are still properly paired with associated OBJECT and GROUP keywords.
It allows time zone expressions, which were disallowed starting in Version 4 of the standards.
It allows blanks where leading zeros belong in dates and times, e.g., “2026- 7 - 4” instead of “2026-07-04” and “12: 3: 4” instead of “12:03:04”.
Commas can be missing between the elements of a sequence or set.
The final line terminator after END can be missing from a detached label.
method=”fast” is a different and much faster (often 30x faster) parser, which takes various “shortcuts” during the parsing. As a result, it may fail on occasions where the other methods succeed, and it may not return correct results in the cases of some oddly-formatted labels. However, it handles all the most common aspects of the PDS3 syntax correctly, and so may be a good choice when handling large numbers of labels.
method=”compound”” is similar to “loose”, but it parses a “compound” label, i.e., one that might contain more than one END statement.
Utilities
The pdsparser module provides several additional utilities for handling PDS3 labels.
read_label(): Reads a PDS3 label from a file. Supports attached labels within binary files.read_vax_binary_label(): Reads the attached PDS3 label from an old-style Vax binary file that uses variable-length records.expand_structures(): Replaces any ^STRUCTURE keywords in a label string with the content of the associated “.FMT” files.
- class pdsparser.Pds3Label(label, method='strict', *, expand=False, fmt_dirs=[], repairs=[], vax=False, types=False, sources=False, first_suffix=True, _details=False)[source]
Bases:
objectClass representing the parsed content of a PDS3 label.
- __init__(label, method='strict', *, expand=False, fmt_dirs=[], repairs=[], vax=False, types=False, sources=False, first_suffix=True, _details=False)[source]
Constructor for a Pds3Label.
- Parameters:
label (str, list, pathlib.Path, or filecache.FCPath) – The label, defined as a path to a file or as the content of a label. The content can be represented by a single string with <LF> or <CR><LF> terminators, or as a list of strings with optional terminators. If the file contains an attached PDS3 label, that file is read up to the END statement and the remainder is ignored. If the file does not contain a label but a detached label (ending in “.lbl” or “.LBL” exists), that file is read instead.
method (str, optional) –
The method of parsing to apply to the label. One of:
”strict” performs strict parsing, which requires that the label conform to the full PDS3 standard.
”loose” is similar to the above, but tolerates some common syntax errors.
”compound” is similar to “loose”, but it parses a “compound” label, i.e., one that might contain more than one “END” statement. This option is not supported for attached labels.
”fast”: uses s a different parser, which executes ~ 30x fast than the above and handles all the most common aspects of the PDS3 standard. However, it is not guaranteed to provide an accurate parsing under all circumstances.
expand (bool, optional) – True to replace the content of any ^STRUCTURE keyword in the label with the content of the associated “.FMT” file.
fmt_dirs (str, pathlib.Path, filecache.FCPath, or list, optional) – One or more directory paths to search for “.FMT” files. Note that if label indicates a file path, the parent directory of that file is always searched first.
repairs (tuple or list[tuple]) –
One or more two-element tuples of the form (pattern, replacement), where the first item is a regular expression and the second is the string with which to replace it. These repair patterns are applied to the label content before it is parsed, and make it possible to repair known syntax errors. For example, this tuple uses a “negative look-behind” pattern (?<!…) tow ensure that every occurrence of “N/A” is surrounded by quotes:
(r'(?<!["\'])N/A', "'N/A'")
The replacement can include back-references (”1”, “2”, etc.) to captured substrings of tuple; see any documentation about regular expressions for more details.
vax (bool, optional) – True to read an attached label from a Vax binary file.
types (bool, optional) – If True, for each PDS keyword in the label, there will be an extra key in the dictionary with the same name but suffix “_type” identifying the PDS3 data type, e.g., “integer”, “based_integer”, “text”, “date_time”, “file_offset_pointer”, etc.
sources (bool, optional) – If True, for each PDS keyword in the label, there will be an extra key in the dictionary with the same name but suffix “_source” returning the substring of the label from which this value was derived.
first_suffix (bool, optional) – If True and a keyword is duplicated, append a suffix “_1” to the first occurrence; otherwise, the first occurrence of the keyword has no suffix.
_details (bool, optional) – Used for debugging. If True, for each PDS keyword in the label, there will be an extra key in the dictionary with the same name but suffix “_detail” returning an object (of class internal to this module) containing details about how the entry was parsed. Not provided if fast=True.
- Raises:
FileNotFoundError – If the label file is missing.
SyntaxError – If the label content contains invalid syntax.
Notes
The label information is preserved as a dictionary using the value before each equal sign as the key. If a keyword is repeated in the label, later dictionary keys have a suffix “_1”, “_2”, “_3”, etc.
OBJECT and GROUP elements are described by internal dictionaries, which are organized the same as the overall label. The key for COLUMN, BIT_COLUMN, FIELD, and ELEMENT_DEFINITION objects is their NAME attribute; for others, it is the value after the equal sign in the OBJECT or GROUP statement.
Numeric values are represented as ints or floats. If the value has a unit, the unit value can be accessed by appending “_unit” to the key. Integers given with a radix are provided as ints, but you can view the radix value and the digit string by appending “_radix” and “_digits” to the key; in addition, suffix “_fmt” returns the full formatted value using the radix notation.
Text strings are represented as Python str values. For those that extend beyond a single line, you can append “_unwrap” to the key to get a version of the text in which indents and newlines within paragraphs have been removed.
Dates, times, and date-times are all represented using classes of the python datetime module. Dates and date-times have an additional dictionary entry using suffix “_day” returning the elapsed days since January 1, 2000. Times and date-times have an additional entry using suffix “_sec” returning the number of elapsed seconds since the beginning of that day. In additiona, all of these have an additional entry with suffix
Sequences are represented by lists. 2-D sequences are represented by list of lists. Append “_unit” to the key to see any unit values that appeared within the sequence; if all units are the same, the “_unit” suffix returns a single value; otherwise, it returns a list or list of lists containing the unit value associated with each value in the sequence.
Set values (enclosed in curly braces {}) are represented by Python set objects. However, because this notation was sometimes mis-used in labels for values that should have been given as sequences, you can also view these values as an ordered list by appending “_list” to the key.
For pointers involving a filename and an offset, the keyword name in the label returns the filename only; Append “_offset” to the key to get the offset and “_unit” to get the unit, which is either “<BYTES>” or an empty string (meaning the unit is records).
- content
The full content of the label as a string with <LF> line separators. If expand is True, this will be the expanded content, with any ^STRUCTURE values replaced.
- Type:
str
- get(key, default=None)[source]
The value for key if key is in the label; otherwise, the specified default.
- as_dict()[source]
This label as a Python dictionary. Part of the old PdsLabel API.
DEPRECATED; use the dict_ attribute or apply the dict API directoy to this Pds3Label object.
Note that this function matches the previous output of as_dict(). Specifically,
dates, times, and datetimes are returned as strings in ISO format.
file_offset_pointers are returned as a tuple (filename, offset, unit).
set values are returns as lists.
units are omitted.
- static from_file(filename)[source]
Load and parse a PDS label. Part of the old PdsLabel API.
DEPRECATED; use Pds3Label(filename).
Utility functions.
- pdsparser.utils.read_label(filepath, *, chars=4000)[source]
Read the PDS3 label from a file. Supports attached labels within binary files.
- Parameters:
filepath (str, pathlib.Path, or filecache.FCPath) – The path to the file. If the file does not contain a PDS3 label, a detached label (with the same path but ending in “.lbl” or “.LBL”) is read instead.
chars (int, optional) – Initial number of characters to read from the top of a binary file when extracting the label. Reads will continue until the END statement is found.
- Returns:
The content of the label as a single string with newline terminators.
- Return type:
str
- Raises:
FileNotFoundError – If the label file is missing.
SyntaxError – If the END statement is not found in a binary file.
Notes
If the filepath ends in “.lbl” or “.LBL”, it is assumed to refer to a detached label and the entire file content is returned. Otherwise, it reads the file (which may be binary) until it finds an “END” statement.
- pdsparser.utils.read_vax_binary_label(filepath)[source]
Read an attached PDS3 label from a Vax binary file that uses variable-length records.
- Parameters:
filepath (str, pathlib.Path, or filecache.FCPath) – The path to the file. A detached label (ending in “.lbl” or “.LBL”) is read using “stream” format; any other file is read assuming Vax variable-length format (in which the first two bytes of each record contain the length of the remaining record). If the file does not contain a PDS3 label, a detached label (with same path but ending in “.lbl” or “.LBL”) is read instead.
- Returns:
The content of the label as a single string with newline terminators.
- Return type:
str
- Raises:
FileNotFoundError – If the label file is missing.
- pdsparser.utils.expand_structures(content, fmt_dirs=[], *, repairs=[], label_path=None)[source]
Replace any ^STRUCTURE keywords in the label with the content of the associated “.FMT” files.
- Parameters:
fmt_dirs (str, pathlib.Path, filecache.FCPath, or list, optional) – One or more directory paths to search for the “.FMT” files.
repairs (tuple or list[tuple]) – One or more two-element tuples of the form (pattern, replacement), where the first item is a regular expression and the second is the string with which to replace it. These repair patterns are applied to the label content before it is parsed, and make it possible to repair known syntax errors.
label_path (str, pathlib.Path, filecache.FCPath, optional) – The path to the label file from which the content was obtained; if provided, the parent directory of this files is the first to be searched for .FMT files.
- Returns:
The revised content string.
- Return type:
str
- Raises:
FileNotFoundError – If a referenced .FMT file cannot be found in any of the directories specified.