Advanced Functionalities

NOAAStudy (pyleotups.utils.NOAAStudy)

class pyleotups.utils.NOAAStudy.NOAAStudy(study_data)[source]

This class encapsulates study metadata and its related components (e.g. publications, sites) retrieved from the NOAA API.

study_id

The unique NOAA study identifier.

Type:

str

xml_id

The XML identifier of the study.

Type:

str

metadata

A dictionary containing basic metadata such as studyName, dataType, earliestYearBP, etc.

Type:

dict

investigators

A comma-separated string of investigator names.

Type:

str

publications

A list of Publication objects associated with the study.

Type:

list of Publication

sites

A list of Site objects associated with the study.

Type:

list of Site

to_dict()[source]

Convert the study data and its components to a dictionary.

Returns:

A dictionary representing the study including metadata, investigators, publications, and sites.

Return type:

dict

PaleoData (pyleotups.utils.PaleoData)

class pyleotups.utils.PaleoData.PaleoData(paleo_data, study_id, site_id)[source]

Represents paleo data associated with a site, including multiple data files and full variable metadata per file.

datatable_id

Unique NOAA data table identifier.

Type:

str

dataTableName

Name of the data table.

Type:

str

timeUnit

Time unit used in the data table.

Type:

str

files

List of raw file info dicts.

Type:

list of dict

file_variable_map

Maps fileUrl to a dict of variables and their full metadata.

Type:

dict

file_url

Shortcut to first file URL (for backward compatibility).

Type:

str or np.nan

variables

Shortcut to variable names in first file (for backward compatibility).

Type:

list of str

to_dict(file_obj=None)[source]

Convert PaleoData into a dictionary, optionally for a specific file.

Parameters:

file_obj (dict, optional) – Specific file object (default is first file).

Returns:

Dictionary of core metadata for one file.

Return type:

dict

Publication (pyleotups.utils.Publication)

class pyleotups.utils.Publication.Publication(pub_data)[source]

Represents a publication within a study.

author

The name of the author(s) of the publication.

Type:

str

title

The title of the publication.

Type:

str

journal

The journal where the publication appeared.

Type:

str

year

The publication year.

Type:

str

volume

The volume number (if applicable).

Type:

str or None

number

The issue number (if applicable).

Type:

str or None

pages

The page numbers (if applicable).

Type:

str or None

pub_type

The type of publication.

Type:

str or None

doi

The Digital Object Identifier.

Type:

str or None

url

URL for the publication.

Type:

str or None

study_id

The NOAA study ID to which this publication belongs.

Type:

str or None

get_citation_key()[source]

Generate a unique citation key for the publication.

Returns:

A citation key in the format: “<LastName>_<FirstSignificantWord>_<Year>_<StudyID>”.

Return type:

str

to_dict()[source]

Convert the publication data into a dictionary.

Returns:

A dictionary representation of the publication.

Return type:

dict

Site (pyleotups.utils.Site)

class pyleotups.utils.Site.Site(site_data, study_id)[source]

Represents a site within a study.

to_dict()[source]

Convert the site into a list of dictionaries, one per PaleoData file.

PangaeaStudy (pyleotups.utils.PangaeaStudy)

class pyleotups.utils.PangaeaStudy.PangaeaStudy(study_id: str, cache_dir: str | None = None, auth_token: str | None = None)[source]

Utility class representing a single PANGAEA study.

This class wraps a persistent pangaeapy.PanDataSet instance and provides: - Lazy data loading - NOAA-style summary normalization - Geographic extraction - Deep publication parsing (including supplement handling)

Parameters:
  • study_id (str) – DOI, URI, or identifier of the PANGAEA dataset.

  • cache_dir (str or None, optional) – Directory for pangaeapy cache.

  • auth_token (str or None, optional) – PANGAEA authentication token for restricted datasets.

get_data() DataFrame[source]

Retrieve the dataset as a pandas DataFrame.

Returns:

Copy of the dataset table with metadata stored in df.attrs.

Return type:

pandas.DataFrame

get_funding() DataFrame[source]

Retrieve funding information for this study.

Returns:

DataFrame with columns: [‘StudyID’, ‘StudyName’, ‘FundingAgency’, ‘FundingGrant’].

If no funding metadata is available, returns an empty DataFrame with columns preserved.

Return type:

pandas.DataFrame

get_geo() DataFrame[source]

Retrieve geographic metadata for study events.

Returns:

DataFrame containing event-level geographic information.

Return type:

pandas.DataFrame

get_variables() DataFrame[source]

Retrieve variable (parameter) metadata for this study.

Returns:

One row per parameter with the following columns:

  • StudyID

  • VariableName

  • ShortName

  • Unit

  • OntologyTerms

Return type:

pandas.DataFrame

Notes

For collection datasets, this returns an empty DataFrame.

to_summary_dict() Dict[str, Any][source]

Convert study metadata to NOAA-style summary dictionary.

Returns:

Dictionary with standardized summary fields.

Return type:

dict

Parsers

StandardParser (pyleotups.utils.Parser.StandardParser)

class pyleotups.utils.Parser.StandardParser(url=None)[source]

StandardParser parses NOAA .txt data files with standard format: Standard format refers to NOAA Templated file with metadata -> (# lines), variables -> (## lines), data (tab-deliimited).

url

URL of the file to parse.

Type:

str

lines

Fetched lines from file.

Type:

list of str

meta_start

Index where metadata block starts.

Type:

int

meta_end

Index where metadata block ends.

Type:

int

variables

Extracted variable names.

Type:

list of str

skip_lines

Lines to skip after metadata to reach data.

Type:

int

data

Parsed data rows.

Type:

list of list of str

df

Final constructed dataframe.

Type:

pandas.DataFrame

parse(url=None)[source]

Public method to parse the NOAA file.

Parameters:

url (str, optional) – URL to override the existing one.

Return type:

pandas.DataFrame

NonStandardParser (pyleotups.utils.Parser.NonStandardParser)

class pyleotups.utils.Parser.NonStandardParser(file_path, use_skip=True, use_refinement=True)[source]

Parses non-standard, fixed-width, or misaligned text files (like those from NOAA) into a structured list of Blocks, each potentially containing a pandas DataFrame.

The parser uses statistical heuristics to classify contiguous blocks of text and then applies different parsing strategies based on the classification.

file_path

The local path or URL to the file being parsed.

Type:

str

use_skip

Whether to skip to the “DATA:” descriptor in the file.

Type:

bool

lines

A list of all lines read from the file.

Type:

list[str]

blocks

The final list of processed Block objects.

Type:

list[Block]

Notes

The parsing workflow is as follows. - A NonStandardParser instance is created with a file_path. - The public parse() method is called. - _fetch_lines() reads the file into self.lines. - _segregate_blocks() splits self.lines into Block objects -a groups of non-empty lines- and saves them to self.blocks. - parse() iterates through each block in self.blocks. - _process_block() is called on each block, which a. Computes statistics for the block, b. Classifies it (e.g., TABULAR, DATA, NARRATIVE), c. Dispatches to a specific parsing method (e.g.,`_parse_tabular_block`). - The specific parse methods (e.g., _parse_data_block) handle logic for header borrowing, DataFrame generation, and error handling, modifying the block object in place. - parse() returns the fully processed self.blocks list.

parse()[source]

Executes the full parsing workflow on the file.

Returns:

A list of processed Block objects. Each block may contain a DataFrame (block.df) if parsing was successful, or an error message (block.error_message) if it failed.

Return type:

list[Block]

Raises:
  • ValueError – If use_skip is True and no “DATA:” line is found.

  • requests.exceptions.RequestException – If the file_path is a URL and it fails to fetch.

NonStandardParserUtils (pyleotups.utils.Parser.NonStandardParserUtils)

class pyleotups.utils.Parser.NonStandardParserUtils.Block(idx, start, end)[source]

Represents a contiguous block of non-empty lines from the source file.

This is the main data structure used by the parser, holding the lines, their classified type, and the resulting parsed DataFrame.

idx

The sequential index (0, 1, 2…) of the block in the file.

Type:

int

start

The starting line number (index) of this block in the source file.

Type:

int

end

The ending line number (index) of this block in the source file.

Type:

int

lines

A list of LineInfo objects contained within this block.

Type:

list[LineInfo]

block_type

The classified type of the block (e.g., NARRATIVE, TABULAR).

Type:

BlockType

headers

A list of header dictionaries, where each dict contains: - “name” (str): The parsed header name. - “interval” (tuple[int, int]): The (start, end) char position.

Type:

list[dict]

title

A potential title line detected above the headers.

Type:

str or None

stats

Aggregated statistics computed for the entire block.

Type:

dict

header_extent

The number of lines detected as being part of the header.

Type:

int

delimiter

The regex string of the delimiter chosen for this block.

Type:

str or None

df

The resulting pandas DataFrame if parsing was successful.

Type:

pd.DataFrame or None

used_as_header_for

A list of block indices that successfully borrowed this block’s headers.

Type:

list[int]

class pyleotups.utils.Parser.NonStandardParserUtils.BlockType(*values)[source]

Enumeration for the different types a Block can be classified as.

class pyleotups.utils.Parser.NonStandardParserUtils.LineInfo(idx, text)[source]

Holds the text and pre-computed statistics for a single line.

idx

The original line number (index) from the source file.

Type:

int

text

The raw text of the line.

Type:

str

line_len

The character length of the line.

Type:

int

count_single_tokens

Token count using a single-space delimiter (r”s+”).

Type:

int

count_multispace_tokens

Token count using a multi-space delimiter (r”(s{2,})”).

Type:

int

count_tab_tokens

Token count using a tab delimiter (r” “).

Type:

int

numeric_single_ratio

Ratio of numeric tokens (0.0 to 1.0) using r”s+”.

Type:

float

numeric_multispace_ratio

Ratio of numeric tokens (0.0 to 1.0) using r”(s{2,})”.

Type:

float

numeric_tab_ratio

Ratio of numeric tokens (0.0 to 1.0) using r” “.

Type:

float

pyleotups.utils.Parser.NonStandardParserUtils.assign_tokens_by_overlap(lines_info, delimiter, headers, header_extent=0)[source]

Generates a DataFrame by assigning tokens based on character-level overlap.

This is a fallback for misaligned data. It checks two stages: 1. Assigns a token to the header with the maximum overlap. 2. If no overlap, assigns to the header with the minimum distance (closest neighbor).

Parameters:
  • lines_info (list[LineInfo]) – The list of LineInfo objects to parse.

  • delimiter (str) – The regex delimiter to split lines.

  • headers (list[dict]) – The list of header objects (must contain “name” and “interval”).

  • header_extent (int, optional) – The number of lines to skip from the start of lines_info. Defaults to 0.

Returns:

The parsed DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If delimiter or headers are missing or malformed.

pyleotups.utils.Parser.NonStandardParserUtils.auto_cast_df(df: DataFrame) DataFrame[source]

Attempt to convert object columns to numeric where possible. Leaves non-convertible columns unchanged.

pyleotups.utils.Parser.NonStandardParserUtils.compute_interval_overlap(interval1, interval2)[source]

Calculates the number of overlapping characters between two intervals.

pyleotups.utils.Parser.NonStandardParserUtils.count_tokens(line, delimiter)[source]

Counts non-empty tokens in a line given a regex delimiter.

pyleotups.utils.Parser.NonStandardParserUtils.generate_df(lines_info, delimiter, headers, header_extent=0)[source]

Generates a DataFrame using a simple split, assuming columns align.

Parameters:
  • lines_info (list[LineInfo]) – The list of LineInfo objects to parse.

  • delimiter (str) – The regex delimiter to split lines.

  • headers (list[dict]) – The list of header objects (must contain “name”).

  • header_extent (int, optional) – The number of lines to skip from the start of lines_info. Defaults to 0.

Returns:

The parsed DataFrame.

Return type:

pd.DataFrame

Raises:
  • ValueError – If delimiter or headers are missing.

  • ValueError – If the number of tokens in a data row does not match the number of headers (and data rows exist).

pyleotups.utils.Parser.NonStandardParserUtils.generate_row_pattern(tokens)[source]

Generates a string pattern (‘N’ for numeric, ‘S’ for string) for a list of tokens.

pyleotups.utils.Parser.NonStandardParserUtils.get_token_intervals_multi(line, delimiter)[source]

Splits a line by a regex delimiter and returns token intervals.

Parameters:
  • line (str) – The line to parse.

  • delimiter (str) – The regex delimiter string (e.g., r”(s{2,})”).

Returns:

A list of token dictionaries, each with: - “key” (str): A unique key for the token. - “display” (str): The stripped token text. - “interval” (tuple[int, int]): The (start, end) char position.

Return type:

list[dict]

pyleotups.utils.Parser.NonStandardParserUtils.intervals_overlap(interval1, interval2)[source]

Checks if two intervals overlap at all.

pyleotups.utils.Parser.NonStandardParserUtils.is_numeric(token)[source]

Robustly checks if a token is numeric.

Handles plain numbers, ranges (e.g., ‘10-20’), values with uncertainty (e.g., ‘1.5 ± 0.1’ or ‘1.50.1’), and wrapped values (e.g., ‘(10)’ or ‘6.80 (8.98)’).

Parameters:

token (str) – The string token to check.

Returns:

True if the token is considered numeric, False otherwise.

Return type:

bool

pyleotups.utils.Parser.NonStandardParserUtils.merge_headers_by_overlap(token_maps)[source]

Merges multiple lines of header tokens into a single header list.

Used for multi-line headers, where tokens from subsequent lines are merged into the first line’s headers based on character overlap.

Parameters:

token_maps (list[list[dict]]) – A list where each item is the output of get_token_intervals_multi for one header line.

Returns:

A single list of merged header dictionaries.

Return type:

list[dict]

pyleotups.utils.Parser.NonStandardParserUtils.numeric_ratio(line, delimiter)[source]

Calculates the ratio of numeric tokens in a line.

pyleotups.utils.Parser.NonStandardParserUtils.refine_headers_by_correspondence(header_lines, data_lines, delimiter, broadcast_identical=False)[source]

Refines headers by analyzing the physical layout (vertical alignment) of the data lines.

It creates a density mask (histogram) of the data to find physical columns, then maps the header tokens to these physical columns. If multiple distinct header tokens map to a single wide data column, it forces a split (preserving granular headers). If adjacent data columns share the exact same header identity, it merges them (unless broadcast_identical is True).

Parameters:
  • header_lines (list[LineInfo]) – The lines identified as headers.

  • data_lines (list[LineInfo]) – The lines identified as data.

  • delimiter (str) – The regex delimiter used to tokenize the lines.

  • broadcast_identical (bool, optional) – If True, adjacent columns with identical headers are kept separate (suffixed). If False (default), they are merged into one column.

Returns:

A list of refined header dictionaries containing “name” and “interval”. Returns None if refinement is not possible (e.g., no data lines).

Return type:

list[dict] or None

ExcelParser (pyleotups.utils.Parser.ExcelParser)

class pyleotups.utils.Parser.ExcelParser(file_path: str)[source]

Parses Excel files by detecting contiguous blocks of non-empty cells and converting them into structured DataFrames.

It handles: - Spatial segmentation (BFS) to find tables. - Merged cell propagation. - Statistical header detection. - Multi-row header merging. - Header borrowing for data-only blocks.

file_path

The local path or URL to the Excel file.

Type:

str

sheets

The loaded sheets content.

Type:

List[SheetGrid]

blocks

The segregated and processed blocks.

Type:

List[Block]

_header_registry

Internal registry for header borrowing logic.

Type:

Dict[str, List[Block]]

parse() List[Block][source]

Executes the full parsing workflow.

Returns:

A list of processed Block objects, potentially containing DataFrames.

Return type:

List[Block]