Advanced Functionalities

NOAAStudy (pyleotups.utils.NOAAStudy)

class pyleotups.utils.NOAAStudy.NOAAStudy(study_data)[source]

This class encapsulates study metadata and its related components (e.g. publications, sites) retrieved from the NOAA API.

study_id

The unique NOAA study identifier.

Type:: str

xml_id

The XML identifier of the study.

Type:: str

metadata

A dictionary containing basic metadata such as studyName, dataType, earliestYearBP, etc.

Type:: dict

investigators

A comma-separated string of investigator names.

Type:: str

publications

A list of Publication objects associated with the study.

Type:: list of Publication

sites

A list of Site objects associated with the study.

Type:: list of Site

to_dict()[source]

Convert the study data and its components to a dictionary.

Returns:: A dictionary representing the study including metadata, investigators, publications, and sites.
Return type:: dict

PaleoData (pyleotups.utils.PaleoData)

class pyleotups.utils.PaleoData.PaleoData(paleo_data, study_id, site_id)[source]

Represents paleo data associated with a site, including multiple data files and full variable metadata per file.

datatable_id

Unique NOAA data table identifier.

Type:: str

dataTableName

Name of the data table.

Type:: str

timeUnit

Time unit used in the data table.

Type:: str

files

List of raw file info dicts.

Type:: list of dict

file_variable_map

Maps fileUrl to a dict of variables and their full metadata.

Type:: dict

file_url

Shortcut to first file URL (for backward compatibility).

Type:: str or np.nan

variables

Shortcut to variable names in first file (for backward compatibility).

Type:: list of str

to_dict(file_obj=None)[source]

Convert PaleoData into a dictionary, optionally for a specific file.

Parameters:: file_obj (dict, optional) – Specific file object (default is first file).
Returns:: Dictionary of core metadata for one file.
Return type:: dict

Publication (pyleotups.utils.Publication)

class pyleotups.utils.Publication.Publication(pub_data)[source]

Represents a publication within a study.

author

The name of the author(s) of the publication.

Type:: str

title

The title of the publication.

Type:: str

journal

The journal where the publication appeared.

Type:: str

year

The publication year.

Type:: str

volume

The volume number (if applicable).

Type:: str or None

number

The issue number (if applicable).

Type:: str or None

pages

The page numbers (if applicable).

Type:: str or None

pub_type

The type of publication.

Type:: str or None

doi

The Digital Object Identifier.

Type:: str or None

url

URL for the publication.

Type:: str or None

study_id

The NOAA study ID to which this publication belongs.

Type:: str or None

get_citation_key()[source]

Generate a unique citation key for the publication.

Returns:: A citation key in the format: “<LastName>_<FirstSignificantWord>_<Year>_<StudyID>”.
Return type:: str

to_dict()[source]

Convert the publication data into a dictionary.

Returns:: A dictionary representation of the publication.
Return type:: dict

Site (pyleotups.utils.Site)

class pyleotups.utils.Site.Site(site_data, study_id)[source]

Represents a site within a study.

to_dict()[source]: Convert the site into a list of dictionaries, one per PaleoData file.

PangaeaStudy (pyleotups.utils.PangaeaStudy)

class pyleotups.utils.PangaeaStudy.PangaeaStudy(study_id: str, cache_dir: str | None = None, auth_token: str | None = None)[source]

Utility class representing a single PANGAEA study.

This class wraps a persistent pangaeapy.PanDataSet instance and provides: - Lazy data loading - NOAA-style summary normalization - Geographic extraction - Deep publication parsing (including supplement handling)

Parameters:

study_id (str) – DOI, URI, or identifier of the PANGAEA dataset.
cache_dir (str or None, optional) – Directory for pangaeapy cache.
auth_token (str or None, optional) – PANGAEA authentication token for restricted datasets.

get_data() → DataFrame[source]

Retrieve the dataset as a pandas DataFrame.

Returns:: Copy of the dataset table with metadata stored in df.attrs.
Return type:: pandas.DataFrame

get_funding() → DataFrame[source]

Retrieve funding information for this study.

Returns:

DataFrame with columns: [‘StudyID’, ‘StudyName’, ‘FundingAgency’, ‘FundingGrant’].

If no funding metadata is available, returns an empty DataFrame with columns preserved.

Return type:

pandas.DataFrame

get_geo() → DataFrame[source]

Retrieve geographic metadata for study events.

Returns:: DataFrame containing event-level geographic information.
Return type:: pandas.DataFrame

get_variables() → DataFrame[source]

Retrieve variable (parameter) metadata for this study.

Returns:

One row per parameter with the following columns:

StudyID
VariableName
ShortName
Unit
OntologyTerms

Return type:

pandas.DataFrame

Notes

For collection datasets, this returns an empty DataFrame.

to_summary_dict() → Dict[str, Any][source]

Convert study metadata to NOAA-style summary dictionary.

Returns:: Dictionary with standardized summary fields.
Return type:: dict

Parsers

StandardParser (pyleotups.utils.Parser.StandardParser)

class pyleotups.utils.Parser.StandardParser(url=None)[source]

StandardParser parses NOAA .txt data files with standard format: Standard format refers to NOAA Templated file with metadata -> (# lines), variables -> (## lines), data (tab-deliimited).

url

URL of the file to parse.

Type:: str

lines

Fetched lines from file.

Type:: list of str

meta_start

Index where metadata block starts.

Type:: int

meta_end

Index where metadata block ends.

Type:: int

variables

Extracted variable names.

Type:: list of str

skip_lines

Lines to skip after metadata to reach data.

Type:: int

data

Parsed data rows.

Type:: list of list of str

df

Final constructed dataframe.

Type:: pandas.DataFrame

parse(url=None)[source]

Public method to parse the NOAA file.

Parameters:: url (str, optional) – URL to override the existing one.
Return type:: pandas.DataFrame

NonStandardParser (pyleotups.utils.Parser.NonStandardParser)

class pyleotups.utils.Parser.NonStandardParser(file_path, use_skip=True, use_refinement=True)[source]

Parses non-standard, fixed-width, or misaligned text files (like those from NOAA) into a structured list of Blocks, each potentially containing a pandas DataFrame.

The parser uses statistical heuristics to classify contiguous blocks of text and then applies different parsing strategies based on the classification.

file_path

The local path or URL to the file being parsed.

Type:: str

use_skip

Whether to skip to the “DATA:” descriptor in the file.

Type:: bool

lines

A list of all lines read from the file.

Type:: list[str]

blocks

The final list of processed Block objects.

Type:: list[Block]

Notes

The parsing workflow is as follows. - A NonStandardParser instance is created with a file_path. - The public parse() method is called. - _fetch_lines() reads the file into self.lines. - _segregate_blocks() splits self.lines into Block objects -a groups of non-empty lines- and saves them to self.blocks. - parse() iterates through each block in self.blocks. - _process_block() is called on each block, which a. Computes statistics for the block, b. Classifies it (e.g., TABULAR, DATA, NARRATIVE), c. Dispatches to a specific parsing method (e.g.,`_parse_tabular_block`). - The specific parse methods (e.g., _parse_data_block) handle logic for header borrowing, DataFrame generation, and error handling, modifying the block object in place. - parse() returns the fully processed self.blocks list.

parse()[source]

Executes the full parsing workflow on the file.

Returns:

A list of processed Block objects. Each block may contain a DataFrame (block.df) if parsing was successful, or an error message (block.error_message) if it failed.

Return type:

list[Block]

Raises:

ValueError – If use_skip is True and no “DATA:” line is found.
requests.exceptions.RequestException – If the file_path is a URL and it fails to fetch.

NonStandardParserUtils (pyleotups.utils.Parser.NonStandardParserUtils)

class pyleotups.utils.Parser.NonStandardParserUtils.Block(idx, start, end)[source]

Represents a contiguous block of non-empty lines from the source file.

This is the main data structure used by the parser, holding the lines, their classified type, and the resulting parsed DataFrame.

idx

The sequential index (0, 1, 2…) of the block in the file.

Type:: int

start

The starting line number (index) of this block in the source file.

Type:: int

end

The ending line number (index) of this block in the source file.

Type:: int

lines

A list of LineInfo objects contained within this block.

Type:: list[LineInfo]

block_type

The classified type of the block (e.g., NARRATIVE, TABULAR).

Type:: BlockType

headers

A list of header dictionaries, where each dict contains: - “name” (str): The parsed header name. - “interval” (tuple[int, int]): The (start, end) char position.

Type:: list[dict]

title

A potential title line detected above the headers.

Type:: str or None

stats

Aggregated statistics computed for the entire block.

Type:: dict

header_extent

The number of lines detected as being part of the header.

Type:: int

delimiter

The regex string of the delimiter chosen for this block.

Type:: str or None

df

The resulting pandas DataFrame if parsing was successful.

Type:: pd.DataFrame or None

used_as_header_for

A list of block indices that successfully borrowed this block’s headers.

Type:: list[int]

class pyleotups.utils.Parser.NonStandardParserUtils.BlockType(*values)[source]: Enumeration for the different types a Block can be classified as.

class pyleotups.utils.Parser.NonStandardParserUtils.LineInfo(idx, text)[source]

Holds the text and pre-computed statistics for a single line.

idx

The original line number (index) from the source file.

Type:: int

text

The raw text of the line.

Type:: str

line_len

The character length of the line.

Type:: int

count_single_tokens

Token count using a single-space delimiter (r”s+”).

Type:: int

count_multispace_tokens

Token count using a multi-space delimiter (r”(s{2,})”).

Type:: int

count_tab_tokens

Token count using a tab delimiter (r” “).

Type:: int

numeric_single_ratio

Ratio of numeric tokens (0.0 to 1.0) using r”s+”.

Type:: float

numeric_multispace_ratio

Ratio of numeric tokens (0.0 to 1.0) using r”(s{2,})”.

Type:: float

numeric_tab_ratio

Ratio of numeric tokens (0.0 to 1.0) using r” “.

Type:: float

pyleotups.utils.Parser.NonStandardParserUtils.assign_tokens_by_overlap(lines_info, delimiter, headers, header_extent=0)[source]

Generates a DataFrame by assigning tokens based on character-level overlap.

This is a fallback for misaligned data. It checks two stages: 1. Assigns a token to the header with the maximum overlap. 2. If no overlap, assigns to the header with the minimum distance (closest neighbor).

Parameters:

lines_info (list[LineInfo]) – The list of LineInfo objects to parse.
delimiter (str) – The regex delimiter to split lines.
headers (list[dict]) – The list of header objects (must contain “name” and “interval”).
header_extent (int, optional) – The number of lines to skip from the start of lines_info. Defaults to 0.

Returns:

The parsed DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If delimiter or headers are missing or malformed.

pyleotups.utils.Parser.NonStandardParserUtils.auto_cast_df(df: DataFrame) → DataFrame[source]: Attempt to convert object columns to numeric where possible. Leaves non-convertible columns unchanged.

pyleotups.utils.Parser.NonStandardParserUtils.compute_interval_overlap(interval1, interval2)[source]: Calculates the number of overlapping characters between two intervals.

pyleotups.utils.Parser.NonStandardParserUtils.count_tokens(line, delimiter)[source]: Counts non-empty tokens in a line given a regex delimiter.

pyleotups.utils.Parser.NonStandardParserUtils.generate_df(lines_info, delimiter, headers, header_extent=0)[source]

Generates a DataFrame using a simple split, assuming columns align.

Parameters:

lines_info (list[LineInfo]) – The list of LineInfo objects to parse.
delimiter (str) – The regex delimiter to split lines.
headers (list[dict]) – The list of header objects (must contain “name”).
header_extent (int, optional) – The number of lines to skip from the start of lines_info. Defaults to 0.

Returns:

The parsed DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If delimiter or headers are missing.
ValueError – If the number of tokens in a data row does not match the number of headers (and data rows exist).

pyleotups.utils.Parser.NonStandardParserUtils.generate_row_pattern(tokens)[source]: Generates a string pattern (‘N’ for numeric, ‘S’ for string) for a list of tokens.

pyleotups.utils.Parser.NonStandardParserUtils.get_token_intervals_multi(line, delimiter)[source]

Splits a line by a regex delimiter and returns token intervals.

Parameters:

line (str) – The line to parse.
delimiter (str) – The regex delimiter string (e.g., r”(s{2,})”).

Returns:

A list of token dictionaries, each with: - “key” (str): A unique key for the token. - “display” (str): The stripped token text. - “interval” (tuple[int, int]): The (start, end) char position.

Return type:

list[dict]

pyleotups.utils.Parser.NonStandardParserUtils.intervals_overlap(interval1, interval2)[source]: Checks if two intervals overlap at all.

pyleotups.utils.Parser.NonStandardParserUtils.is_numeric(token)[source]

Robustly checks if a token is numeric.

Handles plain numbers, ranges (e.g., ‘10-20’), values with uncertainty (e.g., ‘1.5 ± 0.1’ or ‘1.50.1’), and wrapped values (e.g., ‘(10)’ or ‘6.80 (8.98)’).

Parameters:: token (str) – The string token to check.
Returns:: True if the token is considered numeric, False otherwise.
Return type:: bool

pyleotups.utils.Parser.NonStandardParserUtils.merge_headers_by_overlap(token_maps)[source]

Merges multiple lines of header tokens into a single header list.

Used for multi-line headers, where tokens from subsequent lines are merged into the first line’s headers based on character overlap.

Parameters:: token_maps (list[list[dict]]) – A list where each item is the output of get_token_intervals_multi for one header line.
Returns:: A single list of merged header dictionaries.
Return type:: list[dict]

pyleotups.utils.Parser.NonStandardParserUtils.numeric_ratio(line, delimiter)[source]: Calculates the ratio of numeric tokens in a line.

pyleotups.utils.Parser.NonStandardParserUtils.refine_headers_by_correspondence(header_lines, data_lines, delimiter, broadcast_identical=False)[source]

Refines headers by analyzing the physical layout (vertical alignment) of the data lines.

It creates a density mask (histogram) of the data to find physical columns, then maps the header tokens to these physical columns. If multiple distinct header tokens map to a single wide data column, it forces a split (preserving granular headers). If adjacent data columns share the exact same header identity, it merges them (unless broadcast_identical is True).

Parameters:

header_lines (list[LineInfo]) – The lines identified as headers.
data_lines (list[LineInfo]) – The lines identified as data.
delimiter (str) – The regex delimiter used to tokenize the lines.
broadcast_identical (bool, optional) – If True, adjacent columns with identical headers are kept separate (suffixed). If False (default), they are merged into one column.

Returns:

A list of refined header dictionaries containing “name” and “interval”. Returns None if refinement is not possible (e.g., no data lines).

Return type:

list[dict] or None

ExcelParser (pyleotups.utils.Parser.ExcelParser)

class pyleotups.utils.Parser.ExcelParser(file_path: str)[source]

Parses Excel files by detecting contiguous blocks of non-empty cells and converting them into structured DataFrames.

It handles: - Spatial segmentation (BFS) to find tables. - Merged cell propagation. - Statistical header detection. - Multi-row header merging. - Header borrowing for data-only blocks.

file_path

The local path or URL to the Excel file.

Type:: str

sheets

The loaded sheets content.

Type:: List[SheetGrid]

blocks

The segregated and processed blocks.

Type:: List[Block]

_header_registry

Internal registry for header borrowing logic.

Type:: Dict[str, List[Block]]

parse() → List[Block][source]

Executes the full parsing workflow.

Returns:: A list of processed Block objects, potentially containing DataFrames.
Return type:: List[Block]