Advanced Functionalities
NOAAStudy (pyleotups.utils.NOAAStudy)
- class pyleotups.utils.NOAAStudy.NOAAStudy(study_data)[source]
This class encapsulates study metadata and its related components (e.g. publications, sites) retrieved from the NOAA API.
- study_id
The unique NOAA study identifier.
- Type:
str
- xml_id
The XML identifier of the study.
- Type:
str
- metadata
A dictionary containing basic metadata such as studyName, dataType, earliestYearBP, etc.
- Type:
dict
- investigators
A comma-separated string of investigator names.
- Type:
str
- publications
A list of Publication objects associated with the study.
- Type:
list of Publication
PaleoData (pyleotups.utils.PaleoData)
- class pyleotups.utils.PaleoData.PaleoData(paleo_data, study_id, site_id)[source]
Represents paleo data associated with a site, including multiple data files and full variable metadata per file.
- datatable_id
Unique NOAA data table identifier.
- Type:
str
- dataTableName
Name of the data table.
- Type:
str
- timeUnit
Time unit used in the data table.
- Type:
str
- files
List of raw file info dicts.
- Type:
list of dict
- file_variable_map
Maps fileUrl to a dict of variables and their full metadata.
- Type:
dict
- file_url
Shortcut to first file URL (for backward compatibility).
- Type:
str or np.nan
- variables
Shortcut to variable names in first file (for backward compatibility).
- Type:
list of str
Publication (pyleotups.utils.Publication)
- class pyleotups.utils.Publication.Publication(pub_data)[source]
Represents a publication within a study.
- author
The name of the author(s) of the publication.
- Type:
str
- title
The title of the publication.
- Type:
str
- journal
The journal where the publication appeared.
- Type:
str
- year
The publication year.
- Type:
str
- volume
The volume number (if applicable).
- Type:
str or None
- number
The issue number (if applicable).
- Type:
str or None
- pages
The page numbers (if applicable).
- Type:
str or None
- pub_type
The type of publication.
- Type:
str or None
- doi
The Digital Object Identifier.
- Type:
str or None
- url
URL for the publication.
- Type:
str or None
- study_id
The NOAA study ID to which this publication belongs.
- Type:
str or None
Site (pyleotups.utils.Site)
PangaeaStudy (pyleotups.utils.PangaeaStudy)
- class pyleotups.utils.PangaeaStudy.PangaeaStudy(study_id: str, cache_dir: str | None = None, auth_token: str | None = None)[source]
Utility class representing a single PANGAEA study.
This class wraps a persistent pangaeapy.PanDataSet instance and provides: - Lazy data loading - NOAA-style summary normalization - Geographic extraction - Deep publication parsing (including supplement handling)
- Parameters:
study_id (str) – DOI, URI, or identifier of the PANGAEA dataset.
cache_dir (str or None, optional) – Directory for pangaeapy cache.
auth_token (str or None, optional) – PANGAEA authentication token for restricted datasets.
- get_data() DataFrame[source]
Retrieve the dataset as a pandas DataFrame.
- Returns:
Copy of the dataset table with metadata stored in
df.attrs.- Return type:
pandas.DataFrame
- get_funding() DataFrame[source]
Retrieve funding information for this study.
- Returns:
DataFrame with columns: [‘StudyID’, ‘StudyName’, ‘FundingAgency’, ‘FundingGrant’].
If no funding metadata is available, returns an empty DataFrame with columns preserved.
- Return type:
pandas.DataFrame
- get_geo() DataFrame[source]
Retrieve geographic metadata for study events.
- Returns:
DataFrame containing event-level geographic information.
- Return type:
pandas.DataFrame
Parsers
StandardParser (pyleotups.utils.Parser.StandardParser)
- class pyleotups.utils.Parser.StandardParser(url=None)[source]
StandardParser parses NOAA .txt data files with standard format: Standard format refers to NOAA Templated file with metadata -> (# lines), variables -> (## lines), data (tab-deliimited).
- url
URL of the file to parse.
- Type:
str
- lines
Fetched lines from file.
- Type:
list of str
- meta_start
Index where metadata block starts.
- Type:
int
- meta_end
Index where metadata block ends.
- Type:
int
- variables
Extracted variable names.
- Type:
list of str
- skip_lines
Lines to skip after metadata to reach data.
- Type:
int
- data
Parsed data rows.
- Type:
list of list of str
- df
Final constructed dataframe.
- Type:
pandas.DataFrame
NonStandardParser (pyleotups.utils.Parser.NonStandardParser)
- class pyleotups.utils.Parser.NonStandardParser(file_path, use_skip=True, use_refinement=True)[source]
Parses non-standard, fixed-width, or misaligned text files (like those from NOAA) into a structured list of Blocks, each potentially containing a pandas DataFrame.
The parser uses statistical heuristics to classify contiguous blocks of text and then applies different parsing strategies based on the classification.
- file_path
The local path or URL to the file being parsed.
- Type:
str
- use_skip
Whether to skip to the “DATA:” descriptor in the file.
- Type:
bool
- lines
A list of all lines read from the file.
- Type:
list[str]
Notes
The parsing workflow is as follows. - A NonStandardParser instance is created with a file_path. - The public parse() method is called. - _fetch_lines() reads the file into self.lines. - _segregate_blocks() splits self.lines into Block objects -a groups of non-empty lines- and saves them to self.blocks. - parse() iterates through each block in self.blocks. - _process_block() is called on each block, which a. Computes statistics for the block, b. Classifies it (e.g., TABULAR, DATA, NARRATIVE), c. Dispatches to a specific parsing method (e.g.,`_parse_tabular_block`). - The specific parse methods (e.g., _parse_data_block) handle logic for header borrowing, DataFrame generation, and error handling, modifying the block object in place. - parse() returns the fully processed self.blocks list.
- parse()[source]
Executes the full parsing workflow on the file.
- Returns:
A list of processed Block objects. Each block may contain a DataFrame (block.df) if parsing was successful, or an error message (block.error_message) if it failed.
- Return type:
list[Block]
- Raises:
ValueError – If use_skip is True and no “DATA:” line is found.
requests.exceptions.RequestException – If the file_path is a URL and it fails to fetch.
NonStandardParserUtils (pyleotups.utils.Parser.NonStandardParserUtils)
- class pyleotups.utils.Parser.NonStandardParserUtils.Block(idx, start, end)[source]
Represents a contiguous block of non-empty lines from the source file.
This is the main data structure used by the parser, holding the lines, their classified type, and the resulting parsed DataFrame.
- idx
The sequential index (0, 1, 2…) of the block in the file.
- Type:
int
- start
The starting line number (index) of this block in the source file.
- Type:
int
- end
The ending line number (index) of this block in the source file.
- Type:
int
- headers
A list of header dictionaries, where each dict contains: - “name” (str): The parsed header name. - “interval” (tuple[int, int]): The (start, end) char position.
- Type:
list[dict]
- title
A potential title line detected above the headers.
- Type:
str or None
- stats
Aggregated statistics computed for the entire block.
- Type:
dict
- header_extent
The number of lines detected as being part of the header.
- Type:
int
- delimiter
The regex string of the delimiter chosen for this block.
- Type:
str or None
- df
The resulting pandas DataFrame if parsing was successful.
- Type:
pd.DataFrame or None
- used_as_header_for
A list of block indices that successfully borrowed this block’s headers.
- Type:
list[int]
- class pyleotups.utils.Parser.NonStandardParserUtils.BlockType(*values)[source]
Enumeration for the different types a Block can be classified as.
- class pyleotups.utils.Parser.NonStandardParserUtils.LineInfo(idx, text)[source]
Holds the text and pre-computed statistics for a single line.
- idx
The original line number (index) from the source file.
- Type:
int
- text
The raw text of the line.
- Type:
str
- line_len
The character length of the line.
- Type:
int
- count_single_tokens
Token count using a single-space delimiter (r”s+”).
- Type:
int
- count_multispace_tokens
Token count using a multi-space delimiter (r”(s{2,})”).
- Type:
int
- count_tab_tokens
Token count using a tab delimiter (r” “).
- Type:
int
- numeric_single_ratio
Ratio of numeric tokens (0.0 to 1.0) using r”s+”.
- Type:
float
- numeric_multispace_ratio
Ratio of numeric tokens (0.0 to 1.0) using r”(s{2,})”.
- Type:
float
- numeric_tab_ratio
Ratio of numeric tokens (0.0 to 1.0) using r” “.
- Type:
float
- pyleotups.utils.Parser.NonStandardParserUtils.assign_tokens_by_overlap(lines_info, delimiter, headers, header_extent=0)[source]
Generates a DataFrame by assigning tokens based on character-level overlap.
This is a fallback for misaligned data. It checks two stages: 1. Assigns a token to the header with the maximum overlap. 2. If no overlap, assigns to the header with the minimum distance (closest neighbor).
- Parameters:
lines_info (list[LineInfo]) – The list of LineInfo objects to parse.
delimiter (str) – The regex delimiter to split lines.
headers (list[dict]) – The list of header objects (must contain “name” and “interval”).
header_extent (int, optional) – The number of lines to skip from the start of lines_info. Defaults to 0.
- Returns:
The parsed DataFrame.
- Return type:
pd.DataFrame
- Raises:
ValueError – If delimiter or headers are missing or malformed.
- pyleotups.utils.Parser.NonStandardParserUtils.auto_cast_df(df: DataFrame) DataFrame[source]
Attempt to convert object columns to numeric where possible. Leaves non-convertible columns unchanged.
- pyleotups.utils.Parser.NonStandardParserUtils.compute_interval_overlap(interval1, interval2)[source]
Calculates the number of overlapping characters between two intervals.
- pyleotups.utils.Parser.NonStandardParserUtils.count_tokens(line, delimiter)[source]
Counts non-empty tokens in a line given a regex delimiter.
- pyleotups.utils.Parser.NonStandardParserUtils.generate_df(lines_info, delimiter, headers, header_extent=0)[source]
Generates a DataFrame using a simple split, assuming columns align.
- Parameters:
lines_info (list[LineInfo]) – The list of LineInfo objects to parse.
delimiter (str) – The regex delimiter to split lines.
headers (list[dict]) – The list of header objects (must contain “name”).
header_extent (int, optional) – The number of lines to skip from the start of lines_info. Defaults to 0.
- Returns:
The parsed DataFrame.
- Return type:
pd.DataFrame
- Raises:
ValueError – If delimiter or headers are missing.
ValueError – If the number of tokens in a data row does not match the number of headers (and data rows exist).
- pyleotups.utils.Parser.NonStandardParserUtils.generate_row_pattern(tokens)[source]
Generates a string pattern (‘N’ for numeric, ‘S’ for string) for a list of tokens.
- pyleotups.utils.Parser.NonStandardParserUtils.get_token_intervals_multi(line, delimiter)[source]
Splits a line by a regex delimiter and returns token intervals.
- Parameters:
line (str) – The line to parse.
delimiter (str) – The regex delimiter string (e.g., r”(s{2,})”).
- Returns:
A list of token dictionaries, each with: - “key” (str): A unique key for the token. - “display” (str): The stripped token text. - “interval” (tuple[int, int]): The (start, end) char position.
- Return type:
list[dict]
- pyleotups.utils.Parser.NonStandardParserUtils.intervals_overlap(interval1, interval2)[source]
Checks if two intervals overlap at all.
- pyleotups.utils.Parser.NonStandardParserUtils.is_numeric(token)[source]
Robustly checks if a token is numeric.
Handles plain numbers, ranges (e.g., ‘10-20’), values with uncertainty (e.g., ‘1.5 ± 0.1’ or ‘1.50.1’), and wrapped values (e.g., ‘(10)’ or ‘6.80 (8.98)’).
- Parameters:
token (str) – The string token to check.
- Returns:
True if the token is considered numeric, False otherwise.
- Return type:
bool
- pyleotups.utils.Parser.NonStandardParserUtils.merge_headers_by_overlap(token_maps)[source]
Merges multiple lines of header tokens into a single header list.
Used for multi-line headers, where tokens from subsequent lines are merged into the first line’s headers based on character overlap.
- Parameters:
token_maps (list[list[dict]]) – A list where each item is the output of get_token_intervals_multi for one header line.
- Returns:
A single list of merged header dictionaries.
- Return type:
list[dict]
- pyleotups.utils.Parser.NonStandardParserUtils.numeric_ratio(line, delimiter)[source]
Calculates the ratio of numeric tokens in a line.
- pyleotups.utils.Parser.NonStandardParserUtils.refine_headers_by_correspondence(header_lines, data_lines, delimiter, broadcast_identical=False)[source]
Refines headers by analyzing the physical layout (vertical alignment) of the data lines.
It creates a density mask (histogram) of the data to find physical columns, then maps the header tokens to these physical columns. If multiple distinct header tokens map to a single wide data column, it forces a split (preserving granular headers). If adjacent data columns share the exact same header identity, it merges them (unless broadcast_identical is True).
- Parameters:
header_lines (list[LineInfo]) – The lines identified as headers.
data_lines (list[LineInfo]) – The lines identified as data.
delimiter (str) – The regex delimiter used to tokenize the lines.
broadcast_identical (bool, optional) – If True, adjacent columns with identical headers are kept separate (suffixed). If False (default), they are merged into one column.
- Returns:
A list of refined header dictionaries containing “name” and “interval”. Returns None if refinement is not possible (e.g., no data lines).
- Return type:
list[dict] or None
ExcelParser (pyleotups.utils.Parser.ExcelParser)
- class pyleotups.utils.Parser.ExcelParser(file_path: str)[source]
Parses Excel files by detecting contiguous blocks of non-empty cells and converting them into structured DataFrames.
It handles: - Spatial segmentation (BFS) to find tables. - Merged cell propagation. - Statistical header detection. - Multi-row header merging. - Header borrowing for data-only blocks.
- file_path
The local path or URL to the Excel file.
- Type:
str
- sheets
The loaded sheets content.
- Type:
List[SheetGrid]