tab-analysis.docs.user_guide

Created on Fri Sep 30 10:09:01 2022 @author: philippe@loco-labs.io

TAB-analysis : A tool to Analyse tabular and multi-dimensional structures

TAB-analysis analyzes and measures the relationships between Fields in any tabular Dataset.

The TAB-analysis tool is part of the Environmental Sensing Project

For more information, see the user guide or the github repository.

What is TAB-analysis ?

Principles

Each field in a dataset has global properties (e.g. the number of different values). The relationships between two fields can also be characterized in a similar way (e.g. number of pairs of values from the two different fields).

Analyzing these properties gives us a measure of the entire dataset.

The TAB-analysis module carries out these measurements and analyzes. It also identifies data that does not respect given relationships and multidimensional structure.* .

Examples

Here is a price list of different foods based on packaging.

'plants' 'quantity' 'product' 'price'
'fruit' '1 kg' 'apple' 1
'fruit' '10 kg' 'apple' 10
'fruit' '1 kg' 'orange' 2
'fruit' '10 kg' 'orange' 20
'vegetable' '1 kg' 'peppers' 1.5
'vegetable' '10 kg' 'peppers' 15
'vegetable' '1 kg' 'carrot' 0.5
'vegetable' '10 kg' 'carrot' 5

In this example, we observe two kinds of relationships:

  • classification ("derived" relationship): between 'plants' and 'product' (each product belongs a plant)
  • crossing ("crossed" relationship): between 'product' and 'quantity' (all the combinations of the two fields are present).

This Dataset can be translated in a matrix between 'quantity' ['1 kg', '10 kg'] and 'product' ['apple', 'orange', 'peppers', 'carrot']

In [1]: # creation of the `analysis` object 
        from tab_dataset import Sdataset
        from tab_analysis import AnaDataset
        tabular = {'plants':   ['fruit', 'fruit','fruit',   'fruit','vegetable','vegetable','vegetable','vegetable' ],
                   'quantity': ['1 kg' , '10 kg', '1 kg',   '10 kg',  '1 kg',    '10 kg',   '1 kg',     '10 kg'     ], 
                   'product':  ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'carrot',   'carrot'    ], 
                   'price':    [1,       10,      2,        20,       1.5,       15,        0.5,        5           ]}
        analysis = AnaDataset(Sdataset.ntv(tabular).to_analysis(True))
        # `analysis` is also available from pandas data
        import pandas as pd
        import ntv_pandas as npd
        analysis = pd.DataFrame(tabular).npd.analysis(distr=True)

In [2]: # each relationship is evaluated and measured 
        analysis.get_relation('plants', 'product').typecoupl
Out[2]: 'derived'

In [3]: analysis.get_relation('quantity', 'product').typecoupl
Out[3]: 'crossed'

In [4]: # the 'distance' between to Fields is measured (number of codec links to change to be coupled))
        analysis.get_relation('quantity', 'product').distance
Out[4]: 6

In [5]: # the dataset can be represented as a 'derived tree'
        print(analysis.tree())
Out[5]: -1: root-derived (8)
           1 : quantity (6 - 2)
           2 : product (4 - 4)
              0 : plants (2 - 2)
           3 : price (0 - 8)

In [6]: # 'partitions' are found (partitions are multi-dimensionnal data)'
        analysis.partitions()
Out[6]: [['quantity', 'product'], ['price']]

In [7]: # the `field_partition` method return the main structure of the dataset
        analysis.field_partition()
Out[7]: {'primary': ['quantity', 'product'],
         'secondary': ['plants'],
         'mixte': [],
         'unique': [],
         'variable': ['price']}

Uses

A TAB-analysis object is initialized by a set of properties (a dict with specific keys). It can therefore be used from any tabular data manager (e.g. pandas).

Possible uses are as follows:

  • control of a dataset in relation to a data model,
  • quality indicators of a dataset
  • analysis of datasets

and in connection with the tabular application:

  • error detection and correction,
  • generation of optimized data formats
  • conversion to multidimensional data
  • interface to specific applications

installation

tab_analysis package

The tab_analysis package includes

  • analysis module
    • class AnaField : Structure of a single field
    • class AnaRelation : Relationship between two fields
    • class AnaDfield : Structure and relationships of fields inside a dataset
    • class AnaDataset : Structure of a dataset
    • an utility class with static methods : Util

Installation

tab_analysis itself is a pure Python package. maintained on tab-analysis github repository.

It can be installed with pip.

pip install tab_analysis

dependency:

  • None

Examples and uses

One Notebook is available:

  • example of uses presents the main tab_analysis uses (with pandas.DataFrame)

documentation

The documentation presents :

documents

Python Connectors documentation

Roadmap

  • interface : standard interfaces with tabular applications
  • visualization : interface with tree visualization (e.g. mermaid)
  • characterization : research of almost or specific relationships
  • simulation : relationship modification simulation
 1# -*- coding: utf-8 -*-
 2"""
 3Created on Fri Sep 30 10:09:01 2022
 4@author: philippe@loco-labs.io
 5
 6.. include:: ../README.md
 7.. include:: ../tab_analysis/README.md
 8.. include:: ../example/README.md
 9.. include:: ./README.md    
10"""