tab-analysis.docs.user_guide
Created on Fri Sep 30 10:09:01 2022 @author: philippe@loco-labs.io
TAB-analysis : A tool to Analyse tabular and multi-dimensional structures
TAB-analysis analyzes and measures the relationships between Fields in any tabular Dataset.
The TAB-analysis tool is part of the Environmental Sensing Project
For more information, see the user guide or the github repository.
What is TAB-analysis ?
Principles
Each field in a dataset has global properties (e.g. the number of different values). The relationships between two fields can also be characterized in a similar way (e.g. number of pairs of values from the two different fields).
Analyzing these properties gives us a measure of the entire dataset.
The TAB-analysis module carries out these measurements and analyzes. It also identifies data that does not respect given relationships and multidimensional structure.* .
Examples
Here is a price list of different foods based on packaging.
'plants' | 'quantity' | 'product' | 'price' |
---|---|---|---|
'fruit' | '1 kg' | 'apple' | 1 |
'fruit' | '10 kg' | 'apple' | 10 |
'fruit' | '1 kg' | 'orange' | 2 |
'fruit' | '10 kg' | 'orange' | 20 |
'vegetable' | '1 kg' | 'peppers' | 1.5 |
'vegetable' | '10 kg' | 'peppers' | 15 |
'vegetable' | '1 kg' | 'carrot' | 0.5 |
'vegetable' | '10 kg' | 'carrot' | 5 |
In this example, we observe two kinds of relationships:
- classification ("derived" relationship): between 'plants' and 'product' (each product belongs a plant)
- crossing ("crossed" relationship): between 'product' and 'quantity' (all the combinations of the two fields are present).
This Dataset can be translated in a matrix between 'quantity' ['1 kg', '10 kg'] and 'product' ['apple', 'orange', 'peppers', 'carrot']
In [1]: # creation of the `analysis` object
from tab_dataset import Sdataset
from tab_analysis import AnaDataset
tabular = {'plants': ['fruit', 'fruit','fruit', 'fruit','vegetable','vegetable','vegetable','vegetable' ],
'quantity': ['1 kg' , '10 kg', '1 kg', '10 kg', '1 kg', '10 kg', '1 kg', '10 kg' ],
'product': ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'carrot', 'carrot' ],
'price': [1, 10, 2, 20, 1.5, 15, 0.5, 5 ]}
analysis = AnaDataset(Sdataset.ntv(tabular).to_analysis(True))
# `analysis` is also available from pandas data
import pandas as pd
import ntv_pandas as npd
analysis = pd.DataFrame(tabular).npd.analysis(distr=True)
In [2]: # each relationship is evaluated and measured
analysis.get_relation('plants', 'product').typecoupl
Out[2]: 'derived'
In [3]: analysis.get_relation('quantity', 'product').typecoupl
Out[3]: 'crossed'
In [4]: # the 'distance' between to Fields is measured (number of codec links to change to be coupled))
analysis.get_relation('quantity', 'product').distance
Out[4]: 6
In [5]: # the dataset can be represented as a 'derived tree'
print(analysis.tree())
Out[5]: -1: root-derived (8)
1 : quantity (6 - 2)
2 : product (4 - 4)
0 : plants (2 - 2)
3 : price (0 - 8)
In [6]: # 'partitions' are found (partitions are multi-dimensionnal data)'
analysis.partitions()
Out[6]: [['quantity', 'product'], ['price']]
In [7]: # the `field_partition` method return the main structure of the dataset
analysis.field_partition()
Out[7]: {'primary': ['quantity', 'product'],
'secondary': ['plants'],
'mixte': [],
'unique': [],
'variable': ['price']}
Uses
A TAB-analysis object is initialized by a set of properties (a dict with specific keys). It can therefore be used from any tabular data manager (e.g. pandas).
Possible uses are as follows:
- control of a dataset in relation to a data model,
- quality indicators of a dataset
- analysis of datasets
and in connection with the tabular application:
- error detection and correction,
- generation of optimized data formats
- conversion to multidimensional data
- interface to specific applications
installation
tab_analysis package
The tab_analysis
package includes
analysis
module- class
AnaField
: Structure of a single field - class
AnaRelation
: Relationship between two fields - class
AnaDfield
: Structure and relationships of fields inside a dataset - class
AnaDataset
: Structure of a dataset - an utility class with static methods :
Util
- class
Installation
tab_analysis
itself is a pure Python package. maintained on tab-analysis github repository.
It can be installed with pip
.
pip install tab_analysis
dependency:
- None
Examples and uses
One Notebook is available:
- example of uses presents the main tab_analysis uses (with pandas.DataFrame)
documentation
The documentation presents :
documents
- NTV-TAB specification (Json for tabular data)
- tabular analysis: principles and concepts
- tabular relationships
Python Connectors documentation
- API
- Release
Roadmap
- interface : standard interfaces with tabular applications
- visualization : interface with tree visualization (e.g. mermaid)
- characterization : research of almost or specific relationships
- simulation : relationship modification simulation