Root folder organization
│ .gitignore
│ environment?RelFCI.yml // Current conda env settings used
│ README.md
│ LICENSE
│ gen_data.py // File showing how to generate random schemas and models
│
├───algorithms // Rel causal discovery algorithms location
│ ├───RCD.py // Relational Causal Discovery algorithm
│ └───RelFCI.py // Relational FCI algorithm
│
├───datageneration
│ ├───ModelGenerator.py // Methods to generate a model from dependencies
│ └───SchemaGenerator.py // Methods to generate a schema from input data
│
├───dseparation
│ └───DSeparation.py // Methods to compute DSeparation in relational data between variables
│
├───input // Algorithm config data location with example configs
│
├───models // Models used by the algorithms
│ ├───AbstractGroundGraph.py // AGG class used by RCD and for computing DSeparation
│ ├───MixedAbstractGroundGraph.py // AGG class used by RelFCI, containing circle markers
│ ├───Model.py // Relational Model class
│ └───Schema.py // Relational Scehma class
│
├───output // Algorithms output location, also for logger
│
└───utils
├───EdgeOrientation.py // Methods implementing orientation rules for RCD
├───FCIEdgeOrientation.py // Methods implementing orientation rules for RelFCI
├───Oracle.py // Oracle class used to perform conditional independence tests
├───PlotCode.py // Methods to plot to file AGGs and Models
├───RelationalDependency.py // Classes implementing Relational Variables and Dependencies
├───RelationalSpace.py // Helper methods to generate relational data
├───RelationalValidity.py // Helper methods to check for validity of relational data
└───ValidationTest.py // Methods to evaluate learned data, compared to gorund truth
If you want to run the code, you need to install all the packages required by the code or import the our anaconda enviroment with the following commands.
conda env create -f environment_RelFCI.yml
conda activate RelFCINow move in the algorithms direcory and run the RelFCI.py
cd algorithms
python RelFCI.pyBy default, the code will run example E01, but you can modify the main of RelFCI.py and try the different inputs provided in the input directory.
You can also generate a random model/schema or define it by yourself.
Data is generated with the following naming system:
- Entities are user defined e.g.,
A; - Relationships names are the union of the two entities they connect, followed by an index e.g.,
'AB1'forAandB; - Attributes have the same name of the respective entity or relationship, but followed by an underscore and increasing index e.g.,
A_1,A_2andAB1_1.
Custom data
In order to create your own example, you have to define a .json file in input direcory, as following:
{
"entities": {"A": 3, "B": 4},
"latent": ["A_2"],
"relationships": [["A", "B", 0]],
"dependencies": ["[A].A_3 -> [A].A_2", "[B, AB1, A].A_3 -> [B].B_1",
"[A, AB1, B].B_1 -> [A].A_1",
"[B, AB1, A].A_2 -> [B].B_2",
"[B].B_4 -> [B].B_2", "[B].B_4 -> [B].B_1",
"[A, AB1, B].B_3 -> [A].A_1", "[A, AB1, B].B_3 -> [A].A_2"]
}JSON skeleton:
- entities: a dictionary containg the names of the entities and the number of attributes for each one of them;
- latent: the list of attributes that you want to be latent (hidden to the algorithm);
- relationships: the list of relationships. A relationship is define as list where The first two elementas are the entities, that we want to link andf the final number is the number of attributes in the relationship;
- dependencies: the list of dependecies that you want the algorithm to learn.
Random data
In order to create your own example, you have to define a .json file in input direcory, as following:
{
"seed" : 750,
"algos" : ["d-RelFCI"],
"runs" : 1, # Number of runs for each configuration
"num_entities" : 4, # Initial number of entities
"num_dependencies" : 8, # Initial number of dependencies
"hop_threshold" : 2, # Hop threshold for AGGs
"num_latent_vars" : 1, # Initial number of latent variables
"delta_ent" : 1, # Difference of number of entities between configurations
"delta_dep" : 1, # Difference of number of dependencies between configurations
"delta_lat" : 2, # Difference of number of latent variables between configurations
"step_ent" : 2, # Number of variations for entities
"step_dep" : 2, # Number of variations for dependencies
"step_lat" : 2 # Number of variations for latent variables
}For example, num_entities = 4, delta_ent = 1, step_ent = 2 means that the first graph will start with 4 entities, the second (1st step) will have 5 and the last one (2nd step) will have 6.
Don't select number of steps too high as the total number of runs is the product of runs, step_ent, step_dep, step_lat.
Real-world data
We provide also real-world data about the French elections, extracted from Twitter. Data is available in the real-world/ folder.
To run the real-world example:
- Setup a MySQL server.
- Start the MySQL server instance
sudo /usr/local/mysql/support-files/mysql.server start- Connect to the server
mysql -u test -p- Run the following commands
mysql> SET GLOBAL log_bin_trust_function_creators = 1;
mysql> SET GLOBAL local_infile=1;
mysql> SET GLOBAL connect_timeout=28800;
mysql> SET GLOBAL interactive_timeout=28800;
mysql> SET GLOBAL wait_timeout=28800;- Run
algorithms/RelFCI.pywithlinearCITest_run().- Update the MySQLDataStore connection parameters on line 630 with your personal server data
Input test data can be found in the input folder. This data also shows the jsoformat required to create custom data.
We provide 5 examples to test both single entities and multiple entity models with and without latent variables:
| Name | Entities | Relationships | Latent Variables | Description |
|---|---|---|---|---|
| E01 | 1 | 0 | Yes | Simple Example |
| E02 | 1 | 0 | Yes | Complex Example |
| E03 | 1 | 0 | Yes | Simple Example with additional learned edge (see RFCI paper) |
| E04 | 2 | 1 | Yes | Multientity model created merging E01 and E03, with additional latent attribute in the relationship |
| E05 | 2 | 1 | No | Multientity medium complexity model used to test the algorithm without relational variables |