Example Workflows
Here you can find example workflows that combine two or more REST API endpoints to perform common operations.
Upload a CSV file containing a single-table dataset and download a CSV file containing a synthetic dataset
- Upload a dataset as a CSV file
- (optional) Detect its schema
- Create a source catalog that reads from the file
- Generate a synthetic dataset
- Alternative A: create a generator and an execution with separate requests
- Alternative B: create a generator and an execution with one request
- Download a ZIP file containing the synthetic dataset as a CSV file
Assumptions
We assume:
- our requests contain the required
Authorization
header (see the OpenAPI Documentation for more information) - We have a CSV file named
sample.csv
containing the following text:
One,Two,Three1,2.32,abc2,2.55,bce3,2.42,ced4,2.74,def5,7.32,efg6,7.61,fgh7,2.29,hij8,3.34,ijk9,5.85,jkl10,4.42,klm11,3.11,lmn
Upload a dataset as a CSV file
First we need to upload the dataset to the platform. We do so with a
POST /api/v1/catalogs/uploads
request with a multipart/form-data
payload containing the file:
-----------------------------4498002542570133167595167108Content-Disposition: form-data; name="file"; filename="sample.csv"Content-Type: text/csv
One,Two,Three1,2.32,abc2,2.55,bce3,2.42,ced4,2.74,def5,7.32,efg6,7.61,fgh7,2.29,hij8,3.34,ijk9,5.85,jkl10,4.42,klm11,3.11,lmn
-----------------------------4498002542570133167595167108--
and receive a 200 OK
response with payload:
{ "status": "ok", "data": { "id": "aDJHVmM0SFFWaEZnOlpyQjVvTEF2bFV0ZkZ0eklSMXp6MkZmqjRjkNldEeLPGuF-6Pq5fgHAFxAIUe_P7Xkk-ZH01oFiJk8", ... }}
(optional) Detect its schema
To correctly read data from the dataset the platform needs to know its schema. The schema can be automatically detected by the platform, but this is not mandatory if we already know it or can determine it by ourselves. For this example, we will detect it.
NOTE: the detection is heuristic, so it must be reviewed to avoid inaccuracies
We make a POST /api/v1/catalogs/create/introspect
with payload:
{ "type": "rel_file_src", "config": { "files": [ "aDJHVmM0SFFWaEZnOlpyQjVvTEF2bFV0ZkZ0eklSMXp6MkZmqjRjkNldEeLPGuF-6Pq5fgHAFxAIUe_P7Xkk-ZH01oFiJk8" ] }}
and receive a 200 OK
response with payload:
{ "status": "ok", "data": { "schema": { "tables": [ { "name": "sample", "columns": [ { "name": "One", "type": { "nullable": true, "type": "integer" }, "primary": false }, { "name": "Two", "type": { "nullable": true, "type": "numeric" }, "primary": false }, { "name": "Three", "type": { "nullable": true, "caseInsensitive": false, "type": "categorical" }, "primary": false } ], "foreign": [] } ] } }}
We review it and confirm it matches our data correctly.
Create a source catalog that reads from the file
We make a POST /api/v1/catalogs
request with payload containing:
config
: the same input we used for the introspection (schema detection) stepschema
: the schema we received from the introspection (schema detection) step
(or that we determined on our own if we skipped it)
{ "type": "rel_file_src", "config": { "files": [ "aDJHVmM0SFFWaEZnOlpyQjVvTEF2bFV0ZkZ0eklSMXp6MkZmqjRjkNldEeLPGuF-6Pq5fgHAFxAIUe_P7Xkk-ZH01oFiJk8" ] }, "schema": { "tables": [ { "columns": [ { "name": "One", "type": { "nullable": true, "type": "integer" }, "primary": false }, { "name": "Two", "type": { "nullable": true, "type": "numeric" }, "primary": false }, { "name": "Three", "type": { "nullable": true, "caseInsensitive": false, "type": "categorical" }, "primary": false } ], "foreign": [], "name": "sample" } ] }}
and receive a 200 OK
response with payload:
{ "status": "ok", "data": { "id": "h56Q4wRMcvhjvhMQPWXJMXjcP2", ... }}
Generate a synthetic dataset
In order to generate a synthetic dataset we need to create:
- a generator, which is trained on source data
- an execution of that generator, which uses the trained generator to produce synthetic data
These can be created with separate requests or with a single request.
Creating executions separately from their generator can be useful when the generator has to be trained at one time, but produce synthetic data at a later time, or the generator has to be trained once, but produce synthetic data more than once without re-incurring in training overhead for each generated synthetic dataset.
Creating both a generator and its (first) execution in one request can be useful when it is acceptable for generator training and synthetic data generation to have no delay in-between and allows for better performance as data is read only once.
Alternative A: create a generator and an execution with separate requests
Create a generator
In order to generate synthetic data we need a generator that uses our source catalog. We create it with a
POST /api/v1/generators
request with payload:
{ "sourceId": "5jXvHQGjQGFQ5gPq3vxH2wVpMm", "config": { "tables": [ { "columns": [ { "name": "One", "type": { "type": "integer" } }, { "name": "Two", "type": { "type": "numeric" } }, { "name": "Three", "type": { "type": "categorical" } } ], "name": "sample" } ] }}
and receive a 200 OK
response with payload:
{ "status": "ok", "data": { "id": "xXFQV82VG84wphJrR3f9MHmhgw", "status": "pending", ... }}
Generators can be configured with additional parameters via a config
parameter. In this tutorial their default
values are acceptable thus we omit them.
Notice how status
attribute’s value is "pending"
. This is because generator creation is a long-running operation,
since the generator undergoes training when created. To confirm its completion we wait some time then make a
GET /api/v1/generators/xXFQV82VG84wphJrR3f9MHmhgw
request and receive a 200 OK
response with payload:
{ "status": "ok", "data": { "id": "xXFQV82VG84wphJrR3f9MHmhgw", "status": "completed", ... }}
this time status
is "completed"
.
Create a synthetic dataset using the generator
We make a POST /api/v1/generators/xXFQV82VG84wphJrR3f9MHmhgw/executions
with no payload and receive a 200 OK
response with payload:
{ "status": "ok", "data": { "id": "C62jpV9JFX78HgC3VGmMqQgx6q", "status": "pending", ... }}
Notice how the status
attribute’s value is "pending"
. This is because execution creation is a long-running operation.
To confirm its completion we wait some time then make a:
GET /api/v1/generators/xXFQV82VG84wphJrR3f9MHmhgw/executions/C62jpV9JFX78HgC3VGmMqQgx6q
request and receive a 200 OK
response with payload:
{ "status": "ok", "data": { "status": "completed", "destination": { "id": "cjg4XFmvW52CGvhp5v7XhRv3vg" } ... }}
this time status
is "completed"
. Notice how a destination
attribute containing an id
attribute is now present.
This is the id of the catalog created on the platform to store the generated synthetic data.
Executions too can be configured with additional parameters via a config
parameter. In this tutorial their default
values are acceptable thus we omit them.
Alternative B: create a generator and an execution with one request
The request is similar to alternative A, but includes an executionRequest
parameter. This can contain the same data
accepted by the execution creation endpoint we used previously. Since the default values are acceptable, we set it to an
empty object, thus make a POST /api/v1/generators
request with payload:
{ "sourceId": "5jXvHQGjQGFQ5gPq3vxH2wVpMm", "executionRequest": {}, "config": { "tables": [ { "columns": [ { "name": "One", "type": { "type": "integer" } }, { "name": "Two", "type": { "type": "numeric" } }, { "name": "Three", "type": { "type": "categorical" } } ], "name": "sample" } ] }}
and receive a 200 OK
response with payload:
{ "status": "ok", "data": { "id": "xXFQV82VG84wphJrR3f9MHmhgw", "status": "pending", "lastExecutionId": "C62jpV9JFX78HgC3VGmMqQgx6q", ... }}
This is analogous to the response we received when creating a generator in alternative A.
Notice how the response now includes a lastExecutionId
attribute. This is the same attribute we received when creating
an execution in alternative A, and can be used in the same way.
Download a ZIP file containing the synthetic dataset as a CSV file
We make a GET /api/v1/catalogs/cjg4XFmvW52CGvhp5v7XhRv3vg/download?fmt=csv
request and receive a 200 OK
response
with payload:
{ "status": "ok", "data": { "url": "/api/v1/cd/Z0pMeDlMbksxS29wTWxzR2RpbFRzNjNydUlyWU5JOWF4Qjl4TUQxVnVXN1JRYTVqc0JiQkNMZXRTYlFIOUhPR2aqFDs5Ylza6SUsFMgg2UvoRF4MObt-WKVJ2bM62npx63NStA/catalog_cjg4XFmvW52CGvhp5v7XhRv3vg.zip" }}
Here url
contains the URL of a ZIP
file that can be downloaded and extracted to find the sample.csv
file
containing the CSV synthetic dataset:
One,Two,Three1,2.29,Dsa7,3.51,Qsc6,2.35,Ewq8,2.29,Dsa4,2.42,Edc1,2.29,Dsa5,3.32,Cxz2,2.75,Edc6,5.61,Qwe6,3.61,Asd5,3.14,Qwe