Skip to main content

Apify Dataset

Overview

Apify is a web scraping and web automation platform providing both ready-made and custom solutions, an open-source JavaScript SDK and Python SDK for web scraping, proxies, and many other tools to help you build and run web automation jobs at scale.

The results of a scraping job are usually stored in the Apify Dataset. This Airbyte connector provides streams to work with the datasets, including syncing their content to your chosen destination using Airbyte.

To sync data from a dataset, all you need to know is your API token and dataset ID.

You can find your personal API token in the Apify Console in the Settings -> Integrations and the dataset ID in the Storage -> Datasets.

Running Airbyte sync from Apify webhook

When your Apify job (aka Actor run) finishes, it can trigger an Airbyte sync by calling the Airbyte API manual connection trigger (POST /v1/connections/sync). The API can be called from Apify webhook which is executed when your Apify run finishes.

Features

FeatureSupported?
Full Refresh SyncYes
Incremental SyncYes

Performance considerations

The Apify dataset connector uses Apify Python Client under the hood and should handle any API limitations under normal usage.

Streams

dataset_collection

  • Calls api.apify.com/v2/datasets (docs)
  • Properties:
    • Apify Personal API token (you can find it here)

dataset

  • Calls https://api.apify.com/v2/datasets/{datasetId} (docs)
  • Properties:
    • Apify Personal API token (you can find it here)
    • Dataset ID (check the docs)

item_collection

  • Calls api.apify.com/v2/datasets/{datasetId}/items (docs)
  • Properties:
    • Apify Personal API token (you can find it here)
    • Dataset ID (check the docs)
  • Limitations:
    • The stream uses a dynamic schema (all the data are stored under the "data" key), so it should support all the Apify Datasets (produced by whatever Actor).

item_collection_website_content_crawler

  • Calls the same endpoint and uses the same properties as the item_collection stream.
  • Limitations:
    • The stream uses a static schema which corresponds to the datasets produced by Website Content Crawler Actor. So only datasets produced by this Actor are supported.

Reference

Config fields reference

Field
Type
Property name
string
token
string
dataset_id

Changelog

Expand to review
VersionDatePull RequestSubject
2.1.52024-04-1937115Updating to 0.80.0 CDK
2.1.42024-04-1837115Manage dependencies with Poetry.
2.1.32024-04-1537115Base image migration: remove Dockerfile and use the python-connector-base image
2.1.22024-04-1237115schema descriptions
2.1.12023-12-1433414Prepare for airbyte-lib
2.1.02023-10-1331333Add stream for arbitrary datasets
2.0.02023-09-1830428Fix broken stream, manifest refactor
1.0.02023-08-2529859Migrate to lowcode
0.2.02022-06-2028290Make connector work with platform changes not syncing empty stream schemas.
0.1.112022-04-2712397No changes. Used connector to test publish workflow changes.
0.1.92022-04-05PR#11712No changes from 0.1.4. Used connector to test publish workflow changes.
0.1.42021-12-23PR#8434Update fields in source-connectors specifications
0.1.22021-11-08PR#7499Remove base-python dependencies
0.1.02021-07-29PR#5069Initial version of the connector