Skip to main content

HTTP file

This connector captures data from an HTTP endpoint into a Flow collection.

It is available for use in the Flow web application. For local development or open-source workflows, ghcr.io/estuary/source-http-file:dev provides the latest version of the connector as a Docker image. You can also follow the link in your browser to see past image versions.

Supported data types

This connector automatically captures the data hosted at the specified URL into a single Flow collection.

The following file types are supported:

  • Avro
  • CSV
  • JSON
  • Protobuf
  • W3C Extended Log

The following compression methods are supported:

  • ZIP
  • GZIP
  • ZSTD

By default, Flow automatically detects the file type and compression method. If necessary, you can specify the correct file type, compression, and other properties (CSV only) using the optional parser configuration.

Prerequisites

To use this connector, you'll need the URL to an HTTP endpoint that hosts data of one of the supported types. The HTTP endpoint must support HEAD HTTP requests, and the response to this request must include a Last-Modified header.

tip

You can send a test HEAD request using Curl with the -I parameter, for example: curl -I https://my-site.com/my_hosted_dataset.json.zip Use this online tool to easily do so in your browser.

Some HTTP endpoints require credentials for access. If this is the case, have your username and password ready.

Configuration

You configure connectors either in the Flow web app, or by directly editing the catalog specification file. See connectors to learn more about using connectors. The values and specification sample below provide configuration details specific to the HTTP file source connector.

Properties

Endpoint

PropertyTitleDescriptionTypeRequired/Default
/credentialsCredentialsUser credentials, if required to access the data at the HTTP URL.object
/credentials/passwordPasswordPassword, if required to access the HTTP endpoint.string
/credentials/userUserUsername, if required to access the HTTP endpoint.string
/headersHeadersobject
/headers/itemsAdditional HTTP HeadersAdditional HTTP headers when requesting the file. These are uncommon.array
/headers/items/-/keyHeader Keystring
/headers/items/-/valueHeader Valuestring
/parserParser ConfigurationConfigures how files are parsedobject
/parser/compressionCompressionDetermines how to decompress the contents. The default, 'Auto', will try to determine the compression automatically.null, stringnull
/parser/formatFormatDetermines how to parse the contents. The default, 'Auto', will try to determine the format automatically based on the file extension or MIME type, if available.object{"type":"auto"}
/parser/format/typeTypestring
/urlHTTP File URLA valid HTTP url for downloading the source file.stringRequired

Bindings

PropertyTitleDescriptionTypeRequired/Default
/streamStreamName of the datasetstringRequired
/syncModeSync modeConnection method. Set to incremental for real-time updates.stringRequired

Sample

captures:
${PREFIX}/${CAPTURE_NAME}:
endpoint:
connector:
image: ghcr.io/estuary/source-http-file:dev
config:
url: https://my-site.com/my_hosted_dataset.json.zip
parser:
compression: zip
format:
type: csv
config:
delimiter: ","
encoding: UTF-8
errorThreshold: 5
headers: [ID, username, first_name, last_name]
lineEnding: "\\r"
quote: "\""
bindings:
- resource:
stream: my_hosted_dataset.json.zip
syncMode: incremental
target: ${PREFIX}/${COLLECTION_NAME}

Advanced: Parsing HTTP-hosted data

HTTP endpoints can support a variety of file types. For each file type, Flow must parse and translate data into collections with defined fields and JSON schemas.

By default, the parser will automatically detect the type and shape of the data at the HTTP endpoint, so you won't need to change the parser configuration for most captures.

However, the automatic detection may be incorrect in some cases. To fix or prevent this, you can provide explicit information in the parser configuration, which is part of the endpoint configuration for this connector.

The parser configuration includes:

  • Compression: Specify how the data is compressed. If no compression type is specified, the connector will try to determine the compression type automatically. Options are:

    • zip
    • gzip
    • zstd
    • none
  • Format: Specify the data format, which determines how it will be parsed. If no file type is specified, the connector will try to determine the file type automatically Options are:

    • Avro
    • CSV
    • JSON
    • Protobuf
    • W3C Extended Log

CSV configuration

CSV files include several additional properties that are important to the parser. In most cases, Flow is able to automatically determine the correct values, but you may need to specify for unusual datasets. These properties are:

  • Delimiter. Options are:

    • Comma (",")
    • Pipe ("|")
    • Space ("0x20")
    • Semicolon (";")
    • Tab ("0x09")
    • Vertical tab ("0x0B")
    • Unit separator ("0x1F")
    • SOH ("0x01")
    • Auto
  • Encoding type, specified by its WHATWG label.

  • Optionally, an Error threshold, as an acceptable percentage of errors. If set to a number greater than zero, malformed rows that fall within the threshold will be excluded from the capture.

  • Escape characters. Options are:

    • Backslash ("\\")
    • Disable escapes ("")
    • Auto
  • Optionally, a list of column Headers, if not already included in the first row of the CSV file.

    If any headers are provided, it is assumed that the provided list of headers is complete and authoritative. The first row of your CSV file will be assumed to be data (not headers), and you must provide a header value for every column in the file.

  • Line ending values

    • CRLF ("\\r\\n") (Windows)
    • CR ("\\r")
    • LF ("\\n")
    • Record Separator ("0x1E")
    • Auto
  • Quote character

    • Double Quote ("\"")
    • Single Quote (")
    • Disable Quoting ("")
    • Auto

The sample specification above includes these fields.

Advanced: Using HTTP headers

For data accessed through certain APIs, you may need to send headers as part of your HTTP request. This is uncommon, and is supported by the optional Headers configuration.

This configuration section is encrypted with sops, so you can safely include secretes such as API keys.

See the source data's API documentation for headers that may be required for your capture.