Tutorial: Working with larger than memory data in R with Arrow and DuckDB

setup connection to duck DB ... Clarifying: this is a context-retrieval task (I used the provided tutorial/video to prepare the steps below). How to set up a disk-backed connection so you can work with files larger than RAM (R / dplyr + DuckDB-style workflow) Strategy overview — switch from in-memory to an on-disk database When your dataset won’t fit in memory, keep it on disk and access it through a DB-backed connection instead of loading the whole file into R memory. , Create a disk-backed database connection Open a DB connection that stores data on disk (not in-memory) by giving the connector a file path (the tutorial uses a db-file argument / directory when calling dbConnect). , Example (DuckDB-style): Setting read_only = FALSE lets you add data; if you want a read-only mount set it to TRUE. Reference or register the remote/large file without pulling it fully into memory Use dplyr::tbl (first argument = the DB connection) to refer to a table or to register an on-disk resource so you can run dplyr verbs against it rather than materializing it in memory. , Example patterns: If you already imported a table into the DB: tbl(con, "pumps") If DuckDB-style CSV-registration is available, you can reference a CSV on disk via a read helper (e.g., a read_csv_auto-like call inside an SQL expression) so the DB reads from the file on demand. , Work with the data using lazy operations (dplyr verbs) and remote-aware functions Don’t use nrow() (that expects an in-memory data.frame). Use database-aware verbs like count() or dplyr verbs which translate to SQL and run in the DB. , Inspect and manage the DB file and connection lifecycle List DB tables with dbListTables() and close the connection with dbDisconnect(). The file on disk persists between R sessions, so you can reconnect later and it will still contain your data. , Example: Concurrency and safety notes If you open the DB with read_only = FALSE, only one process can write at a time (concurrent writers are restricted); choose read_only appropriately for multi-user access. , Practical tips from the tutorial Start by testing your workflow on a small sample (the tutorial used a smaller dataset) before switching to the full large dataset. , Below is a concise, ready-to-run R example that: creates a disk-backed DuckDB file, registers a CSV and a Parquet dataset without loading them into( R memory , and runs a few common dplyr queries showing SQL translation and how to materialize small results). Setup (packages + connect) This uses DBI to open a DuckDB connection and(( store the DB on disk rather than in-memory , 6)). Register a large CSV without loading it DuckDB lets you reference file-readers (like read_csv_auto) as SQL expressions so the DB reads data on demand rather than materializing the full file into R first , . Register a Parquet dataset (on-disk or cloud) without loading The DuckDB “parquet” reader (parquet_scan / reader) lets you work with Parquet files lazily; you can also point to Parquet stored in cloud object stores (S3/GCS/Azure) and DuckDB/Arrow will read only needed portions , . Parquet datasets can also be partitioned so queries only scan relevant files , . Example queries (remote, lazy; show SQL, then collect small result) Use database-aware verbs like count(), group_by(), summarise() — dbplyr/dplyr translate these to SQL and the computation runs inside the DB, avoiding loading the whole dataset into R memory , . Use collect() only to bring back the (small) result you need; collect() triggers(( evaluation and transfers results into R objects 13)), . Inspecting tables and closing the DB Use dbListTables() to inspect what’s registered and dbDisconnect() to close the connection when finished , . Short practical notes Prefer lazy dplyr/dbplyr operations so computation runs in DuckDB and only required data is read . count() works for remote tables; avoid nrow() which expects in-memory data frames . If data sits in cloud storage (S3/GCS/Azure), DuckDB/Arrow can read remotely and transfer only needed slices; partitioned Parquet can greatly speed selective queries , . This example gives a minimal, reproducible pattern: create a disk-backed DB, register file-backed views that don’t load into R, use dplyr verbs to run work in the DB, and collect only the small results you need , , .