[GH-ISSUE #65] How to speed up loading in 4GB CSV? #47

Closed
opened 2026-05-23 08:27:55 -06:00 by gitea-mirror · 16 comments
Owner

Originally created by @qpwo on GitHub (Feb 4, 2022).
Original GitHub issue: https://github.com/appy-one/acebase/issues/65

Originally assigned to: @appy-one on GitHub.

Hey I'm wondering if I'm doing something wrong. I have a 4GB CSV file where each line is points one string to a list of strings. I have about 3 million lines.

  • I tried setting them one at a time and it was taking a while.
  • I noticed readme recommends doing large batches with update(). So I'm batching my updates into groups of 10k rows, but each batch is still taking a minute or two, so the whole file is going to take five-ten hours at this rate. I can't do the whole thing at once because I'll run out of RAM.
  • I have logging set to error only
  • I think node's memory should be 12gb because I set --max-old-space-size=12000
  • Node is running in Terminal.app, not vscode's integrated terminal because I've heard that can slow processes
  • Loading the whole file into object batches without doing acebase only took a couple minutes
  • After each batch update completes I get a bunch of these messages:

write lock on path "whatever" by tid whatever (_lockAndWrite "whatever") is taking a long time to complete [1]

Here's the main bit of code:

  let giantObj: Record<string, string[]> = {}

  await processLines(path + '/data.tsv', async (line, num) => {
      if (num % 10_000 === 0) {
          log('on line', frac(num, numStargazerRows))
          log('putting batch in database:')
          await db.ref('gazers').update(giantObj)
          log('batch done')
          log('freeing old obj hopefully')
          giantObj = {}
      }
      const cols = line.split('\t')
      const head = cols[0].replace('/', '!!')
      const tail = cols.slice(1)
      giantObj[head] = tail
  })

Any suggestions for loading the data in faster?

Originally created by @qpwo on GitHub (Feb 4, 2022). Original GitHub issue: https://github.com/appy-one/acebase/issues/65 Originally assigned to: @appy-one on GitHub. Hey I'm wondering if I'm doing something wrong. I have a 4GB CSV file where each line is points one string to a list of strings. I have about 3 million lines. - I tried setting them one at a time and it was taking a while. - I noticed readme recommends doing large batches with `update()`. So I'm batching my updates into **groups of 10k rows**, but each batch is still taking a minute or two, so the whole file is going to take five-ten hours at this rate. I can't do the whole thing at once because I'll run out of RAM. - I have logging set to error only - I think node's memory should be 12gb because I set `--max-old-space-size=12000` - Node is running in Terminal.app, not vscode's integrated terminal because I've heard that can slow processes - Loading the whole file into object batches without doing acebase only took a couple minutes - After each batch update completes I get a bunch of these messages: `write lock on path "whatever" by tid whatever (_lockAndWrite "whatever") is taking a long time to complete [1]` Here's the main bit of code: ```ts let giantObj: Record<string, string[]> = {} await processLines(path + '/data.tsv', async (line, num) => { if (num % 10_000 === 0) { log('on line', frac(num, numStargazerRows)) log('putting batch in database:') await db.ref('gazers').update(giantObj) log('batch done') log('freeing old obj hopefully') giantObj = {} } const cols = line.split('\t') const head = cols[0].replace('/', '!!') const tail = cols.slice(1) giantObj[head] = tail }) ``` Any suggestions for loading the data in faster?
gitea-mirror 2026-05-23 08:27:55 -06:00
Author
Owner

@qpwo commented on GitHub (Feb 4, 2022):

Could be something with the async functions, maybe causing some kind of mutex thrashing? I don't think that two update() calls ever run over each other.

<!-- gh-comment-id:1029611804 --> @qpwo commented on GitHub (Feb 4, 2022): Could be something with the async functions, maybe causing some kind of mutex thrashing? I don't think that two `update()` calls ever run over each other.
Author
Owner

@appy-one commented on GitHub (Feb 4, 2022):

Thanks for your detailed report!

On the database side there are a couple of things to consider:

  • Running AceBase in a single Node.JS process means only 1 CPU thread doing all the work. Although AceBase is able to run in clusters, it does not share already assigned workload with other processes. There is no internal load balancer, good to keep that in mind!
  • There is currently a hard limit on how long a database operation is allowed to run, it will show the warnings you mentioned 3 times ([1] to [3]) before canceling the write lock on the resource. If the update takes longer, it will deny further lock requests for that transaction, effectively canceling them. The warnings are shown every 30s and effectively canceled after 2 minutes. I will be making this timeout optional and configurable soon, but you can increase it yourself by editing LOCK_TIMEOUT constant in acebase/src/node-lock.js. Another option is making your batches smaller so they are able to be stored before timing out.
  • If you have indexes on the data you are importing to, consider removing them until your import is done. An index slows down inserts. I assume this is not the case here, but I do want to mention it.
  • If you have transaction logging enabled on your db, this will also slow down inserts and you might also want to disable that.

Looking at your code, you might also want to try preparing the 10,000 records synchronously and then do the batch update. I don't think that will make a huge difference, but now it's an additional tick per record. Maybe also try to reduce the batch size to 1,000 records at a time.

I've recently worked on developing a streaming import for json data, but the current implementation is (too) slow. Improvements are coming, but will take some time. I might actually look at implementing a csv import before then, because of it simple "flat" format that makes it very easy to create batches. Note that transforming the import data (as in your code) will not be possible then, so batch updating will remain the preferred strategy in your case.

Let me know if you are able to speed up the process with above info, I'll also perform some tests with generated data.

<!-- gh-comment-id:1029810268 --> @appy-one commented on GitHub (Feb 4, 2022): Thanks for your detailed report! On the database side there are a couple of things to consider: - Running AceBase in a single Node.JS process means only 1 CPU thread doing all the work. Although AceBase is able to run in clusters, it does not share already assigned workload with other processes. There is no internal load balancer, good to keep that in mind! - There is currently a hard limit on how long a database operation is allowed to run, it will show the warnings you mentioned 3 times (`[1]` to `[3]`) before canceling the write lock on the resource. If the update takes longer, it will deny further lock requests for that transaction, effectively canceling them. The warnings are shown every 30s and effectively canceled after 2 minutes. I will be making this timeout optional and configurable soon, but you can increase it yourself by editing LOCK_TIMEOUT constant in [acebase/src/node-lock.js](https://github.com/appy-one/acebase/blob/15e42e3fdd530275573f203e659b604ffcb1b829/src/node-lock.js#L7). Another option is making your batches smaller so they are able to be stored before timing out. - If you have indexes on the data you are importing to, consider removing them until your import is done. An index slows down inserts. I assume this is not the case here, but I do want to mention it. - If you have transaction logging enabled on your db, this will also slow down inserts and you might also want to disable that. Looking at your code, you might also want to try preparing the 10,000 records synchronously and then do the batch update. I don't think that will make a huge difference, but now it's an additional tick per record. Maybe also try to reduce the batch size to 1,000 records at a time. I've recently worked on developing a streaming import for json data, but the current implementation is (too) slow. Improvements are coming, but will take some time. I might actually look at implementing a csv import before then, because of it simple "flat" format that makes it very easy to create batches. Note that transforming the import data (as in your code) will not be possible then, so batch updating will remain the preferred strategy in your case. Let me know if you are able to speed up the process with above info, I'll also perform some tests with generated data.
Author
Owner

@appy-one commented on GitHub (Feb 6, 2022):

I've done some testing with generated data, I definitely agree its performance must be improved here. I have a hunch where the bottleneck might be, I'll dive deeper into it..

<!-- gh-comment-id:1030832507 --> @appy-one commented on GitHub (Feb 6, 2022): I've done some testing with generated data, I definitely agree its performance must be improved here. I have a hunch where the bottleneck might be, I'll dive deeper into it..
Author
Owner

@qpwo commented on GitHub (Feb 8, 2022):

Where do you think it might be?

<!-- gh-comment-id:1033000351 --> @qpwo commented on GitHub (Feb 8, 2022): Where do you think it might be?
Author
Owner

@appy-one commented on GitHub (Feb 8, 2022):

There's multiple places, but I found the biggest performance improvements can be made in the way existing child nodes are queried, and new ones are added. For large object collections, an index is created for the child nodes to enable quick lookups. When doing small queries and updates, it's fast enough to query the index for each requested child - but that becomes really slow if 10,000 children have to be looked up one at a time. Similarly, adding new entries to an index one at a time is not very fast when 10,000s are added.

I've made quite a few improvements that allow multiple index lookups and storing at the same time, I am seeing performance improvements in the 20x range already in my current tests. These changes do impact critical parts of the storage engine so I'll have to make 100% none of this can corrupt a database at any time, so I'll be doing extensive testing this week.

<!-- gh-comment-id:1033018199 --> @appy-one commented on GitHub (Feb 8, 2022): There's multiple places, but I found the biggest performance improvements can be made in the way existing child nodes are queried, and new ones are added. For large object collections, an index is created for the child nodes to enable quick lookups. When doing small queries and updates, it's fast enough to query the index for each requested child - but that becomes really slow if 10,000 children have to be looked up one at a time. Similarly, adding new entries to an index one at a time is not very fast when 10,000s are added. I've made quite a few improvements that allow multiple index lookups and storing at the same time, I am seeing performance improvements in the 20x range already in my current tests. These changes do impact critical parts of the storage engine so I'll have to make 100% none of this can corrupt a database at any time, so I'll be doing extensive testing this week.
Author
Owner

@appy-one commented on GitHub (Feb 19, 2022):

As you can see, truckloads of commits! Improved many parts of the code and tackled quite a number of issues along the way, tests are now looking good, I'm performing last long-running tests this weekend. If all stays this way I'll publish to npm Monday 🥳

<!-- gh-comment-id:1045976830 --> @appy-one commented on GitHub (Feb 19, 2022): As you can see, truckloads of commits! Improved many parts of the code and tackled quite a number of issues along the way, tests are now looking good, I'm performing last long-running tests this weekend. If all stays this way I'll publish to npm Monday 🥳
Author
Owner

@appy-one commented on GitHub (Feb 21, 2022):

I just published acebase version 1.15.0, let me know if it works!

<!-- gh-comment-id:1047097228 --> @appy-one commented on GitHub (Feb 21, 2022): I just published acebase version 1.15.0, let me know if it works!
Author
Owner

@qpwo commented on GitHub (Feb 21, 2022):

Appreciated I'll give it a run

<!-- gh-comment-id:1047209612 --> @qpwo commented on GitHub (Feb 21, 2022): Appreciated I'll give it a run
Author
Owner

@appy-one commented on GitHub (Mar 3, 2022):

@qpwo Any news about this?

<!-- gh-comment-id:1057990555 --> @appy-one commented on GitHub (Mar 3, 2022): @qpwo Any news about this?
Author
Owner

@qpwo commented on GitHub (Mar 4, 2022):

Hey sorry to report I'm getting 30 seconds on 10k batches and a crash after 6 batches.

I had a bug in my script I wasn't awaiting my transactions!

So I'm getting about 30 seconds to a minute per 10k records. So for the full 3 million records I would expect it to take about (30 seconds * 3 million / 10 k) = 2 1/2 hours . That's not crazy slow.

Full test script and logs here

https://gist.github.com/qpwo/1445f4a7053ba5e712ea2628eb1c6e38

<!-- gh-comment-id:1059444594 --> @qpwo commented on GitHub (Mar 4, 2022): ~~Hey sorry to report I'm getting 30 seconds on 10k batches and a crash after 6 batches.~~ I had a bug in my script I wasn't awaiting my transactions! So I'm getting about 30 seconds to a minute per 10k records. So for the full 3 million records I would expect it to take about (30 seconds * 3 million / 10 k) = 2 1/2 hours . That's not crazy slow. Full test script and logs here https://gist.github.com/qpwo/1445f4a7053ba5e712ea2628eb1c6e38
Author
Owner

@qpwo commented on GitHub (Mar 4, 2022):

Is there a simple example somewhere of using multiple CPU cores?

100k records in 10k batches took a total of 389 seconds.

I should also mention each record is a list of up to like 10k <40 character strings. (Maybe averaging a few dozen strings per record.)

<!-- gh-comment-id:1059453260 --> @qpwo commented on GitHub (Mar 4, 2022): Is there a simple example somewhere of using multiple CPU cores? 100k records in 10k batches took a total of 389 seconds. I should also mention each record is a list of up to like 10k <40 character strings. (Maybe averaging a few dozen strings per record.)
Author
Owner

@appy-one commented on GitHub (May 9, 2022):

Is there a simple example somewhere of using multiple CPU cores?

Sorry for the late reply. See Standard Node.js clusters for info about running multiple threads. Kindly note you can't use multiple threads to speed up your import because each thread would acquire an exclusive write lock on the target collection - they'll effectively just be taking turns in importing batches.

<!-- gh-comment-id:1120896854 --> @appy-one commented on GitHub (May 9, 2022): > Is there a simple example somewhere of using multiple CPU cores? Sorry for the late reply. See [Standard Node.js clusters](https://github.com/appy-one/acebase-ipc-server#standard-nodejs-clusters) for info about running multiple threads. Kindly note you can't use multiple threads to speed up your import because each thread would acquire an exclusive write lock on the target collection - they'll effectively just be taking turns in importing batches.
Author
Owner

@appy-one commented on GitHub (May 19, 2022):

I'm closing this for now. Feel free to comment later, I'll post an update once I've worked on this again.

Spread the word contribute Sponsor AceBase

<!-- gh-comment-id:1131494779 --> @appy-one commented on GitHub (May 19, 2022): I'm closing this for now. Feel free to comment later, I'll post an update once I've worked on this again. [![Spread the word](https://user-images.githubusercontent.com/26569719/169265089-3d593555-e1ad-4390-986b-877ac2c38a47.svg)](https://twitter.com/intent/tweet?button=&url=https://github.com/appy-one/acebase&text=I'm+using+@AcebaseRealtime+in+my+project+to+make+my+life+easier!&button=) [![contribute](https://user-images.githubusercontent.com/26569719/169265318-30c4c6a5-7c89-46a0-a7a2-ef433a8192f4.svg)](https://github.com/appy-one/acebase#contributing) [![Sponsor AceBase](https://user-images.githubusercontent.com/26569719/168233053-8e56b243-4140-40ab-9a30-4cb3cc149bfe.svg)](https://github.com/sponsors/appy-one)
Author
Owner

@YanDevDe commented on GitHub (Nov 1, 2022):

I'm also facing the issue here, acebase is just simply too slow for it. I've decided to use LokiJS instead which is way quicker (1 Million entries only took 1206ms to insert, 878ms for query, 896ms for updating and 2555ms to save it in IndexeDB)

acebase in this case rather takes 1 mins in some area :/

<!-- gh-comment-id:1298980489 --> @YanDevDe commented on GitHub (Nov 1, 2022): I'm also facing the issue here, acebase is just simply too slow for it. I've decided to use LokiJS instead which is way quicker (1 Million entries only took 1206ms to insert, 878ms for query, 896ms for updating and 2555ms to save it in IndexeDB) acebase in this case rather takes 1 mins in some area :/
Author
Owner

@qpwo commented on GitHub (Nov 1, 2022):

((Disk can never compete with memory -- different use cases right))

<!-- gh-comment-id:1298994130 --> @qpwo commented on GitHub (Nov 1, 2022): ((Disk can never compete with memory -- different use cases right))
Author
Owner

@YanDevDe commented on GitHub (Nov 1, 2022):

Sorry, I misread - I thought acebase was stored in memory too. Different usecase, yep :)

<!-- gh-comment-id:1298995513 --> @YanDevDe commented on GitHub (Nov 1, 2022): Sorry, I misread - I thought acebase was stored in memory too. Different usecase, yep :)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/acebase#47
No description provided.