Transforming large (+1gb) JSON and CSV files | Coursera Community
Coursera Header
Question

Transforming large (+1gb) JSON and CSV files

  • 18 December 2019
  • 3 replies
  • 158 views

Badge +1

Hi guys, I'm typically playing with data of several GBs, be it JSON or CSV. It's something I do at home, with my own pc and a little server I have (no fancy hardware, just don't want hear the fans running at night so I've installed the server in the living room).

Now, I typically transform this data with Knime and PowerQuery because it's easy to work with, but they really struggle with the ~7Gb allocated for them. My sever has even less memory, and it's running other stuff.

I tried all sorts of tricks but it becomes very slow, or just cannot work because of the available ram (Knime). I also tried with R and, yeah I get better results but it removes the fast&easy element from it and it becomes a time sink. Same with python, where I'm even less proficient.

I've been wondering about loading that data into a mariadb or sql database, and transforming that data in the server. I know databases are meant to work with this sort of problems, and I already know a bit of SQL, but I'm just exploring this option, so I come here for advice before sinking a lot of time again on it, because I'll need a lot of transformations.

Any advice?


3 replies

Badge

I would recommend installing either mysql or postgresql on your server and finding an efficient way of loading that data…

If transformations are the biggest bottleneck, you could use something like Talen Open Studio to create an etl process that handles the transformations for you and pushes into JSON or into any number of SQL flavor databases.

Userlevel 6
Badge +4

As David mentions, your problem isn’t entirely clear. 

  • What are you trying to do?
  • What is the problem?

If you need/want to create a database server it may be worth looking into Apache Spark.  Spark does have a bit of learning curve, but it is a bit of technology that scales to ‘big data’ really well.

 

Badge

Hi @ConnorEzra, I don’t understand exactly what type of suggestions you’re looking for.

If it is a matter of disk space and computation cost, I would suggest you to move to sql. Maybe a little bit less manageable than a json, but definitely less time and effort consuming. I think sqlite can easily handle data than in a json would become 1gb to load, read, write etc…

 

Don’t know if this is the kind of answer you were looking for :|

Reply