Quantcast
Viewing all articles
Browse latest Browse all 3

Answer by a_guest for How to parse a large JSON file efficiently in Python?

You can parse the JSON file once to find the positions of each level-1 separator, i.e. a comma that is part of the top-level object, and then divide the file into sections indicated by these positions. For example:

{"a": [1, 2, 3], "b": "Hello, World!", "c": {"d": 4, "e": 5}}        ^      ^            ^        ^             ^        |      |            |        |             |     level-2   |         quoted      |          level-2               |                     |            level-1               level-1

Here we want to find the level-1 commas, that separate the objects which are contained by the top-level object. We can use a generator which parses the JSON stream and keeps track of descending into and stepping out of nested objects. When it encounters a level-1 comma that is not quoted it yields the corresponding position:

def find_sep_pos(stream, *, sep=','):    level = 0    quoted = False  # handling strings in the json    backslash = False  # handling quoted quotes    for pos, char in enumerate(stream):        if backslash:            backslash = False        elif char in '{[':            level += 1        elif char in ']}':            level -= 1        elif char == '"':            quoted = not quoted        elif char == '\\':            backslash = True        elif char == sep and not quoted and level == 1:            yield pos

Used on the example data above, this would give list(find_sep_pos(example)) == [15, 37].

Then we can divide the file into sections that correspond to the separator positions and load each section individually via json.loads:

import itertools as itimport jsonwith open('example.json') as fh:    # Iterating over `fh` yields lines, so we chain them in order to get characters.    sep_pos = tuple(find_sep_pos(it.chain.from_iterable(fh)))    fh.seek(0)  # reset to the beginning of the file    stream = it.chain.from_iterable(fh)    opening_bracket = next(stream)    closing_bracket = dict(('{}', '[]'))[opening_bracket]    offset = 1  # the bracket we just consumed adds an offset of 1    for pos in sep_pos:        json_str = (            opening_bracket+''.join(it.islice(stream, pos - offset))+ closing_bracket        )        obj = json.loads(json_str)  # this is your object        next(stream)  # step over the separator        offset = pos + 1  # adjust where we are in the stream right now        print(obj)    # The last object still remains in the stream, so we load it here.    obj = json.loads(opening_bracket +''.join(stream))    print(obj)

Viewing all articles
Browse latest Browse all 3

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>