Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbmmodule bool function is performance prohibitive #46412

Closed
johansen mannequin opened this issue Feb 22, 2008 · 26 comments
Closed

dbmmodule bool function is performance prohibitive #46412

johansen mannequin opened this issue Feb 22, 2008 · 26 comments
Assignees
Labels
3.12 bugs and security fixes extension-modules C modules in the Modules dir performance Performance or resource usage

Comments

@johansen
Copy link
Mannequin

johansen mannequin commented Feb 22, 2008

BPO 2159
Nosy @jcea
Files
  • dbm_bool.patch: Add nb_bool for ndbm and gdbm
  • dbm_bool_b.patch: Add nb_bool for ndbm and gdbm
  • dbm_bool_c.patch
  • dbm_bool_d.patch: Add nb_bool for ndbm and gdbm
  • dbm_bool_e.patch: Add nb_bool for ndbm and gdbm
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/jcea'
    closed_at = None
    created_at = <Date 2008-02-22.00:41:14.969>
    labels = ['extension-modules', 'performance']
    title = 'dbmmodule inquiry function is performance prohibitive'
    updated_at = <Date 2014-05-04.15:01:48.260>
    user = 'https://bugs.python.org/johansen'

    bugs.python.org fields:

    activity = <Date 2014-05-04.15:01:48.260>
    actor = 'eolson'
    assignee = 'jcea'
    closed = False
    closed_date = None
    closer = None
    components = ['Extension Modules']
    creation = <Date 2008-02-22.00:41:14.969>
    creator = 'johansen'
    dependencies = []
    files = ['34879', '34886', '34895', '34916', '35150']
    hgrepos = []
    issue_num = 2159
    keywords = ['patch']
    message_count = 23.0
    messages = ['62668', '64126', '64447', '65080', '65081', '68106', '68107', '80737', '216330', '216361', '216382', '216519', '217098', '217542', '217557', '217568', '217662', '217669', '217670', '217671', '217676', '217685', '217877']
    nosy_count = 5.0
    nosy_names = ['jafo', 'jcea', 'johansen', 'ysj.ray', 'eolson']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue2159'
    versions = ['Python 3.5']

    @johansen
    Copy link
    Mannequin Author

    johansen mannequin commented Feb 22, 2008

    We've been using Python 2.4 to build the new package management software
    for OpenSolaris. We use a ndbm database to hold keywords about
    packages, and found that each time we added a new OpenSolaris build to
    our package repository, the time to import would increase by about 1/3
    of the previous time.

    It turns out that we were continually invoking a function in the
    dbmmodule that walked the entire database every time the function was
    called.

    Looking at dbmmodule.c, the source for dbm.so, is instructive:

    This is dbm_length, the function that we're _always_ calling.

    static int
    dbm_length(dbmobject *dp)
    {
            if (dp->di_dbm == NULL) {
                     PyErr_SetString(DbmError, "DBM object has already been
    closed"); 
                     return -1; 
            }
            if ( dp->di_size < 0 ) {
                    datum key;
                    int size;
    
                    size = 0;
                    for ( key=dbm_firstkey(dp->di_dbm); key.dptr;
                          key = dbm_nextkey(dp->di_dbm))
                            size++;
                    dp->di_size = size;
            }
            return dp->di_size;
    }

    It's a knock-off of function shown in ndbm(3C) that traverses the
    database. It looks like this function walks every record in the
    database, and then returns that as its size.

    Further examination of dbmmodule shows that dbm_length has been assigned
    as the function for the inquiry operator:

    static PyMappingMethods dbm_as_mapping = {
    (inquiry)dbm_length, /mp_length/
    (binaryfunc)dbm_subscript, /mp_subscript/
    (objobjargproc)dbm_ass_sub, /mp_ass_subscript/
    };

    It looks like dbm_length stashes the size of the database, so it doesn't
    always have to traverse it. However, further examination of the source
    shows that an insertion doesn't update the di_size counter. Worse yet,
    an update or a delete will cause the counter to be set to -1. This
    means that the next call to dbm_length will have to walk the entire
    database all over again. Ick.

    One of the problem parts of the code is this line in catalog.py:
    update_searchdb():

                    if fmri_list:
                            if not self.searchdb:
                                    self.searchdb = \
                                        dbm.open(self.searchdb_file, "c")

    This if not triggers the PyObject_IsTrue that invokes the inquiry operator
    for the dbm module. Every time we run this code, we're going to walk
    the entire database. By changing this to:

                    if fmri_list:
                            if self.searchdb is None:
                                    self.searchdb = \
                                        dbm.open(self.searchdb_file, "c")

    We were able to work around the problem by using the is None check,
    instead of if not self.searchdb; however, this seems like it is really a
    problem with the dbmmodule and should ultimately be resolved there.

    @johansen johansen mannequin added performance Performance or resource usage extension-modules C modules in the Modules dir labels Feb 22, 2008
    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Mar 19, 2008

    Proposed patch is inline.

    @jafo jafo mannequin assigned gvanrossum Mar 19, 2008
    @benjaminp benjaminp added performance Performance or resource usage and removed performance Performance or resource usage labels Mar 24, 2008
    @jcea
    Copy link
    Member

    jcea commented Mar 24, 2008

    I think that "-1" is a sanity check. If the count is updated in the
    database, but it is not transactional (or there are bugs, or the DB is
    updated by a not up-to-date library, and so on), the cached counter and
    the real data can diverge.

    Anybody using "Berkeley DB" related databases knows that "length" is
    costly. By good reasons, actually :-).

    Checking for empty databases should be fast, nevertheless (just iterate
    over the first item in the database). We could simply define a
    "__nonzero__()" method for that. That would solve the "if" issue.

    @jcea
    Copy link
    Member

    jcea commented Apr 7, 2008

    Somebody posted a similar issue in pybsddb. The patch I proposed would
    be consistent with current "__len__" *internal* use, but the real
    intention, I think, is to return True or False if the database is open
    or closed, not if the database is empty or not.

    Opinions?

    See http://mailman.argo.es/pipermail/pybsddb/2008-April/000028.html

    @gvanrossum
    Copy link
    Member

    Assigning anything not related to Py3k design to me is a mistake; I
    don't have the bandwidth to handle this, sorry.

    @gvanrossum gvanrossum removed their assignment Apr 7, 2008
    @jcea
    Copy link
    Member

    jcea commented Jun 12, 2008

    johansen, could you be happy returning True of False according to
    database being open/closed?.

    @johansen
    Copy link
    Mannequin Author

    johansen mannequin commented Jun 12, 2008

    Yes, True/False should be sufficient for our purposes. IIRC, we were
    trying to determine if we had a stale handle to the database and needed
    to open it again.

    @johansen
    Copy link
    Mannequin Author

    johansen mannequin commented Jan 29, 2009

    I haven't been able to find any of the patches listed in the comments,
    but it does look like providing a nb_nonzero method in the module would
    solve our issue. PyObject_IsTrue checks the tp_as_number methods before
    the sequence and mapping methods. I'm not sure if it's safe to count on
    this behavior as always being true, but for 2.4 and the dbmmodule, it
    would work.

    @eolson
    Copy link
    Mannequin

    eolson mannequin commented Apr 15, 2014

    The performance is still an issue in python 3.
    Attaching a patch for python 3, performance numbers are below.

    Measuring ndbm time for a falsey check on an open db with 100 entries. gdbm performance is similar.
    Before patch:
    db is not None: 6.9141387939453125e-06
    not db: 0.0006985664367675781
    Factor: 101X (slow)
    After patch:
    db is not None: 4.76837158203125e-06
    not db: 4.0531158447265625e-06
    Factor: 1X (expected)

    Also, added a couple tests to verify bool(db) returns the correct value.

    @eolson
    Copy link
    Mannequin

    eolson mannequin commented Apr 15, 2014

    Uploading patch with minor test changes (dbm_bool_b.patch).

    Backwards compatibility note:

    Result of running bool(db) on a db that has been closed:
    Old: _dbm.error: DBM object has already been closed
    With patch: returns False instead of raising.

    I think this is desireable, but we could have bool(db) raise an exception when the db is closed if that is preferred for backwards compatibility.

    @eolson
    Copy link
    Mannequin

    eolson mannequin commented Apr 15, 2014

    Make the changes backward compatible after getting input on possible problems from r.david.murray
    patch: dbm_bool_c.patch

    Now, the only change should be faster performance for bool(db).

    @eolson
    Copy link
    Mannequin

    eolson mannequin commented Apr 16, 2014

    New patch with PEP-7 fix - no c++ // style comments. -Thanks johansen.

    @jcea
    Copy link
    Member

    jcea commented Apr 23, 2014

    First, Python 2.4 has been out of support for a really long time. Deleting.

    Eric, let me clarify the situation, because this report is old and I forgot the details.

    I think current situation is this, when doing something like "if db : DO_SOMETHING":

    a) If the database is closed, raise an exception.

    b) If database is empty, returns False.

    c) If database has any entry, returns True. Takes time proportional to database length, because it is going to go thru it.

    The patch you are proposing:

    a) If the database is closed, raise an exception.

    b) If database is empty, returns 0.

    c) If database has any entry, returns 1. Fast and simply checking if the database has at least a single record.

    Why 0/1 instead of True/False?. This is going to be a 3.5 patch, True/False are really there, they are not aliases.

    PS: When done, I will probably port this patch to current "pybsddb" work, although I am not sure that Berkeley DB has "firstkey" for all database types it support. Would you allow this porting, Eric? https://pypi.python.org/pypi/bsddb3/6.0.1

    @jcea
    Copy link
    Member

    jcea commented Apr 29, 2014

    Eric, would you mind to clarify the points I raised in the last message?. Lets move this forward.

    @eolson
    Copy link
    Mannequin

    eolson mannequin commented Apr 29, 2014

    Thank you for the feedback. Sorry I didn't see your previous response until today. I will take a look and respond tonight.

    @eolson
    Copy link
    Mannequin

    eolson mannequin commented Apr 30, 2014

    Hi Jesús,

    I believe the patch should have this behavior:
    a) If the database is closed, raise an exception.
    b) If database is empty, return False.
    c) If database has any entry, return True. Fast and simply checking if the database has at least a single record.

    The 0/1 appears to be converted to Py_TRUE/Py_FALSE in PyObject_IsTrue() in boolobject.c.

    For background, here's the code path I'm seeing:

    bool_new(...) (in Objects/boolobject.c)
    ...
    ok = PyObject_IsTrue(obj) (in Objects/object.c)
    ...
    return PyBool_FromLong(ok)

    Looking at PyObject_IsTrue is informative. It shows that since tp_as_number->nb_bool is defined (as dbm_bool), it is used instead of the old default of tp_as_mapping->mp_length.

    I'm still learning CPython, and I'm not sure if everything goes through bool_new(), but I think this makes sense. If you see more places I can/should look for details here, let me know.

    @eolson
    Copy link
    Mannequin

    eolson mannequin commented Apr 30, 2014

    Also, I'm happy to allow the code to be ported to pybsddb. As long as it doesn't cause any problems with CPython licensing - and I can't think of any way it could.

    @jcea
    Copy link
    Member

    jcea commented May 1, 2014

    I would like a bit more comfortable if you return True/False. Maybe I am missing something, I am not familiar with this either, but looks like more... sensible, instead of counting on implicit and magical 0/1 -> False/True conversion.

    What do you think?. Maybe I am missing something, I am not familiar with the details either.

    @jcea
    Copy link
    Member

    jcea commented May 1, 2014

    BTW, would you mind to sign a contributor form?.

    https://www.python.org/psf/contrib/

    @jcea
    Copy link
    Member

    jcea commented May 1, 2014

    Oh, I see that your already did. Sorry for the noise.

    @eolson
    Copy link
    Mannequin

    eolson mannequin commented May 1, 2014

    I did try the suggestion to return Py_False, but that gives the wrong result since Py_False is not 0 and gets returned as Py_True.

    I looked for similar code, and this looks like the convention for handling "if obj". PyObject_IsTrue() is called on the object. What it returns can be affected by an nb_bool function that returns 0 or 1 (or -1).

    Here are a few similar examples that need to implement nb_bool to handle converting an obj to a bool:
    Objects/floatobject.c:float_bool
    Modules/_datetimemodule.c:delta_bool
    Objects/complexobject.c:complex_bool

    For many types, there is no nb_bool, and other things are tried such as getting the length.

    @jcea
    Copy link
    Member

    jcea commented May 1, 2014

    OK, you did your homework.

    I checked "PyObject_Is_True()" function and I agree. This actually looks like a "leak" when True/False were added to Python. Python3 is inheriting it :-).

    OK.

    I see three issues in the code:

    a) You are getting a key from the database, and you are not freeing it. So, this code leaks memory.

    b) You should check database errors too. Documentation says "When the end of the database is reached, the dptr member of the key is a null pointer. If an error is detected, the dptr member of the key shall be a null pointer and the error condition of the database shall be set.". Raise an exception with the proper error. Would be nice to test that too, but it is probably nos possible.

    c) Please, use NULL instead of "0".

    Thanks for your effort, Eric.

    @jcea jcea self-assigned this May 1, 2014
    @eolson
    Copy link
    Mannequin

    eolson mannequin commented May 4, 2014

    Hi,
    Thanks for finding those issues. I attached a new patch.

    a) Good find, I added the free() for gdbm. ndbm doesn't need free().
    b) Added the error check. I don't know if a test can be made for this. If there was a common way to patch C libraries in CPython, I would give that a try.
    Also, do you know if we should check for the gdbm error before the dbm error, or does it not matter?
    c) Yes, it looks like NULL is used here instead of 0. Changed 0s to NULLS.

    Let me know if you see anything else.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @gvanrossum
    Copy link
    Member

    This looks worthy. I propose to make a new patch that works with 3.11. Also Python code that uses “if db:” should probably use “if db is not None:”, else an empty db will be reopened over and over.

    @gvanrossum gvanrossum self-assigned this Sep 4, 2022
    @iritkatriel iritkatriel added the 3.12 bugs and security fixes label Sep 6, 2022
    @gvanrossum gvanrossum changed the title dbmmodule inquiry function is performance prohibitive dbmmodule bool function is performance prohibitive Sep 8, 2022
    gvanrossum added a commit to gvanrossum/cpython that referenced this issue Sep 8, 2022
    Based on patches uploaded to BPO by BPO user eolson (Eric Olson).
    No known email or GitHub username for attribution.
    
    (The bug was opened in 2008, not quite a record but still impressive.)
    @gvanrossum
    Copy link
    Member

    I have a new PR (#96692). It's based on eolson's (Eric Olson's) last patch, from 2014. Alas, I don't have an email or GH username to attribute.

    @gvanrossum
    Copy link
    Member

    This will be in 3.12.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 bugs and security fixes extension-modules C modules in the Modules dir performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants