Skip to content

Gold patches have 13 tasks failed in SWE-Bench-Verified #20

@KawaiiNotHawaii

Description

@KawaiiNotHawaii

I retrieved the gold patches from the swe-bench-verified dataset and upload using sb-cli for testing. But it results that only 487 passed all the test cases, with 5 marked as 'incompleted' and 8 marked as 'unresolved'.

    "unresolved_ids": [
        "astropy__astropy-7606",
        "astropy__astropy-8707",
        "astropy__astropy-8872",
        "django__django-10097",
        "psf__requests-1724",
        "psf__requests-2317",
        "pylint-dev__pylint-6528",
        "pylint-dev__pylint-7277"
    ],

I then ran mini-swe-agent with claude and upload the preds.json to sb-cli, it turns out that among the unresolved_ids above, two are marked as resolved, which indicates that the gold patch is not really 'gold'...

"django__django-10097",
"psf__requests-1724"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions