profile
viewpoint
Alex Chan alexwlchan @wellcometrust https://alexwlchan.net “Lexie’s solution probably consists of several robots doing plenty of complicated magic” ~ @Samathy • digital preservation @ Wellcome • fun projects • they/them

alan-turing-institute/the-turing-way 646

Host repository for The Turing Way: a how to guide for reproducible data science

alexwlchan/contributions-graph 76

A Python clone of GitHub’s Contributions graph

alexwlchan/backup-slack 60

A script for backing up your message history from Slack

alexwlchan/backup-pinboard 51

Create a local backup of your Pinboard backups

alexwlchan/ao3 47

A scripted Python interface to some of the data on AO3

alexwlchan/asexual 40

🖤💜 Asexual Pride in GitHub repository languages

alexwlchan/alexwlchan.net 26

Source code and plugins for my website, a static site built with Jekyll

alexwlchan/docstore 22

Organising my scanned documents and reference files with keyword tagging

alexwlchan/backup_tumblr 14

Scripts for backing up your posts, likes and media files from Tumblr

alexwlchan/auto_merge_my_pull_requests 6

A GitHub Action for automatically merging my pull requests on personal repos

pull request commentwellcomecollection/catalogue

Fix ElasticRetrieverTest

Ah, is this to ensure there’s something in the index?

warrd

comment created time in 37 minutes

push eventwellcomecollection/catalogue

Alex Chan

commit sha 20a2ba47f29f5f52ed313579c7cd5dec70ee3d02

Handle checkout --track errors properly

view details

push time in 3 hours

push eventwellcomecollection/catalogue

Alex Chan

commit sha a596c23a7144528cdc5d2fa9d2c6bf3fbf7024ce

Handle having the branch already checked out

view details

push time in 3 hours

push eventwellcomecollection/catalogue

Alex Chan

commit sha 648aba0b4197ddc605fa1a473cb5bff3e87f76d0

We never expect to look up an empty list of identifiers

view details

push time in 4 hours

push eventwellcomecollection/catalogue

Alex Chan

commit sha 8e0cafdd6180dac19a02dd30d0d3c0e085b8dbe9

The ElasticIndexer should never be asked to index an empty list

view details

Alex Chan

commit sha 8d0a817ff36cc2571d245dfeb7bf63dc156555fa

Log the complete set of works we tried to index in the merger The Left(documents) will only contain documents that failed to index -- also log what we tried to index.

view details

push time in 4 hours

Pull request review commentwellcomecollection/catalogue

add WECO_DEPLOY_STAGE_DEPLOY_TIMEOUT to build pipeline

 steps:               "--from-label", "latest",               "--environment-id", "staging",               "--description", $BUILDKITE_BUILD_URL,-              "--confirmation-wait-for", 1200]+              "--confirmation-wait-for", ${WECO_DEPLOY_STAGE_DEPLOY_TIMEOUT:-1200}]

Is the intention that we'd eventually replace this with a hard-coded value, and ditch the env var?

jamesgorrie

comment created time in 5 hours

delete branch wellcomecollection/weco-deploy

delete branch : show-confirmation-defaults

delete time in 5 hours

push eventwellcomecollection/weco-deploy

Alex Chan

commit sha 2570b17582ac573b77f9cc98fed3ca669f2915cd

Show defaults for the --confirmation flags

view details

Alex Chan

commit sha 7a7efa81f531c65607cc4eea1615e228c36dacf8

Merge pull request #60 from wellcomecollection/show-confirmation-defaults Show defaults for the --confirmation flags

view details

push time in 5 hours

create barnchwellcomecollection/weco-deploy

branch : show-confirmation-defaults

created branch time in 6 hours

PullRequestReviewEvent

push eventwellcomecollection/catalogue

push time in 6 hours

push eventwellcomecollection/catalogue

Alex Chan

commit sha 93e5eddefb0acd99c070fda780d2d25bd36ea2cf

Clear out the .sbt_metadata directory of old entries

view details

push time in 6 hours

pull request commentwellcomecollection/catalogue

Fix ID minter dependencies

Wonder if we can make sbt do this automatically.

jamieparkinson

comment created time in 6 hours

PullRequestReviewEvent

PR opened wellcomecollection/catalogue

Use generic test cases for ElasticIndexer

This patch replaces the existing ElasticIndexerTest with a generic IndexerTestCases, with two implementations (ElasticIndexerTest and MemoryIndexerTest).

This follows the pattern we've used elsewhere, and ensures the MemoryIndexer behaves in the same way as the MemoryIndexer. This paves the way for us to actually use the MemoryIndexer in some of our tests.

+307 -177

0 comment

9 changed files

pr created time in 9 hours

create barnchwellcomecollection/catalogue

branch : elastic-indexer-tests

created branch time in 9 hours

delete branch wellcomecollection/storage-service

delete branch : python-client-from-path

delete time in 11 hours

push eventwellcomecollection/storage-service

Alex Chan

commit sha 8cf11833f4ed7d8b4068c0b75851ad67c1190900

We care about testing on *a* Python 3, not a specific version

view details

Alex Chan

commit sha 7e5d03f45d4e270e16aa3a5e81512ae61bd2aa8c

Fix a linting issue

view details

Alex Chan

commit sha 0b939c1b52b75065028548218bdd6a6c8a78c4bc

Commonise the fetching of environment secrets

view details

Alex Chan

commit sha 07e04a94dd7022ec771a1f60a1cf84d2a8719dcf

Add a helper method for getting a client from a JSON creds file

view details

Alex Chan

commit sha d92ef855a712c4074371cf6c8dab74cc682121bd

Bump the version, write a changelog entry

view details

Alex Chan

commit sha dc60eeeb1abaf6483d1021cb7b788bb56d3b9778

Automate the release process for the Python client

view details

Alex Chan

commit sha f088eb1f4bb7a2c9bb0ad3df1fcc500a1a7c350e

Fix some links in the Python client metadata

view details

Buildkite on behalf of Wellcome Collection

commit sha 73bd7281797d0d5fea80ebee333d9578ee655ad5

Apply auto-formatting rules

view details

Alex Chan

commit sha 1b66827a0fdb26ca9b9e6f566314fdf4c9823e16

Merge pull request #763 from wellcomecollection/python-client-from-path Allow creating a Python client for the storage service from a path

view details

push time in 11 hours

PR merged wellcomecollection/storage-service

Allow creating a Python client for the storage service from a path

Closes https://github.com/wellcomecollection/platform/issues/4852

Plus automating the release process.

+126 -30

2 comments

9 changed files

alexwlchan

pr closed time in 11 hours

pull request commentwellcomecollection/catalogue

Add Terraform for the relation embedder

ECR repo came separately to unbreak the build.

alexwlchan

comment created time in a day

pull request commentwellcomecollection/storage-service

Allow creating a Python client for the storage service from a path

For now it’s still manually released, but easier for whoever does so.

alexwlchan

comment created time in a day

delete branch wellcomecollection/platform-infrastructure

delete branch : add-warmup-script

delete time in a day

push eventwellcomecollection/platform-infrastructure

Alex Chan

commit sha 251e348f655c703e2c63da8c1d12d22a966978f3

Add a script for warming/cooling an autoscaling ECS cluster

view details

Buildkite on behalf of Wellcome Collection

commit sha 571d24b599cf8e398bc080166ffa1f4922881923

Apply auto-formatting rules

view details

Alex Chan

commit sha 9f0a20ac01addc8cc02acc2cc726ae5afd429c4c

Merge pull request #70 from wellcomecollection/add-warmup-script Add a script for warming/cooling an autoscaling ECS cluster

view details

push time in a day

PR opened wellcomecollection/storage-service

Allow creating a Python client for the storage service from a path

Closes https://github.com/wellcomecollection/platform/issues/4852

Plus automating the release process.

+123 -31

0 comment

10 changed files

pr created time in a day

create barnchwellcomecollection/storage-service

branch : python-client-from-path

created branch time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)++    @staticmethod+    def _append_archivematica_uuid(working_folder, archivematica_uuid):+        bag_info = load_space_separated_file(+            file_location=f"{working_folder}/bag-info.txt"+        )++        if "Internal-Sender-Identifier:" in bag_info:+            assert (+                bag_info["Internal-Sender-Identifier:"] == archivematica_uuid+            ), archivematica_uuid+            return False+        else:+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")+            return True++    def _update_tagmanifest(self, working_folder):+        existing_checksums = self._load_existing_checksums(working_folder)+        new_checksums = self._generate_updated_checksums(working_folder)++        merged_checksums = dict(existing_checksums)+        for k, v in new_checksums.items():+            merged_checksums[k] = v++        assert "fetch.txt" in merged_checksums.keys()++        old_bag_info_checksum = existing_checksums.get("bag-info.txt")+        new_bag_info_checksum = merged_checksums.get("bag-info.txt")++        assert old_bag_info_checksum != new_bag_info_checksum++        with open(f"{working_folder}/{self.tagmanifest_name}", "w") as fp:+            for checksum, filename in merged_checksums.items():+                fp.write(f"{filename} {checksum}\n")++    def _upload_bag_to_s3(self, archive_location, working_id, remove_bag=True):+        s3_upload_key = f"{self.s3_upload_prefix}/{working_id}.tar.gz"+        workflow_s3_client.upload_file(+            Filename=archive_location, Bucket=s3_upload_bucket, Key=s3_upload_key+        )+        if remove_bag:+            os.remove(archive_location)++        return s3_upload_key++    def migrate(self, version, space, external_identifier):+        storage_manifest = storage_client.get_bag(+            space=space, external_identifier=external_identifier, version=version+        )++        id = storage_manifest["id"]+        bucket = storage_manifest["location"]["bucket"]+        path = storage_manifest["location"]["path"]+        files = storage_manifest["manifest"]["files"]+        provider = storage_manifest["location"]["provider"]["id"]+        tagmanifest_files = storage_manifest["tagManifest"]["files"]++        assert provider == "amazon-s3"++        working_id = id.replace("/", "_")+        working_folder = f"{self.target_folder}/{working_id}"++        os.makedirs(working_folder, exist_ok=True)++        # Initialise working log+        with open(f"{self.target_folder}/{working_id}.log", "w") as fp:+            fp.write(+                f"{datetime.datetime.now().isoformat()}: Starting migration for {id}\n"+            )++        def _log(msg):+            with open(f"{self.target_folder}/{working_id}.log", "a") as fp:+                fp.write(f"{datetime.datetime.now().isoformat()}: {msg}\n")++        # Write fetch.txt+        self._write_fetch_file(+            working_folder=working_folder, bucket=bucket, path=path, files=files+        )+        _log(f"Wrote fetch.txt")++        # Get required files from bag+        self._get_bagit_files_from_s3(+            working_folder=working_folder,+            bucket=bucket,+            path=path,+            version=version,+            tagmanifest_files=tagmanifest_files,+        )+        _log(f"Got BagIt files from S3")++        # Update bag-info.txt+        archivematica_uuid = self._get_archivematica_uuid(+            bucket=bucket, path=path, version=version+        )++        did_append_uuid = self._append_archivematica_uuid(+            working_folder, archivematica_uuid+        )+        if not did_append_uuid:+            _log(+                f"Internal-Sender-Identifier found in bag-info.txt: {archivematica_uuid}"+            )+            _log(f"Not migrating {id} (already migrated)")+            return+        else:+            _log(+                f"Appended Internal-Sender-Identifier to bag-info.txt: {archivematica_uuid}"+            )++        # Update tagmanifest-sha256.txt+        self._update_tagmanifest(working_folder=working_folder)+        _log(f"Updated {self.tagmanifest_name}")++        # Create compressed bag+        archive_location = compress_folder(folder=working_folder)+        _log(f"Created archive: {archive_location}")++        # Upload compressed bag to S3+        s3_upload_key = self._upload_bag_to_s3(+            archive_location=archive_location, working_id=working_id, remove_bag=False+        )+        _log(f"Uploaded bag to s3://{s3_upload_bucket}/{s3_upload_key}")++        # Request ingest of uploaded bag from Storage Service+        # ingest_uri = storage_client.create_s3_ingest(+        #     space=space,+        #     external_identifier=external_identifier,+        #     s3_bucket=s3_upload_bucket,+        #     s3_key=s3_upload_key,+        #     ingest_type="update"+        # )+        # _log(f"Requested ingest: {ingest_uri}")+        # sys.exit(1)+        _log(f"Completed migration for {id}")+++if __name__ == "__main__":+    storage_role_arn = "arn:aws:iam::975596993436:role/storage-developer"+    workflow_role_arn = "arn:aws:iam::299497370133:role/workflow-developer"++    elastic_secret_id = "archivematica_bags_migration/credentials"+    index = "storage_bags"++    environment_id = "prod"++    environments = {+        "prod": {+            "bucket": "wellcomecollection-archivematica-ingests",+            "api_url": "https://api.wellcomecollection.org/storage/v1",+        },+        "stage": {+            "bucket": "wellcomecollection-archivematica-staging-ingests",+            "api_url": "https://api-stage.wellcomecollection.org/storage/v1",+        },+    }++    api_url = environments[environment_id]["api_url"]++    workflow_s3_client = get_aws_client(resource="s3", role_arn=workflow_role_arn)++    storage_s3_client = get_aws_client(resource="s3", role_arn=storage_role_arn)++    elastic_client = get_elastic_client(+        role_arn=storage_role_arn, elastic_secret_id=elastic_secret_id+    )++    storage_client = get_storage_client(api_url=api_url)++    elastic_query = {"query": {"prefix": {"space": {"value": "born-digital"}}}}

Yeah, I saw that later. I would do it ASAP to avoid doing unnecessary work, but not a big deal for a one-off migration.

kenoir

comment created time in a day

PullRequestReviewEvent

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)++    @staticmethod+    def _append_archivematica_uuid(working_folder, archivematica_uuid):+        bag_info = load_space_separated_file(+            file_location=f"{working_folder}/bag-info.txt"+        )++        if "Internal-Sender-Identifier:" in bag_info:+            assert (+                bag_info["Internal-Sender-Identifier:"] == archivematica_uuid+            ), archivematica_uuid+            return False+        else:+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")

What if the existing bag-info.txt doesn't end with a newline?

kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)++    @staticmethod+    def _append_archivematica_uuid(working_folder, archivematica_uuid):+        bag_info = load_space_separated_file(+            file_location=f"{working_folder}/bag-info.txt"+        )++        if "Internal-Sender-Identifier:" in bag_info:+            assert (+                bag_info["Internal-Sender-Identifier:"] == archivematica_uuid+            ), archivematica_uuid+            return False+        else:+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")+            return True++    def _update_tagmanifest(self, working_folder):+        existing_checksums = self._load_existing_checksums(working_folder)+        new_checksums = self._generate_updated_checksums(working_folder)++        merged_checksums = dict(existing_checksums)+        for k, v in new_checksums.items():+            merged_checksums[k] = v++        assert "fetch.txt" in merged_checksums.keys()++        old_bag_info_checksum = existing_checksums.get("bag-info.txt")+        new_bag_info_checksum = merged_checksums.get("bag-info.txt")++        assert old_bag_info_checksum != new_bag_info_checksum++        with open(f"{working_folder}/{self.tagmanifest_name}", "w") as fp:+            for checksum, filename in merged_checksums.items():+                fp.write(f"{filename} {checksum}\n")++    def _upload_bag_to_s3(self, archive_location, working_id, remove_bag=True):+        s3_upload_key = f"{self.s3_upload_prefix}/{working_id}.tar.gz"+        workflow_s3_client.upload_file(+            Filename=archive_location, Bucket=s3_upload_bucket, Key=s3_upload_key+        )+        if remove_bag:+            os.remove(archive_location)++        return s3_upload_key++    def migrate(self, version, space, external_identifier):+        storage_manifest = storage_client.get_bag(+            space=space, external_identifier=external_identifier, version=version+        )++        id = storage_manifest["id"]+        bucket = storage_manifest["location"]["bucket"]+        path = storage_manifest["location"]["path"]+        files = storage_manifest["manifest"]["files"]+        provider = storage_manifest["location"]["provider"]["id"]+        tagmanifest_files = storage_manifest["tagManifest"]["files"]++        assert provider == "amazon-s3"++        working_id = id.replace("/", "_")+        working_folder = f"{self.target_folder}/{working_id}"++        os.makedirs(working_folder, exist_ok=True)++        # Initialise working log+        with open(f"{self.target_folder}/{working_id}.log", "w") as fp:+            fp.write(+                f"{datetime.datetime.now().isoformat()}: Starting migration for {id}\n"+            )++        def _log(msg):+            with open(f"{self.target_folder}/{working_id}.log", "a") as fp:+                fp.write(f"{datetime.datetime.now().isoformat()}: {msg}\n")++        # Write fetch.txt+        self._write_fetch_file(+            working_folder=working_folder, bucket=bucket, path=path, files=files+        )+        _log(f"Wrote fetch.txt")++        # Get required files from bag+        self._get_bagit_files_from_s3(+            working_folder=working_folder,+            bucket=bucket,+            path=path,+            version=version,+            tagmanifest_files=tagmanifest_files,+        )+        _log(f"Got BagIt files from S3")++        # Update bag-info.txt+        archivematica_uuid = self._get_archivematica_uuid(+            bucket=bucket, path=path, version=version+        )++        did_append_uuid = self._append_archivematica_uuid(+            working_folder, archivematica_uuid+        )+        if not did_append_uuid:+            _log(+                f"Internal-Sender-Identifier found in bag-info.txt: {archivematica_uuid}"+            )+            _log(f"Not migrating {id} (already migrated)")

Small optimisation: once you've got the bag from the storage-service API, looks for Internal-Sender-Identifier in the API response. Saves you doing any downloads from S3 before you skip it here.

kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)++    @staticmethod+    def _append_archivematica_uuid(working_folder, archivematica_uuid):+        bag_info = load_space_separated_file(+            file_location=f"{working_folder}/bag-info.txt"+        )++        if "Internal-Sender-Identifier:" in bag_info:+            assert (+                bag_info["Internal-Sender-Identifier:"] == archivematica_uuid+            ), archivematica_uuid+            return False+        else:+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")+            return True++    def _update_tagmanifest(self, working_folder):+        existing_checksums = self._load_existing_checksums(working_folder)+        new_checksums = self._generate_updated_checksums(working_folder)++        merged_checksums = dict(existing_checksums)+        for k, v in new_checksums.items():+            merged_checksums[k] = v++        assert "fetch.txt" in merged_checksums.keys()++        old_bag_info_checksum = existing_checksums.get("bag-info.txt")+        new_bag_info_checksum = merged_checksums.get("bag-info.txt")++        assert old_bag_info_checksum != new_bag_info_checksum++        with open(f"{working_folder}/{self.tagmanifest_name}", "w") as fp:+            for checksum, filename in merged_checksums.items():+                fp.write(f"{filename} {checksum}\n")++    def _upload_bag_to_s3(self, archive_location, working_id, remove_bag=True):+        s3_upload_key = f"{self.s3_upload_prefix}/{working_id}.tar.gz"+        workflow_s3_client.upload_file(+            Filename=archive_location, Bucket=s3_upload_bucket, Key=s3_upload_key+        )+        if remove_bag:+            os.remove(archive_location)++        return s3_upload_key++    def migrate(self, version, space, external_identifier):+        storage_manifest = storage_client.get_bag(+            space=space, external_identifier=external_identifier, version=version+        )++        id = storage_manifest["id"]+        bucket = storage_manifest["location"]["bucket"]+        path = storage_manifest["location"]["path"]+        files = storage_manifest["manifest"]["files"]

nit: I would call these payload_files.

kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)

👍

kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):

minor: took me a while to figure out what this function was for – FWIW, I think the term is "tag file".

kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)

minor: prefer os.path.basename(…) instead of manipulating paths directly, for portability.

Also, you already have the filename as file['name'].

kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)++    @staticmethod+    def _append_archivematica_uuid(working_folder, archivematica_uuid):+        bag_info = load_space_separated_file(+            file_location=f"{working_folder}/bag-info.txt"+        )++        if "Internal-Sender-Identifier:" in bag_info:+            assert (+                bag_info["Internal-Sender-Identifier:"] == archivematica_uuid+            ), archivematica_uuid+            return False+        else:+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")+            return True++    def _update_tagmanifest(self, working_folder):+        existing_checksums = self._load_existing_checksums(working_folder)+        new_checksums = self._generate_updated_checksums(working_folder)++        merged_checksums = dict(existing_checksums)+        for k, v in new_checksums.items():+            merged_checksums[k] = v++        assert "fetch.txt" in merged_checksums.keys()++        old_bag_info_checksum = existing_checksums.get("bag-info.txt")+        new_bag_info_checksum = merged_checksums.get("bag-info.txt")++        assert old_bag_info_checksum != new_bag_info_checksum++        with open(f"{working_folder}/{self.tagmanifest_name}", "w") as fp:+            for checksum, filename in merged_checksums.items():+                fp.write(f"{filename} {checksum}\n")++    def _upload_bag_to_s3(self, archive_location, working_id, remove_bag=True):+        s3_upload_key = f"{self.s3_upload_prefix}/{working_id}.tar.gz"+        workflow_s3_client.upload_file(+            Filename=archive_location, Bucket=s3_upload_bucket, Key=s3_upload_key+        )+        if remove_bag:+            os.remove(archive_location)++        return s3_upload_key++    def migrate(self, version, space, external_identifier):+        storage_manifest = storage_client.get_bag(+            space=space, external_identifier=external_identifier, version=version+        )++        id = storage_manifest["id"]+        bucket = storage_manifest["location"]["bucket"]+        path = storage_manifest["location"]["path"]+        files = storage_manifest["manifest"]["files"]+        provider = storage_manifest["location"]["provider"]["id"]+        tagmanifest_files = storage_manifest["tagManifest"]["files"]++        assert provider == "amazon-s3"++        working_id = id.replace("/", "_")+        working_folder = f"{self.target_folder}/{working_id}"

minor: prefer os.path.join(…) to combining paths directly. Not an issue here, but bit more portable in general.

kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )

You could save yourself a trip to S3 here – you have the payload files from the storage manifest. Something like:

mets_files = [f for f in files if f.startswith("data/METS.") and f.endswith(".xml")]

would be faster.

kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)++    @staticmethod+    def _append_archivematica_uuid(working_folder, archivematica_uuid):+        bag_info = load_space_separated_file(+            file_location=f"{working_folder}/bag-info.txt"+        )++        if "Internal-Sender-Identifier:" in bag_info:+            assert (+                bag_info["Internal-Sender-Identifier:"] == archivematica_uuid+            ), archivematica_uuid+            return False+        else:+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")+            return True++    def _update_tagmanifest(self, working_folder):+        existing_checksums = self._load_existing_checksums(working_folder)+        new_checksums = self._generate_updated_checksums(working_folder)++        merged_checksums = dict(existing_checksums)+        for k, v in new_checksums.items():+            merged_checksums[k] = v++        assert "fetch.txt" in merged_checksums.keys()++        old_bag_info_checksum = existing_checksums.get("bag-info.txt")+        new_bag_info_checksum = merged_checksums.get("bag-info.txt")++        assert old_bag_info_checksum != new_bag_info_checksum++        with open(f"{working_folder}/{self.tagmanifest_name}", "w") as fp:+            for checksum, filename in merged_checksums.items():+                fp.write(f"{filename} {checksum}\n")++    def _upload_bag_to_s3(self, archive_location, working_id, remove_bag=True):+        s3_upload_key = f"{self.s3_upload_prefix}/{working_id}.tar.gz"+        workflow_s3_client.upload_file(+            Filename=archive_location, Bucket=s3_upload_bucket, Key=s3_upload_key+        )+        if remove_bag:+            os.remove(archive_location)++        return s3_upload_key++    def migrate(self, version, space, external_identifier):+        storage_manifest = storage_client.get_bag(+            space=space, external_identifier=external_identifier, version=version+        )++        id = storage_manifest["id"]+        bucket = storage_manifest["location"]["bucket"]+        path = storage_manifest["location"]["path"]+        files = storage_manifest["manifest"]["files"]+        provider = storage_manifest["location"]["provider"]["id"]+        tagmanifest_files = storage_manifest["tagManifest"]["files"]++        assert provider == "amazon-s3"++        working_id = id.replace("/", "_")+        working_folder = f"{self.target_folder}/{working_id}"++        os.makedirs(working_folder, exist_ok=True)++        # Initialise working log+        with open(f"{self.target_folder}/{working_id}.log", "w") as fp:+            fp.write(+                f"{datetime.datetime.now().isoformat()}: Starting migration for {id}\n"+            )++        def _log(msg):+            with open(f"{self.target_folder}/{working_id}.log", "a") as fp:+                fp.write(f"{datetime.datetime.now().isoformat()}: {msg}\n")++        # Write fetch.txt+        self._write_fetch_file(+            working_folder=working_folder, bucket=bucket, path=path, files=files+        )+        _log(f"Wrote fetch.txt")++        # Get required files from bag+        self._get_bagit_files_from_s3(+            working_folder=working_folder,+            bucket=bucket,+            path=path,+            version=version,+            tagmanifest_files=tagmanifest_files,+        )+        _log(f"Got BagIt files from S3")++        # Update bag-info.txt+        archivematica_uuid = self._get_archivematica_uuid(+            bucket=bucket, path=path, version=version+        )++        did_append_uuid = self._append_archivematica_uuid(+            working_folder, archivematica_uuid+        )+        if not did_append_uuid:+            _log(+                f"Internal-Sender-Identifier found in bag-info.txt: {archivematica_uuid}"+            )+            _log(f"Not migrating {id} (already migrated)")+            return+        else:+            _log(+                f"Appended Internal-Sender-Identifier to bag-info.txt: {archivematica_uuid}"+            )++        # Update tagmanifest-sha256.txt+        self._update_tagmanifest(working_folder=working_folder)+        _log(f"Updated {self.tagmanifest_name}")++        # Create compressed bag+        archive_location = compress_folder(folder=working_folder)+        _log(f"Created archive: {archive_location}")++        # Upload compressed bag to S3+        s3_upload_key = self._upload_bag_to_s3(+            archive_location=archive_location, working_id=working_id, remove_bag=False+        )+        _log(f"Uploaded bag to s3://{s3_upload_bucket}/{s3_upload_key}")++        # Request ingest of uploaded bag from Storage Service+        # ingest_uri = storage_client.create_s3_ingest(+        #     space=space,+        #     external_identifier=external_identifier,+        #     s3_bucket=s3_upload_bucket,+        #     s3_key=s3_upload_key,+        #     ingest_type="update"+        # )+        # _log(f"Requested ingest: {ingest_uri}")+        # sys.exit(1)+        _log(f"Completed migration for {id}")+++if __name__ == "__main__":+    storage_role_arn = "arn:aws:iam::975596993436:role/storage-developer"+    workflow_role_arn = "arn:aws:iam::299497370133:role/workflow-developer"++    elastic_secret_id = "archivematica_bags_migration/credentials"+    index = "storage_bags"++    environment_id = "prod"++    environments = {+        "prod": {+            "bucket": "wellcomecollection-archivematica-ingests",+            "api_url": "https://api.wellcomecollection.org/storage/v1",+        },+        "stage": {+            "bucket": "wellcomecollection-archivematica-staging-ingests",+            "api_url": "https://api-stage.wellcomecollection.org/storage/v1",+        },+    }++    api_url = environments[environment_id]["api_url"]++    workflow_s3_client = get_aws_client(resource="s3", role_arn=workflow_role_arn)++    storage_s3_client = get_aws_client(resource="s3", role_arn=storage_role_arn)++    elastic_client = get_elastic_client(+        role_arn=storage_role_arn, elastic_secret_id=elastic_secret_id+    )++    storage_client = get_storage_client(api_url=api_url)++    elastic_query = {"query": {"prefix": {"space": {"value": "born-digital"}}}}++    s3_upload_bucket = environments[environment_id]["bucket"]++    bag_migrator = ArchivematicaUUIDBagMigrator(+        workflow_s3_client=workflow_s3_client,+        storage_s3_client=storage_s3_client,+        storage_client=storage_client,+        s3_upload_bucket=s3_upload_bucket,+    )++    initial_query = elastic_client.search(index=index, body=elastic_query, size=0)++    document_count = initial_query["hits"]["total"]["value"]++    results = helpers.scan(+        client=elastic_client, index=index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            document = result["_source"]++            version = f"v{document['version']}"+            space = document["space"]+            external_identifier = document["info"]["externalIdentifier"]++            bag_migrator.migrate(+                version=version, space=space, external_identifier=external_identifier+            )++            pbar.update(1)

Not important here, but a suggestion for future: rather than creating and advancing a progress bar manually, the way I'd do this would be to create a new function that generates documents, and tqdm.tqdm(…) its output.

e.g.

def get_documents_to_migrate(elastic_client):
    elastic_query = {…}
    results = helpers.scan(
        client=elastic_client, index=index, size=5, query=elastic_query
    )
    for result in results:
        yield result["_source"]

if __name__ == "__main__":
    for document in tqdm.tqdm(get_documents_to_migrate(…), total=document_count):
        # do stuff with document
kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)++    @staticmethod+    def _append_archivematica_uuid(working_folder, archivematica_uuid):+        bag_info = load_space_separated_file(+            file_location=f"{working_folder}/bag-info.txt"+        )++        if "Internal-Sender-Identifier:" in bag_info:+            assert (+                bag_info["Internal-Sender-Identifier:"] == archivematica_uuid+            ), archivematica_uuid+            return False+        else:+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")+            return True++    def _update_tagmanifest(self, working_folder):+        existing_checksums = self._load_existing_checksums(working_folder)+        new_checksums = self._generate_updated_checksums(working_folder)++        merged_checksums = dict(existing_checksums)+        for k, v in new_checksums.items():+            merged_checksums[k] = v++        assert "fetch.txt" in merged_checksums.keys()++        old_bag_info_checksum = existing_checksums.get("bag-info.txt")+        new_bag_info_checksum = merged_checksums.get("bag-info.txt")++        assert old_bag_info_checksum != new_bag_info_checksum++        with open(f"{working_folder}/{self.tagmanifest_name}", "w") as fp:+            for checksum, filename in merged_checksums.items():+                fp.write(f"{filename} {checksum}\n")++    def _upload_bag_to_s3(self, archive_location, working_id, remove_bag=True):+        s3_upload_key = f"{self.s3_upload_prefix}/{working_id}.tar.gz"+        workflow_s3_client.upload_file(+            Filename=archive_location, Bucket=s3_upload_bucket, Key=s3_upload_key+        )+        if remove_bag:+            os.remove(archive_location)++        return s3_upload_key++    def migrate(self, version, space, external_identifier):+        storage_manifest = storage_client.get_bag(+            space=space, external_identifier=external_identifier, version=version+        )++        id = storage_manifest["id"]+        bucket = storage_manifest["location"]["bucket"]+        path = storage_manifest["location"]["path"]+        files = storage_manifest["manifest"]["files"]+        provider = storage_manifest["location"]["provider"]["id"]+        tagmanifest_files = storage_manifest["tagManifest"]["files"]++        assert provider == "amazon-s3"++        working_id = id.replace("/", "_")+        working_folder = f"{self.target_folder}/{working_id}"++        os.makedirs(working_folder, exist_ok=True)++        # Initialise working log+        with open(f"{self.target_folder}/{working_id}.log", "w") as fp:+            fp.write(+                f"{datetime.datetime.now().isoformat()}: Starting migration for {id}\n"+            )++        def _log(msg):+            with open(f"{self.target_folder}/{working_id}.log", "a") as fp:+                fp.write(f"{datetime.datetime.now().isoformat()}: {msg}\n")++        # Write fetch.txt+        self._write_fetch_file(+            working_folder=working_folder, bucket=bucket, path=path, files=files+        )+        _log(f"Wrote fetch.txt")++        # Get required files from bag+        self._get_bagit_files_from_s3(+            working_folder=working_folder,+            bucket=bucket,+            path=path,+            version=version,+            tagmanifest_files=tagmanifest_files,+        )+        _log(f"Got BagIt files from S3")++        # Update bag-info.txt+        archivematica_uuid = self._get_archivematica_uuid(+            bucket=bucket, path=path, version=version+        )++        did_append_uuid = self._append_archivematica_uuid(+            working_folder, archivematica_uuid+        )+        if not did_append_uuid:+            _log(+                f"Internal-Sender-Identifier found in bag-info.txt: {archivematica_uuid}"+            )+            _log(f"Not migrating {id} (already migrated)")+            return+        else:+            _log(+                f"Appended Internal-Sender-Identifier to bag-info.txt: {archivematica_uuid}"+            )++        # Update tagmanifest-sha256.txt+        self._update_tagmanifest(working_folder=working_folder)+        _log(f"Updated {self.tagmanifest_name}")++        # Create compressed bag+        archive_location = compress_folder(folder=working_folder)+        _log(f"Created archive: {archive_location}")++        # Upload compressed bag to S3+        s3_upload_key = self._upload_bag_to_s3(+            archive_location=archive_location, working_id=working_id, remove_bag=False+        )+        _log(f"Uploaded bag to s3://{s3_upload_bucket}/{s3_upload_key}")++        # Request ingest of uploaded bag from Storage Service+        # ingest_uri = storage_client.create_s3_ingest(+        #     space=space,+        #     external_identifier=external_identifier,+        #     s3_bucket=s3_upload_bucket,+        #     s3_key=s3_upload_key,+        #     ingest_type="update"+        # )+        # _log(f"Requested ingest: {ingest_uri}")+        # sys.exit(1)+        _log(f"Completed migration for {id}")+++if __name__ == "__main__":+    storage_role_arn = "arn:aws:iam::975596993436:role/storage-developer"+    workflow_role_arn = "arn:aws:iam::299497370133:role/workflow-developer"++    elastic_secret_id = "archivematica_bags_migration/credentials"+    index = "storage_bags"++    environment_id = "prod"++    environments = {+        "prod": {+            "bucket": "wellcomecollection-archivematica-ingests",+            "api_url": "https://api.wellcomecollection.org/storage/v1",+        },+        "stage": {+            "bucket": "wellcomecollection-archivematica-staging-ingests",+            "api_url": "https://api-stage.wellcomecollection.org/storage/v1",+        },+    }++    api_url = environments[environment_id]["api_url"]++    workflow_s3_client = get_aws_client(resource="s3", role_arn=workflow_role_arn)++    storage_s3_client = get_aws_client(resource="s3", role_arn=storage_role_arn)++    elastic_client = get_elastic_client(+        role_arn=storage_role_arn, elastic_secret_id=elastic_secret_id+    )++    storage_client = get_storage_client(api_url=api_url)++    elastic_query = {"query": {"prefix": {"space": {"value": "born-digital"}}}}

You've lost a TODO for:

  1. Getting stuff from the born-digital-accessions space
  2. Filtering out bags that already have an I-S-I.
kenoir

comment created time in a day

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.++When run this will create a "target" folder in the same directory+to work within. It will leave "some_bag_id.log" files for each+bag it migrates.+"""++import datetime+import hashlib+import os+import shutil+import sys+from uuid import UUID++from elasticsearch import helpers+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def generate_checksum(file_location):+    sha256_hash = hashlib.sha256()++    with open(file_location, "rb") as f:+        for byte_block in iter(lambda: f.read(4096), b""):+            sha256_hash.update(byte_block)++        return sha256_hash.hexdigest()+++def compress_folder(folder, remove_folder=True):+    archive_name = shutil.make_archive(folder, "gztar", folder)+    if remove_folder:+        shutil.rmtree(folder, ignore_errors=True)++    return archive_name+++def filter_s3_objects(s3_client, bucket, prefix):+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)++    if "Contents" in response:+        return [content["Key"] for content in response["Contents"]]+    else:+        return []+++def load_space_separated_file(file_location, key_first=True):+    fields = {}++    with open(file_location) as fp:+        for line in fp:+            first, second = line.split(" ", 1)++            if key_first:+                fields[first.strip()] = second.strip()+            else:+                fields[second.strip()] = first.strip()++    return fields+++class ArchivematicaUUIDBagMigrator:+    def __init__(+        self, workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+    ):+        self.workflow_s3_client = workflow_s3_client+        self.storage_s3_client = storage_s3_client+        self.storage_client = storage_client+        self.s3_upload_bucket = s3_upload_bucket++        self.target_folder = "target"+        self.tagmanifest_name = "tagmanifest-sha256.txt"+        self.s3_upload_prefix = "born-digital/archivematica-uuid-update"++    @staticmethod+    def _get_archivematica_uuid(bucket, path, version):+        files = filter_s3_objects(+            s3_client=storage_s3_client,+            bucket=bucket,+            prefix=f"{path}/{version}/data/METS.",+        )++        assert len(files) == 1, files+        mets_file_with_id = files[0]++        archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++        assert UUID(archivematica_uuid, version=4)+        return archivematica_uuid++    @staticmethod+    def _generate_updated_checksums(working_folder):+        files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++        return {+            filename: generate_checksum(f"{working_folder}/{filename}")+            for filename in files_in_need_of_update+        }++    def _load_existing_checksums(self, working_folder):+        tag_manifest = load_space_separated_file(+            file_location=f"{working_folder}/{self.tagmanifest_name}", key_first=False+        )++        files_that_should_be_referenced = [+            "bag-info.txt",+            "bagit.txt",+            "manifest-sha256.txt",+        ]++        assert any(+            filename in files_that_should_be_referenced+            for filename in tag_manifest.keys()+        )++        return tag_manifest++    @staticmethod+    def _write_fetch_file(bucket, path, working_folder, files):+        path_prefix = f"s3://{bucket}/{path}"++        with open(f"{working_folder}/fetch.txt", "w") as fetch_file:+            for file in files:+                s3_uri = f"{path_prefix}/{file['path']}"+                fetch_file.write(f"{s3_uri}\t{file['size']}\t{file['name']}\n")++    @staticmethod+    def _get_bagit_files_from_s3(+        bucket, path, version, working_folder, tagmanifest_files+    ):+        file_locations = [+            f"{path}/{version}/{file['name']}" for file in tagmanifest_files+        ]++        for location in file_locations:+            filename = location.split("/")[-1]+            save_path = f"{working_folder}/{filename}"+            storage_s3_client.download_file(bucket, location, save_path)++    @staticmethod+    def _append_archivematica_uuid(working_folder, archivematica_uuid):+        bag_info = load_space_separated_file(+            file_location=f"{working_folder}/bag-info.txt"+        )++        if "Internal-Sender-Identifier:" in bag_info:+            assert (+                bag_info["Internal-Sender-Identifier:"] == archivematica_uuid+            ), archivematica_uuid+            return False+        else:+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")+            return True++    def _update_tagmanifest(self, working_folder):+        existing_checksums = self._load_existing_checksums(working_folder)+        new_checksums = self._generate_updated_checksums(working_folder)++        merged_checksums = dict(existing_checksums)+        for k, v in new_checksums.items():+            merged_checksums[k] = v++        assert "fetch.txt" in merged_checksums.keys()++        old_bag_info_checksum = existing_checksums.get("bag-info.txt")+        new_bag_info_checksum = merged_checksums.get("bag-info.txt")++        assert old_bag_info_checksum != new_bag_info_checksum++        with open(f"{working_folder}/{self.tagmanifest_name}", "w") as fp:+            for checksum, filename in merged_checksums.items():+                fp.write(f"{filename} {checksum}\n")++    def _upload_bag_to_s3(self, archive_location, working_id, remove_bag=True):+        s3_upload_key = f"{self.s3_upload_prefix}/{working_id}.tar.gz"+        workflow_s3_client.upload_file(+            Filename=archive_location, Bucket=s3_upload_bucket, Key=s3_upload_key+        )+        if remove_bag:+            os.remove(archive_location)++        return s3_upload_key++    def migrate(self, version, space, external_identifier):+        storage_manifest = storage_client.get_bag(+            space=space, external_identifier=external_identifier, version=version+        )++        id = storage_manifest["id"]+        bucket = storage_manifest["location"]["bucket"]+        path = storage_manifest["location"]["path"]+        files = storage_manifest["manifest"]["files"]+        provider = storage_manifest["location"]["provider"]["id"]+        tagmanifest_files = storage_manifest["tagManifest"]["files"]++        assert provider == "amazon-s3"++        working_id = id.replace("/", "_")+        working_folder = f"{self.target_folder}/{working_id}"++        os.makedirs(working_folder, exist_ok=True)++        # Initialise working log+        with open(f"{self.target_folder}/{working_id}.log", "w") as fp:+            fp.write(+                f"{datetime.datetime.now().isoformat()}: Starting migration for {id}\n"+            )++        def _log(msg):+            with open(f"{self.target_folder}/{working_id}.log", "a") as fp:+                fp.write(f"{datetime.datetime.now().isoformat()}: {msg}\n")++        # Write fetch.txt+        self._write_fetch_file(+            working_folder=working_folder, bucket=bucket, path=path, files=files+        )+        _log(f"Wrote fetch.txt")++        # Get required files from bag+        self._get_bagit_files_from_s3(+            working_folder=working_folder,+            bucket=bucket,+            path=path,+            version=version,+            tagmanifest_files=tagmanifest_files,+        )+        _log(f"Got BagIt files from S3")++        # Update bag-info.txt+        archivematica_uuid = self._get_archivematica_uuid(+            bucket=bucket, path=path, version=version+        )++        did_append_uuid = self._append_archivematica_uuid(+            working_folder, archivematica_uuid+        )+        if not did_append_uuid:+            _log(+                f"Internal-Sender-Identifier found in bag-info.txt: {archivematica_uuid}"+            )+            _log(f"Not migrating {id} (already migrated)")+            return+        else:+            _log(+                f"Appended Internal-Sender-Identifier to bag-info.txt: {archivematica_uuid}"+            )++        # Update tagmanifest-sha256.txt+        self._update_tagmanifest(working_folder=working_folder)+        _log(f"Updated {self.tagmanifest_name}")++        # Create compressed bag+        archive_location = compress_folder(folder=working_folder)+        _log(f"Created archive: {archive_location}")++        # Upload compressed bag to S3+        s3_upload_key = self._upload_bag_to_s3(+            archive_location=archive_location, working_id=working_id, remove_bag=False+        )+        _log(f"Uploaded bag to s3://{s3_upload_bucket}/{s3_upload_key}")++        # Request ingest of uploaded bag from Storage Service+        # ingest_uri = storage_client.create_s3_ingest(+        #     space=space,+        #     external_identifier=external_identifier,+        #     s3_bucket=s3_upload_bucket,+        #     s3_key=s3_upload_key,+        #     ingest_type="update"+        # )+        # _log(f"Requested ingest: {ingest_uri}")+        # sys.exit(1)+        _log(f"Completed migration for {id}")+++if __name__ == "__main__":+    storage_role_arn = "arn:aws:iam::975596993436:role/storage-developer"+    workflow_role_arn = "arn:aws:iam::299497370133:role/workflow-developer"++    elastic_secret_id = "archivematica_bags_migration/credentials"+    index = "storage_bags"++    environment_id = "prod"

Maybe move the index down into the environments block? It differs for staging/prod.

kenoir

comment created time in a day

PullRequestReviewEvent
PullRequestReviewEvent

issue commentwellcomecollection/platform

Spurious errors in the ID minter logs

Some discussion in Slack here: https://wellcome.slack.com/archives/CN56BRQ5B/p1603095998110000

And a tentative fix: https://github.com/wellcomecollection/catalogue/commit/1c49488ed862ed37db69c96d4a79c482a8cc147e

alexwlchan

comment created time in a day

issue openedwellcomecollection/platform

Spurious errors in the ID minter logs

If you look in the ID minter logs, you see errors like this:

08:45:30.718 [main-actor-system-akka.actor.default-dispatcher-140] ERROR u.a.wellcome.messaging.sqs.SQSStream - Unrecognised failure while: next on empty iterator
java.util.NoSuchElementException: next on empty iterator
	at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
	at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
	at scala.collection.LinearSeqLike$$anon$1.next(LinearSeqLike.scala:50)
	at scala.collection.TraversableOnce$FlattenOps$$anon$2.next(TraversableOnce.scala:502)
	at uk.ac.wellcome.platform.id_minter.database.IdentifiersDao.readOnlySession(IdentifiersDao.scala:170)
	at uk.ac.wellcome.platform.id_minter.database.IdentifiersDao.lookupIds$default$2(IdentifiersDao.scala:29)
	at uk.ac.wellcome.platform.id_minter.steps.IdentifierGenerator.retrieveOrGenerateCanonicalIds(IdentifierGenerator.scala:25)
	at uk.ac.wellcome.platform.id_minter_works.services.IdMinterWorkerService.$anonfun$processJson$1(IdMinterWorkerService.scala:57)
	at scala.util.Success.flatMap(Try.scala:251)
	at uk.ac.wellcome.platform.id_minter_works.services.IdMinterWorkerService.processJson(IdMinterWorkerService.scala:55)
	at uk.ac.wellcome.platform.id_minter_works.services.IdMinterWorkerService.$anonfun$processMessage$1(IdMinterWorkerService.scala:51)
	at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
	at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:56)
	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:93)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:93)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

They don't seem to affect works being minted – stuff still gets through okay – but it's a PITA if you're trying to follow the logs. We should find out where these are being thrown and fix them.

created time in a day

push eventwellcomecollection/catalogue

Alex Chan

commit sha 1c49488ed862ed37db69c96d4a79c482a8cc147e

Don't make the poolNames iterator lazy For some reason, we get a NoSuchElementException when calling poolNames.next() in this method. It doesn't seem to be terminal -- nothing ends up on the DLQ -- but it does spam the ID minter logs. Try making this iterator non-lazy; see if that fixes things.

view details

push time in a day

delete branch wellcomecollection/catalogue

delete branch : dont-pass-json-as-string

delete time in a day

push eventwellcomecollection/catalogue

Alex Chan

commit sha b7718c25317edcbaa1112613afab42f03795c64b

Log the name of the index if we get an error in ElasticRetriever

view details

Alex Chan

commit sha fd00e4d671043f4b72ab3bc6d6b2df762979fe96

Test that ElasticRetriever can retrieve a document with a slash in the ID

view details

Alex Chan

commit sha 70d135c19ad09d269cd91549f8c72023b853835c

Improve some test descriptions in MergerWorkerServiceTest

view details

Alex Chan

commit sha 6f6434baaf12c68e235cb5f11043a3a28a70c322

Add some helper functions in the merger tests to get onward messages Having a single place where we decide what the "right" output looks like makes the tests easier to update when we want to change the output of the merger.

view details

Alex Chan

commit sha 12ebde06263aa9c993baa342267a003ca3f3f5dd

Remove an unused variable

view details

Alex Chan

commit sha d4ecba6685b01002878846ee16e6406dcfeb3042

Don't JSON encode the work ID sent by the merger

view details

Buildkite on behalf of Wellcome Collection

commit sha 55411ade469fed66dddfa6fbb13dcca0d0f3bb4e

Apply auto-formatting rules

view details

Alex Chan

commit sha 0e62d83692519ad4bf21ad70f1f4e85e9d2820b4

Merge pull request #965 from wellcomecollection/dont-pass-json-as-string Don't JSON-encode the work ID sent by the merger

view details

push time in a day

PR merged wellcomecollection/catalogue

Don't JSON-encode the work ID sent by the merger

Closes https://github.com/wellcomecollection/platform/issues/4850

+59 -29

1 comment

6 changed files

alexwlchan

pr closed time in a day

issue closedwellcomecollection/platform

Merger should not be JSON-encoding the work ID it sends

The ID minter is confused because it receives the message (exact bytes):

"sierra-system-number/b16590442"

The string has been JSON encoded by the merger, but the ID minter isn't JSON-unencoding it, because it expects to get a raw string. We should remove the JSON encoding in the merger.

Should be a simple enough fix, but it's late enough I'm not going to try it this week.

Part of https://github.com/wellcomecollection/platform/issues/4828

closed time in a day

alexwlchan

pull request commentwellcomecollection/catalogue

Don't JSON-encode the work ID sent by the merger

I’m going to merge this so I can deploy it to staging, but please comment if you think there are things I should change or fix, and I’ll sort them out in a separate PR.

alexwlchan

comment created time in a day

delete branch wellcomecollection/catalogue

delete branch : add-relation-embedder-ecr-repo

delete time in a day

push eventwellcomecollection/catalogue

Alex Chan

commit sha 90fc5d14ee6eababb46629fcf6e4ac094e48986a

Add an ECR repository for the relation embedder

view details

Alex Chan

commit sha b89fdfe0203b3a6a444b32987d08ae7ff67b245d

Merge pull request #966 from wellcomecollection/add-relation-embedder-ecr-repo Add an ECR repository for the relation embedder

view details

push time in a day

PR merged wellcomecollection/catalogue

Add an ECR repository for the relation embedder

The build on master is failing because the ECR repo for the relation_embedder doesn't exist yet. I've created it in the console and imported it into the Terraform state and restarted the latest build. Fingers crossed this makes it go green!

Part of https://github.com/wellcomecollection/platform/issues/4830

+4 -0

0 comment

1 changed file

alexwlchan

pr closed time in a day

PR opened wellcomecollection/catalogue

Add an ECR repository for the relation embedder

The build on master is failing because the ECR repo for the relation_embedder doesn't exist yet. I've created it in the console and imported it into the Terraform state, so builds should go green again.

+4 -0

0 comment

1 changed file

pr created time in a day

create barnchwellcomecollection/catalogue

branch : add-relation-embedder-ecr-repo

created branch time in a day

PR opened wellcomecollection/catalogue

Don't JSON-encode the work ID sent by the merger

Closes https://github.com/wellcomecollection/platform/issues/4850

+55 -29

0 comment

6 changed files

pr created time in a day

create barnchwellcomecollection/catalogue

branch : dont-pass-json-as-string

created branch time in a day

push eventalexwlchan/ttml2srt

Alex Chan

commit sha bba90f9dd67688f7b308af813cf92e05965e54f9

Add a shebang; make ttml2srt executable

view details

Alex Chan

commit sha 2e70ef22ca941c37cd1a1bf684d690947c2b19f7

Throw away the .py extension

view details

Alex Chan

commit sha e762bc37b4aaa35c5c2e62e43b3f97c104f6bea8

Remove a deprecated use of .getiterator() Previously this dropped a warning: /repos/ttml2srt/ttml2srt:14: DeprecationWarning: This method will be removed in future versions. Use 'tree.iter()' or 'list(tree.iter())' instead. for elem in root.getiterator():

view details

Alex Chan

commit sha bbe428684c7a7a7c7130c2c20ff17dd21c5c7870

Handle multiple classes being applied in the style attribute

view details

Alex Chan

commit sha eb576f3bdf393b7f66e448375263c269fb218b73

Push everything inside an if __name__ == '__main__' block

view details

Alex Chan

commit sha 99a3faf3bec1e452b2cdb22fe42364e07925abf6

Don't use global variables for everything

view details

Alex Chan

commit sha 6407a077d2602649b3629b4893fbdce8ea575ca7

Write the subtitles directly to an SRT file

view details

push time in 2 days

fork alexwlchan/ttml2srt

convert TTML subtitles to SRT subtitles

fork in 2 days

issue commentterraform-providers/terraform-provider-aws

terraform import aws_eip_association says that the remote object does not exist but it does.

I’m not sure if this is significant, but the example in the Terraform docs uses a different-looking ID to the one you're using (emphasis mine):

EIP Assocations can be imported using their association ID.

<pre><code>$ terraform import aws_eip_association.test <strong>eipassoc</strong>-ab12c345</code></pre>

Since an EIP association exports both an association and an allocation ID, maybe you need to try the other one?

(I have no idea how you go about finding the association ID.)

trajano

comment created time in 2 days

push eventalexwlchan/alexwlchan.net

Azure Pipelines on behalf of Alex Chan

commit sha c35946b302b2083e372d0b27f3d532fd35ffc08c

Publish new post how-do-i-use-my-iphone-cameras.md

view details

push time in 3 days

push eventalexwlchan/alexwlchan.net

Alex Chan

commit sha ea26936493318afde3c0ab3b95a40af4731cb735

Run optipng over a couple of images

view details

push time in 3 days

delete branch alexwlchan/alexwlchan.net

delete branch : exif-lenses

delete time in 3 days

push eventalexwlchan/alexwlchan.net

Alex Chan

commit sha 5f7be1251739666777736689acd7613ea90e1fb0

First draft of a post about EXIF metadata

view details

Alex Chan

commit sha 57dac427161183c833cb184f3893b2aa49d01191

Markups on how I use my iPhone cameras

view details

Alex Chan

commit sha d7d4aa0cca23cc2b8de048b4284edf17d3ea8615

More markups on my EXIF post

view details

Alex Chan

commit sha 980e18a26678335cffa83b6996ae08f85932b376

Add some alt text to the info screenshot

view details

Alex Chan

commit sha 1791ef9c5115f429e180fd905fdfc263806478ea

Merge pull request #398 from alexwlchan/exif-lenses Add a post about finding out how often I use my iPhone cameras

view details

push time in 3 days

create barnchalexwlchan/alexwlchan.net

branch : exif-lenses

created branch time in 3 days

issue openedwellcomecollection/platform

Merger should not be JSON-encoding the work ID it sends

The ID minter is confused because it receives the message (exact bytes):

"sierra-system-number/b16590442"

The string has been JSON encoded by the merger, but the ID minter isn't JSON-unencoding it, because it expects to get a raw string. We should remove the JSON encoding in the merger.

Should be a simple enough fix, but it's late enough I'm not going to try it this week.

created time in 4 days

push eventwellcomecollection/catalogue

Alex Chan

commit sha d7e66b232220d8421de36522f4260325bb221882

Log the response if asked to look up a non-existent ID

view details

push time in 4 days

delete branch wellcomecollection/catalogue

delete branch : fix-id-minter

delete time in 4 days

push eventwellcomecollection/catalogue

Alex Chan

commit sha db26fae3316169c9f20b67b1af3e9a1025b62f8c

Allow building the RDS database sans-Typesafe This is useful for setting up a local instance of the ID minter for testing, without passing the config out to Typesafe and back again.

view details

Alex Chan

commit sha 35a43b1b35726e27c27bd0ea730715a797b56fb5

Assert that the list of identifiers is non-empty in IdentifiersDao

view details

Alex Chan

commit sha ba0880781c2d0accde7f63aee272ffa787f818f0

Rename the withIdentifiersDatabase fixture -- it returns TableConfig

view details

Alex Chan

commit sha 1c58a47249c17d8a476e785eb29bd8ac4672e963

Log the ID of the T we're about to retrieve from ElasticRetriever

view details

Alex Chan

commit sha 1c42e4f31aaab0db9cd5c64b02ebbcf657147b95

Write some generic RetrieverTestCases

view details

Alex Chan

commit sha d9625093b8999dfc6071a2e34ca9e981a656ecf4

Throw an exception if you can't find a document This should never happen in the normal operation of the pipeline.

view details

Alex Chan

commit sha 30a0db5cbcb4bf5732de362e72e27524ef205254

Send a vanilla SNS message from the merger

view details

Alex Chan

commit sha 95676ffb133097b30d120ed2e1e8b9e94a980742

Remove the old ElasticRetrieverTest

view details

Buildkite on behalf of Wellcome Collection

commit sha 3f8bc1639999c5e05194145c6dc91f815536a59a

Apply auto-formatting rules

view details

Alex Chan

commit sha 65587ff11e762eca97107e2ddbf47c81f8ed1f8d

Revert "Sorry" This reverts commit fb3c3b6bb10c8d34189579e212a5e29df7bc1fa2.

view details

Alex Chan

commit sha cfbbbfc8c0bde16f5dfb1804927bd9f3686c2e26

Merge pull request #964 from wellcomecollection/fix-id-minter Make failures in the ID minter more explicit, earlier

view details

push time in 4 days

PR merged wellcomecollection/catalogue

Make failures in the ID minter more explicit, earlier

Pulling messages from catalogue-20201016_work_id_minter_dlq, I see lots of messages like:

{"jsonString": "\"sierra-system-number/b28972995\"", "type": "InlineNotification"}

The ID minter then passes that JSON string directly to the ElasticRetriever, which returns nothing. Normally we'd catch that, but because the ID minter reads Json instead of a stronger type, that nothing gets coerced to an empty list, and we see all the downstream weirdness we've been experiencing.

This patch:

  • Logs the ID we're trying to look up in the ElasticRetriever. If you see a DEBUG log go by with this JSON string, you know something is up.
  • Throws an exception from the Retriever if asked to retrieve a non-existent document – this should never happen in the normal operation of the pipeline. Don't let a weird response float downstream.
  • Assert immediately in the ID minter if asked to mint an empty set of sourceIdentifiers, which is a sign something weird is happening. DOn't wait to hit it while trying to get the .tail of an empty list.
  • Updates the merger to emit the work ID as a pure string, rather than an InlineNotification
+187 -86

1 comment

20 changed files

alexwlchan

pr closed time in 4 days

push eventwellcomecollection/catalogue

Alex Chan

commit sha ba0880781c2d0accde7f63aee272ffa787f818f0

Rename the withIdentifiersDatabase fixture -- it returns TableConfig

view details

Alex Chan

commit sha 1c58a47249c17d8a476e785eb29bd8ac4672e963

Log the ID of the T we're about to retrieve from ElasticRetriever

view details

Alex Chan

commit sha 1c42e4f31aaab0db9cd5c64b02ebbcf657147b95

Write some generic RetrieverTestCases

view details

Alex Chan

commit sha d9625093b8999dfc6071a2e34ca9e981a656ecf4

Throw an exception if you can't find a document This should never happen in the normal operation of the pipeline.

view details

Alex Chan

commit sha 30a0db5cbcb4bf5732de362e72e27524ef205254

Send a vanilla SNS message from the merger

view details

Alex Chan

commit sha 95676ffb133097b30d120ed2e1e8b9e94a980742

Remove the old ElasticRetrieverTest

view details

Buildkite on behalf of Wellcome Collection

commit sha 3f8bc1639999c5e05194145c6dc91f815536a59a

Apply auto-formatting rules

view details

Alex Chan

commit sha 65587ff11e762eca97107e2ddbf47c81f8ed1f8d

Revert "Sorry" This reverts commit fb3c3b6bb10c8d34189579e212a5e29df7bc1fa2.

view details

push time in 4 days

push eventwellcomecollection/catalogue

Alex Chan

commit sha d5744b36438be4df2d34d06794acf98bbca1de67

Revert "Sorry" This reverts commit fb3c3b6bb10c8d34189579e212a5e29df7bc1fa2.

view details

push time in 4 days

push eventwellcomecollection/catalogue

Alex Chan

commit sha 9daf6f3e295cd84765ffcad214d5bba806541d16

Remove the old ElasticRetrieverTest

view details

push time in 4 days

PR opened wellcomecollection/catalogue

Make failures in the ID minter more explicit, earlier

Pulling messages from catalogue-20201016_work_id_minter_dlq, I see lots of messages like:

{'jsonString': '"sierra-system-number/b28972995"', 'type': 'InlineNotification'}

The ID minter then passes that JSON string directly to the ElasticRetriever, which returns nothing. Normally we'd catch that, but because the ID minter reads Json instead of a stronger type, that nothing gets coerced to an empty list, and we see all the downstream weirdness we've been experiencing.

This patch:

  • Logs the ID we're trying to look up in the ElasticRetriever. If you see a DEBUG log go by with this JSON string, you know something is up.
  • Throws an exception from the Retriever if asked to retrieve a non-existent document – this should never happen in the normal operation of the pipeline. Don't let a weird response float downstream.
  • Assert immediately in the ID minter if asked to mint an empty set of sourceIdentifiers, which is a sign something weird is happening. DOn't wait to hit it while trying to get the .tail of an empty list.
  • Updates the merger to emit the work ID as a pure string, rather than an InlineNotification
+168 -23

0 comment

17 changed files

pr created time in 4 days

push eventwellcomecollection/catalogue

Alex Chan

commit sha b1cc0e0bd493081bfe39b575058d5a1b653345ae

Throw an exception if you can't find a document This should never happen in the normal operation of the pipeline.

view details

Alex Chan

commit sha a0c86d6a6eeadf367834e595a64b8dcd65ba09d0

Send a vanilla SNS message from the merger

view details

push time in 4 days

push eventwellcomecollection/catalogue

Alex Chan

commit sha 1cf6c42fba73542f8b360f618e54bff7db857c1b

Log the ID of the T we're about to retrieve from ElasticRetriever

view details

Alex Chan

commit sha abbfc7f6824474132351ad0c4c2698fc19b2db50

Write some generic RetrieverTestCases

view details

Alex Chan

commit sha 79737cb05ea4d882f0b9a570ce446cd4d28e7a7e

Throw an exception if you can't find a document

view details

push time in 4 days

create barnchwellcomecollection/catalogue

branch : fix-id-minter

created branch time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()++        def _generate_updated_checksums():+            files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++            return {+                filename: _generate_checksum(f"{working_folder}/{filename}")+                for filename in files_in_need_of_update+            }++        def _load_existing_checksums():+            tag_manifest = {}++            with open(f"{working_folder}/{tagmanifest_name}") as fp:+                for _, line in enumerate(fp):+                    split_manifest_line = line.split(" ")+                    checksum = split_manifest_line[0].strip()+                    filename = split_manifest_line[1].strip()++                    assert checksum+                    assert filename++                    tag_manifest[filename] = checksum++            return tag_manifest++        def _write_new_tagmanifest():+            existing_checksums = _load_existing_checksums()+            new_checksums = _generate_updated_checksums()++            merged_checksums = dict(existing_checksums.items() | new_checksums.items())++            with open(f"{working_folder}/{tagmanifest_name}", "w") as fp:+                for checksum, filename in merged_checksums.items():+                    fp.write(f"{filename} {checksum}\n")++        def _get_archivematica_uuid():+            files = _filter_bag_files("data/METS.")++            assert len(files) == 1+            mets_file_with_id = files[0]++            archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++            assert archivematica_uuid+            return archivematica_uuid++        def _update_bag_info():+            archivematica_uuid = _get_archivematica_uuid()+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}")++        def _upload_bag_to_s3():+            workflow_s3_client.upload_file(+                f"{target_folder}/{archive_name}",+                s3_upload_bucket,+                f"{s3_upload_prefix}/{archive_name}",+            )++        def _request_ingest():+            response = storage_client.create_s3_ingest(+                space=space,+                external_identifier=external_identifier,+                s3_bucket=s3_upload_bucket,+                s3_key=f"{s3_upload_prefix}/{archive_name}",+                ingest_type="update",+            )

This response should include the ingest ID in the header – how about printing it, so it's easy to find?

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)

This method returns the name of the archive it generates, which is more robust than hard-coding it as archive_name above.

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()++        def _generate_updated_checksums():+            files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++            return {+                filename: _generate_checksum(f"{working_folder}/{filename}")+                for filename in files_in_need_of_update+            }++        def _load_existing_checksums():+            tag_manifest = {}++            with open(f"{working_folder}/{tagmanifest_name}") as fp:+                for _, line in enumerate(fp):

minor: don't think enumerate adds anything?

                for line in fp:
kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()++        def _generate_updated_checksums():+            files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++            return {+                filename: _generate_checksum(f"{working_folder}/{filename}")+                for filename in files_in_need_of_update+            }++        def _load_existing_checksums():+            tag_manifest = {}++            with open(f"{working_folder}/{tagmanifest_name}") as fp:+                for _, line in enumerate(fp):+                    split_manifest_line = line.split(" ")+                    checksum = split_manifest_line[0].strip()+                    filename = split_manifest_line[1].strip()++                    assert checksum+                    assert filename++                    tag_manifest[filename] = checksum++            return tag_manifest++        def _write_new_tagmanifest():+            existing_checksums = _load_existing_checksums()+            new_checksums = _generate_updated_checksums()++            merged_checksums = dict(existing_checksums.items() | new_checksums.items())++            with open(f"{working_folder}/{tagmanifest_name}", "w") as fp:+                for checksum, filename in merged_checksums.items():+                    fp.write(f"{filename} {checksum}\n")++        def _get_archivematica_uuid():+            files = _filter_bag_files("data/METS.")++            assert len(files) == 1+            mets_file_with_id = files[0]++            archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++            assert archivematica_uuid+            return archivematica_uuid++        def _update_bag_info():+            archivematica_uuid = _get_archivematica_uuid()+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}")++        def _upload_bag_to_s3():+            workflow_s3_client.upload_file(+                f"{target_folder}/{archive_name}",+                s3_upload_bucket,+                f"{s3_upload_prefix}/{archive_name}",+            )++        def _request_ingest():+            response = storage_client.create_s3_ingest(+                space=space,+                external_identifier=external_identifier,+                s3_bucket=s3_upload_bucket,+                s3_key=f"{s3_upload_prefix}/{archive_name}",+                ingest_type="update",+            )++        _write_fetch_file()+        _get_bag_files()+        _update_bag_info()+        _write_new_tagmanifest()+        _compress_bag()+        _upload_bag_to_s3()+        _request_ingest()++        sys.exit(1)++    return create_bag+++if __name__ == "__main__":+    storage_role_arn = "arn:aws:iam::975596993436:role/storage-developer"+    workflow_role_arn = "arn:aws:iam::299497370133:role/workflow-developer"++    elastic_secret_id = "archivematica_bags_migration/credentials"+    index = "storage_bags"++    environment_id = "stage"++    environments = {+        "prod": {+            "bucket": "wellcomecollection-archivematica-ingests",+            "api_url": "https://api.wellcomecollection.org/storage/v1",+        },+        "stage": {+            "bucket": "wellcomecollection-archivematica-staging-ingests",+            "api_url": "https://api-stage.wellcomecollection.org/storage/v1",+        },+    }++    api_url = environments[environment_id]["api_url"]++    workflow_s3_client = get_aws_client(resource="s3", role_arn=workflow_role_arn)++    storage_s3_client = get_aws_client(resource="s3", role_arn=storage_role_arn)++    elastic_client = get_elastic_client(+        role_arn=storage_role_arn, elastic_secret_id=elastic_secret_id+    )++    storage_client = get_storage_client(api_url=api_url)++    # TODO: update query to include born-digital-accessions

And to ignore bags that already have an Internal-Sender-Identifier field, so you don't send multiple updates for the same bag.

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

 def get_aws_client(resource, *, role_arn):         aws_secret_access_key=credentials["SecretAccessKey"],         aws_session_token=credentials["SessionToken"],     )+++def get_storage_client(api_url):

This can be cached as well – the storage service class has its own logic for refreshing tokens. You don't need a new instance unless the on-disk credentials have changed.

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()++        def _generate_updated_checksums():+            files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++            return {+                filename: _generate_checksum(f"{working_folder}/{filename}")+                for filename in files_in_need_of_update+            }++        def _load_existing_checksums():+            tag_manifest = {}++            with open(f"{working_folder}/{tagmanifest_name}") as fp:+                for _, line in enumerate(fp):+                    split_manifest_line = line.split(" ")+                    checksum = split_manifest_line[0].strip()+                    filename = split_manifest_line[1].strip()

I'd suggest using tuple unpacking here:

                    checksum, filename = line.strip().split()

That way, Python will warn you if there isn't enough text in the line (or conversely, too much!).

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()++        def _generate_updated_checksums():+            files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++            return {+                filename: _generate_checksum(f"{working_folder}/{filename}")+                for filename in files_in_need_of_update+            }++        def _load_existing_checksums():+            tag_manifest = {}++            with open(f"{working_folder}/{tagmanifest_name}") as fp:+                for _, line in enumerate(fp):+                    split_manifest_line = line.split(" ")+                    checksum = split_manifest_line[0].strip()+                    filename = split_manifest_line[1].strip()++                    assert checksum+                    assert filename++                    tag_manifest[filename] = checksum

I would throw in an assertion that filename isn't already in this dictionary, which would surely indicate something very weird is up.

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()++        def _generate_updated_checksums():+            files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++            return {+                filename: _generate_checksum(f"{working_folder}/{filename}")+                for filename in files_in_need_of_update+            }++        def _load_existing_checksums():+            tag_manifest = {}++            with open(f"{working_folder}/{tagmanifest_name}") as fp:+                for _, line in enumerate(fp):+                    split_manifest_line = line.split(" ")+                    checksum = split_manifest_line[0].strip()+                    filename = split_manifest_line[1].strip()++                    assert checksum+                    assert filename++                    tag_manifest[filename] = checksum++            return tag_manifest++        def _write_new_tagmanifest():+            existing_checksums = _load_existing_checksums()+            new_checksums = _generate_updated_checksums()++            merged_checksums = dict(existing_checksums.items() | new_checksums.items())++            with open(f"{working_folder}/{tagmanifest_name}", "w") as fp:+                for checksum, filename in merged_checksums.items():+                    fp.write(f"{filename} {checksum}\n")++        def _get_archivematica_uuid():+            files = _filter_bag_files("data/METS.")++            assert len(files) == 1+            mets_file_with_id = files[0]++            archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++            assert archivematica_uuid+            return archivematica_uuid++        def _update_bag_info():+            archivematica_uuid = _get_archivematica_uuid()+            with open(f"{working_folder}/bag-info.txt", "a") as fp:+                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}")

Don't forget the trailing newline!

                fp.write(f"Internal-Sender-Identifier: {archivematica_uuid}\n")
kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()++        def _generate_updated_checksums():+            files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++            return {+                filename: _generate_checksum(f"{working_folder}/{filename}")+                for filename in files_in_need_of_update+            }++        def _load_existing_checksums():+            tag_manifest = {}++            with open(f"{working_folder}/{tagmanifest_name}") as fp:+                for _, line in enumerate(fp):+                    split_manifest_line = line.split(" ")+                    checksum = split_manifest_line[0].strip()+                    filename = split_manifest_line[1].strip()++                    assert checksum+                    assert filename++                    tag_manifest[filename] = checksum++            return tag_manifest++        def _write_new_tagmanifest():+            existing_checksums = _load_existing_checksums()+            new_checksums = _generate_updated_checksums()++            merged_checksums = dict(existing_checksums.items() | new_checksums.items())++            with open(f"{working_folder}/{tagmanifest_name}", "w") as fp:+                for checksum, filename in merged_checksums.items():+                    fp.write(f"{filename} {checksum}\n")++        def _get_archivematica_uuid():+            files = _filter_bag_files("data/METS.")++            assert len(files) == 1

minor: Python will just give you a generic "AssertionFailed" error if this goes wrong. I like to include the value I was just testing, so I've got a chance of debugging it:

            assert len(files) == 1, files

(Some test frameworks will give you nicer assertions, but not vanilla Python.)

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()++        def _generate_updated_checksums():+            files_in_need_of_update = ["bag-info.txt", "fetch.txt"]++            return {+                filename: _generate_checksum(f"{working_folder}/{filename}")+                for filename in files_in_need_of_update+            }++        def _load_existing_checksums():+            tag_manifest = {}++            with open(f"{working_folder}/{tagmanifest_name}") as fp:+                for _, line in enumerate(fp):+                    split_manifest_line = line.split(" ")+                    checksum = split_manifest_line[0].strip()+                    filename = split_manifest_line[1].strip()++                    assert checksum+                    assert filename++                    tag_manifest[filename] = checksum++            return tag_manifest++        def _write_new_tagmanifest():+            existing_checksums = _load_existing_checksums()+            new_checksums = _generate_updated_checksums()++            merged_checksums = dict(existing_checksums.items() | new_checksums.items())++            with open(f"{working_folder}/{tagmanifest_name}", "w") as fp:+                for checksum, filename in merged_checksums.items():+                    fp.write(f"{filename} {checksum}\n")++        def _get_archivematica_uuid():+            files = _filter_bag_files("data/METS.")++            assert len(files) == 1+            mets_file_with_id = files[0]++            archivematica_uuid = mets_file_with_id.split("/METS.")[-1].split(".xml")[0]++            assert archivematica_uuid

Checking the UUID is non-empty – nice. Even nicer: how about using the uuid library to check this is actually a UUID?

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)++        def _compress_bag():+            shutil.make_archive(working_folder, "gztar", working_folder)+            shutil.rmtree(working_folder, ignore_errors=True)++        def _generate_checksum(file_location):+            sha256_hash = hashlib.sha256()++            with open(file_location, "rb") as f:+                for byte_block in iter(lambda: f.read(4096), b""):+                    sha256_hash.update(byte_block)++                return sha256_hash.hexdigest()

Nice, could be a standalone function which would make the bag_creator function easier to read.

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import get_aws_resource, get_aws_client, get_storage_client, get_secret, get_elastic_client+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index,+        body=elastic_query,+        size=0+    )++    document_count = initial_query['hits']['total']['value']+    results = helpers.scan(+        client=elastic_client,+        index=elastic_index,+        size=5,+        query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket):+    def create_bag(result):+        document = result['_source']++        id = document['id']+        bucket = document['location']['bucket']+        path = document['location']['path']+        version = f"v{document['version']}"+        space = document['space']+        external_identifier = document['info']['externalIdentifier']++        provider = document['location']['provider']+        assert provider == 'amazon-s3'++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = [+            "bagit.txt",+            "bag-info.txt",+            "manifest-",+            "tagmanifest-"+        ]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document['files']++            fetch_file = open(f"{working_folder}/fetch.txt", 'w')+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()

The more idiomatic Python approach for opening files is to use a context manager (aka the with statement). It's syntactic sugar around try … finally.

            with open(f"{working_folder}/fetch.txt", "w") as fetch_file:
                for file in files:
                    fetch_file.write(_create_fetch_line(file))

Given the number of functions swirling around here, I'm not sure the _create_fetch_line function makes much sense. I would inline it:

def _write_fetch_file():
    with open(f"{working_folder}/fetch.txt", "w") as fetch_file:
        for file in files:
            fetch_file.write(f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n")

You could go one step further and pull this out from bag_creator:

def write_fetch_file(path, fetch_uri_prefix, files):
    with open(path, "w") as fetch_file:
        for file in files:
            fetch_file.write(f"{fetch_uri_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n")

Not necessarily important here, but having it as a standalone would make it easier to test and reuse later.

kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")++        def _get_bag_files():+            file_locations = []+            for prefix in prefix_patterns:+                file_locations = file_locations + _filter_bag_files(prefix)++            _check_expected_prefixes_exist(file_locations)++            for location in file_locations:+                filename = location.split("/")[-1]+                save_path = f"{working_folder}/{filename}"+                storage_s3_client.download_file(bucket, location, save_path)

Couple of suggestions:

  • I would call this _get_bag_manifest_files to be really explicit that you mean things like fetch.txt and the tag manifest, not payload files.
  • You can already get a list of these files from the "tagManifest" entry on a JSON storage manifest from the storage service.
kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import (+    get_aws_resource,+    get_aws_client,+    get_storage_client,+    get_secret,+    get_elastic_client,+)+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index, body=elastic_query, size=0+    )++    document_count = initial_query["hits"]["total"]["value"]+    results = helpers.scan(+        client=elastic_client, index=elastic_index, size=5, query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(+    workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket+):+    def create_bag(result):+        document = result["_source"]++        id = document["id"]+        bucket = document["location"]["bucket"]+        path = document["location"]["path"]+        version = f"v{document['version']}"+        space = document["space"]+        external_identifier = document["info"]["externalIdentifier"]++        provider = document["location"]["provider"]+        assert provider == "amazon-s3"++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)++        prefix_patterns = ["bagit.txt", "bag-info.txt", "manifest-", "tagmanifest-"]++        def _write_fetch_file():+            def _create_fetch_line(file):+                return f"{path_prefix}/{file['path']}\t{file['size']}\t{file['name']}\n"++            files = document["files"]++            fetch_file = open(f"{working_folder}/fetch.txt", "w")+            for file in files:+                fetch_file.write(_create_fetch_line(file))+            fetch_file.close()++        def _filter_bag_files(prefix):+            response = storage_s3_client.list_objects_v2(+                Bucket=bucket, Prefix=f"{path}/{version}/{prefix}"+            )++            if "Contents" in response:+                return [content["Key"] for content in response["Contents"]]+            else:+                return []++        def _check_expected_prefixes_exist(file_locations):+            for prefix in prefix_patterns:+                found_match = False++                for location in file_locations:+                    if prefix in location:+                        found_match = True++                if not found_match:+                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")

Python has a convenience method you might find helpful

https://docs.python.org/3/library/functions.html?highlight=any#any:

Return True if any element of the iterable is true. If the iterable is empty, return False.

Consider:

            for prefix in prefix_patterns:
                if not any(prefix in location for location in file_locations):
                    raise RuntimeError(f"Missing any files matching prefix: {prefix}")
kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import get_aws_resource, get_aws_client, get_storage_client, get_secret, get_elastic_client+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index,+        body=elastic_query,+        size=0+    )++    document_count = initial_query['hits']['total']['value']+    results = helpers.scan(+        client=elastic_client,+        index=elastic_index,+        size=5,+        query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)+++def bag_creator(workflow_s3_client, storage_s3_client, storage_client, s3_upload_bucket):+    def create_bag(result):+        document = result['_source']++        id = document['id']+        bucket = document['location']['bucket']+        path = document['location']['path']+        version = f"v{document['version']}"+        space = document['space']+        external_identifier = document['info']['externalIdentifier']++        provider = document['location']['provider']+        assert provider == 'amazon-s3'++        path_prefix = f"s3://{bucket}/{path}"++        working_id = id.replace("/", "_")+        target_folder = "target"+        working_folder = f"{target_folder}/{working_id}"+        tagmanifest_name = "tagmanifest-sha256.txt"+        archive_name = f"{working_id}.tar.gz"+        s3_upload_prefix = "born-digital/archivematica-uuid-update"++        if not os.path.exists(working_folder):+            os.makedirs(working_folder)

minor, equivalent to mkdir -p on the command line:

        os.makedirs(working_folder, exist_ok=True)
kenoir

comment created time in 4 days

Pull request review commentwellcomecollection/storage-service

Archivematica Bag Migration

+#!/usr/bin/env python3+"""+This is a script to append an Archivematica UUID as+an Internal-Sender-Identifier to existing born-digital bags.+"""++from elasticsearch import helpers+import sys+import os+import shutil+import hashlib+from tqdm import tqdm++from common import get_aws_resource, get_aws_client, get_storage_client, get_secret, get_elastic_client+++def query_index(elastic_client, elastic_index, elastic_query, transform):+    initial_query = elastic_client.search(+        index=elastic_index,+        body=elastic_query,+        size=0+    )++    document_count = initial_query['hits']['total']['value']+    results = helpers.scan(+        client=elastic_client,+        index=elastic_index,+        size=5,+        query=elastic_query+    )++    with tqdm(total=document_count, file=sys.stdout) as pbar:+        for result in results:+            transform(result)+            sys.exit(1)+            pbar.update(1)

This approach of taking a function as a callback is less common in Python (although perfectly doable). Because we don't have explicit types, it can be harder to see what's expected in an API like this – what is the transform() method meant to do?

If I was writing this, I might make this a function that just generates the documents, and let the caller decide how to transform it.

I can't remember if I've recommended it before; Ned Batchelder's talk is all about iteration and generation. That sort of function is pretty idiomatic in Python. https://nedbatchelder.com/text/iter.html

kenoir

comment created time in 4 days

more