profile
viewpoint

aptise/peter_sslers 26

or how i stopped worrying and learned to love the ssl certificate

jvanasco/facebook_utils 9

simple facebook utils in python

jvanasco/dogpile_backend_redis_advanced 8

advanced redis backend for dogpile.cache

jvanasco/bleach_extras 5

unofficial "extras" and utilities for `bleach`

jvanasco/bbedit_mako 1

mako plist for textwrangler & bbedit

jvanasco/embedly-python 1

Python lib for Embedly

jvanasco/forgodsakes 1

Simple Twitter bot to correct people when they misspeak about pete and his sakes

jvanasco/acme-dns 0

Limited DNS server with RESTful HTTP API to handle ACME DNS challenges easily and securely.

jvanasco/acme-dns-certbot-joohoi 0

Certbot client hook for acme-dns

startedShopify/maintenance_tasks

started time in 12 days

PublicEvent

startedjvanasco/metadata_parser

started time in a month

startedjvanasco/metadata_parser

started time in a month

startedrubygarage/api_struct

started time in a month

startedjvanasco/metadata_parser

started time in a month

startedfbn-roussel/aws-sam-typescript-webpack-backend

started time in 2 months

fork alexzorin/imap-backup

Backup GMail (or other IMAP) accounts to disk

fork in 2 months

starteddgraham/vim-eslint

started time in 2 months

fork janispauls/pyramid_session_redis

This is an extensive fork with large rewrites and functionality changes. originally ericrasmussen/pyramid_redis_sessions

fork in 2 months

pull request commentjvanasco/metadata_parser

Remove unsilenceable exception log message (fixes #32)

Superseded by #34.

fmarier

comment created time in 2 months

issue commentjvanasco/metadata_parser

`NotParsableFetchError`s lead to output even when handled

Thanks for the clarification.

Dropping to INFO or WARNING would work. It would keep it out of tools like Sentry by default without having to adjust the module logging manually.

fmarier

comment created time in 2 months

fork fmarier/metadata_parser

python library for getting metadata

fork in 2 months

issue openedjvanasco/metadata_parser

`NotParsableFetchError`s lead to output even when handled

Because of an explicit log.error(), it's not possible to silence the NotParsableFetchError exception even when catching it in a try..except statement:

try:
    page = metadata_parser.MetadataParser(url=url, requests_timeout=5)
except metadata_parser.NotParsableFetchError as e:
    if e.code and e.code not in (403, 429, 500):
        print("Error parsing: %s [%s]" % (url, e.code))

created time in 2 months

startedjvanasco/metadata_parser

started time in 2 months

startedjvanasco/metadata_parser

started time in 2 months

startedTaDaa/vimade

started time in 2 months

startedShopify/draggable

started time in 2 months

issue commentjvanasco/metadata_parser

Creeping featurism and adding one or two more `page` fields ...

Unfortunately, I am way too busy over the next week to address this stuff in depth, but i can give highlevel feedback.

Understood- thanks for taking the time to answer this

  1. The refresh stuff seems somewhere between fine and too much. I need to look at it against some datasets. I do like the idea of parsing some of the info, such as... "meta_refresh": { "refresh": "2;url=/account?param1=xyz", }, _meta_refresh_parsed: { "time": 2, "url": /account?param1=xyz", }

I'm not sure about the further parsing

Yeah, I agree with you here. Despite implementing that extra level of cracking in my code, I don't think that level of detail is appropriate within metadata_parser. I only really consider it reasonable to see the content split into the time and URL/path component- there's minimal value for that extra granularity for nearly everyone and it definitely wanders out of scope of the project in my opinion, going from parsing HTML into parsing and deeply interpreting/making meaning of the HTML. Doesn't seem appropriate for core metadata_parser functionality, the user can crack it after using metadata_parser to get the value if they want to.

  1. The h2/h3/etc stuff seems okay, but I worry about things getting too carried away and how this may effect existing installs. It might make sense to use callback hooks or even a subclass model to do extra work

I agree, I definitely do not want to go beyond h1, h2, h3, especially within metadata_parser code. Anything beyond that, using the extensibility approach seems like the right way to go. This would allow much less invasive changes to provide the flexibility for users to easily do this themselves, either via their own adapters/callbacks and/or via some that may be included as optional to use within the package- maybe one common adapter/subclass (mainly to function as an example for users) would be appropriate for inclusion if this was implemented.

There was a lot of feature creep a few years back, and the number of args/kwargs exploded. This has really pushed me towards looking more at subclasses and callbacks, so the library can support more use cases... but still remain usable.

I noticed that some of the function kwargs had started to look like pandas, which commonly has 10+ kwargs for their most ubiquitous and powerful classes. I bet pylint default settings loves that kwargs list :)

Correct me if I am wrong (if you have time) but what I am ultimately getting out of your comments, which were very helpful:

  1. I don't see any areas of philosophical or implementation disagreement, aside from the extensive cracking of the meta refresh- I cited in my first comment, but I happen to actually agree with you there. I think metadata_parser can serve to break the content into the time value and the URL/relative path, but anything beyond on that belongs in user code or as a special (optional) adapter

  2. h1/h2/h3 are suitable candidates for being bolted on to the code, but going anywhere beyond that with HTML body tags is too slippery a slope. Fully agree here and I actually concluded these were the only tags with universal appeal across my dataset- I didn't choose them arbitrarily, nor were they just a starting point for a laundry list of additional tags, so we're on the same page. Otherwise at some point you might as well just return the entire bs4.soup object :P

  3. In general, you want to move more towards providing a generic/cleaner way for functionality to be extended, via classes/adapters and/or callbacks/hooks, some of which may be included in the package for user convenience or to serve as examples. This provides both flexibility and makes the project easier to maintain in the long term and doesn't require adding more and more kwargs

  4. I don't know that you said this directly- but if there are opportunities for retroactively applying this design in some areas where it can provide a maintainability improvement, they're worth looking at. One example I can think of is replacing one or two of the kwargs with a class that acts as an adapter to influence the standard behavior. Though I would guess you would prefer to increment the major version of the package for changes like these as opposed to modifying the current major and incrementing the minor, even if they are mostly abstracted from the user

It doesn't matter to me whether you leave this issue open or not, it served its purpose as a communication medium and helped me quite a bit

I will probably fork master in the short-term and at some point send a PR for you to glance at and see what you think- at the very least for the h1/h2/h3 stuff, which is quick and easy. If you don't want to merge it at that time (or ever) it won't be a big problem for me, I can just pip install from my fork+branch. Or we can work out modifications. The more overarching changes will be a separate effort

Thanks again for the insight and the brief backstory on that initial expansion of the features. I can relate to being short on time and I realize this is not a sponsored project for you, so taking the time to write this note is much appreciated

mzpqnxow

comment created time in 3 months

fork sayrer/repc

Rust port of replicache-client

fork in 3 months

issue openedjvanasco/metadata_parser

[Feature] Cache soup (and other) objects by sha256sum

When running against a very large corpus there is an opportunity for a non-negligible performance gain from caching the full operations of MetadataParser using a sum over the HTML content, at least when across over 100k-200k+ pages that contain a high percentage of duplicates which is the case for my use

This is something I'm planning on implementing anyway because in my case, my corpus already has the sha256_sum available before sending it to MetadataParser. The only question is whether I do it on the front-end on my side of the project, or contribute it as a PR to your project

If I implement it for MetadataParser, I'll be doing it such that the sha256sum can be sent precalculated via a constructor kwargs value along with html=

If you do not want this as a feature or have any comments about possible problems that might crop up, it would be great if you could post a comment in the next day or two so I can make an educated decision on where to place this

I'm expecting it to be a simple CPU/RAM tradeoff, which I can afford in my case

The only constraint/requirement for me is that sha256 must be used as the algorithm- though it wouldn't be much work for me to add multiple algorithms as optional

Thanks!

created time in 3 months

issue openedjvanasco/metadata_parser

Creeping featurism and adding one or two more `page` fields ...

Hello, first thank you for your work on this, I was very happy to find it. Unfortunately for me, I went through the pain of writing something very similar 2 days before stumbling upon it- probably similar to your implementation, it involved a mishmash of bs4 and regex to try to intelligently handle malformed/non-compliant HTML as well as just take whatever bs4 could cleanly pick up on the first try. I trust your implementation much more because of how hastily I wrote mine and have switched over to it and it's working great and chopped out a lot of cruft

However... there are a few (minor) things I was grabbing from the page in addition to the title that you chose not to grab. For my use (fingerprinting specific technologies based on HTTP protocol info and HTML content) I found that grabbing h1, h2 and h3 came in handy in a bunch of my cases. Right now, I switched over to metadata_parser to get the title and the meta tag data, but I still have a bit doing the bs4 and re.search logic for those hn tags. Any thoughts one way or another on adding capture of these? I'd be happy to PR that if you would accept it as I would love to dump the last few remnants of my code from my project.

If you think this goes a bit too far beyond the scope of what you have in mind, no worries. I only even thought to ask because you pull the <title> tag content.

Something I find nice about your interface is that you thought to include clean direct access to the bs4 soup object via MetadataParser::soupso either way I can at least avoid the cost of parsing the page multiple times with bs4

Thanks again, appreciate your work on this

Ah, one last note- I went through the trouble of adding a function to parse the http-equiv refresh content value- not just to split out the seconds value and the URL or relative path value, but also, in the case where it's a URL, crack the URL into the urlparse() components. This is probably a bit much for most people but was important for my project. I really doubt you would be interested in adding this to your project (either yourself or as a PR) but happy to hear your thoughts either way. I was breaking it into something like this:

          "meta_refresh": {
            "refresh": "2;URL=https://www.somesite.co.za:2087/login"
          },
          "meta_refresh_cracked": {
            "path": "/login",
            "port": 2087,
            "username": null,
            "password": null,
            "params": "",
            "fragment": "",
            "refresh_seconds": 2,
            "scheme": "https",
            "private suffix": "somesite.co.za"
            "hostname": "www.somesite.co.za",
            "raw_url": "https://www.somesite.co.za:2087/login"
          }

For instances where it's a relative path instead of a full URL, I give that relative path a dummy scheme and hostname, to let urlparse() cleanly break out the params from the path. I was partially too lazy to manually split the path and query params and partially worried I might mishandle some bizarre edge-case that I trusted urlparse() with more. So I break down:

          "meta_refresh": {
            "refresh": "2;url=/account?param1=xyz",
            "Robots": "noindex, nofollow"
          },
          "meta_refresh_cracked": {
            "path": "/account",
            "port": null,
            "username": null,
            "password": null,
            "params": "param1=xyz",
            "fragment": null,
            "refresh_seconds": 2,
            "scheme": null,
            "hostname": null,
            "raw_url": "/account?param1=x"
          }

This is probably not worth much, but for some reason I felt compelled to "finish" the job- once I start cracking a field... I can get carried away and crack it as far as it can go. At least I didn't break the query parameters into a dict :>

created time in 3 months

startedjvanasco/metadata_parser

started time in 3 months

startedbda-research/node-crawler

started time in 3 months

startedgithub/time-elements

started time in 3 months

more