or how i stopped worrying and learned to love the ssl certificate
simple facebook utils in python
jvanasco/dogpile_backend_redis_advanced 8
advanced redis backend for dogpile.cache
unofficial "extras" and utilities for `bleach`
mako plist for textwrangler & bbedit
Python lib for Embedly
Simple Twitter bot to correct people when they misspeak about pete and his sakes
Limited DNS server with RESTful HTTP API to handle ACME DNS challenges easily and securely.
jvanasco/acme-dns-certbot-joohoi 0
Certbot client hook for acme-dns
startedShopify/maintenance_tasks
started time in 12 days
startedjvanasco/metadata_parser
started time in a month
startedjvanasco/metadata_parser
started time in a month
startedrubygarage/api_struct
started time in a month
startedjvanasco/metadata_parser
started time in a month
startedfbn-roussel/aws-sam-typescript-webpack-backend
started time in 2 months
Backup GMail (or other IMAP) accounts to disk
fork in 2 months
starteddgraham/vim-eslint
started time in 2 months
fork janispauls/pyramid_session_redis
This is an extensive fork with large rewrites and functionality changes. originally ericrasmussen/pyramid_redis_sessions
fork in 2 months
pull request commentjvanasco/metadata_parser
Remove unsilenceable exception log message (fixes #32)
Superseded by #34.
comment created time in 2 months
issue commentjvanasco/metadata_parser
`NotParsableFetchError`s lead to output even when handled
Thanks for the clarification.
Dropping to INFO
or WARNING
would work. It would keep it out of tools like Sentry by default without having to adjust the module logging manually.
comment created time in 2 months
python library for getting metadata
fork in 2 months
issue openedjvanasco/metadata_parser
`NotParsableFetchError`s lead to output even when handled
Because of an explicit log.error()
, it's not possible to silence the NotParsableFetchError
exception even when catching it in a try..except
statement:
try:
page = metadata_parser.MetadataParser(url=url, requests_timeout=5)
except metadata_parser.NotParsableFetchError as e:
if e.code and e.code not in (403, 429, 500):
print("Error parsing: %s [%s]" % (url, e.code))
created time in 2 months
startedjvanasco/metadata_parser
started time in 2 months
startedjvanasco/metadata_parser
started time in 2 months
startedTaDaa/vimade
started time in 2 months
startedShopify/draggable
started time in 2 months
issue commentjvanasco/metadata_parser
Creeping featurism and adding one or two more `page` fields ...
Unfortunately, I am way too busy over the next week to address this stuff in depth, but i can give highlevel feedback.
Understood- thanks for taking the time to answer this
- The refresh stuff seems somewhere between fine and too much. I need to look at it against some datasets. I do like the idea of parsing some of the info, such as... "meta_refresh": { "refresh": "2;url=/account?param1=xyz", }, _meta_refresh_parsed: { "time": 2, "url": /account?param1=xyz", }
I'm not sure about the further parsing
Yeah, I agree with you here. Despite implementing that extra level of cracking in my code, I don't think that level of detail is appropriate within metadata_parser. I only really consider it reasonable to see the content split into the time and URL/path component- there's minimal value for that extra granularity for nearly everyone and it definitely wanders out of scope of the project in my opinion, going from parsing HTML into parsing and deeply interpreting/making meaning of the HTML. Doesn't seem appropriate for core metadata_parser functionality, the user can crack it after using metadata_parser to get the value if they want to.
- The h2/h3/etc stuff seems okay, but I worry about things getting too carried away and how this may effect existing installs. It might make sense to use callback hooks or even a subclass model to do extra work
I agree, I definitely do not want to go beyond h1
, h2
, h3
, especially within metadata_parser code. Anything beyond that, using the extensibility approach seems like the right way to go. This would allow much less invasive changes to provide the flexibility for users to easily do this themselves, either via their own adapters/callbacks and/or via some that may be included as optional to use within the package- maybe one common adapter/subclass (mainly to function as an example for users) would be appropriate for inclusion if this was implemented.
There was a lot of feature creep a few years back, and the number of args/kwargs exploded. This has really pushed me towards looking more at subclasses and callbacks, so the library can support more use cases... but still remain usable.
I noticed that some of the function kwargs
had started to look like pandas
, which commonly has 10+ kwargs
for their most ubiquitous and powerful classes. I bet pylint
default settings loves that kwargs
list :)
Correct me if I am wrong (if you have time) but what I am ultimately getting out of your comments, which were very helpful:
-
I don't see any areas of philosophical or implementation disagreement, aside from the extensive cracking of the meta refresh- I cited in my first comment, but I happen to actually agree with you there. I think metadata_parser can serve to break the content into the time value and the URL/relative path, but anything beyond on that belongs in user code or as a special (optional) adapter
-
h1/h2/h3 are suitable candidates for being bolted on to the code, but going anywhere beyond that with HTML body tags is too slippery a slope. Fully agree here and I actually concluded these were the only tags with universal appeal across my dataset- I didn't choose them arbitrarily, nor were they just a starting point for a laundry list of additional tags, so we're on the same page. Otherwise at some point you might as well just return the entire
bs4.soup
object :P -
In general, you want to move more towards providing a generic/cleaner way for functionality to be extended, via classes/adapters and/or callbacks/hooks, some of which may be included in the package for user convenience or to serve as examples. This provides both flexibility and makes the project easier to maintain in the long term and doesn't require adding more and more
kwargs
-
I don't know that you said this directly- but if there are opportunities for retroactively applying this design in some areas where it can provide a maintainability improvement, they're worth looking at. One example I can think of is replacing one or two of the kwargs with a class that acts as an adapter to influence the standard behavior. Though I would guess you would prefer to increment the major version of the package for changes like these as opposed to modifying the current major and incrementing the minor, even if they are mostly abstracted from the user
It doesn't matter to me whether you leave this issue open or not, it served its purpose as a communication medium and helped me quite a bit
I will probably fork master in the short-term and at some point send a PR for you to glance at and see what you think- at the very least for the h1/h2/h3 stuff, which is quick and easy. If you don't want to merge it at that time (or ever) it won't be a big problem for me, I can just pip
install from my fork+branch. Or we can work out modifications. The more overarching changes will be a separate effort
Thanks again for the insight and the brief backstory on that initial expansion of the features. I can relate to being short on time and I realize this is not a sponsored project for you, so taking the time to write this note is much appreciated
comment created time in 3 months
issue openedjvanasco/metadata_parser
[Feature] Cache soup (and other) objects by sha256sum
When running against a very large corpus there is an opportunity for a non-negligible performance gain from caching the full operations of MetadataParser
using a sum over the HTML content, at least when across over 100k-200k+ pages that contain a high percentage of duplicates which is the case for my use
This is something I'm planning on implementing anyway because in my case, my corpus already has the sha256_sum
available before sending it to MetadataParser
. The only question is whether I do it on the front-end on my side of the project, or contribute it as a PR to your project
If I implement it for MetadataParser
, I'll be doing it such that the sha256sum can be sent precalculated via a constructor kwargs
value along with html=
If you do not want this as a feature or have any comments about possible problems that might crop up, it would be great if you could post a comment in the next day or two so I can make an educated decision on where to place this
I'm expecting it to be a simple CPU/RAM tradeoff, which I can afford in my case
The only constraint/requirement for me is that sha256 must be used as the algorithm- though it wouldn't be much work for me to add multiple algorithms as optional
Thanks!
created time in 3 months
issue openedjvanasco/metadata_parser
Creeping featurism and adding one or two more `page` fields ...
Hello, first thank you for your work on this, I was very happy to find it. Unfortunately for me, I went through the pain of writing something very similar 2 days before stumbling upon it- probably similar to your implementation, it involved a mishmash of bs4
and regex to try to intelligently handle malformed/non-compliant HTML as well as just take whatever bs4
could cleanly pick up on the first try. I trust your implementation much more because of how hastily I wrote mine and have switched over to it and it's working great and chopped out a lot of cruft
However... there are a few (minor) things I was grabbing from the page
in addition to the title that you chose not to grab. For my use (fingerprinting specific technologies based on HTTP protocol info and HTML content) I found that grabbing h1
, h2
and h3
came in handy in a bunch of my cases. Right now, I switched over to metadata_parser
to get the title
and the meta
tag data, but I still have a bit doing the bs4
and re.search
logic for those hn
tags. Any thoughts one way or another on adding capture of these? I'd be happy to PR that if you would accept it as I would love to dump the last few remnants of my code from my project.
If you think this goes a bit too far beyond the scope of what you have in mind, no worries. I only even thought to ask because you pull the <title>
tag content.
Something I find nice about your interface is that you thought to include clean direct access to the bs4 soup object via MetadataParser::soup
so either way I can at least avoid the cost of parsing the page multiple times with bs4
Thanks again, appreciate your work on this
Ah, one last note- I went through the trouble of adding a function to parse the http-equiv
refresh
content
value- not just to split out the seconds value and the URL or relative path value, but also, in the case where it's a URL, crack the URL into the urlparse()
components. This is probably a bit much for most people but was important for my project. I really doubt you would be interested in adding this to your project (either yourself or as a PR) but happy to hear your thoughts either way. I was breaking it into something like this:
"meta_refresh": {
"refresh": "2;URL=https://www.somesite.co.za:2087/login"
},
"meta_refresh_cracked": {
"path": "/login",
"port": 2087,
"username": null,
"password": null,
"params": "",
"fragment": "",
"refresh_seconds": 2,
"scheme": "https",
"private suffix": "somesite.co.za"
"hostname": "www.somesite.co.za",
"raw_url": "https://www.somesite.co.za:2087/login"
}
For instances where it's a relative path instead of a full URL, I give that relative path a dummy scheme and hostname, to let urlparse()
cleanly break out the params
from the path
. I was partially too lazy to manually split the path and query params and partially worried I might mishandle some bizarre edge-case that I trusted urlparse()
with more. So I break down:
"meta_refresh": {
"refresh": "2;url=/account?param1=xyz",
"Robots": "noindex, nofollow"
},
"meta_refresh_cracked": {
"path": "/account",
"port": null,
"username": null,
"password": null,
"params": "param1=xyz",
"fragment": null,
"refresh_seconds": 2,
"scheme": null,
"hostname": null,
"raw_url": "/account?param1=x"
}
This is probably not worth much, but for some reason I felt compelled to "finish" the job- once I start cracking a field... I can get carried away and crack it as far as it can go. At least I didn't break the query parameters into a dict
:>
created time in 3 months
startedjvanasco/metadata_parser
started time in 3 months
startedbda-research/node-crawler
started time in 3 months
startedgithub/time-elements
started time in 3 months