The Joys Of Programming

So here's a thing that happened. I have a voxel game engine that I've been working on for 4-5 years, it's MIT licensed and on github but as far as I know nobody else has ever done anything with it.

Then last week, Mojang apparently released a new retro version of minecraft built on it.

Somebody who poked around the source was kind enough to email and let me know, otherwise I suppose I might not have found out...

That's awesome!

neat!

It's weird.. ostensibly it's great but I'm kind of down about it. I wish I'd known sooner, so I could have participated in the reddit threads and hackernews comments and whatever, instead of picking through them a week later.

I mean if mojang has further plans for it then maybe there's more to come. But somehow it has the look and feel of a one-time marketing goof that's now over with.

Gah. Well, I get to bother you folks about it, anyway.

I guess you can add that to the project's GitHub page. Minecraft approved!

Speaking of that, here's a reminder everybody - put proper attribution stuff in your open source work! Mojang built their files in such a way that attribution comment blocks would be retained in the source bundle, but I never got around to adding one so my engine went uncredited.

Pushed a fix today, but considering they're using an engine version from mid 2017 I have a feeling they may not immediately update.

I mentioned this in the GWJ Slack, but like many before me I have embarked on the project of parsing PDF files to extract information. Unfortunately the format and locations of the data of interest will vary, even within files from the same provider.

The existing solution we use is built on the .NET formulation of iText. It reads through a PDF, infers a row/column structure, gathers atomic text objects, and logs the points defining the rectangular cells that contain them. Finally, the page, content of the text object, and the coordinates defining the cell are dumped to SQLite.

From there my team queries the database and attempts to make sense of the data, transforming them into Excel and JSON files. It’s curiously messy and at present error-prone because, for instance, a given text object may overlap columns and so we need to find the orphans and merge them.

I’ve resolved to read the PDF spec as far as I need to make sense of the control codes that define fonts and font sizes and what not. I’ve used qpdf to uncompress a PDF to see how the actual text of interest is codified, thinking that there’s got to be a more-direct way than inferring inflexible rectilinear structure to parse the information.

That said, many far-smarter people have tried this and done the same thing we do, more or less. It just seems like the abundance of PDF readers/renderers suggests we could make more-efficient sense of the underlying data than that.

Anyone know if better approaches (not including “Don’t use PDF”—that’s not an option yet)?

muraii wrote:

I’ve resolved to read the PDF spec as far as I need...

And he was never seen again. Heard he was still gibbering when they gently closed the door to the padded room.

One can only hope.

I actually wrote some Elixir code a few years back at work to convert landscape pages in a PDF to portrait for faxing. Doctors kept uploading entire patient histories, ImageMagick would use more than the allowed RAM, it'd kill the job and put it right back on the front of the queue, killing the system over and over til I figured it out.

I don't envy you having to look at PDF internals at all.

Feel free to PM me any questions, though, and I'll see what I can do to help.

Obliged!

libvips seems to be the new hotness for image and PDF manipulation, is it more sane than imagemagick? Haven't had a chance to use it yet, but have some tools in production putting imagemagick to heavy use.

Mr Crinkle wrote:
muraii wrote:

I’ve resolved to read the PDF spec as far as I need...

And he was never seen again. Heard he was still gibbering when they gently closed the door to the padded room.

One of the lucky ones.

Mr Crinkle wrote:
muraii wrote:

I’ve resolved to read the PDF spec as far as I need...

And he was never seen again. Heard he was still gibbering when they gently closed the door to the padded room.

Which is why when my work needed to scrape PDFs, we used the iTextSharp library.

Quintin_Stone wrote:

Which is why when my work needed to scrape PDFs, we used the iTextSharp library.

Yup. We rolled our own and eventually scrapped it, switching to iTextSharp.

-BEP

I'm wondering if anyone has recommendations on reputable training courses. I'm feeling a little rusty on some specific technical skills and thinking of brushing up because of the possibility I might be on the job market in the future. In the past I would skim or read big technical books, but I haven't done that for a while.

I've been working at ad agency jobs for the past 5 years. By their nature, ad agencies are fairly reactive, unfortunately, so you generally learn in the moment by simply looking up language or framework documentation. What I'm thinking of is something more structured and guided to get the cobwebs out of the corners of my skillset.

bepnewt wrote:
Quintin_Stone wrote:

Which is why when my work needed to scrape PDFs, we used the iTextSharp library.

Yup. We rolled our own and eventually scrapped it, switching to iTextSharp.

-BEP

We use it too but I and the other new hire came in to this project with neither of us particularly familiar with PDFs. We’ve learned a lot and tried a bunch of different parsers/scrapers, mostly in Python, but we may need to step back a bit.