Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zhexamples #628

Merged
merged 8 commits into from
May 22, 2024
Merged

Zhexamples #628

merged 8 commits into from
May 22, 2024

Conversation

kristian-clausal
Copy link
Collaborator

I've added one more branch to the extract_example multi-line if-tree to account for how Template:zh-x formats its examples. Because it's a special case, it's simpler than usual (just using classify_desc to put each line into a box).

For example in: https://en.wiktionary.org/wiki/%E6%9C%AA%E9%9B%A8%E7%B6%A2%E7%B9%86

Note that each line is split on square brackets because [ interferes with classify_desc, most probably on purpose (it classifies it as 'other' instead of 'romanization'). zh-x uses a lot of "text [Specific Chinese language, trad. or simp.]" style qualifiers, which we're going to ignore and add as part of the text itself. Examples inside senses don't have a tags field, and I don't want to add them unless there's a lot more need for it.

The [ also broke the heuristics for the original code, so I also added a negative condition regarding zh-x up in the first if condition to counteract that. If there are more Chinese templates like this, or maybe if all Chinese examples have this exact format, then we can expand the condition or make it more general (with "lang_code == 'zh'", for example).

It took me too long messing with this code to realize it would be a royal mess to integrate this stuff into the 'general' branches, and adding a special case isn't going to be that expensive. Just makes the code even longer.

if any(re.search(r"[]\d:)]\s*$", x) for x in lines[:-1]):
if any(
re.search(r"[]\d:)]\s*$", x) for x in lines[:-1]
) and example_template_names in (["zh-x"], ["zh-usex"]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be not in.

This might be difficult, but I think it's better to put the "zh-x" code in a new function so it could be reused for etymology section. And process the expanded HTML tags would fix the unreliable classify_desc problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting that, I was hasty.

The classify_desc problem isn't with HTML tags, it's because classify_desc breaks in square brackets, like "[Chinese language name, trad.]".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if we could processing HTML tags then separating the text would be easier. Because language variety and other texts can be distinguished through HTML tag name and CSS class names. This is just a suggestion, if we want to move the data inside brackets to tags in the future, we could copy the zh edition code, they use the same HTML tags.

@xxyzz
Copy link
Collaborator

xxyzz commented May 14, 2024

The "etymology_text" field is still missing for the Chinese section JSON data in page "作", there might be a bug in code adding the extracted etymology data to the final dictionary variable.

@kristian-clausal
Copy link
Collaborator Author

I tested this with a stripped-down version of the article, but didn't try the the whole thing. Using the whole Chinese section, there's not etymology data.

@kristian-clausal
Copy link
Collaborator Author

kristian-clausal commented May 14, 2024

The issue with 作 seems to be that Glyph origin, Etymology and Pronunciation are all on the same level. My minimal zuo.txt didn't have the Pronunciation titles + pron subsections, which is why the etymologies got associated with its definitions.

Chinese articles look like this:

===Glyph Origin===
...
===Etymology===
...
===Pronunciation 1===
...
====Definitions====
...
===Pronunciation 2===
...
====Definitions====

Currently it seems etymology information for POS sections is not duplicated and the first POS section on the page receives it.

Ok, I found a Chinese article where the Pronunciation is a level 4 ====Pronunciation====.

A hack might be to make the Pronunciation section a level 4 when it's right after an Etymology template. The other level 4 sections after that (under the Pronunciation template previously) will now be children of the Etymology node, and other Pronunciation sections will be left alone.

EDIT:

Well, that didn't work.

@xxyzz
Copy link
Collaborator

xxyzz commented May 15, 2024

The etymology section data are added now(also restore the etymology data in Japanese section) but the first pronunciation section data are added to the second "definitions" POS section data and the second pronunciation section data also added to the first POS section.

In Chinese entries, you sometimes get a Glyph Origin
and Etymology section next to the each other
(without subsections in Glyph Origin, hopefully),
and Glyph Origin doesn't get properly extracted,
or gets overwritten by the etymology in the the
following Etymology section. An etymology section
(in this case the Glyph Origin section) without
contents isn't very useful, and is probably culled
at some point anyhow, so let's just combine the
two into one big etymology section.
Mostly affecting Chinese etymology template data
Chinese sections often have up to three level 3
sections one after another; a Glyph Origin,
Etymology and Pronunciation. Later sections
override the previous (or rather, the previous)
is ignored; in the previous commit, we combined
Etymology sections with an immediately preceding
(without subsections, important) Level 3 section
(Glyph Origin), which worked fine in simple examples,
but when you have Pronunciation sections (taking
the slot of an Etymology on Level 3) there was
still no etymology data.

This commit is trying to put the Pronunciation section
on level 4 so that it's the child of the Etymology
section, but it didn't work out. This commit is
mainly for history so that I can wipe the previous
changes to fix_subtitle hierarchy()
This is needed for Chinese entries with structures like:

===Glyph Origin===
===Etymology===
===Pronunciation 1===
====POS Section====
===Pronunciation 2===
====POS Section====

Glyph Origin and Etymology are merged in fix_subtitle_hierarchy
(previous commit), and Pronunciation sections are moved to
Level 4, while all other sections under that are move one
step further, also; Levels 2 and 3 (Language and Etymology)
are left alone, and happily they're also the only truly
"meaningful" sections; The code doesn't really check for
`NodeKind.LEVEL[3-6]` anywhere.

The inbetween stage for Pronunciation only involves acting
as a bridge between push_etym() and push_pos(). 99% of the time
(or maybe a slightly lower percentage, there are a lot of
Chinese entries) it doesn't really do anything.
@xxyzz
Copy link
Collaborator

xxyzz commented May 17, 2024

All issues seem to be fixed. Should we merge this, or you want to test the code on more pages?

@kristian-clausal
Copy link
Collaborator Author

I don't trust this at all, I had so much trouble getting it to work, so I will try to diff (or whatever is appropriate for json) differences in output between the main branch and this branch. Thanks for taking a look!

Chinese word entries sometimes have several level 3
sections after each other like this:

```
===Glyph Origin===
===Etymology===
===Pronunciation 1===
===Pronunciation 2===
```

where these should all be part of the same thing.

Previous changes attempted to fix this by inserting a
virtual Level 4 into the mix, by changing Pronunciation
sections to level 4 so they would be hierarchically
under Etymology sections (and Etymology sections
are combined under one).

The last commit didn't work perfectly, but I think I got it
right this time... At least it looks like it, and
tests pass.

It took me ages. I just couldn't get it right.

Anyhow, currently it works like this:

There's POS data and etym data as previously, but they
are bridged by level_four data. Usually the level_four
data is empty, except when a pronunciation section is found,
in which case we flip a flag to true that means that
when we parse stuff and add the data to something, that
data is added to the level_four data instead of `etym`.

`etym` is not etymological data, it's data for an
Etymology *section*, so it's everything under an
Etymology title, unless there's a Pronunciation title,
in which case `etym` should contain mostly etymology
and general data (like categories plucked from the
etymology section). The name is a bit confusing, but
changing to "level_three" is annoyingly abstract too.
@kristian-clausal
Copy link
Collaborator Author

I FINALLY got the jsondiff thing to work. The script is slow as molasses, I got caught trying to figure out why things didn't work due to several minor bugs compounding each other (for example: a continue that was indented too low, and having the name of the file be 'jsondiff' which messed up with the importing of jsondiff...), then I had messed up because I needed to re-extract stuff due to unrelated changes to the code (ignored etymology templates causing diffs, of course)... But looking at the diff right now, and adding some new section terms to a couple of places, it seems this is where I want it to be!!!

@kristian-clausal kristian-clausal merged commit b972423 into master May 22, 2024
10 checks passed
@kristian-clausal kristian-clausal deleted the zhexamples branch May 22, 2024 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants