Zhexamples #628

kristian-clausal · 2024-05-13T10:51:01Z

I've added one more branch to the extract_example multi-line if-tree to account for how Template:zh-x formats its examples. Because it's a special case, it's simpler than usual (just using classify_desc to put each line into a box).

For example in: https://en.wiktionary.org/wiki/%E6%9C%AA%E9%9B%A8%E7%B6%A2%E7%B9%86

Note that each line is split on square brackets because [ interferes with classify_desc, most probably on purpose (it classifies it as 'other' instead of 'romanization'). zh-x uses a lot of "text [Specific Chinese language, trad. or simp.]" style qualifiers, which we're going to ignore and add as part of the text itself. Examples inside senses don't have a tags field, and I don't want to add them unless there's a lot more need for it.

The [ also broke the heuristics for the original code, so I also added a negative condition regarding zh-x up in the first if condition to counteract that. If there are more Chinese templates like this, or maybe if all Chinese examples have this exact format, then we can expand the condition or make it more general (with "lang_code == 'zh'", for example).

It took me too long messing with this code to realize it would be a royal mess to integrate this stuff into the 'general' branches, and adding a special case isn't going to be that expensive. Just makes the code even longer.

xxyzz · 2024-05-14T01:33:08Z

src/wiktextract/extractor/en/page.py

-                    if any(re.search(r"[]\d:)]\s*$", x) for x in lines[:-1]):
+                    if any(
+                        re.search(r"[]\d:)]\s*$", x) for x in lines[:-1]
+                    ) and example_template_names in (["zh-x"], ["zh-usex"]):


Should be not in.

This might be difficult, but I think it's better to put the "zh-x" code in a new function so it could be reused for etymology section. And process the expanded HTML tags would fix the unreliable classify_desc problem.

Thanks for spotting that, I was hasty.

The classify_desc problem isn't with HTML tags, it's because classify_desc breaks in square brackets, like "[Chinese language name, trad.]".

I mean if we could processing HTML tags then separating the text would be easier. Because language variety and other texts can be distinguished through HTML tag name and CSS class names. This is just a suggestion, if we want to move the data inside brackets to tags in the future, we could copy the zh edition code, they use the same HTML tags.

xxyzz · 2024-05-14T01:55:29Z

The "etymology_text" field is still missing for the Chinese section JSON data in page "作", there might be a bug in code adding the extracted etymology data to the final dictionary variable.

kristian-clausal · 2024-05-14T05:24:00Z

I tested this with a stripped-down version of the article, but didn't try the the whole thing. Using the whole Chinese section, there's not etymology data.

kristian-clausal · 2024-05-14T05:55:35Z

The issue with 作 seems to be that Glyph origin, Etymology and Pronunciation are all on the same level. My minimal zuo.txt didn't have the Pronunciation titles + pron subsections, which is why the etymologies got associated with its definitions.

Chinese articles look like this:

===Glyph Origin===
...
===Etymology===
...
===Pronunciation 1===
...
====Definitions====
...
===Pronunciation 2===
...
====Definitions====

Currently it seems etymology information for POS sections is not duplicated and the first POS section on the page receives it.

Ok, I found a Chinese article where the Pronunciation is a level 4 ====Pronunciation====.

A hack might be to make the Pronunciation section a level 4 when it's right after an Etymology template. The other level 4 sections after that (under the Pronunciation template previously) will now be children of the Etymology node, and other Pronunciation sections will be left alone.

EDIT:

Well, that didn't work.

xxyzz · 2024-05-15T01:22:22Z

The etymology section data are added now(also restore the etymology data in Japanese section) but the first pronunciation section data are added to the second "definitions" POS section data and the second pronunciation section data also added to the first POS section.

In Chinese entries, you sometimes get a Glyph Origin and Etymology section next to the each other (without subsections in Glyph Origin, hopefully), and Glyph Origin doesn't get properly extracted, or gets overwritten by the etymology in the the following Etymology section. An etymology section (in this case the Glyph Origin section) without contents isn't very useful, and is probably culled at some point anyhow, so let's just combine the two into one big etymology section.

Mostly affecting Chinese etymology template data

Chinese sections often have up to three level 3 sections one after another; a Glyph Origin, Etymology and Pronunciation. Later sections override the previous (or rather, the previous) is ignored; in the previous commit, we combined Etymology sections with an immediately preceding (without subsections, important) Level 3 section (Glyph Origin), which worked fine in simple examples, but when you have Pronunciation sections (taking the slot of an Etymology on Level 3) there was still no etymology data. This commit is trying to put the Pronunciation section on level 4 so that it's the child of the Etymology section, but it didn't work out. This commit is mainly for history so that I can wipe the previous changes to fix_subtitle hierarchy()

This is needed for Chinese entries with structures like: ===Glyph Origin=== ===Etymology=== ===Pronunciation 1=== ====POS Section==== ===Pronunciation 2=== ====POS Section==== Glyph Origin and Etymology are merged in fix_subtitle_hierarchy (previous commit), and Pronunciation sections are moved to Level 4, while all other sections under that are move one step further, also; Levels 2 and 3 (Language and Etymology) are left alone, and happily they're also the only truly "meaningful" sections; The code doesn't really check for `NodeKind.LEVEL[3-6]` anywhere. The inbetween stage for Pronunciation only involves acting as a bridge between push_etym() and push_pos(). 99% of the time (or maybe a slightly lower percentage, there are a lot of Chinese entries) it doesn't really do anything.

xxyzz · 2024-05-17T02:34:41Z

All issues seem to be fixed. Should we merge this, or you want to test the code on more pages?

kristian-clausal · 2024-05-17T04:41:02Z

I don't trust this at all, I had so much trouble getting it to work, so I will try to diff (or whatever is appropriate for json) differences in output between the main branch and this branch. Thanks for taking a look!

Chinese word entries sometimes have several level 3 sections after each other like this: ``` ===Glyph Origin=== ===Etymology=== ===Pronunciation 1=== ===Pronunciation 2=== ``` where these should all be part of the same thing. Previous changes attempted to fix this by inserting a virtual Level 4 into the mix, by changing Pronunciation sections to level 4 so they would be hierarchically under Etymology sections (and Etymology sections are combined under one). The last commit didn't work perfectly, but I think I got it right this time... At least it looks like it, and tests pass. It took me ages. I just couldn't get it right. Anyhow, currently it works like this: There's POS data and etym data as previously, but they are bridged by level_four data. Usually the level_four data is empty, except when a pronunciation section is found, in which case we flip a flag to true that means that when we parse stuff and add the data to something, that data is added to the level_four data instead of `etym`. `etym` is not etymological data, it's data for an Etymology *section*, so it's everything under an Etymology title, unless there's a Pronunciation title, in which case `etym` should contain mostly etymology and general data (like categories plucked from the etymology section). The name is a bit confusing, but changing to "level_three" is annoyingly abstract too.

kristian-clausal · 2024-05-22T08:22:15Z

I FINALLY got the jsondiff thing to work. The script is slow as molasses, I got caught trying to figure out why things didn't work due to several minor bugs compounding each other (for example: a continue that was indented too low, and having the name of the file be 'jsondiff' which messed up with the importing of jsondiff...), then I had messed up because I needed to re-extract stuff due to unrelated changes to the code (ignored etymology templates causing diffs, of course)... But looking at the diff right now, and adding some new section terms to a couple of places, it seems this is where I want it to be!!!

kristian-clausal requested a review from xxyzz May 13, 2024 10:51

xxyzz reviewed May 14, 2024

View reviewed changes

kristian-clausal added 7 commits May 16, 2024 07:42

Handle Chinese zh-x example template output

2afd75b

Examples: use "zh-x" as a condition for branch

1f7db5b

Add zh-x aliases

a9e2db6

More ignored etymology templates

3e7ef7b

Mostly affecting Chinese etymology template data

kristian-clausal force-pushed the zhexamples branch from 741c3e0 to 76fdba7 Compare May 16, 2024 11:09

kristian-clausal force-pushed the zhexamples branch from 76fdba7 to 822a101 Compare May 22, 2024 08:14

kristian-clausal force-pushed the zhexamples branch from 822a101 to 24b7d98 Compare May 22, 2024 08:15

kristian-clausal merged commit b972423 into master May 22, 2024
10 checks passed

kristian-clausal deleted the zhexamples branch May 22, 2024 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zhexamples #628

Zhexamples #628

kristian-clausal commented May 13, 2024

xxyzz May 14, 2024

kristian-clausal May 14, 2024

xxyzz May 14, 2024

xxyzz commented May 14, 2024

kristian-clausal commented May 14, 2024

kristian-clausal commented May 14, 2024 •

edited

xxyzz commented May 15, 2024 •

edited

xxyzz commented May 17, 2024

kristian-clausal commented May 17, 2024

kristian-clausal commented May 22, 2024

Zhexamples #628

Zhexamples #628

Conversation

kristian-clausal commented May 13, 2024

xxyzz May 14, 2024

Choose a reason for hiding this comment

kristian-clausal May 14, 2024

Choose a reason for hiding this comment

xxyzz May 14, 2024

Choose a reason for hiding this comment

xxyzz commented May 14, 2024

kristian-clausal commented May 14, 2024

kristian-clausal commented May 14, 2024 • edited

xxyzz commented May 15, 2024 • edited

xxyzz commented May 17, 2024

kristian-clausal commented May 17, 2024

kristian-clausal commented May 22, 2024

kristian-clausal commented May 14, 2024 •

edited

xxyzz commented May 15, 2024 •

edited