Dongdong She @DongdongShe profile

Dongdong She

@DongdongShe

Followers

605

Following

743

Statuses

63

Assistant Prof @HKUST. CS Ph.D. @ColumbiaCompSci. Security, machine learning, program analysis, fuzzing.

Joined May 2013

Don't wanna be here? Send us removal request.

Dongdong She

@DongdongShe

4 months

RT @HKUSTCSE: We are recruiting! Applications including 1) a cover letter, 2) a full curriculum vitae, 3) names and contact information of…

0

10

0

Dongdong She

@DongdongShe

4 months

@whexyshi @is_eqv It seems we used the same fuzzbench script to collect code coverage and didn't compare the address of the Sancov section. Could you explain your concerns in more details? I am happy to discuss.

1

0

Dongdong She

@DongdongShe

4 months

@andreafioraldi @is_eqv @laosong Yes, we do have the fuzzbench infra results that are consistent with our paper. Hope it could make the excitement great again. Please see more detailed info

Dongdong She

@DongdongShe

4 months

Thanks @is_eqv for raising such an important question for the fuzzing research. Open questions and discussion are the keys to the advancement of fuzzing research. @laosong, @andreafioraldi, I fully agree that results from the third-party fuzzbench infra are more reliable and convincing. Thankfully, we do have results from fuzzbench infra that are consistent with our paper. Your question: Why not use FuzzBench results in the paper? Are there any hidden dirty tricks/hacks/unfair engineering tricks? How can we trust the results? Our answer: It takes some engineering effort to build all programs automatically under fuzzbench infra. When building the FOX prototype, we have to fix many compilation issues locally. Due to the time constraint before the paper submission deadline, we locally set up the fuzzing campaign to avoid potential building failure (all code and replication results are open-sourced). We evaluate FOX under fuzzbench infra after the paper submission. The breakdown performance of FOX vs. AFL++ is consistent with our paper results. Please see the attached plot from the above experimental results under fuzzbench infra. Note that there is still some compilation issue on "proj4" under fuzzbench infra, so the result of FOX is "nan". This compilation failure issue also influences the average ranks of FOX due to the "nan" result on "proj4". Let me know if you have any questions or concerns about FOX. I am happy to discuss further. We found several fatal bugs regarding the MLFuzz evaluation issue and confirmed with MLFuzz's authors. We are now discussing with MLFuzz's authors and would like to collaborate on the errata of MLFuzz and future adversarial collaboration. We plan to share the experience and lessons of this incident with the community later. Stay tuned.

0

Dongdong She

@DongdongShe

4 months

Thanks @is_eqv for raising such an important question for the fuzzing research. Open questions and discussion are the keys to the advancement of fuzzing research. @laosong, @andreafioraldi, I fully agree that results from the third-party fuzzbench infra are more reliable and convincing. Thankfully, we do have results from fuzzbench infra that are consistent with our paper. Your question: Why not use FuzzBench results in the paper? Are there any hidden dirty tricks/hacks/unfair engineering tricks? How can we trust the results? Our answer: It takes some engineering effort to build all programs automatically under fuzzbench infra. When building the FOX prototype, we have to fix many compilation issues locally. Due to the time constraint before the paper submission deadline, we locally set up the fuzzing campaign to avoid potential building failure (all code and replication results are open-sourced). We evaluate FOX under fuzzbench infra after the paper submission. The breakdown performance of FOX vs. AFL++ is consistent with our paper results. Please see the attached plot from the above experimental results under fuzzbench infra. Note that there is still some compilation issue on "proj4" under fuzzbench infra, so the result of FOX is "nan". This compilation failure issue also influences the average ranks of FOX due to the "nan" result on "proj4". Let me know if you have any questions or concerns about FOX. I am happy to discuss further. We found several fatal bugs regarding the MLFuzz evaluation issue and confirmed with MLFuzz's authors. We are now discussing with MLFuzz's authors and would like to collaborate on the errata of MLFuzz and future adversarial collaboration. We plan to share the experience and lessons of this incident with the community later. Stay tuned.

1

7

36

Dongdong She

@DongdongShe

4 months

@laosong @is_eqv We are working on the upstreaming of the AFL++ code repo now. The frontier scheduler design is designed to be simple and easy to use, and does not need external coolchian like gllvm (we spent quite some effort to make it happen). Stay tuned!

0

3

Dongdong She

@DongdongShe

4 months

@is_eqv Right, it could be an interesting future work centering on how to principally solve various constraints.

1

0

1

Dongdong She

@DongdongShe

4 months

RT @pkqzy888: @zhouxinan will present "Untangling the Knot: Breaking Access Control in Home Wireless Mesh Networks" at @acm_ccs this aftern…

0

8

0

Dongdong She

@DongdongShe

4 months

@is_eqv @aflplusplus Yes, this insight makes sense! Mutator “direction,” e.g., length checking or bitwise computation, can further boost the gradient calculation and hard-branch solving.

1

0

6

Dongdong She

@DongdongShe

5 months

@is_eqv A response does not mean our rebuttal has issues. Please check the latest episode 5.

Dongdong She

@DongdongShe

5 months

Ep5. Rebuttal MLFuzz Thanks Irina’s response. We never heard back from you and @AndreasZeller since last month when we sent the last email to ask if you guys were willing to write an errata of MLFuzz to acknowledge the bugs and wrong conclusion. So I am happy to communicate with you in the public channel about this issue and clarify the misleading conclusions in your paper MLFuzz in front of the fuzzing community. Our first email pointed out 4 bugs in MLFuzz and we showed that if you fixed the 4 bugs you can successfully reproduce our results. We also provide a fixed version of your code and preliminary results on 4 FuzzBench programs. Your first response confirmed 3 bugs but refused to acknowledge the most severe one – an error in training data collection. For any ML model, garbage in, garbage out. If you manipulate the training data distribution, you can cook any arbitrary poor results for an ML model. Why are you reluctant to fix the training data collection error? Instead, you insist on running NEUZZ with the WRONG training data and cooking invalid results even though we already notified you of this issue. We suspect maybe that’s the only way to keep reproducing your wrong experiment results and avoid acknowledging your error in MLFuzz. Your research conduct raised a serious issue about how to properly reproduce fuzzing performance in the Fuzzing community. Devil’s advice: blindly, deliberately or stealthily run it with WRONG settings or patch it with a few bugs and claim its performance does not hold? Only an ill-configured fuzzer is a good baseline fuzzer. We think a fair and scientific way to reproduce/revisit a fuzzer should ensure running a fuzzer properly as the original paper did, rather than free-style wrong settings and bug injections. The fact is you guys wrote buggy code (you confirmed in the email) and cooked invalid results and wrong conclusions published in a top-tier conference @FSEconf 2023. We wrote a rebuttal to point out 4 fatal bugs in your code and wrong conclusions. A responsible and professional response should directly address our questions about the 4 fatal bugs and wrong conclusions. But your response discussed the inconsistent performance number issue of NEUZZ (due to a different metric choice), the benchmark, seed corpus, IID issue of MLFuzz. They are research questions about NEUZZ and MLFuzz, but they are not the topic of this post: MLFuzz rebuttal. They can only shift the audience's attention but cannot fix the bugs and errors in MLFuzz. I promise I will address every question in your response in a separate post on X, but not in this one. Stay tuned! @is_eqv @moyix @thorstenholz @mboehme_

0

Dongdong She

@DongdongShe

5 months

Ep5. Rebuttal MLFuzz Thanks Irina’s response. We never heard back from you and @AndreasZeller since last month when we sent the last email to ask if you guys were willing to write an errata of MLFuzz to acknowledge the bugs and wrong conclusion. So I am happy to communicate with you in the public channel about this issue and clarify the misleading conclusions in your paper MLFuzz in front of the fuzzing community. Our first email pointed out 4 bugs in MLFuzz and we showed that if you fixed the 4 bugs you can successfully reproduce our results. We also provide a fixed version of your code and preliminary results on 4 FuzzBench programs. Your first response confirmed 3 bugs but refused to acknowledge the most severe one – an error in training data collection. For any ML model, garbage in, garbage out. If you manipulate the training data distribution, you can cook any arbitrary poor results for an ML model. Why are you reluctant to fix the training data collection error? Instead, you insist on running NEUZZ with the WRONG training data and cooking invalid results even though we already notified you of this issue. We suspect maybe that’s the only way to keep reproducing your wrong experiment results and avoid acknowledging your error in MLFuzz. Your research conduct raised a serious issue about how to properly reproduce fuzzing performance in the Fuzzing community. Devil’s advice: blindly, deliberately or stealthily run it with WRONG settings or patch it with a few bugs and claim its performance does not hold? Only an ill-configured fuzzer is a good baseline fuzzer. We think a fair and scientific way to reproduce/revisit a fuzzer should ensure running a fuzzer properly as the original paper did, rather than free-style wrong settings and bug injections. The fact is you guys wrote buggy code (you confirmed in the email) and cooked invalid results and wrong conclusions published in a top-tier conference @FSEconf 2023. We wrote a rebuttal to point out 4 fatal bugs in your code and wrong conclusions. A responsible and professional response should directly address our questions about the 4 fatal bugs and wrong conclusions. But your response discussed the inconsistent performance number issue of NEUZZ (due to a different metric choice), the benchmark, seed corpus, IID issue of MLFuzz. They are research questions about NEUZZ and MLFuzz, but they are not the topic of this post: MLFuzz rebuttal. They can only shift the audience's attention but cannot fix the bugs and errors in MLFuzz. I promise I will address every question in your response in a separate post on X, but not in this one. Stay tuned! @is_eqv @moyix @thorstenholz @mboehme_

1

8

38

Dongdong She

@DongdongShe

5 months

@is_eqv @AndreasZeller Check our response here

Dongdong She

@DongdongShe

5 months

What happens if you write buggy code and misconfigure the experimental setup when evaluating a fuzzer’s performance? Wrong and misleading conclusion! We found several fatal bugs and wrong experimental settings in MLFuzz ( a revisit work on NEUZZ published on a top tier software engineering conference ASE 2023, @AndreasZeller, @ASE_conf ). These following bugs lead to wrong and misleading conclusions in MLFuzz. • An initialization bug ⇒ Failure setup of persistent mode fuzzing. • A program crash ⇒ Unexpected early termination of NEUZZ. • An error in training dataset collection ⇒ A poorly-trained neural network model. • An error in result collection ⇒ Incomplete code coverage report We confirmed these bugs with the MLFuzz’s authors and write a rebuttal paper( to explain the errors in MLFuzz and summarize the lessons on a fair and scientific fuzzing experiment/revisit. 1. Ensure the correctness of code implementation. Careful and rigorous debugging is needed. If you would like to patch a prior work, double-check your setting or patch is correct and seek help from original developer if needed. MLFuzz introduced 3 implementation bugs that led to wrong experimental results and conclusions. 2. Diverse benchmark selection. Try to evaluate your fuzzer on multiple benchmarks, like FuzzBench, Magma, UniFuzz. 3. Uniform code coverage metric. Covert different code coverage metrics like AFL XOR hash, LLVM coverage sanitizer (pruned), LLVM coverage sanitizer (no-pruned), AFL++ code coverage into a uniform one by replaying 4. Complete test case collection. Be sure to collect all the test cases generated by the fuzzer. 5. Uniform fuzzing mode. Ensure all fuzzer are running under same modes, either the default mode or the faster persistent mode. An apple-to-banana comparison like MLFuzz only leads to wrong conclusions. 6. Open-source your fuzzing corpus. Fuzzing is an optimization and different seed corpus (starting point) can lead to drastically variant results.

0

3

Dongdong She

@DongdongShe

5 months

What happens if you write buggy code and misconfigure the experimental setup when evaluating a fuzzer’s performance? Wrong and misleading conclusion! We found several fatal bugs and wrong experimental settings in MLFuzz ( a revisit work on NEUZZ published on a top tier software engineering conference ASE 2023, @AndreasZeller, @ASE_conf ). These following bugs lead to wrong and misleading conclusions in MLFuzz. • An initialization bug ⇒ Failure setup of persistent mode fuzzing. • A program crash ⇒ Unexpected early termination of NEUZZ. • An error in training dataset collection ⇒ A poorly-trained neural network model. • An error in result collection ⇒ Incomplete code coverage report We confirmed these bugs with the MLFuzz’s authors and write a rebuttal paper( to explain the errors in MLFuzz and summarize the lessons on a fair and scientific fuzzing experiment/revisit. 1. Ensure the correctness of code implementation. Careful and rigorous debugging is needed. If you would like to patch a prior work, double-check your setting or patch is correct and seek help from original developer if needed. MLFuzz introduced 3 implementation bugs that led to wrong experimental results and conclusions. 2. Diverse benchmark selection. Try to evaluate your fuzzer on multiple benchmarks, like FuzzBench, Magma, UniFuzz. 3. Uniform code coverage metric. Covert different code coverage metrics like AFL XOR hash, LLVM coverage sanitizer (pruned), LLVM coverage sanitizer (no-pruned), AFL++ code coverage into a uniform one by replaying 4. Complete test case collection. Be sure to collect all the test cases generated by the fuzzer. 5. Uniform fuzzing mode. Ensure all fuzzer are running under same modes, either the default mode or the faster persistent mode. An apple-to-banana comparison like MLFuzz only leads to wrong conclusions. 6. Open-source your fuzzing corpus. Fuzzing is an optimization and different seed corpus (starting point) can lead to drastically variant results.

0

15

73

Dongdong She

@DongdongShe

1 year

@AndreasZeller @ririnicolae @MaxCamillo @FSEconf At the same time any WRONG conclusions from your work can be way more misleading to the ML-based fuzzing field. As a responsible researcher, could you clarify your paper's WRONG claim, "ML-based fuzzer is magnitude slower than modern fuzzer" which is caused by file-retrieval mode

2

0

Dongdong She

@DongdongShe

1 year

@AndreasZeller @ririnicolae @MaxCamillo @FSEconf In-memory mode requires the setup from both tested binary and fuzzer. NEUZZ is a research prototype designed initially for file-retrieval mode ONLY. It would take a few line patches to support in-memory fuzzing. But your implementation didn't add that

0

1

Dongdong She

@DongdongShe

1 year

3. Open-source your seed corpus along with your source code. Fuzzing is a continuous optimization process. Without the same starting point (seed corpus), it's hard to reproduce the fuzzing result.

0

4

Dongdong She

@DongdongShe

1 year

1. Ensure all fuzzers are testing the SAME binary. Otherwise, please do a coverage replay on a standard binary before comparing the raw number. Be careful with the discrepancy in AFL coverage, AFL++ coverage, LLVM sanitizer coverage, LLVM sanitizer coverage with no prune feature.

1

0

3