DSA

DSA

Concepts

Time complexity:

This paper analyzes from the following five points:

  • What is time complexity anyway
  • What’s a big O
  • Differences in data sizes
  • Simplification of complex expressions
  • What is the base of log in -O (log n)?

究竟什么是时间复杂度

时间复杂度是一个函数,它定性描述该算法的运行时间

我们在软件开发中,时间复杂度就是用来方便开发者估算出程序运行的答题时间。

那么该如何估计程序运行时间呢,通常会估算算法的操作单元数量来代表程序消耗的时间,这里默认CPU的每个单元运行消耗的时间都是相同的。

假设算法的问题规模为n,那么操作单元数量便用函数f(n)来表示,随着数据规模n的增大,算法执行时间的增长率和f(n)的增长率相同,这称作为算法的渐近时间复杂度,简称时间复杂度,记为 O(f(n))。

什么是大O

这里的大O是指什么呢,说到时间复杂度,大家都知道O(n),O(n^2),却说不清什么是大O

算法导论给出的解释:大O用来表示上界的,当用它作为算法的最坏情况运行时间的上界,就是对任意数据输入的运行时间的上界。

同样算法导论给出了例子:拿插入排序来说,插入排序的时间复杂度我们都说是O(n^2) 。

输入数据的形式对程序运算时间是有很大影响的,在数据本来有序的情况下时间复杂度是O(n),但如果数据是逆序的话,插入排序的时间复杂度就是O(n^2),也就对于所有输入情况来说,最坏是O(n^2) 的时间复杂度,所以称插入排序的时间复杂度为O(n^2)。

同样的同理再看一下快速排序,都知道快速排序是O(nlogn),但是当数据已经有序情况下,快速排序的时间复杂度是O(n^2) 的,**所以严格从大O的定义来讲,快速排序的时间复杂度应该是O(n^2)**。

但是我们依然说快速排序是O(nlogn)的时间复杂度,这个就是业内的一个默认规定,这里说的O代表的就是一般情况,而不是严格的上界。如图所示: 时间复杂度4,一般情况下的时间复杂度

我们主要关心的还是一般情况下的数据形式。

面试中说道算法的时间复杂度是多少指的都是一般情况。但是如果面试官和我们深入探讨一个算法的实现以及性能的时候,就要时刻想着数据用例的不一样,时间复杂度也是不同的,这一点是一定要注意的。

#不同数据规模的差异

如下图中可以看出不同算法的时间复杂度在不同数据输入规模下的差异。

时间复杂度,不同数据规模的差异

在决定使用哪些算法的时候,不是时间复杂越低的越好(因为简化后的时间复杂度忽略了常数项等等),要考虑数据规模,如果数据规模很小甚至可以用O(n^2)的算法比O(n)的更合适(在有常数项的时候)。

就像上图中 O(5n^2) 和 O(100n) 在n为20之前 很明显 O(5n^2)是更优的,所花费的时间也是最少的。

那为什么在计算时间复杂度的时候要忽略常数项系数呢,也就说O(100n) 就是O(n)的时间复杂度,O(5n^2) 就是O(n^2)的时间复杂度,而且要默认O(n) 优于O(n^2) 呢 ?

这里就又涉及到大O的定义,因为大O就是数据量级突破一个点且数据量级非常大的情况下所表现出的时间复杂度,这个数据量也就是常数项系数已经不起决定性作用的数据量

例如上图中20就是那个点,n只要大于20 常数项系数已经不起决定性作用了。

所以我们说的时间复杂度都是省略常数项系数的,是因为一般情况下都是默认数据规模足够的大,基于这样的事实,给出的算法时间复杂的的一个排行如下所示

O(1)常数阶 < O(logn)对数阶 < O(n)线性阶 < O(nlogn)线性对数阶 < O(n^2)平方阶 < O(n^3)立方阶 < O(2^n)指数阶

但是也要注意大常数,如果这个常数非常大,例如10^7 ,10^9 ,那么常数就是不得不考虑的因素了。

#复杂表达式的化简

有时候我们去计算时间复杂度的时候发现不是一个简单的O(n) 或者O(n^2), 而是一个复杂的表达式,例如:

1
O(2*n^2 + 10*n + 1000)

1

那这里如何描述这个算法的时间复杂度呢,一种方法就是简化法。

去掉运行时间中的加法常数项 (因为常数项并不会因为n的增大而增加计算机的操作次数)。

1
O(2*n^2 + 10*n)

1

去掉常数系数(上文中已经详细讲过为什么可以去掉常数项的原因)。

1
O(n^2 + n)

1

只保留保留最高项,去掉数量级小一级的n (因为n^2 的数据规模远大于n),最终简化为:

1
O(n^2)

1

如果这一步理解有困难,那也可以做提取n的操作,变成O(n(n+1)) ,省略加法常数项后也就别变成了:

1
O(n^2)

1

所以最后我们说:这个算法的算法时间复杂度是O(n^2) 。

也可以用另一种简化的思路,其实当n大于40的时候, 这个复杂度会恒小于O(3 × n^2), O(2 × n^2 + 10 × n + 1000) < O(3 × n^2),所以说最后省略掉常数项系数最终时间复杂度也是O(n^2)。

#O(logn)中的log是以什么为底?

平时说这个算法的时间复杂度是logn的,那么一定是log 以2为底n的对数么?

其实不然,也可以是以10为底n的对数,也可以是以20为底n的对数,但我们统一说 logn,也就是忽略底数的描述

为什么可以这么做呢?如下图所示:

时间复杂度1.png

假如有两个算法的时间复杂度,分别是log以2为底n的对数和log以10为底n的对数,那么这里如果还记得高中数学的话,应该不难理解以2为底n的对数 = 以2为底10的对数 * 以10为底n的对数

而以2为底10的对数是一个常数,在上文已经讲述了我们计算时间复杂度是忽略常数项系数的。

抽象一下就是在时间复杂度的计算过程中,log以i为底n的对数等于log 以j为底n的对数,所以忽略了i,直接说是logn。

这样就应该不难理解为什么忽略底数了。

#举一个例子

通过这道面试题目,来分析一下时间复杂度。题目描述:找出n个字符串中相同的两个字符串(假设这里只有两个相同的字符串)。

如果是暴力枚举的话,时间复杂度是多少呢,是O(n^2)么?

这里一些同学会忽略了字符串比较的时间消耗,这里并不像int 型数字做比较那么简单,除了n^2 次的遍历次数外,字符串比较依然要消耗m次操作(m也就是字母串的长度),所以时间复杂度是O(m × n × n)。

接下来再想一下其他解题思路。

先排对n个字符串按字典序来排序,排序后n个字符串就是有序的,意味着两个相同的字符串就是挨在一起,然后在遍历一遍n个字符串,这样就找到两个相同的字符串了。

那看看这种算法的时间复杂度,快速排序时间复杂度为O(nlogn),依然要考虑字符串的长度是m,那么快速排序每次的比较都要有m次的字符比较的操作,就是O(m × n × log n) 。

之后还要遍历一遍这n个字符串找出两个相同的字符串,别忘了遍历的时候依然要比较字符串,所以总共的时间复杂度是 O(m × n × logn + n × m)。

我们对O(m × n × log n + n × m) 进行简化操作,把m × n提取出来变成 O(m × n × (logn + 1)),再省略常数项最后的时间复杂度是 O(m × n × log n)。

最后很明显O(m × n × logn) 要优于O(m × n × n)!

所以先把字符串集合排序再遍历一遍找到两个相同字符串的方法要比直接暴力枚举的方式更快。

这就是我们通过分析两种算法的时间复杂度得来的。

当然这不是这道题目的最优解,我仅仅是用这道题目来讲解一下时间复杂度

#总结

本篇讲解了什么是时间复杂度,复杂度是用来干什么,以及数据规模对时间复杂度的影响。

还讲解了被大多数同学忽略的大O的定义以及log究竟是以谁为底的问题。

再分析了如何简化复杂的时间复杂度,最后举一个具体的例子,把本篇的内容串起来。

相信看完本篇,大家对时间复杂度的认识会深刻很多!

各类语言的性能比较,底层

因为在简历上写了我熟悉C, Python, Java, 所以要了解一下这三种语言的底层实现,逻辑;要会比较三种语言的运行速度。

首先我们从底层实现开始:

底层实现

JAVA

Questions

Greedy algorithm:

1. 455. Assign Cookies (Easy)

题目描述:每个孩子都有一个满足度 grid,每个饼干都有一个大小 size,只有饼干的大小大于等于一个孩子的满足度,该孩子才会获得满足。求解最多可以获得满足的孩子数量。

  1. 给一个孩子的饼干应当尽量小并且又能满足该孩子,这样大饼干才能拿来给满足度比较大的孩子。
  2. 因为满足度最小的孩子最容易得到满足,所以先满足满足度最小的孩子。

在以上的解法中,我们只在每次分配时饼干时选择一种看起来是当前最优的分配方法,但无法保证这种局部最优的分配方法最后能得到全局最优解。我们假设能得到全局最优解,并使用反证法进行证明,即假设存在一种比我们使用的贪心策略更优的最优策略。如果不存在这种最优策略,表示贪心策略就是最优策略,得到的解也就是全局最优解。

证明:假设在某次选择中,贪心策略选择给当前满足度最小的孩子分配第 m 个饼干,第 m 个饼干为可以满足该孩子的最小饼干。假设存在一种最优策略,可以给该孩子分配第 n 个饼干,并且 m < n。我们可以发现,经过这一轮分配,贪心策略分配后剩下的饼干一定有一个比最优策略来得大。因此在后续的分配中,贪心策略一定能满足更多的孩子。也就是说不存在比贪心策略更优的策略,即贪心策略就是最优策略。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public int findContentChildren(int[] g, int[] s) {

if(g==null||s==null) return 0;

Arrays.sort(g);

Arrays.sort(s);

int gi=0;

int si=0;

while(gi<g.length&&si<s.length){

if(s[si]>=g[gi]){

gi++;

si++;

}

else si++;

}

return gi;

}

2. 435. Non-overlapping Intervals (Medium)

给定一个区间的集合 intervals ,其中 intervals[i] = [starti, endi] 。返回 需要移除区间的最小数量,使剩余区间互不重叠 。

示例 1:

输入: intervals = [[1,2],[2,3],[3,4],[1,3]]
输出: 1
解释: 移除 [1,3] 后,剩下的区间没有重叠。
示例 2:

输入: intervals = [ [1,2], [1,2], [1,2] ]
输出: 2
解释: 你需要移除两个 [1,2] 来使剩下的区间没有重叠。
示例 3:

输入: intervals = [ [1,2], [2,3] ]
输出: 0
解释: 你不需要移除任何区间,因为它们已经是无重叠的了。

solution:

先计算最多能组成的不重叠区间个数,然后用区间总个数减去不重叠区间的个数。

在每次选择中,区间的结尾最为重要,选择的区间结尾越小,留给后面的区间的空间越大,那么后面能够选择的区间个数也就越大。

按区间的结尾进行排序,每次选择结尾最小,并且和前一个区间不重叠的区间。结束时间越早越好

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public int eraseOverlapIntervals(int[][] intervals) {
if (intervals.length == 0) {
return 0;
}
Arrays.sort(intervals, Comparator.comparingInt(o -> o[1]));
int cnt = 1;
int end = intervals[0][1];
for (int i = 1; i < intervals.length; i++) {
if (intervals[i][0] < end) {
continue;
}
end = intervals[i][1];
cnt++;
}
return intervals.length - cnt;
}
1
2
3
4
5
6
7
Arrays.sort(intervals, new Comparator<int[]>() {
@Override
public int compare(int[] o1, int[] o2) {
return (o1[1] < o2[1]) ? -1 : ((o1[1] == o2[1]) ? 0 : 1);
}
});
//这里是让数组中,各个区间的末尾按照正序也就是升序来排序,所以是用o1[1]<o2[1],也可以用return o1[1]-o2[1],但可能会溢出

3. 452. Minimum Number of Arrows to Burst Balloons

有一些球形气球贴在一堵用 XY 平面表示的墙面上。墙面上的气球记录在整数数组 points ,其中points[i] = [xstart, xend] 表示水平直径在 xstartxend之间的气球。你不知道气球的确切 y 坐标。

一支弓箭可以沿着 x 轴从不同点 完全垂直 地射出。在坐标 x 处射出一支箭,若有一个气球的直径的开始和结束坐标为 x``startx``end, 且满足 xstart ≤ x ≤ xend,则该气球会被 引爆 。可以射出的弓箭的数量 没有限制 。 弓箭一旦被射出之后,可以无限地前进。

给你一个数组 points返回引爆所有气球所必须射出的 最小 弓箭数

示例 1:

1
2
3
4
5
输入:points = [[10,16],[2,8],[1,6],[7,12]]
输出:2
解释:气球可以用2支箭来爆破:
-在x = 6处射出箭,击破气球[2,8]和[1,6]。
-在x = 11处发射箭,击破气球[10,16]和[7,12]。

该题思路基本和第二题一致,只是要求最多不重叠的区间数,然后射出这么多箭即可。而且注意边界值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class Solution {
class myComparator implements Comparator<int[]>{
@Override
public int compare(int[] o1,int[] o2){
return (o1[1] < o2[1]) ? -1 : ((o1[1] == o2[1]) ? 0 : 1);
//如果使用return o1[1]-o2[1],可能会出现int越界的问题,所以用比较符号来
}
}
public int findMinArrowShots(int[][] points) {
if(points==null){
return 0;
}
Arrays.sort(points,new Solution.myComparator());
int cnt=1;
int end=points[0][1];
for(int i=1;i<points.length;i++){
if(points[i][0]<=end){
continue;
}
cnt++;
end=points[i][1];
}
return cnt;

}
}

4. 406. Queue Reconstruction by Height(Medium)

假设有打乱顺序的一群人站成一个队列,数组 people 表示队列中一些人的属性(不一定按顺序)。每个 people[i] = [hi, ki] 表示第 i 个人的身高为 hi ,前面 正好 有 ki 个身高大于或等于 hi 的人。

请你重新构造并返回输入数组 people 所表示的队列。返回的队列应该格式化为数组 queue ,其中 queue[j] = [hj, kj] 是队列中第 j 个人的属性(queue[0] 是排在队列前面的人)。

示例 1:

输入:people = [[7,0],[4,4],[7,1],[5,0],[6,1],[5,2]]
输出:[[5,0],[7,0],[5,2],[6,1],[4,4],[7,1]]

思路: 为了使插入操作不影响后续的操作,身高较高的学生应该先做插入操作,否则身高较小的学生原先正确插入的第 k 个位置可能会变成第 k+1 个位置。

身高 h 降序、个数 k 值升序,然后将某个学生插入队列的第 k 个位置中。

Solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Solution {
class myComparator implements Comparator<int[]> {
@Override
public int compare(int[] o1,int[] o2){
if(o1[0]!=o2[0]){
return o2[0]-o1[0];
}
else{
return o1[1]-o2[1];
}
}
}
public int[][] reconstructQueue(int[][] people) {
if (people == null || people.length == 0 || people[0].length == 0) {
return new int[0][0];
}
Arrays.sort(people,new myComparator());// equals to Arrays.sort(people, (a, b) -> (a[0] == b[0] ? a[1] - b[1] : b[0] - a[0]));
List<int[]> queue = new ArrayList<>();
for(int[] p:people){
queue.add(p[1],p);
}
return queue.toArray(new int[queue.size()][]);

}
}

5. 121. Best Time to Buy and Sell Stock (Easy)

给定一个数组 prices ,它的第 i 个元素 prices[i] 表示一支给定股票第 i 天的价格。

你只能选择 某一天 买入这只股票,并选择在 未来的某一个不同的日子 卖出该股票。设计一个算法来计算你所能获取的最大利润。

返回你可以从这笔交易中获取的最大利润。如果你不能获取任何利润,返回 0 。

示例 1:

输入:[7,1,5,3,6,4]
输出:5
解释:在第 2 天(股票价格 = 1)的时候买入,在第 5 天(股票价格 = 6)的时候卖出,最大利润 = 6-1 = 5 。 注意利润不能是 7-1 = 6, 因为卖出价格需要大于买入价格;同时,你不能在买入前卖出股票。

Idea:

只要记录前面的最小价格,将这个最小价格作为买入价格,然后将当前的价格作为售出价格,查看当前收益是不是最大收益。

Solution:

1
2
3
4
5
6
7
8
9
10
11
public int maxProfit(int[] prices) {
int n = prices.length;
if (n == 0) return 0;
int soFarMin = prices[0];
int max = 0;
for (int i = 1; i < n; i++) {
if (soFarMin > prices[i]) soFarMin = prices[i];
else max = Math.max(max, prices[i] - soFarMin);
}
return max;
}

6. 122. Best Time to Buy and Sell Stock II (Easy)

给你一个整数数组 prices ,其中 prices[i] 表示某支股票第 i 天的价格。

在每一天,你可以决定是否购买和/或出售股票。你在任何时候 最多 只能持有 一股 股票。你也可以先购买,然后在 同一天 出售。

返回 你能获得的 最大 利润 。

输入:prices = [7,1,5,3,6,4]
输出:7
解释:在第 2 天(股票价格 = 1)的时候买入,在第 3 天(股票价格 = 5)的时候卖出, 这笔交易所能获得利润 = 5 - 1 = 4。 随后,在第 4 天(股票价格 = 3)的时候买入,在第 5 天(股票价格 = 6)的时候卖出, 这笔交易所能获得利润 = 6 - 3 = 3。总利润为 4 + 3 = 7 。

Idea:

题目描述:可以进行多次交易,多次交易之间不能交叉进行,可以进行多次交易。

对于 [a, b, c, d],如果有 a <= b <= c <= d ,那么最大收益为 d - a。而 d - a = (d - c) + (c - b) + (b - a) ,因此当访问到一个 prices[i] 且 prices[i] - prices[i-1] > 0,那么就把 prices[i] - prices[i-1] 添加到收益中。

Solution:

1
2
3
4
5
6
7
8
9
10
11
class Solution {
public int maxProfit(int[] prices) {
int profit = 0;
for (int i = 1; i < prices.length; i++) {
if (prices[i] > prices[i - 1]) {
profit += (prices[i] - prices[i - 1]);
}
}
return profit;
}
}

Linked List

什么是链表,链表是一种通过指针串联在一起的线性结构,每一个节点由两部分组成,一个是数据域一个是指针域(存放指向下一个节点的指针),最后一个节点的指针域指向null(空指针的意思)。

链表的入口节点称为链表的头结点也就是head。

如图所示: 链表1

了解完链表的类型,再来说一说链表在内存中的存储方式。

数组是在内存中是连续分布的,但是链表在内存中可不是连续分布的。

链表是通过指针域的指针链接在内存中各个节点。

所以链表中的节点在内存中不是连续分布的 ,而是散乱分布在内存中的某地址上,分配机制取决于操作系统的内存管理。

如图所示:

链表3

这个链表起始节点为2, 终止节点为7, 各个节点分布在内存的不同地址空间上,通过指针串联在一起。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
public class ListNode {
// 结点的值
int val;

// 下一个结点
ListNode next;

// 节点的构造函数(无参)
public ListNode() {
}

// 节点的构造函数(有一个参数)
public ListNode(int val) {
this.val = val;
}

// 节点的构造函数(有两个参数)
public ListNode(int val, ListNode next) {
this.val = val;
this.next = next;
}
}
JavaScript:

class ListNode {
val;
next = null;
constructor(value) {
this.val = value;
this.next = null;
}
}

203. remove a node

给你一个链表的头节点 head 和一个整数 val ,请你删除链表中所有满足 Node.val == val 的节点,并返回 新的头节点 。

示例 1:

输入:head = [1,2,6,3,4,5,6], val = 6
输出:[1,2,3,4,5]

Solution1. 使用dummy node. 这样子在移除头节点的时候可以使用虚拟节点将其移出,其他的节点都是依靠后面的节点来进行删除的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
 public ListNode removeElements(ListNode head, int val) {
ListNode dummy= new ListNode(-1);
dummy.next=head;
ListNode pre = dummy;
ListNode cur = head;
while (cur != null) {
if (cur.val == val) {
pre.next = cur.next;
} else {
pre = cur;
}
cur = cur.next;
}
return dummy.next;
}
}

Solution2. 不使用dummy node, 直接将

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 public ListNode removeElements(ListNode head, int val) {
while (head != null && head.val == val) {
head = head.next;//如果头为空,则往后移
}
if(head==null){
return head;
}
ListNode pre=head;
ListNode cur=head.next;
while(cur!=null){
if(cur.val==val){
pre.next = cur.next;
}else{
pre=cur;
}
cur=cur.next;

}
return head;
}
}

707 design myLinkedList

solution1

单链表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
//单链表
class ListNode {
int val;
ListNode next;
ListNode(){}
ListNode(int val) {
this.val=val;
}
}
class MyLinkedList {
//size存储链表元素的个数
int size;
//虚拟头结点
ListNode head;

//初始化链表
public MyLinkedList() {
size = 0;
head = new ListNode(0);
}

//获取第index个节点的数值,注意index是从0开始的,第0个节点就是头结点
public int get(int index) {
//如果index非法,返回-1
if (index < 0 || index >= size) {
return -1;
}
ListNode currentNode = head;
//包含一个虚拟头节点,所以查找第 index+1 个节点
for (int i = 0; i <= index; i++) {
currentNode = currentNode.next;
}
return currentNode.val;
}
//在链表最前面插入一个节点,等价于在第0个元素前添加
public void addAtHead(int val) {
addAtIndex(0, val);
}
//可以不用addAtIndex方法来实现
public void addAtHead(int val) {
size++;
ListNode node = new ListNode(val);
if (head.next == null) {
head.next = node;
} else {
ListNode temp = head.next;
head.next = node;
node.next = temp;
}
}

//在链表的最后插入一个节点,等价于在(末尾+1)个元素前添加
public void addAtTail(int val) {
addAtIndex(size, val);
}

// 在第 index 个节点之前插入一个新节点,例如index为0,那么新插入的节点为链表的新头节点。
// 如果 index 等于链表的长度,则说明是新插入的节点为链表的尾结点
// 如果 index 大于链表的长度,则返回空
public void addAtIndex(int index, int val) {
if (index > size) {
return;
}
if (index < 0) {
index = 0;
}
size++;
//找到要插入节点的前驱
ListNode pred = head;
for (int i = 0; i < index; i++) {
pred = pred.next;
}
ListNode toAdd = new ListNode(val);
toAdd.next = pred.next;
pred.next = toAdd;
}

//删除第index个节点
public void deleteAtIndex(int index) {
if (index < 0 || index >= size) {
return;
}
size--;
if (index == 0) {
head = head.next;
return;
}
ListNode pred = head;
for (int i = 0; i < index ; i++) {
pred = pred.next;
}
pred.next = pred.next.next;
}
}

solution2

双链表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
//双链表
class ListNode{
int val;
ListNode next,prev;
ListNode() {};
ListNode(int val){
this.val = val;
}
}


class MyLinkedList {

//记录链表中元素的数量
int size;
//记录链表的虚拟头结点和尾结点
ListNode head,tail;

public MyLinkedList() {
//初始化操作
this.size = 0;
this.head = new ListNode(0);
this.tail = new ListNode(0);
//这一步非常关键,否则在加入头结点的操作中会出现null.next的错误!!!
head.next=tail;
tail.prev=head;
}

public int get(int index) {
//判断index是否有效
if(index<0 || index>=size){
return -1;
}
ListNode cur = this.head;
//判断是哪一边遍历时间更短
if(index >= size / 2){
//tail开始
cur = tail;
for(int i=0; i< size-index; i++){
cur = cur.prev;
}
}else{
for(int i=0; i<= index; i++){
cur = cur.next;
}
}
return cur.val;
}

public void addAtHead(int val) {
//等价于在第0个元素前添加
addAtIndex(0,val);
}

public void addAtTail(int val) {
//等价于在最后一个元素(null)前添加
addAtIndex(size,val);
}

public void addAtIndex(int index, int val) {
//index大于链表长度
if(index>size){
return;
}
//index小于0
if(index<0){
index = 0;
}
size++;
//找到前驱
ListNode pre = this.head;
for(int i=0; i<index; i++){
pre = pre.next;
}
//新建结点
ListNode newNode = new ListNode(val);
newNode.next = pre.next;
pre.next.prev = newNode;
newNode.prev = pre;
pre.next = newNode;

}

public void deleteAtIndex(int index) {
//判断索引是否有效
if(index<0 || index>=size){
return;
}
//删除操作
size--;
ListNode pre = this.head;
for(int i=0; i<index; i++){
pre = pre.next;
}
pre.next.next.prev = pre;
pre.next = pre.next.next;
}
}

160 交叉链表的最简单解法

1
2
3
4
5
6
7
8
public ListNode getIntersectionNode(ListNode headA, ListNode headB) {
ListNode l1 = headA, l2 = headB;
while (l1 != l2) {
l1 = (l1 == null) ? headB : l1.next;
l2 = (l2 == null) ? headA : l2.next;
}
return l1;
}

Page1

Page3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
//递归写法    
public ListNode reverseList(ListNode head) {
return reverse(null,head);
}
private ListNode reverse(ListNode pre, ListNode cur){
if(cur==null){
return pre;
}
ListNode temp = cur.next;
cur.next=pre;
return reverse(cur,temp);
}

//从后向前的递归
ListNode reverseList(ListNode head) {
// 边缘条件判断
if(head == null) return null;
if (head.next == null) return head;

// 递归调用,翻转第二个节点开始往后的链表
ListNode last = reverseList(head.next);
// 翻转头节点与第二个节点的指向
head.next.next = head;
// 此时的 head 节点为尾节点,next 需要指向 NULL
head.next = null;
return last;
}

Page3

Page3Page4Page5Page6Page7Page8Page9Page10Page11Page12

Trees

leetcode 104

Page13Page14Page15Page16Page17Page18Page19Page20Page21Page22Page23Page24Page25Page26Page27

Double pointers

leetcode 167

p1

p2

Sorting

p3

p4

p5

p6

p7

p8

p9

p10

p11

p12

p13

p14

p15

p16

p17

p18

p19

p20

p21

Arrays

leetcode 704

我们定义 target 是在一个在左闭右闭的区间里,也就是[left, right] (这个很重要非常重要)

区间的定义这就决定了二分法的代码应该如何写,因为定义target在[left, right]区间,所以有如下两点:

  • while (left <= right) 要使用 <= ,因为left == right是有意义的,所以使用 <=
  • if (nums[middle] > target) right 要赋值为 middle - 1,因为当前这个nums[middle]一定不是target,那么接下来要查找的左区间结束下标位置就是 middle - 1
1
2
3
4
5
6
7
8
9
10
public int search(int[] nums, int target) {
int left = 0, right = nums.length-1;
while(left<=right){
int mid = left + ((right - left) >> 1);
if(target>nums[mid]) left = mid+1;
else if(target<nums[mid]) right = mid-1;
else return mid;
}
return -1;
}

leetcode 27

快慢指针

Brute

0,1,2,3,3,0,4,2

0,1,3,3,0,4,2,[2]

0,1,3,3,0,4,[2,2]

1,3,3,4,5 i=1 result = 5

1,3,4,5,[5] i=1 result = 4

1
2
3
4
5
6
7
8
9
10
11
12
13
public int removeElement(int[] nums, int val) {
int result=nums.length;
for(int i = 0;i<result;i++){
if(nums[i]==val){
for(int j=i+1;j<result;j++){
nums[j-1]=nums[j];
}
result--;
i--;
}
}
return result;
}

Fast and slow pointer

gif0_dsa

1
2
3
4
5
6
7
8
9
10
11
12
 public int removeElement(int[] nums, int val) {
// 快慢指针
int slowIndex = 0;
for (int fastIndex = 0; fastIndex < nums.length; fastIndex++) {
if (nums[fastIndex] != val) {
nums[slowIndex] = nums[fastIndex];
slowIndex++;
}
}
return slowIndex;

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
//相向双指针法
class Solution {
public int removeElement(int[] nums, int val) {
int left = 0;
int right = nums.length - 1;
while(right >= 0 && nums[right] == val) right--; //将right移到从右数第一个值不为val的位置
while(left <= right) {
if(nums[left] == val) { //left位置的元素需要移除
//将right位置的元素移到left(覆盖),right位置移除
nums[left] = nums[right];
right--;
}
left++;
while(right >= 0 && nums[right] == val) right--;
}
return left;
}
}

Leetcode 977

给你一个按 非递减顺序 排序的整数数组 nums,返回 每个数字的平方 组成的新数组,要求也按 非递减顺序 排序。

1
2
3
4
输入:nums = [-4,-1,0,3,10]
输出:[0,1,9,16,100]
解释:平方后,数组变为 [16,1,0,9,100]
排序后,数组变为 [0,1,9,16,100]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
int right = nums.length-1;
int left = 0;
//construct a new array to store otherwise it will cause o(n^2) in time complexity
int[] result = new int[nums.length];
int index = result.length - 1;
while(left<=right){
if(nums[left]*nums[left]>nums[right]*nums[right]){
result[index]=nums[left]*nums[left];
index--;
left++;
}
else{
result[index]=nums[right]*nums[right];
index--;
right--;
}
}
return result;
}

Leetcode 209

给定一个含有 n 个正整数的数组和一个正整数 target 。

找出该数组中满足其和 ≥ target 的长度最小的 连续子数组 [nums_l, nums_l+1, …, nums_r-1, nums_r] ,并返回其长度。如果不存在符合条件的子数组,返回 0 。

示例 1:

输入:target = 7, nums = [2,3,1,2,4,3]
输出:2
解释:子数组 [4,3] 是该条件下的长度最小的子数组。
示例 2:

输入:target = 4, nums = [1,4,4]
输出:1
示例:3

输入:target = 11, nums = [1,1,1,1,1,1,1,1]
输出:0

Brute

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class Solution {
public:
int minSubArrayLen(int s, vector<int>& nums) {
int result = INT32_MAX; // 最终的结果
int sum = 0; // 子序列的数值之和
int subLength = 0; // 子序列的长度
for (int i = 0; i < nums.size(); i++) { // 设置子序列起点为i
sum = 0;
for (int j = i; j < nums.size(); j++) { // 设置子序列终止位置为j
sum += nums[j];
if (sum >= s) { // 一旦发现子序列和超过了s,更新result
subLength = j - i + 1; // 取子序列的长度
result = result < subLength ? result : subLength;
break; // 因为我们是找符合条件最短的子序列,所以一旦符合条件就break
}
}
}
// 如果result没有被赋值的话,就返回0,说明没有符合条件的子序列
return result == INT32_MAX ? 0 : result;
}
};

Sliding Windows/滑动窗口:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 public int minSubArrayLen(int s, int[] nums) {
int left = 0;
int sum = 0;
int result = Integer.MAX_VALUE;
for (int right = 0; right < nums.length; right++) {
sum += nums[right];
while (sum >= s) {
result = Math.min(result, right - left + 1);
sum -= nums[left];
left++;
}
}
return result == Integer.MAX_VALUE ? 0 : result;
}

gif1_dsa

Leetcode 59

注意边界条件, 坚持左闭右开的原则.

循环的次数等于n/2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public int[][] generateMatrix(int n) {
int start=0;
int count=1;
int loop=0;
int[][] nums = new int[n][n];
int i,j;
while(loop++<n/2){
for(j = start ; j<n-loop ; j++){
nums[start][j] = count++;
}
for(i = start;i<n-loop;i++){
nums[i][j]=count++;
}
for(;j>=loop;j--){
nums[i][j]=count++;
}
for(;i>=loop;i--){
nums[i][j]=count++;
}
start++;
}
if(n%2==1){
nums[n/2][n/2]=n*n;
}
return nums;
}

Summary of Array

  • 数组下标都是从0开始的。
  • 数组内存空间的地址是连续的

正是因为数组的在内存空间的地址是连续的,所以我们在删除或者增添元素的时候,就难免要移动其他元素的地址。

我们来举一个Java的例子,例如: int[][] rating = new int[3][4]; , 这个二维数组在内存空间可不是一个 3*4 的连续地址空间

看了下图,就应该明白了:

算法通关数组3

所以Java的二维数组在内存中不是 3\*4 的连续地址空间,而是四条连续的地址空间组成!

Hash table

哈希碰撞

如图所示,小李和小王都映射到了索引下标 1 的位置,这一现象叫做哈希碰撞

哈希表3

一般哈希碰撞有两种解决方法, 拉链法和线性探测法。

#拉链法

刚刚小李和小王在索引1的位置发生了冲突,发生冲突的元素都被存储在链表中。 这样我们就可以通过索引找到小李和小王了

哈希表4

(数据规模是dataSize, 哈希表的大小为tableSize)

其实拉链法就是要选择适当的哈希表的大小,这样既不会因为数组空值而浪费大量内存,也不会因为链表太长而在查找上浪费太多时间。

#线性探测法

使用线性探测法,一定要保证tableSize大于dataSize。 我们需要依靠哈希表中的空位来解决碰撞问题。

例如冲突的位置,放了小李,那么就向下找一个空位放置小王的信息。所以要求tableSize一定要大于dataSize ,要不然哈希表上就没有空置的位置来存放 冲突的数据了。如图所示:

哈希表5

其实关于哈希碰撞还有非常多的细节,感兴趣的同学可以再好好研究一下,这里我就不再赘述了。

hash表是牺牲空间减少时间cost

Leetcode 242

给定两个字符串 s 和 t ,编写一个函数来判断 t 是否是 s 的字母异位词。

示例 1: 输入: s = “anagram”, t = “nagaram” 输出: true

示例 2: 输入: s = “rat”, t = “car” 输出: false

说明: 你可以假设字符串只包含小写字母。

242.有效的字母异位词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public boolean isAnagram(String s, String t) {
Map<Character,Integer> map = new HashMap<>();
for(char c:s.toCharArray()){
map.put(c,map.getOrDefault(c,0)+1);
}
for (int i = 0; i < t.length(); i++) {
char c = t.charAt(i);
int count = map.getOrDefault(c, 0);
if (count == 0) {
return false;
}
if (count == 1) {
map.remove(c);
} else {
map.put(c, count - 1);
}
}

return map.isEmpty();
}

更简单的做法是用数组

数组其实就是一个简单哈希表,而且这道题目中字符串只有小写字符,那么就可以定义一个数组,来记录字符串s里字符出现的次数。需要定义一个多大的数组呢,定一个数组叫做record,大小为26 就可以了,初始化为0,因为字符a到字符z的ASCII也是26个连续的数值。

定义一个数组叫做record用来上记录字符串s里字符出现的次数。

需要把字符映射到数组也就是哈希表的索引下标上,因为字符a到字符z的ASCII是26个连续的数值,所以字符a映射为下标0,相应的字符z映射为下标25。

再遍历 字符串s的时候,只需要将 s[i] - ‘a’ 所在的元素做+1 操作即可,并不需要记住字符a的ASCII,只要求出一个相对数值就可以了。 这样就将字符串s中字符出现的次数,统计出来了。

那看一下如何检查字符串t中是否出现了这些字符,同样在遍历字符串t的时候,对t中出现的字符映射哈希表索引上的数值再做-1的操作。

那么最后检查一下,record数组如果有的元素不为零0,说明字符串s和t一定是谁多了字符或者谁少了字符,return false。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/**
* 242. 有效的字母异位词 字典解法
* 时间复杂度O(m+n) 空间复杂度O(1)
*/
class Solution {
public boolean isAnagram(String s, String t) {
int[] record = new int[26];

for (int i = 0; i < s.length(); i++) {
record[s.charAt(i) - 'a']++; // 并不需要记住字符a的ASCII,只要求出一个相对数值就可以了
}

for (int i = 0; i < t.length(); i++) {
record[t.charAt(i) - 'a']--;
}

for (int count: record) {
if (count != 0) { // record数组如果有的元素不为零0,说明字符串s和t 一定是谁多了字符或者谁少了字符。
return false;
}
}
return true; // record数组所有元素都为零0,说明字符串s和t是字母异位词
}
}

Leetcode 349

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public int[] intersection(int[] nums1, int[] nums2) {
if(nums1==null||nums2==null||nums1.length==0||nums2.length==0){
return new int[0];
}
Set<Integer> set = new HashSet<>();
Set<Integer> resSet = new HashSet<>();
for(int i:nums1){
set.add(i);
}
for(int i:nums2){
if(set.contains(i)){
resSet.add(i);
}
}
int[] arr = new int[resSet.size()];
int j = 0;
for(int i : resSet){
arr[j++] = i;
}

return arr;

}

Leetcode 202 快乐数

这个题讨论3种假设情况

  1. 最后n1->n2->…->1 true
  2. 最后循环下来了,n1->n2->…->n1 false
  3. 越加越大

但是事实上不会越加越大,拿9来举例,9^2=81,99->162, ….999->243,9999->324,99999->405这样加下去,不会超其本身,所以不会越加越大。所以我们只需要讨论前两种方案,只要在循环过程中出现了重复的,就停止循环并返回,出现1了也停止循环并返回。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public boolean isHappy(int n) {
Set<Integer> record = new HashSet<>();
while (n != 1 && !record.contains(n)) {
record.add(n);
n = getNextNumber(n);
}
return n == 1;

}
private int getNextNumber(int n){
int res=0;
while(n>0){
int temp = n%10;
res+=temp*temp;
n=n/10;
}
return res;

}

Leetcode 1 两数之和

Brute force:

1
2
3
4
5
6
7
8
9
10
11
12
13
 public int[] twoSum(int[] nums, int target) {
for(int i=0;i<nums.length;i++){
for(int j=0;j<nums.length;j++){
if(i==j){
continue;
}
if((nums[i]+nums[j])==target){
return new int[]{i,j};
}
}
}
return null;
}

HashMap:

本解法有四个重点:

  • 为什么会想到用哈希表
  • 哈希表为什么用map
  • 本题map是用来存什么的
  • map中的key和value用来存什么的

map目的用来存放我们访问过的元素,因为遍历数组的时候,需要记录我们之前遍历过哪些元素和对应的下标,这样才能找到与当前元素相匹配的(也就是相加等于target)

接下来是map中key和value分别表示什么。

这道题 我们需要 给出一个元素,判断这个元素是否出现过,如果出现过,返回这个元素的下标。

那么判断元素是否出现,这个元素就要作为key,所以数组中的元素作为key,有key对应的就是value,value用来存下标。

所以 map中的存储结构为 {key:数据元素,value:数组元素对应的下标}。

在遍历数组的时候,只需要向map去查询是否有和目前遍历元素匹配的数值,如果有,就找到的匹配对,如果没有,就把目前遍历的元素放进map中,因为map存放的就是我们访问过的元素。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public int[] twoSum(int[] nums, int target) {
int[] res = new int[2];
if(nums==null||nums.length==0){
return res;
}
Map<Integer,Integer> map = new HashMap<>();
for(int i=0;i<nums.length;i++){
int temp = target-nums[i];
if(map.containsKey(temp)){
res[0]=i;
res[1]=map.get(temp);
break;
}
map.put(nums[i],i);
}
return res;

Leetcode 454 四数相加

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
 public int fourSumCount(int[] nums1, int[] nums2, int[] nums3, int[] nums4) {
Map<Integer,Integer> map = new HashMap<>();
int temp;
int res=0;
for(int i:nums1){
for(int j:nums2){
temp=i+j;
if (map.containsKey(temp)) {
map.put(temp, map.get(temp) + 1);
} else {
map.put(temp, 1);
}
}
}
for(int i:nums3){
for(int j:nums4){
temp=i+j;
if (map.containsKey(0 - temp)) {
res += map.get(0 - temp);
}
}
}
return res;
}
}

String type question

344. Reverse the String

使用双指针,leftright

循环swap left 和 right 直到 left>=right

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public void reverseString(char[] s) {
int i = 0;
int j = s.length-1;
while(i<j){
swap(s,i,j);
i++;
j--;
}
}
private void swap(char s[], int i, int j){
char temp=s[i];
s[i]=s[j];
s[j]=temp;
}

541. Reverse the String II

给定一个字符串 s 和一个整数 k,从字符串开头算起,每计数至 2k 个字符,就反转这 2k 字符中的前 k 个字符。

如果剩余字符少于 k 个,则将剩余字符全部反转。
如果剩余字符小于 2k 但大于或等于 k 个,则反转前 k 个字符,其余字符保持原样。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
public String reverseStr(String s, int k) {
int length = s.length();
StringBuilder res = new StringBuilder();
int count = 0;
int remain = length;

while (remain >= 2 * k) {
res.append(reverse(s, 0 + 2 * k * count, k + 2 * k * count));
res.append(s.substring(k + 2 * k * count, 2 * k + 2 * k * count));
remain -= 2 * k;
count++;
}

if (remain >= k) {
res.append(reverse(s, 0 + 2 * k * count, k + 2 * k * count));
res.append(s.substring(k + 2 * k * count));
} else {
String sub = s.substring(0 + 2 * k * count);
res.append(reverse(sub, 0, sub.length()));
}

return res.toString();
}

private String reverse(String s, int i, int j) {
char[] a = s.substring(i, j).toCharArray();
int left = 0;
int right = a.length - 1;

while (left < right) {
swap(a, left, right);
left++;
right--;
}

return new String(a);
}

private void swap(char[] s, int i, int j) {
char temp = s[i];
s[i] = s[j];
s[j] = temp;
}

这题就是注意边界值的处理,别的没什么。

剑指offer05

1
2
3
4
5
6
7
8
9
10
11
12
public String replaceSpace(String s) {
StringBuilder sb = new StringBuilder();
char[] ch = s.toCharArray();
for(char c: ch){
if(c==32){
sb.append("%20");
}else{
sb.append(c);
}
}
return sb.toString();
}

双指针法,很巧妙

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public String replaceSpace(String s) {
if(s == null || s.length() == 0){
return s;
}
//扩充空间,空格数量2倍
StringBuilder str = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
if(s.charAt(i) == ' '){
str.append(" ");
}
}
//若是没有空格直接返回
if(str.length() == 0){
return s;
}
//有空格情况 定义两个指针
int left = s.length() - 1;//左指针:指向原始字符串最后一个位置
s += str.toString();
int right = s.length()-1;//右指针:指向扩展字符串的最后一个位置
char[] chars = s.toCharArray();
while(left>=0){
if(chars[left] == ' '){
chars[right--] = '0';
chars[right--] = '2';
chars[right] = '%';
}else{
chars[right] = chars[left];
}
left--;
right--;
}
return new String(chars);
}

151 翻转字符串里的单词

My solution: use a stack

1
2
3
4
5
6
7
8
9
10
11
12
13
public String reverseWords(String s) {
String[] words = s.split("\\s+");
Stack<String> stack = new Stack<>();
for(String word: words){
stack.push(word);
}
StringBuilder sb = new StringBuilder();
while(!stack.isEmpty()){
sb.append(stack.pop());
sb.append(" ");
}
return sb.toString().trim();
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
public String reverseWords(String s) {
// System.out.println("ReverseWords.reverseWords2() called with: s = [" + s + "]");
// 1.去除首尾以及中间多余空格
StringBuilder sb = removeSpace(s);
// 2.反转整个字符串
reverseString(sb, 0, sb.length() - 1);
// 3.反转各个单词
reverseEachWord(sb);
return sb.toString();
}

private StringBuilder removeSpace(String s) {
// System.out.println("ReverseWords.removeSpace() called with: s = [" + s + "]");
int start = 0;
int end = s.length() - 1;
while (s.charAt(start) == ' ') start++;
while (s.charAt(end) == ' ') end--;//这两行是为了去除头部和尾部的空格
StringBuilder sb = new StringBuilder();
while (start <= end) {
char c = s.charAt(start);
if (c != ' ' || sb.charAt(sb.length() - 1) != ' ') {
sb.append(c);
}
start++;
}
// System.out.println("ReverseWords.removeSpace returned: sb = [" + sb + "]");
return sb;
}

/**
* 反转字符串指定区间[start, end]的字符
*/
public void reverseString(StringBuilder sb, int start, int end) {
// System.out.println("ReverseWords.reverseString() called with: sb = [" + sb + "], start = [" + start + "], end = [" + end + "]");
while (start < end) {
char temp = sb.charAt(start);
sb.setCharAt(start, sb.charAt(end));
sb.setCharAt(end, temp);
start++;
end--;
}
// System.out.println("ReverseWords.reverseString returned: sb = [" + sb + "]");
}

private void reverseEachWord(StringBuilder sb) {
int start = 0;
int end = 1;
int n = sb.length();
while (start < n) {
while (end < n && sb.charAt(end) != ' ') {
end++;
}
reverseString(sb, start, end - 1);
start = end + 1;
end = start + 1;
}
}

剑指offer58 II

基本解法:

1
2
3
4
5
6
7
8
9
10
11
public String reverseLeftWords(String s, int n) {
char[] ss= s.toCharArray();
StringBuilder sb = new StringBuilder();
for(int i=n;i<s.length();i++){
sb.append(ss[i]);
}
for(int i=0;i<n;i++){
sb.append(ss[i]);
}
return sb.toString();
}

解法二,省空间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class Solution {
public String reverseLeftWords(String s, int n) {
char[] chars = s.toCharArray();
reverse(chars, 0, chars.length - 1);
reverse(chars, 0, chars.length - 1 - n);
reverse(chars, chars.length - n, chars.length - 1);
return new String(chars);
}

public void reverse(char[] chars, int left, int right) {
while (left < right) {
chars[left] ^= chars[right];
chars[right] ^= chars[left];
chars[left] ^= chars[right];
left++;
right--;
}
}

leetcode 28

解法1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public int strStr(String haystack, String needle) {
if(needle.length()>haystack.length()){
return -1;
}
if(haystack.equals(needle)){
return 0;
}
char[] s1 = haystack.toCharArray();
int nLen = needle.length();
for(int i=0;i<=(s1.length-nLen);i++){
if(s1[i]==needle.charAt(0)){
if(haystack.substring(i,i+nLen).equals(needle)){
return i;
}
}
}
return -1;
}

解法2: KMP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class Solution {
public int strStr(String haystack, String needle) {
if (needle.length() == 0) return 0;
int[] next = new int[needle.length()];
getNext(next, needle);

int j = 0;
for (int i = 0; i < haystack.length(); i++) {
while (j > 0 && needle.charAt(j) != haystack.charAt(i))
j = next[j - 1];
if (needle.charAt(j) == haystack.charAt(i))
j++;
if (j == needle.length())
return i - needle.length() + 1;
}
return -1;

}

private void getNext(int[] next, String s) {
int j = 0;
next[0] = 0;
for (int i = 1; i < s.length(); i++) {
while (j > 0 && s.charAt(j) != s.charAt(i))
j = next[j - 1];
if (s.charAt(j) == s.charAt(i))
j++;
next[i] = j;
}
}
}

next数组求法

构造一个getNext方法,传入next数组和模式串作为参数

1.初始化

2.讨论前缀和后缀不同的情况

3.讨论前缀和后缀相同的情况

4.更新next数组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
getNext(int[] next,String s){

int j=0;

next[0]=0;

for(int i=0;i<s.size();i++){

while(s[i]!=s[j]&&j>0){
j=next[j-1];
}
if(s[i]==s[j]){
j++;
}
next[i]=j;
}

}

Leetcode 459

重复的字串

使用kmp算法的解法. 对整个字符串求next前缀数组,可以得到最长的相等前后缀长度,然后用字符串减去这个长度,就可以算得重复字串长度,如果该字符串是由这个字串重复构成的话,那么它是能被字符串长度整除的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Solution {
public boolean repeatedSubstringPattern(String s) {
int len=s.length();
if (s.equals("")) return false;
int j=0;
int[] next = new int[len];
next[0]=0;
for(int i=1;i<len;i++){
while(j>0&&s.charAt(j)!=s.charAt(i)){
j=next[j-1];
}
if(s.charAt(j)==s.charAt(i)){
j++;
}
next[i]=j;
}
if(len%(len-next[len-1])==0&&next[len-1]>0){
return true;
}
return false;
}
}

栈与队列

leetcode 232

用栈实现队列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import java.util.Stack;

class MyQueue {
Stack<Integer> s1;
Stack<Integer> s2;
public MyQueue() {
s1=new Stack();
s2=new Stack();
}

public void push(int x) {
s1.push(x);
}

public int pop() {
while(!s1.isEmpty()){
s2.push(s1.pop());
}
int result=s2.pop();
while(!s2.isEmpty()){
s1.push(s2.pop());
}
return result;


}

public int peek() {
while(!s1.isEmpty()){
s2.push(s1.pop());
}
int result=s2.peek();
while(!s2.isEmpty()){
s1.push(s2.pop());
}
return result;
}

public boolean empty() {
return s1.isEmpty();
}
}

这里面有四次将s1的元素和s2的元素互相倒过去或者倒回来的,所以我们可以将这一步抽象成一个方法。

leetcode 225

用队列实现栈

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class MyStack {

Queue<Integer> queue1; // 和栈中保持一样元素的队列
Queue<Integer> queue2; // 辅助队列

/** Initialize your data structure here. */
public MyStack() {
queue1 = new LinkedList<>();
queue2 = new LinkedList<>();
}

/** Push element x onto stack. */
public void push(int x) {
queue2.offer(x); // 先放在辅助队列中
while (!queue1.isEmpty()){
queue2.offer(queue1.poll());
}
Queue<Integer> queueTemp;
queueTemp = queue1;
queue1 = queue2;
queue2 = queueTemp; // 最后交换queue1和queue2,将元素都放到queue1中
}
//在这里让队列push元素的方法使得最后保持pop的时候和栈的pop顺序一样就行

/** Removes the element on top of the stack and returns that element. */
public int pop() {
return queue1.poll(); // 因为queue1中的元素和栈中的保持一致,所以这个和下面两个的操作只看queue1即可
}

/** Get the top element. */
public int top() {
return queue1.peek();
}

/** Returns whether the stack is empty. */
public boolean empty() {
return queue1.isEmpty();
}
}

Leetcode20

括号匹配问题

这个题主要有一点比较巧妙,我们来遍历整个字符串,遇到左括号,我们就往栈里面push其相应的右括号,在遇到右括号就开始判断这个栈首元素和这个右括号是否匹配,如果不匹配或者说栈已经空了,那么就是返回false。最后完成字符串扫描后,如果栈内还没pop干净也说明没匹配上。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import java.util.*;

class Solution {
public boolean isValid(String s) {
if(s.length()%2!=0){
return false;
}
char[] ss = s.toCharArray();
Deque<Character> dq = new LinkedList<>();
for(char c:ss){
if(c=='('){
dq.push(')');
}
else if(c=='{'){
dq.push('}');
}
else if(c=='['){
dq.push(']');
}
else if(dq.isEmpty()||dq.peek()!=c){
return false;
}else{
dq.pop();
}
}
return dq.isEmpty();
}
}

Leetcode 1047 删除字符串中的所有相邻重复项

这道题和上一道括号匹配问题类似,都是用栈来解决

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import java.util.*;

class Solution {
public String removeDuplicates(String s) {
char[] ss = s.toCharArray();
Deque<Character> dq = new LinkedList<>();
for(char c:ss){
if(!dq.isEmpty()&&dq.peek()==c){
dq.pop();
}
else{
dq.push(c);
}
}
StringBuilder sb = new StringBuilder();
while(!dq.isEmpty()){
sb.append(dq.pollLast());
}
return sb.toString();
}
}

Leetcode150 逆波兰式表达式求值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public int evalRPN(String[] tokens) {
Deque<String> dq = new LinkedList<>();
for(String s:tokens){
if(s.equals("+")){
int a = Integer.parseInt(dq.pop());
int b = Integer.parseInt(dq.pop());
dq.push(Integer.toString(a+b));
}
else if(s.equals("*")){
int a = Integer.parseInt(dq.pop());
int b = Integer.parseInt(dq.pop());
dq.push(Integer.toString(a*b));
}
else if(s.equals("-")){
int a = Integer.parseInt(dq.pop());
int b = Integer.parseInt(dq.pop());
dq.push(Integer.toString(b-a));
}
else if(s.equals("/")){
int a = Integer.parseInt(dq.pop());
int b = Integer.parseInt(dq.pop());
dq.push(Integer.toString(b/a));
}
else{
dq.push(s);
}
}
return Integer.parseInt(dq.pop());
}

Leetcode239 滑动窗口的最大值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class MyQueue {
Deque<Integer> deque = new LinkedList<>();
//弹出元素时,比较当前要弹出的数值是否等于队列出口的数值,如果相等则弹出
//同时判断队列当前是否为空
void poll(int val) {
if (!deque.isEmpty() && val == deque.peek()) {
deque.poll();
}
}
//添加元素时,如果要添加的元素大于入口处的元素,就将入口元素弹出
//保证队列元素单调递减
//比如此时队列元素3,1,2将要入队,比1大,所以1弹出,此时队列:3,2
void add(int val) {
while (!deque.isEmpty() && val > deque.getLast()) {
deque.removeLast();
}
deque.add(val);
}
//队列队顶元素始终为最大值
int peek() {
return deque.peek();
}
}

class Solution {
public int[] maxSlidingWindow(int[] nums, int k) {
if (nums.length == 1) {
return nums;
}
int len = nums.length - k + 1;
//存放结果元素的数组
int[] res = new int[len];
int num = 0;
//自定义队列
MyQueue myQueue = new MyQueue();
//先将前k的元素放入队列
for (int i = 0; i < k; i++) {
myQueue.add(nums[i]);
}
res[num++] = myQueue.peek();
for (int i = k; i < nums.length; i++) {
//滑动窗口移除最前面的元素,移除是判断该元素是否放入队列
myQueue.poll(nums[i - k]);
//滑动窗口加入最后面的元素
myQueue.add(nums[i]);
//记录对应的最大值
res[num++] = myQueue.peek();
}
return res;
}
}

树/Tree

树的种类

满二叉树:如果一棵二叉树只有度为0的结点和度为2的结点,并且度为0的结点在同一层上,则这棵二叉树为满二叉树。

完全二叉树:在完全二叉树中,除了最底层节点可能没填满外,其余每层节点数都达到最大值,并且最下面一层的节点都集中在该层最左边的若干位置。若最底层为第 h 层(h从1开始),则该层包含 1~ 2^(h-1) 个节点。

二叉搜索树 (有序树):

  • 若它的左子树不空,则左子树上所有结点的值均小于它的根结点的值;
  • 若它的右子树不空,则右子树上所有结点的值均大于它的根结点的值;
  • 它的左、右子树也分别为二叉排序树

**平衡二叉树AVL(Adelson-Velsky and Landis): **它是一棵空树或它的左右两个子树的高度差的绝对值不超过1,并且左右两个子树都是一棵平衡二叉树。

树的实现

链式存储(使用指针),顺序存储(使用数组)

树的遍历

1.DFS:

先往深走,遇到叶子节点再往回走

  • 前序遍历(递归法,迭代法)
  • 中序遍历(递归法,迭代法)
  • 后序遍历(递归法,迭代法)

2.BFS:

一层一层的去遍历。

  • 层次遍历(迭代法)
1
2
3
4
5
6
7
8
9
10
11
12
13
public class TreeNode {
int val;
TreeNode left;
TreeNode right;

TreeNode() {}
TreeNode(int val) { this.val = val; }
TreeNode(int val, TreeNode left, TreeNode right) {
this.val = val;
this.left = left;
this.right = right;
}
}

二叉树递归遍历,DFS

前序遍历: 中左右 leetcode 144

中序遍历: 左中右 leetcode 145

后序遍历: 左右中 leetcode 94

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// 前序遍历·递归·LC144_二叉树的前序遍历
class Solution {
public List<Integer> preorderTraversal(TreeNode root) {
List<Integer> result = new ArrayList<Integer>();
preorder(root, result);
return result;
}

public void preorder(TreeNode root, List<Integer> result) {
if (root == null) {
return;
}
result.add(root.val);
preorder(root.left, result);
preorder(root.right, result);
}
}
// 中序遍历·递归·LC94_二叉树的中序遍历
class Solution {
public List<Integer> inorderTraversal(TreeNode root) {
List<Integer> res = new ArrayList<>();
inorder(root, res);
return res;
}

void inorder(TreeNode root, List<Integer> list) {
if (root == null) {
return;
}
inorder(root.left, list);
list.add(root.val); // 注意这一句
inorder(root.right, list);
}
}
// 后序遍历·递归·LC145_二叉树的后序遍历
class Solution {
public List<Integer> postorderTraversal(TreeNode root) {
List<Integer> res = new ArrayList<>();
postorder(root, res);
return res;
}

void postorder(TreeNode root, List<Integer> list) {
if (root == null) {
return;
}
postorder(root.left, list);
postorder(root.right, list);
list.add(root.val); // 注意这一句
}
}

N叉树的前序和后序遍历, leetcode 589,590

前序:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public List<Integer> preorder(Node root) {
List<Integer> res = new ArrayList<>();
if(root==null){
return res;
}
pre(root,res);
return res;
}
public void pre(Node root, List<Integer> list){
if(root==null){
return;
}
list.add(root.val);
for(Node child : root.children){
pre(child,list);
}
}

后序:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public List<Integer> postorder(Node root) {
List<Integer> res = new ArrayList<>();
if(root==null){
return res;
}
post(root,res);
return res;
}
public void post(Node root,List<Integer> list){
if(root == null){
return;
}
for(Node child:root.children){
post(child,list);
}
list.add(root.val);
}

二叉树遍历,迭代法, DFS

使用栈来进行迭代

前序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public List<Integer> preorderTraversal(TreeNode root) {
List<Integer> result = new ArrayList<>();
if (root == null){
return result;
}
Stack<TreeNode> stack = new Stack<>();
stack.push(root);
while (!stack.isEmpty()){
TreeNode node = stack.pop();
result.add(node.val);
if (node.right != null){
stack.push(node.right);
}
if (node.left != null){
stack.push(node.left);
}
}
return result;
}

中序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public List<Integer> inorderTraversal(TreeNode root) {
List<Integer> result = new ArrayList<>();
if (root == null){
return result;
}
Stack<TreeNode> stack = new Stack<>();
TreeNode cur = root;
while (cur != null || !stack.isEmpty()){
if (cur != null){
stack.push(cur);
cur = cur.left;
}else{
cur = stack.pop();
result.add(cur.val);
cur = cur.right;
}
}
return result;
}

后序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public List<Integer> postorderTraversal(TreeNode root) {
List<Integer> result = new ArrayList<>();
if (root == null){
return result;
}
Stack<TreeNode> stack = new Stack<>();
stack.push(root);
while (!stack.isEmpty()){
TreeNode node = stack.pop();
result.add(node.val);
if (node.left != null){
stack.push(node.left);
}
if (node.right != null){
stack.push(node.right);
}
}
Collections.reverse(result);
return result;
}

BFS

Leetcode 102/107

需要队列来实现/或者递归

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public List<List<Integer>> levelOrder(TreeNode root) {
List<List<Integer>> res = new ArrayList<List<Integer>>();
if(root==null){
return res;
}
Queue<TreeNode> que = new LinkedList<TreeNode>();
que.offer(root);

while(!que.isEmpty()){
List<Integer> list = new ArrayList<Integer>();
int len = que.size();//记录了每一层的节点个数,方便我们进入下一层。
while(len>0){
TreeNode tem = que.poll();
list.add(tem.val);
if(tem.left!=null) que.offer(tem.left);
if(tem.right!=null) que.offer(tem.right);
len--;

}
res.add(list);
}
return res;
}

Leetcode 199 返回二叉树的右视图

还是bfs,不过不用构造整棵二叉树,就返回每一层最右边的那个值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public List<Integer> rightSideView(TreeNode root) {
List<Integer> res = new ArrayList<>();
if(root==null){
return res;
}
Queue<TreeNode> que = new LinkedList<>();
que.offer(root);
while(!que.isEmpty()){
int len = que.size();
while(len>0){
TreeNode node = que.poll();
if(len==1){
res.add(node.val);
}
if(node.left!=null) que.offer(node.left);
if(node.right!=null) que.offer(node.right);
len--;
}
}
return res;
}

Leetcode 637

求每一层的平均值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public List<Double> averageOfLevels(TreeNode root) {
List<Double> averages = new ArrayList<Double>();
Queue<TreeNode> queue = new LinkedList<TreeNode>();
queue.offer(root);
while (!queue.isEmpty()) {
double sum = 0;
int size = queue.size();
for (int i = 0; i < size; i++) {
TreeNode node = queue.poll();
sum += node.val;
TreeNode left = node.left, right = node.right;
if (left != null) {
queue.offer(left);
}
if (right != null) {
queue.offer(right);
}
}
averages.add(sum / size);
}
return averages;

}

Leetcode 429

遍历n叉树

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
public List<List<Integer>> levelOrder(Node root) {
List<List<Integer>> res = new ArrayList<>();
if(root==null){
return res;
}
Queue<Node> que = new LinkedList<>();
que.offer(root);
while(!que.isEmpty()){
int size = que.size();
List<Integer> list = new ArrayList<>();
while(size>0){
Node node = que.poll();
list.add(node.val);
for(Node child:node.children){
if(child!=null){
que.offer(child);
}
}
size--;
}
res.add(list);
}
return res;
}

Leetcode 515

每一层的最大值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public List<Integer> largestValues(TreeNode root) {
List<Integer> res = new ArrayList<>();
if(root==null){
return res;
}
Queue<TreeNode> que = new LinkedList<>();
que.offer(root);
while(!que.isEmpty()){
int size = que.size();
int max = Integer.MIN_VALUE;
while(size>0){
TreeNode node = que.poll();
if(node.left!=null) que.offer(node.left);
if(node.right!=null) que.offer(node.right);
if(node.val>max) max=node.val;
size--;
}
res.add(max);
}
return res;
}

Leetcode 116/117

这两道题用这个答案都可以通过leetcode

添加next指针,注意处理一下引入previous结点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public Node connect(Node root) {
if(root==null){
return root;
}
Queue<Node> que = new LinkedList<>();
que.offer(root);
while(!que.isEmpty()){
int len = que.size();
Node previous = null;
while(len>0){
Node node = que.poll();
if(previous!=null){
previous.next=node;
}
if(len==1){
node.next=null;
}
if(node.left!=null) que.offer(node.left);
if(node.right!=null) que.offer(node.right);
previous=node;
len--;
}
}
return root;
}

Leetcode 104

求最大深度递归就行了,不用遍历

1
2
3
4
5
6
public int maxDepth(TreeNode root) {
if(root==null){
return 0;
}
return Math.max((maxDepth(root.left)+1),(maxDepth(root.right)+1));
}

Leetcode 111

求最小深度的时候要讨论树退化成线性的情况,因为如果直接递归的话,这样子返回的最小值是1,但是其实这颗树的最小深度是整颗树的高度。

1
2
3
4
5
6
7
8
9
10
11
12
public int minDepth(TreeNode root) {
if(root==null){
return 0;
}
if(root.right==null){
return minDepth(root.left)+1;
}
else if(root.left==null){
return minDepth(root.right)+1;
}
else return Math.min((minDepth(root.left)+1),(minDepth(root.right)+1));
}
1
2
3
4
5
6
7
public int minDepth(TreeNode root) {
if(root==null) return 0;
int left = minDepth(root.left);
int right = minDepth(root.right);
if(left==0||right==0) return left+right+1;
return Math.min(left,right)+1;
}

Leetcode 226 翻转二叉树

1
2
3
4
5
6
7
public TreeNode invertTree(TreeNode root) {
if(root==null) return null;
TreeNode left = root.left; // 后面的操作会改变 left 指针,因此先保存下来
root.left=invertTree(root.right);
root.right=invertTree(left);
return root;
}

Leetcoed 100 相同的二叉树

1
2
3
4
5
6
7
8
9
10
11
12
public boolean isSameTree(TreeNode p, TreeNode q) {
if(p==null&&q==null){
return true;
}
if(p==null||q==null){
return false;
}
if(p.val!=q.val){
return false;
}
return isSameTree(p.left,q.left)&&isSameTree(p.right,q.right);
}

Leetcode 101 对称二叉树

递归法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public boolean isSymmetric(TreeNode root) {
if(root==null) return false;
return sym(root.left,root.right);
}
private boolean sym(TreeNode t1, TreeNode t2){
if(t1==null&t2==null){
return true;
}
if(t1==null||t2==null){
return false;
}
if(t1.val!=t2.val){
return false;
}
return sym(t1.left,t2.right)&&sym(t2.left,t1.right);
}

迭代法,使用队列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    public boolean isSymmetric(TreeNode root) {
Queue<TreeNode> deque = new LinkedList<>();
deque.offer(root.left);
deque.offer(root.right);
while (!deque.isEmpty()) {
TreeNode leftNode = deque.poll();
TreeNode rightNode = deque.poll();
if (leftNode == null && rightNode == null) {
continue;
}
// }
// if (leftNode != null && rightNode == null) {
// return false;
// }
// if (leftNode.val != rightNode.val) {
// return false;
// }
// 以上三个判断条件合并
if (leftNode == null || rightNode == null || leftNode.val != rightNode.val) {
return false;
}
// 这里顺序与使用Deque不同
deque.offer(leftNode.left);
deque.offer(rightNode.right);
deque.offer(leftNode.right);
deque.offer(rightNode.left);
}
return true;
}

Leetcode 222

完全二叉树的个数

1
2
3
4
5
6
public int countNodes(TreeNode root) {
if(root==null){
return 0;
}
return countNodes(root.left)+countNodes(root.right)+1;
}

Leetcode 110

平衡二叉树

这道题和求最大深度不一样,子树中要是最大深度差大于一了也不是平衡二叉树。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public boolean isBalanced(TreeNode root) {
return getHeight(root) != -1;
}

private int getHeight(TreeNode root) {
if (root == null) {
return 0;
}
int leftHeight = getHeight(root.left);
if (leftHeight == -1) {
return -1;
}
int rightHeight = getHeight(root.right);
if (rightHeight == -1) {
return -1;
}
// 左右子树高度差大于1,return -1表示已经不是平衡树了
if (Math.abs(leftHeight - rightHeight) > 1) {
return -1;
}
return Math.max(leftHeight, rightHeight) + 1;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class Solution {
public boolean isBalanced(TreeNode root) {
if (root == null) {
return true;
} else {
return Math.abs(height(root.left) - height(root.right)) <= 1 && isBalanced(root.left) && isBalanced(root.right);
}
}

public int height(TreeNode root) {
if (root == null) {
return 0;
} else {
return Math.max(height(root.left), height(root.right)) + 1;
}
}
}

Leetcode 257 二叉树的所有路径

前序遍历法,在遍历的过程中记得时刻记录每一条路径,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
public List<String> binaryTreePaths(TreeNode root) {
List<String> res = new ArrayList<>();
if (root == null) {
return res;
}
Stack<TreeNode> stack = new Stack<>();
stack.push(root);

Stack<StringBuilder> paths = new Stack<>();
StringBuilder sb = new StringBuilder();
sb.append(Integer.toString(root.val));
paths.push(sb);

while (!stack.isEmpty()) {
TreeNode node = stack.pop();
sb = paths.pop();

if (node.right != null) {
stack.push(node.right);
StringBuilder newPath = new StringBuilder(sb);
newPath.append("->");
newPath.append(Integer.toString(node.right.val));
paths.push(newPath);
}
if (node.left != null) {
stack.push(node.left);
StringBuilder newPath = new StringBuilder(sb);
newPath.append("->");
newPath.append(Integer.toString(node.left.val));
paths.push(newPath);
}
if (node.left == null && node.right == null) {
String route = sb.toString();
res.add(route);
}
}
return res;
}

迭代法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
public List<String> binaryTreePaths(TreeNode root) {
List<String> res = new ArrayList<>();// 存最终的结果
if (root == null) {
return res;
}
List<Integer> paths = new ArrayList<>();// 作为结果中的路径
traversal(root, paths, res);
return res;
}

private void traversal(TreeNode root, List<Integer> paths, List<String> res) {
paths.add(root.val);// 前序遍历,中
// 遇到叶子结点
if (root.left == null && root.right == null) {
// 输出
StringBuilder sb = new StringBuilder();// StringBuilder用来拼接字符串,速度更快
for (int i = 0; i < paths.size() - 1; i++) {
sb.append(paths.get(i)).append("->");
}
sb.append(paths.get(paths.size() - 1));// 记录最后一个节点
res.add(sb.toString());// 收集一个路径
return;
}
// 递归和回溯是同时进行,所以要放在同一个花括号里
if (root.left != null) { // 左
traversal(root.left, paths, res);
paths.remove(paths.size() - 1);// 回溯
}
if (root.right != null) { // 右
traversal(root.right, paths, res);
paths.remove(paths.size() - 1);// 回溯
}
}

Leetcode 404 左叶子之和

寻找这课树左叶子结点的和,叶子结点是没有子节点的。

我们的思路是通过某个节点的父节点来判断他是否是叶子结点也就是node.left!=null&&node.left.left==null&&node.left.right==null;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Solution {
public int sumOfLeftLeaves(TreeNode root) {
if (root == null) return 0;
int leftValue = sumOfLeftLeaves(root.left); // 左
int rightValue = sumOfLeftLeaves(root.right); // 右

int midValue = 0;
if (root.left != null && root.left.left == null && root.left.right == null) {
midValue = root.left.val;
}
int sum = midValue + leftValue + rightValue; // 中
return sum;
}
}

Leetcode 513

寻找二叉树最左下角的值,注意不一定是最左边的值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
//递归法
class Solution {
private int Deep = -1;
private int value = 0;
public int findBottomLeftValue(TreeNode root) {
value = root.val;
findLeftValue(root,0);
return value;
}

private void findLeftValue (TreeNode root,int deep) {
if (root == null) return;
if (root.left == null && root.right == null) {
if (deep > Deep) {
value = root.val;
Deep = deep;
}
}
if (root.left != null) findLeftValue(root.left,deep + 1);
if (root.right != null) findLeftValue(root.right,deep + 1);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
//迭代法
class Solution {

public int findBottomLeftValue(TreeNode root) {
Queue<TreeNode> queue = new LinkedList<>();
queue.offer(root);
int res = 0;
while (!queue.isEmpty()) {
int size = queue.size();
for (int i = 0; i < size; i++) {
TreeNode poll = queue.poll();
if (i == 0) {
res = poll.val;
}
if (poll.left != null) {
queue.offer(poll.left);
}
if (poll.right != null) {
queue.offer(poll.right);
}
}
}
return res;
}
}

随想练习

Leetcode 852山峰数组

线性查找, O(n)

1
2
3
4
5
6
7
8
9
10
11
12
public int peakIndexInMountainArray(int[] arr) {
int len = arr.length;
for(int i=0,j=len-1;i<len;i++,j--){
if(arr[i]>arr[i+1]){
return i;
}
if(arr[j-1]<arr[j]){
return j;
}
}
return -1;
}

二分查找, O(log(n))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    public int peakIndexInMountainArray(int[] arr) {
int left = 0;
int right = arr.length - 1;

while (left < right) {
int mid = left + (right - left) / 2;

if (arr[mid] < arr[mid + 1]) {
left = mid + 1;
} else {
right = mid;
}
}

return left;
}

快速排序quicksort

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public static void quickSort(int[] array){
quick(array,0,array.length-1);
}

private static void quick(int[] array, int left, int right) {

//递归结束条件:这里代表只有一个根 大于号:有可能没有子树 1 2 3 4 1为基准,pivot-1就越界了
if(left >= right){
return;
}

int pivot = partition(array, left, right);
quick(array,left,pivot-1);
quick(array,pivot+1,right);
}

public static int partition(int[] array,int start, int end){
int i = start;//这里存开始时基准的下标,为了循环结束后,相遇值和基准值交换时能找到基准值的下标
int key = array[start];//这是基准
while (start < end){
while (start < end && array[end] >= key){
end--;
}
while (start < end && array[start] <= key){
start++;
}
swap(array,start,end);
}
//到这里s和e相遇,把相遇值和基准值交换
swap(array,start,i);
return start;//返回基准下标
}


Secured Software Engineering

The below case studies are given here in advance, they will be part of the final exam.

Case Study 1: You are part of a team that has been tasked with developing a web-based application for a healthcare organisation. The application will allow doctors to view patient information, including medical history, test results and treatment plans. Patients will also be able to access the application to view their own medical information and communicate with their doctors. The application will be used by a large number of healthcare providers and patients, and therefore security is a top priority. (Provided ahead of the exam)

Case Study 2: ABC airlines operates a large number of flights globally. In 2021, the airline suffered a significant data breach that affected the personal information of thousands of customers. In response to this breach, ABC airlines invested in a new approach to secure software engineering, including implementing secure coding practices, testing and validating software, and conducting regular security audits. The company has a complex network infrastructure that includes several web servers, database servers, and application servers. (Provided ahead of the exam)

案例研究1: 你是一个团队的成员,他们的任务是为一个医疗机构开发一个基于网络的应用程序。该应用程序将允许医生查看病人信息,包括病史、测试结果和治疗计划。病人也可以访问该应用程序,查看他们自己的医疗信息,并与他们的医生沟通。该应用程序将被大量的医疗服务提供者和病人使用,因此安全性是重中之重。(该案例研究是期末考试的一部分,在考试前提供给大家)

案例研究2:ABC航空公司在全球范围内运营着大量的航班。2021年,该航空公司遭受了一次重大的数据泄露事件,影响了成千上万的客户的个人信息。为了应对这一漏洞,ABC航空公司投资了一种新的安全软件工程方法,包括实施安全编码实践,测试和验证软件,并定期进行安全审计。该公司有一个复杂的网络基础设施,包括几个网络服务器、数据库服务器和应用服务器。(在考试前提供)

1. Case Study: You are part of a team that has been tasked with developing a web-based application for a healthcare organization. The application will allow doctors to view patient information, including medical history, test results and treatment plans. Patients will also be able to access the application to view their own medical information and communicate with their doctors. The application will be used by a large number of healthcare providers and patients, so security is a top priority. (This case study is part of the final exam, provided ahead of the exam here)

(a) Identify and list down five use cases and five misuse cases of the system. [5]

Use Cases:

  1. Registration: A new patient can securely register on the platform, providing their personal and contact information, and setting up a strong, unique password.
  2. Medical Record Access: A doctor can access a patient’s medical history, including previous diagnoses, treatment plans, medications, and test results, to make informed decisions regarding the patient’s care.
  3. Appointment Scheduling: Patients can use the application to schedule appointments with their healthcare providers, selecting available dates and times and receiving confirmation notifications.
  4. Secure Messaging: Doctors and patients can communicate via a secure messaging platform within the application, discussing treatment plans, symptoms, and other healthcare-related topics.
  5. Prescription Management: Healthcare providers can securely prescribe medications through the application, enabling patients to access their prescriptions and facilitating communication with pharmacies.

Misuse Cases:

  1. Unauthorized Access: An attacker gains unauthorized access to a patient’s account, potentially viewing, altering, or deleting sensitive medical information, or even impersonating the patient.
  2. Man-in-the-Middle Attack: An attacker intercepts the communication between the patient and the healthcare provider, potentially modifying the messages sent or received or eavesdropping on sensitive information.
  3. Phishing Attacks: An attacker sends phishing emails or messages to patients, posing as a healthcare provider or the application itself, in an attempt to steal login credentials or other sensitive information.
  4. Distributed Denial-of-Service (DDoS) Attack: An attacker overwhelms the application with a large volume of traffic, causing it to become unresponsive or slow, preventing healthcare providers and patients from accessing necessary information and services.
  5. Security Misconfiguration: The application’s security settings are not properly configured, leaving it vulnerable to attacks and exposing sensitive patient data.

(b) List down five actors and five attackers of the system. [5]

Actors:

Patients, Doctors, Admin, Nurses, Patient Family.

Attackers:

General attackers, Payment gateway attackers, Nurses, Doctors, Patient Family.

(c) List down five high-level threats to the system. [5]

  1. Insecure authentication
  2. DDoS Attacks
  3. Leakage of patients medical information. Implement a chain of custody to monitor on this system.
  4. Attacker modify the information of patients and drug perscription for patients.
  5. Large size file uploads/buffer overflow:

(d) Against each threat what would be your countermeasure? [5]

  1. Introduce multi-factor authentication.

  2. Add defence against DoS/DDoS attacks.

  3. Implement a chain of custody to monitor on this system.

  4. implement logging on website against all actions and audit to be performed by doctors.

  5. implement max size limit for file submission and implement defence against buffer overflow.

2. (a) If requirements are introduced at a later stage in the project, what would be its impact on security?

Support your answer with an example. [3]

it often leads to incomplete or insufficient security measures, making the system vulnerable to threats and attacks.

  1. Increased complexity: Adding new requirements can make the system more complex, which in turn makes it harder to analyze and secure. Complex systems are more prone to vulnerabilities, as they have more potential points of failure and attack.

Example: Imagine a web application that initially only required user authentication via username and password. If the project scope is changed to include multi-factor authentication, this introduces additional components such as SMS or email verification. The complexity of the system increases, making it more difficult to ensure that all security aspects are correctly implemented and that no vulnerabilities are introduced.

  1. Insufficient time for security analysis and testing: Introducing new requirements late in the project often means that there is less time available for security analysis and testing, increasing the likelihood of vulnerabilities remaining undetected.

Example: If a new payment feature is added to an e-commerce website just before the project’s deadline, there may not be enough time to perform a thorough security analysis and test all potential attack vectors, such as SQL injection or cross-site scripting vulnerabilities.

  1. Cost and resource constraints: Late requirement changes can lead to increased costs and put additional strain on resources. This can result in cutting corners on security measures or failing to allocate enough time for thorough security testing.

Example: If a new payment feature is added to an e-commerce website just before the project’s deadline, there may not be enough time to perform a thorough security analysis and test all potential attack vectors, such as SQL injection or cross-site scripting vulnerabilities.

  1. Incomplete integration of security measures: New requirements might not be fully compatible with existing security measures, requiring rework or additional implementation to ensure that the overall security posture remains strong.

Example: An organization decides to migrate its on-premises infrastructure to the cloud during the project’s later stages. This change requires a reassessment of the security measures in place, as cloud environments have different security considerations compared to on-premises infrastructure. If not addressed properly, this can lead to vulnerabilities in the cloud environment.

(b) An organisation is performing risk analysis during the development stage for their asset having an

estimated value of £100,000. The exposure factor of the asset is estimated to be 0.25 and annual

rate of occurrence is expected to be 0.5. Mitigating the risk will cost the organization £5,000, is it

feasible to fix the vulnerability in the code or not? [5]

If do not fix the vulnerablity, the annual lost will be £12500

So, it is very worthy to spend £5000 to fix this vulnerability.

(c) A company wants to do the audit of their software for GDPR compliance. Which three Secure

Software Engineering methods you would recommend for this purpose? Give a brief explanation

for each method. [6]

Biometric Recognition Multi-factor Authentication

(d) In the code given on the next page, identify which variables are considered tainted and identify

any two potential security vulnerabilities in the code and suggest how to fix them (write in your

own words, no coding is required). [6]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
function authenticateUser(username, password) {

var userCredentials = {

username: username,

password: password

};

var input = document.getElementById("input").value;

if (isValidEmail(input)) {

userCredentials.email = input;

}

$.post("/login", userCredentials, function(response) {

if (response.success) {

alert("You are logged in!");

} else {

alert("Authentication failed.");

}

});

}

function isValidEmail(email) {

var emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;

return emailRegex.test(email);

}
  1. tainted varaibles:

userCredentials中的username和userpassword可能受污染,因为他是直接从传入的参数中获得的,

input变量同样是直接从页面中获取的,可能收到污染

  1. solution

对于用户名和密码,进行表单提交,并且进行表单认证,并且对于用户名和密码进行安全编码,同时对用户的输入进行输入检查,如dompurify的库可以帮助做到这一点

对于发送的 POST 请求,应该使用 HTTPS 协议进行通信,并对用户输入的敏感信息进行加密或者哈希处理,避免被劫持或者窃取。可以使用加密算法、哈希算法或者 SSL/TLS 协议来保证通信安全。

代码中有两个潜在的安全漏洞:

跨站脚本漏洞:

代码接受用户输入(用户名和密码),并直接将其发送到服务器,而没有进行适当的验证或处理。如果攻击者将恶意脚本注入到输入字段中,服务器可能会执行这些脚本,从而可能危及应用程序。

要修复此漏洞,您应该在将用户输入发送到服务器之前对其进行清理和验证。您可以使用像DOMPurify这样的库来净化输入,并创建一个验证函数来检查输入字段中的任何无效字符。

敏感数据的不安全传输:

该代码将用户的凭据(用户名和密码)以明文形式发送到服务器,而不进行任何加密。这可能会将敏感信息暴露给潜在的窃听者或攻击者,特别是在连接没有使用HTTPS保护的情况下。

为了解决这个问题,你应该确保你的网站使用HTTPS,它加密客户端和服务器之间传输的数据。此外,考虑实现更安全的身份验证机制(如OAuth2)或使用密码散列技术(如bcrypt)来安全地存储和传输密码。

(e).

(a). You are being asked to consult with a start-up who are looking to put together a security policy for integrity. You are told that there are three security levels for the objects: o1 < o2 <o3 where o3 is the highest and o1 is the lowest security level. The following table present the associated security levels and categories for the objects (o). The subjects S1 lies on Level 1, S2 lies on Level 2 and S3 lies on Level 3. Fill out the following table to indicate all applicable permissions when the Simple Security Property is applied. Write -p’ for read and ‘-w’ for write and ‘-rw’ for both read and write operations. If a Subject is not authorized to perform any action, leave the cell blank. [8]

O1 O2 O3
S1 -r
S2 -r -r
S3 -r -r -r

Table 1: A security model indicating subject and object integrity levels

(b). How an organization can effectively apply confidentiality and integrity in their software products?

Provide a brief explanation of the consequences.

(c). ABC airlines operates a large number of flights globally. In 2021, the airline suffered a significant data breach that affected the personal information of thousands of customers. In response to this breach, ABC airlines invested in a new approach to secure software engineering, including implementing secure coding practices, testing and validating software, and conducting regular security audits. The company has a complex network infrastructure that includes several web servers, database servers, and application servers. Now, the airline company wants to implement confidentiality and integrity in their system, how it will be done? List any three suggestions.

(d). Why it is important to rank stakeholders in the system? Provide an example scenario and rank 3 different stakeholders in it.

Reason:

The ranking is important to satisfy the competing needs,Contradiction of needs e.g., privacy in one case can conflict with requirement to send name and email in other case.

Candidate stakeholders are: end users, project manager, CEO, CIO, team lead, network admin, db manager, marketing department, legal department and sales team.

Number of external stakeholders is limited

revision

Risk

Risk and Risk management

  • Risk is the potential that a given threat will exploit vulnerabilities of an asset or group of assets and thereby cause harm to the organization
  • Risk management— “Process of identifying, controlling and minimizing or eliminating security risks that may affect information systems, for an acceptable cost.”
  • Risk assessment—“assessment of threats to, impact on and vulnerabilities of information and information processing facilities and the likelihood of their occurrence.”

Who is the enemy? Why do they do it?
• Offenders

  • Crackers—mostly teenagers doing as intellectual challenge
  • Information system’s criminals—Espionage and/or
    Fraud/abuse—for a nation/company to gain a competitive advantage over its rivals
  • Vandals—authorized users and strangers (cracker or a criminal)—motivated by anger directed at an individual/organization/life in general

Risk = Threats × Vulnerabilities

Risk Threats Vulnerabilities
business disruption angry employees software bugs
financial losses dishonest employees broken processes
loss of privacy criminals ineffective controls
damage to reputation governments hardware flaws
loss of confidence terrorists business change
legal penalties the press legacy systems
impaired growth competitors Inadequate BCP
loss of life hackers human error
nature

Types of Damage

  • Interruption—destroyed/unavailable services/resources
  • Interception—unauthorized party snooping or getting access to a resource
  • Modification— unauthorized party modifying
    a resource
  • Fabrication—unauthorized party inserts a fake asset/resource

The purpose of risk management

  • Ensure overall business and business assets are safe
  • Protect against competitive disadvantage
  • Compliance with laws and best business practices
  • Maintain a good public reputation

Accountability for Risk Management

  • It is the responsibility of each community of interest to manage risks; each community has a role to play:

  • Information Security - best understands the threats and attacks that introduce risk into the organization

  • Management and Users - play a part in the early detection and response process - they also insure sufficient resources are allocated

  • Information Technology - must assist in building secure systems and operating them safely

image-20230426143138064

Steps of a risk management plan

  • Step 1: Identify Risk
  • Step 2: Assess Risk
  • Step 3: Control Risk
  • Steps are similar regardless of context (InfoSec, Physical Security, Financial, etc.)
  • This presentation will focus on controlling risk within an InfoSec context

Risk Identification

  • The steps to risk identification are:

  • Identify your organization’s information assets

  • Classify and categorize said assets into useful groups

  • Rank assets necessity to the organization

To the right is a simplified example of how a company may identify risks

Risk Assessment

  • The steps to risk assessment are:

  • Identify threats and threat agents

  • Prioritize threats and threat agents

  • Assess vulnerabilities in current InfoSec plan

  • Determine risk of each threat

R=P*V-M+ U

​ R= Risk

​ P= Probability of threat attack

​ V = Value of Information Asset

​ M = Mitigation by current controls

​ U = Uncertainty of vulnerability

Risk control

  • The steps to risk control are:
  • Cost-Benefit Analysis (CBA)
    • Single Loss Expectancy (SLE)
    • Annualized Rate of Occurrence (ARO)
    • Annual Loss Expectancy (ALE)
    • Annual Cost of the Safeguard (ASG)
  • Feasibility Analysis
    • Organizational Feasibility
    • Operational Feasibility
    • Technical Feasibility
    • Political Feasibility
  • Risk Control Strategy Implementation

Cost-Benefit analysis

  • Determine what risk control strategies are cost effective

  • Below are some common formulas used to calculate cost-benefit analysis

  • SLE = AV * EF

    • AV = Asset Value, EF =
      Exposure factor (% of asset affected)
  • ALE = SLE * ARO

  • CBA = ALE (pre-control) -

    ALE (post-control) - ACE(Annually Countermeasure Expectancy)

SLE - single loss expectancy, ALE - Annual loss expectancy, CBA – Cost benefit analysis

Feasibility analysis

Organizational: Does the plan correspond to the organization’s objectives? What is in it for the organization? Does it limit the organization’s capabilities in any way?

Operational: Will shareholders (users, managers, etc.) be able/willing to accept the plan? Is the system compatible with the new changes? Have the possible changes been communicated to the employees?

Technical: Is the necessary technology owned or obtainable? Are our employees trained and if not can we afford to train them? Should we hire new employees?

Political: Can InfoSec acquire the necessary budget and approval to implement the plan? Is the budget required justifiable? Does InfoSec have to compete with other departments to acquire the desired budget?

Risk control Strategies

  • Defense
  • Transferal
  • Mitigation
  • Acceptance (Abandonment)
  • Termination

Risk control Strategy: defense

• Defense: Prevent the exploitation of the system via application of policy, training/education, and technology. Preferably layered security (defense in depth)

% Counter threats

% Remove vulnerabilities from assess

% Limit access to assets

§ Add protective safeguards

Figure 1-Application Security Layered Approach

image-20230426144850808

Risk control Strategy: transferal

  • Transferal: Shift risks to other areas or outside entities to handle
  • Can include:

% Purchasing insurance

• Outsourcing to other organizations

* Implementing service contracts with providers
§ Revising deployment models

image-20230426145019527

Risk control Strategy: Mitigation

image-20230426145147173

Risk control Strategy: Acceptance

Appropriate when:

% The cost to protect an asset or assets exceeds the cost to replace it/them

  • When the probability of risk is very low and the asset is of low priority
  • Otherwise acceptance = negligence

Risk control Strategy: Termination

  • Termination: Removing or discontinuing the information asset from the organization
  • Examples include:

Equipment disposal

Discontinuing a provided service

Firing an employee

Pros and cons of each strategy

image-20230426145601149

Standard approaches to risk management

  • U.S CERT’s Operationally Critical Threat Assessment
    Vulnerability Evaluation (OCTAVE) Methods (Original, OCTAVE-S, OCTAVE-Allegro)
  • ISO 27005 Standard for InfoSec Risk Management
  • NIST Risk Management Model
  • Microsoft Risk Management Approach
  • Jack A. Jones’ Factor Analysis of Information Risk
    (FAIR)
  • Delphi Technique

Risk Determination

For the purpose of relative risk assessment:

RISK =

​ likelihood of vulnerability occurrence times value (or impact)

MINUS

​ percentage risk already controlled

PLUS

​ an element of uncertainty

Access Controls

  • One particular application of controls is in the area of access controls

  • Access controls are those controls that specifically address admission of a user into a trusted area of the organization

  • There are a number of approaches to controlling access

  • Access controls can be

  • discretionary

  • mandatory

  • nondiscretionary

Types of Access Controls

  • Discretionary Access Controls (DAC) are implemented at the discretion or option of the data user
  • Mandatory Access Controls (MACs) are structured and coordinated with a data classification scheme, and are required
  • Nondiscretionary Controls are those determined by a central authority in the organization and can be based on that individual’s role

Lattice-based Control

  • Another type of nondiscretionary access is lattice-based control, where a lattice structure (or matrix) is created containing subjects and objects, and the boundaries associated with each pair is contained
  • This specifies the level of access each subiect has to each object
  • In a lattice-based control the column of attributes associated with a particular object are referred to as an access control list or ACL
  • The row of attributes associated with a particular subiect (such as a user) is referred to as a capabilities table
  • This is part of Principles of Information SecurityPrinciples of Information Security

Documenting Results of Risk Assessment

  • The goal of this process has been to identify the information assets of the organization that have specific vulnerabilities and create a list of them, ranked for focus on those most needing protection first
  • In preparing this list we have collected and preserved factual information about the assets, the threats they fac and the vulnerabilities they experience
  • We should also have collected some information about the controls that are already in place

Asset Identification and Valuation

  • This iterative process begins with the identification of assets, including all of the elements of an organization’s system
  • Then, we classify and categorize the assets adding details as we dig deeper into the analysis

Components of an Information System

image-20230426150102652

Hardware, Software, and Network Asset Identification

• Automated tools can sometimes uncover the system elements that make up the hardware, software, and network components

Once created, the inventory listing must be kept current, often through a tool that periodically refreshes the data

  • What attributes of each of these information assets should be tracked?
  • When deciding which information assets to track, consider including these asset attributes:

​ Name

​ IP address

​ MAC address

​ Element type

​ Serial number

​  Manufacturer name

People, Procedures, and Data Asset Identification

  • Unlike the tangible hardware and software elements already described, the human resources, documentation, and data information assets are not as readily discovered and documented
  • These assets should be identified, described, and evaluated by people using knowledge, experience, and judgment
  • As these elements are identified, they should also be recorded into some reliable data handling process

Asset Information for People

  • For People:

    • Position name/number/ID - try to avoid names and stick to identifying positions, roles, or functions

    • Supervisor

    • Security clearance level

    • Special skills

Asset Information for Procedures

  • For Procedures:

    • Description

    • Intended purpose

    • What elements is it tied to

    • Where is it stored for reference

    • Where is it stored for update purposes

Asset Information for Data

  • For Data:

    • Classification

    • Owner/creator/manager

    • Size of data structure

    • Data structure used - sequential, relational

    • Online or offline

    • Where located

    • Backup procedures employed

Information Asset Classification

  • Many organizations already have a classification scheme

  • Examples of these kinds of classifications are:

    • confidential data

    • internal data

    • public data

  • Informal organizations may have to organize themselves to create a useable data classification model

  • The other side of the data classification scheme is the personnel security clearance structure

Information Asset Valuation

  • Each asset is categorized

  • Questions to assist in developing the criteria to be used for asset valuation:

    • Which information asset is the most critical to the success of the organization?

    • Which information asset generates the most revenue?

    • Which information asset generates the most profitability?

    • Which information asset would be the most expensive to replace?

    • Which information asset would be the most expensive to protect?

    • Which information asset would be the most embarrassing or cause the greatest liability if revealed?

Examples of Information Security Vulnerabilities

  • Information security vulnerabilities are weaknesses that expose an organization to risk.
  • Through employees: Social interaction, Customer interaction,
    Discussing work in public locations,
  • Through former employees—Former employees working for competitors, Former employees retaining company data,
    Former employees discussing company matters
  • Though Technology—Social networking, File sharing, Rapid technological changes, Legacy systems, Storing data on mobile devices such as mobile phones, Internet browsers
  • Through hardware—. Susceptibility to dust, heat and humidity,
    Hardware design flaws, Out of date hardware,
  • Misconfiguration of hardware

Examples of Information Security Vulnerabilities (Cont.)

  • Through software—Insufficient testing, Lack of audit trail, Software bugs and design faults, Unchecked user input, Software that fails to consider human factors, Software complexity (bloatware), Software as a service (relinquishing control of data), Software vendors that go out of business or change ownership
  • Through Network—Unprotected network communications, Open physical connections, IPs and ports, Insecure network architecture, Unused user ids, Excessive privileges, Unnecessary jobs and scripts executing, Wifi networks
  • Through IT Management—Insufficient IT capacity, Missed security patches, Insufficient incident and problem management, Configuration errors and missed security notices, System operation errors
  • Partners and suppliers—Disruption of telecom services, Disruption of utility services such as electric, gas, water, Hardware failure, Software failure, Lost mail and courier packages, Supply disruptions, Sharing confidential data with partners and suppliers

image-20230426150722235

Programme Criticality

  • The programme criticality framework is a common
    United Nations system framework for decision-making that puts in place a systematic structured approach that uses programme criticality as a way to ensure that programme activities can be balanced against security risks.

  • The concept of criticality means the critical impact of an activity on the population, not necessarily on the organisation.

  • Programme criticality assessment is mandatory in areas with residual risk levels of ‘high’ and very high,’ as determined in the Security Risk Assessments (SRAs).

  • Primary accountability for programme criticality is with United Nations senior management at the country level.

A programme criticality assessment has steps as follows:

  • 1. Establish geographical scope and timeframe

  • 2. List strategic results (SRs)

  • 3. List UN activities/outputs (involving UN personnel)

  • 4. Assess contribution to strategic results

  • 5. Assess likelihood of implementation

  • 6. Evaluate activities/outputs with PCI criteria

  • 7. View PC level results, form consensus within the UN system and approve final results

  • 8. Agree on a process to address and manage the results of the PC
    assessment

  • 9. Follow-up and review.

  • There are two possible criteria for an activity to be considered a PCI activity:

    • Either the activity is assessed as lifesaving (humanitarian or non-humanitarian) at scale (defined as any activity to support processes or services, including needs assessments), that would have an immediate and significant impact on mortality; or

    • The activity is a directed activity that receives the endorsement of the Office of the Secretary-General for this particular situation.

  • Risk level has no impact on programme criticality.
    There must be no consideration of risk level when determining PC.

  • Programme criticality has no impact on risk level.
    There must be no consideration of PC when determining risk level.

image-20230426152530740

Other Processes with Security Implications

  • Intelligence and Information Cycle.
  • Strategic-level Integrated Mission Planning Process.
  • Mission-level Integrated Planning, Coordination and implementation.
  • Mission Component Planning and Implementation
    Processes.
  • UN Budget Processes.
  • Staff Selection and Managed Mobility System.
  • Any other process that impacts the substance of UN security.

Security Patterns

Values of security pattern:

Security patterns can apply security principles, guide the design and implementation, guide the use of security mechanisms,help understanding and use of complex **standards (**XACML, WiMax) and Convenient for teaching security principles and mechanisms.

An Abstract Security Pattern (ASP), describes a conceptual security mechanism that realizes one or more security policies able to control (stop or mitigate) a threat or comply with a security-related regulation or institutional policy (no implementation aspects).

Conceptual security

  • Security is a quality aspect that constrains the semantic behavior of applications (by imposing access restrictions), so the requirements stage is the right development stage to start addressing security
  • However, we only want to indicate at this stage which specific security controls are needed, not their convenient or optimal implementation.
  • For example, in bank applications we only want to specify the semantic aspects of accounts, customers, and transactions with their corresponding restrictions.

An ASP example: Authenticator:

  • This is the Intent section of an Authenticator pattern: “When a user or system (subject) identifies itself to the system, how do we verify that the subject intending to access the system is who it says it is? Present some information that is recognized by the system as identifying this subject.
    After being recognized, the requestor is given some proof that it has been authenticated.”
  • Authentication restricts access to a system to only registered users; it handles the threat where an intruder enters a system and may try to perform unauthorized access to information
  • It is clear that there are many ways to perform this authentication, that go from manual ways, as done in voting places, to purely automatic ways, as when accessing a web site, but all of them must include the requirements of the abstract Authenticator

Authentication as an abstract function requires a basic sequence of activities. Concrete realizations of this sequence implement these steps in different ways but all must perform these two steps: • The subject requests to enter a system indicating its identity and presenting some proof of identity. • If the system recognizes the subject using its identity information, it grants her entrance to the system and provides her with a proof of authentication for further use. If not, the request is denied. • We can define a hierarchy of authentication patterns starting from the abstract Authenticator

Force os abstract authenticator

  • Closed system. If the authentication information presented by the user is not recognized, there is no access. In an open system all subjects would have access except some who are blacklisted for some reason.

  • Registration. Users must register their identity information so that the system can recognize them later.

  • Flexibility. There may be a variety of individuals (users) who require access to the system and a variety of system units with different access restrictions. We need to be able to handle all this variety appropriately or we risk security exposures.

  • Dependability. We need to authenticate users in a reliable and secure way.

    This means a robust protocol and a high degree of availability. Otherwise, users may fool the authentication process or enter when the system authentication is down.

  • Protection of authentication information. Users should not be able to read or modify the authentication information. Otherwise, they can give themselves access to the system.

  • Simplicity. The authentication process must be relatively simple or the users or administrators may be confused. User errors are annoying to them but administrator errors may lead to security exposures.

  • Reach. Successful authentication only gives access to the system, not to any specific resource in the system. Access to these resources must be controlled using other mechanisms, typically authorization.

  • Tamper freedom. It should be very difficult to falsify the proof of identity presented by the user.

  • Cost. There should be tradeoffs between security and cost, more security can be obtained at a higher cost.

  • Performance. Authentication should not take a long time or users will be annoyed.

  • Frequency. We should not make users authenticate frequently. Frequent authentications waste time and annoy the users.
    All these properties must be present in the lower-level ways of performing authentication, e.g. in a Password Authenticator (see next slide). A Password Authenticator needs tomake concrete its Authentication Information (list of passwords) and its proof of authentication (a session)

Reference Monitor

  • Authorization rules define who has access to what and how. They must be enforced when a process request a resource
  • Each request for resources must be intercepted and evaluated for authorized access; this is the concept of Reference Monitor
  • An abstract concept, implemented as memory access manager, file permission checks, CORBA adapters, etc.

Role-Based Access Control

  • Users are assigned roles according to their functions and given the needed rights (access types for specific objects)
  • When users are assigned by administrators, this is a mandatory model
  • Can implement least privilege and separation of duty policies

XML firewall

Controls input/output of XML applications

Well-formed documents (schema as reference)

Harmful data (wrong type or length)

Encryption/decryption

Sign and verify signatures in documents


Building secure systems

  • Secure systems need to be built in a systematic way where security is an integral part of the lifecycle, and the same applies to safety.
  • The platform should match the type of application, and all compliance, safety and security constraints should be defined at the application level, where their semantics are understood and propagated to the lower levels.
  • The lower levels must provide the assurance that the constraints are being followed, i.e., they implement these constraints and enforce that there are no ways to bypass them.
  • Following these ideas, the Authors of the book in reference developed a secure systems development methodology, which considers all lifecycle stages and all architectural levels. It is expanded with architectural aspects, and recently with process aspects.

Security methodology:

systematic way of introducing security into a software system during the development life-cycle

Consists of two aspects/facets:security process(SP) and conceptualsecurity framework (CF)

ASE: a comprehensive security methodology for distributed systems

  • Many methodologies exist with different paradigms
  • Very important class is methodologies that use security patterns
  • ASE: a security methodology using patterns and related constructs designed specifically for general distributed systems

Basic security principles for system design

  • Security constraints must be defined at the highest layer, where their semantics are clear, and propagated to the lower levels, which enforce them.
  • All the layers of the architecture must be secure.
  • We can define patterns at all levels. This allows a designer to make sure that all levels are secured, and also makes easier propagating down the high-level constraints.
  • We must apply security in all development stages
  • A two-dimensional approach: time and space

Reference Architecture (RA)

• A Reference Architecture (RA) is a generic software architecture, based on one or more domains, with no implementation aspects
• An RA is reusable, extendable, and configurable.
• It specifies the components of the system, their individual functionalities and their mutual interaction.
• An RA can be considered as a compound pattern and its components described as patterns.
• In addition to domain models, an RA may include a set of use cases (UC), and a set of Roles (R) corresponding to its stakeholders actors).

We can measure security by counting the threats that have been neutralized by using patterns

Incident Response Plan

• Before system is released, an incident response plan should be created (compromise or failure)

• Contact personnel, availability and procedures for multiple levels of failure

• A minor glitch does not require CEO to be informed but should be documented

• Network admin or security admin to determine an appropriate response when incident occurs

Key elements of incident response plan:

  • Monitoring duties for the software in live operation

  • A definition for incidents

  • A contact for incidents

  • An emergency contact for priority incidents

  • A clear chain of escalation

  • Procedures for shutting down the software or components of the software

  • Procedures for specified exploits or attacks

  • Security documentation or the references for external ode or hardware used

*It depends on the organization that the person responsible is from development team or outside

evolving attacks / periodic review and archiving

Web Application Threats

  • Never trust the client at the server side
  • Never trust the browser on the client side
  • Never execute client input as code
  • Never allow client input to pass into the system without validation internally
  • Scrub client for any known exploits and suspect characters

XSS

CSRF

File Upload

Buffer Overflow

SQL Injection

Threat Modelling

• In a nutshell, threat modelling is the use of abstractions to help you consider risks.
• Involves developing a shared understanding of a product or service architecture and the problems that could happen.
•When you model threats, you typically use one of two types of model.

  1. The model for what you are building.
  2. The model for the threats of what you are building.

The Four-Step Framework

  1. Model System, Model the system you’re building, deploying or changing.
  2. Find Threats, Find threats using the model.
  3. Addrasc Inrears, Address the threats.
  4. Validate,Validate, the result for completeness and effectiveness

(1) What are you Building?

  • Diagrams are a good way to communicate what you are building.
  • There are lots of ways to diagram software and you can start with a whiteboard diagram of how data flows through the system.

(2) What can go wrong?

  • Given a simple diagram, we can start thinking about what can go wrong. Example:

  • How do you know that the web browser is being used by the person you expect?

  • What happens if someone modifies data in the database?

  • Is it okay for information to move from one box to the next without encryption?

  • You can identify threats like these using the STRIDE approach.

  • Use STRIDE to walk through each part of the diagram:

  • Spoofing.

  • Tampering.

  • Repudiation.

  • Information Disclosure.

  • Denial of Service.

  • Elevation of Privilege.

(3) What Are We Going To Do About It

• There are four types of action you can take against a threat:

  1. Mitigate it.

image-20230427001002917

image-20230427001035008

image-20230427001116901

image-20230427001224625

image-20230427001300080

image-20230427001333023

  1. Eliminate it. Vpn(Geo-restrictions: A VPN can be used to bypass geo-restrictions and access content that may be blocked in certain regions, reducing the risk of inadvertently visiting malicious websites or downloading malicious content.)

  2. Transfer it.

  3. Accept it

(4) Did We Do A Good Job

  1. Look at the diagram, does it represent the system well?
  2. Look at the list of threats, did you find at least 5 threats per node in the diagram?
  3. Did you file a bug per threat?
  • When you find threats that violate your requirements and cannot be mitigated, it generally makes sense to adjust your requirements.
    Sometimes it’s possible to either mitigate the threat operationally, or defer a decision to the person using the system.
  • There are many other threat modelling techniques, we have only touched most popular approaches.
  • Some other threat modelling approaches include:
  • PASTA (Process for Attack Simulation and Threat Analysis)
  • CSVSS Common Vulnerability Scoring System)
  • HMM (Hybrid Threat Modelling Method)

Attack Trees

image-20230427001820867

  • Top level node (or root) represents the ultimate goal of an attacker.
  • The nodes (or leaves) represent sub goals that need to be achieved (together or independently) to arrive at the top level goal.

• An alternative to STRIDE(Spoofing, tampering, repudiation, Information disclosure,Denial of service,Denial of service).

• You can use attack trees as a way to find threats, or as a way to organize threats.

​ • Attack trees work well as a building block for threat enumeration in the fourstep framework.

We can use attack trees to find the threats

Attack-Defence Scenarios

• Attack trees are used to model attack-defence scenarios.

• The attack-defence scenario is a game between two players:

• The proponent (denoted as p)

• The opponent (denoted as o)

  • The root of an attack tree represents the main goal of the proponent.

    • When the root is an attack node, the proponent is an attacker and the opponent is a defender.
    • Conversely, when the root is a defence node, the proponent is a defender and the opponent is an attacker.
  • Notations:

  • Attack nodes are represented with circles.

  • Defence nodes are represented with rectangles.

  • Refinements:

    • Refinements can be conjunctive (AND aggregation) or disjunctive (OR choice)
    • Refinement relations are indicated by solid edges between nodes.
    • Countermeasures are indicated by dotted edges.
    • A conjunctive refinement of a node is represented by an arc over all edges connecting the node and its children of equal type.

The basic steps to create an attack tree are as follows:

\1. Decide on a representation.

Two types of trees:

AND trees (aggregation): The state of a node depends on all of the nodes below it being true. 有夹脚

Or trees (choice): A node is true if any of its subnodes are true. 没夹脚

\2. Create a root node.

If the root node is a countermeasure mitigation action…

Then the subnodes are used to identify what can go wrong.

Decompose the mitigation action and identify how the results can be threatened.

If the root node is a goal of the attacker:

Then we consider ways to achieve that goal.

Each alternative way to achieve the goal should be drawn as a subnode.

\3. Create subnodes.

• You typically create subnodes by brainstorming in some structured manner.

• The relation between subnodes and parent node can be AND or OR.

• Some possible structures for first-level subnodes include:

Attacking a system:

  • Physical access.
  • Subvert software.
  • Subvert a person.

Attacking a system via:

  • People.
    Process.
  • Technology.

Attacking a product during:

  • Design.
  • Production.
  • Distribution.
  • Usage.

\4. Consider completeness.

• An attack tree can be checked for quality by iterating over the node, looking for additional ways to reach the goal. It may be helpful to use STRIDE or attack libraries.

\5. Prune the tree.

• In this step, go through each node in the tree and consider whether the action in each subnode is prevented or duplicative.

\6. Check the presentation.

• Trees can be represented in two ways:

• As a free form (human viewable) model without any technical structure.

• As a structured representation with variable types and/or metadata to facilitate programmatic analysis.

Human Viewable Representations

• Attack trees can be drawn graphically or shown in outline form.

• Care should be taken to ensure that the graphics are actually information rich and useful.

• Outline representations are easier to create than graphical representations, but they tend to be less attention-grabbing.

If you are using someone else’s tree, be sure to understand their intent. If you are creating once, be sure you are clear on your intent and communicate your intent clearly.

Risk Analysis

Risk = (Probability of Incident) x (Incident Impact)

• When dealing with hostile risk, consider the following:

​ • Vulnerabilities in systems are NOT constants.

​ • Capabilities of the adversary are NOT constant.

​ • Motivators for adversary changes instantaneously.

​ • Adversaries can attack your system anytime.

Probability of Incident = Vulnerability x Threat x Motivation

Security Functional Requirements

  • FAU: Security auditing.
  • FCP: Communications.
  • FCS: Cryptographic support.
  • FDP: User data protection.
  • FIA: Identification and authentication.
  • FMT: Security management.
  • FPR: Privacy.
  • FPT: Protection of security function.
  • FRU: Resource utilization.
  • FTA: Access.
  • FTP: Trusted path.

Security Mechanismss

  • There are several ways in which the security policy can be enforced.
  • Assurance requirements detail the ways in which security is demonstrated.
  • There are also different ways of evaluating these requirements.
  • Here are the FIPS standards.

image-20230427124928638

image-20230427125103365

forensics

5080 UofG

[TOC]

Revision and Reflection:

  1. Raphael Aerospace

The fictional Raphael Aerospace is a leading multinational, specialising in software engineering for defense systems. The organisation has seized a laptop from one of its employees during a routine search that occurs at the entry and exit point to its North America campus. The company is concerned that the laptop contains sensitive source code. The employee has refused to speak or cooperate with the organisation since the laptop was seized. The digital investigation team believes the laptop is password protected, employs software-based full-disk encryption and has been seized in sleep mode. The laptop has highly integrated components to achieve a ‘slim-line’ profile and potentially has a Trusted Platform Module (TPM). The investigation team needs to acquire the keys associated with decryption and encryption of hard disk contents.

a). Outline the SIX states of the Advanced Configuration and Power Interface (ACPI) and argue the relevancy to software-based full-disk encryption for each state.

The Advanced Configuration and Power Interface (ACPI) is a specification that defines power management and system configuration for computers. It establishes different power states, which play a significant role in the energy consumption and performance of a device. These power states are relevant to software-based full-disk encryption, as they can impact the security and accessibility of encrypted data. The six ACPI power states are:

  1. G0 (S0) - Working State: In this state, the system is fully powered on and operational. The CPU is executing instructions, and all peripherals are active. In the context of full-disk encryption, this state is when the system is most vulnerable to attacks. However, it’s also the state in which encryption and decryption processes can be performed.
  2. G1 (S1-S4) - Sleeping States: There are four different sleep states, S1 to S4, with S1 being the lightest sleep and S4 the deepest sleep state. As the sleep state deepens, more components are powered down, and the system consumes less power. Full-disk encryption keys are typically stored in RAM or the TPM. In sleep states, the encryption keys remain in memory, making it possible to access encrypted data without re-entering the password. However, these states also increase the risk of cold boot attacks, where attackers can extract the keys from RAM or TPM if they can gain physical access to the machine.
  3. G2 (S5) - Soft Off State: In this state, the system is powered off, but some components may still receive power to support features like Wake-on-LAN. The encryption keys are not present in memory, so an attacker cannot extract them. However, encrypted data on the disk remains secure, and a password is required to access the system upon startup.
  4. G3 - Mechanical Off State: The system is completely powered off, and no components receive power. In this state, the encryption keys are not present in memory, and the encrypted data is secure. To access the data, the user must power on the system and enter the password.
  5. S4 - Hibernate State: In this state, the system’s state is saved to the hard drive before powering down. The encryption keys are not present in memory, making the data secure. However, the system will require the password upon waking from hibernation to access the encrypted data.
  6. C0-C3 - Processor Power States: These are the power states of the CPU itself, with C0 being the fully operational state and C3 being the deepest sleep state. These states are less relevant to full-disk encryption, as they primarily affect the CPU’s power consumption and performance. However, the encryption and decryption processes may be slower in deeper sleep states due to reduced CPU performance.

In summary, ACPI power states have varying degrees of relevancy to software-based full-disk encryption. The Working State (G0) and Sleeping States (G1) are the most relevant, as they impact the security and accessibility of encrypted data. The Soft Off State (G2), Mechanical Off State (G3), Hibernate State (S4), and Processor Power States (C0-C3) are less directly relevant but may still influence the security of encrypted data in certain situations.

b). Evaluate and describe THREE potential approaches to recover the keys associated with software-based full-disk encryption and argue for the optimal approach in the given context.

There are several potential approaches to recover the keys associated with software-based full-disk encryption (FDE). Here are three possible methods:

  1. Brute-force attack: This approach involves systematically trying every possible combination of characters to find the correct encryption key or password. Given enough time and computational resources, a brute-force attack will eventually succeed.

Pros:

  • Guaranteed to find the correct key or password eventually.

Cons:

  • Requires a significant amount of time and computational power, making it inefficient.
  • May be ineffective against long, complex passwords or strong encryption algorithms.
  1. Dictionary attack: This method involves using a pre-compiled list of words or phrases (a “dictionary”) to attempt to recover the encryption key or password. Dictionary attacks are faster than brute-force attacks since they rely on a smaller set of possibilities, usually based on known common passwords or phrases.

Pros:

  • Faster than brute-force attacks.
  • Effective against weak passwords or phrases.

Cons:

  • Less effective against strong, unique passwords or phrases.
  • Relies on the quality of the dictionary used.
  1. Cryptanalysis attack: This approach involves analyzing the encrypted data or the encryption algorithm itself to discover weaknesses or flaws that can be exploited to recover the encryption key. This method often requires deep knowledge of cryptography and a thorough understanding of the specific encryption algorithm used.

Pros:

  • Can be more efficient than brute-force or dictionary attacks.
  • Exploits weaknesses or flaws in the encryption algorithm itself, potentially making it more successful.

Cons:

  • Requires extensive knowledge and expertise in cryptography.
  • May not be successful if the encryption algorithm is well-designed and without significant weaknesses.

In the given context, the optimal approach to recover the keys associated with software-based full-disk encryption would depend on several factors, such as the strength of the encryption algorithm, the complexity of the password, and the available resources (time and computational power).

If the encryption algorithm is known to have weaknesses or flaws, a cryptanalysis attack could be the most efficient method to recover the keys. However, this approach requires a high level of expertise in cryptography.

In cases where the password is known to be weak or likely to be found in a dictionary, a dictionary attack would be the preferred approach, as it is faster and more efficient than brute-force attacks.

If no information about the password or encryption algorithm’s weaknesses is available, a brute-force attack could be the only viable option. However, this method may be time-consuming and resource-intensive.

Ultimately, the choice of the optimal approach will depend on the specific circumstances and the available resources. In many cases, a combination of these approaches may be required to successfully recover the keys associated with software-based full-disk encryption.

c). Raphael Aerospace is a British company, but the laptop was seized at the North American campus. The employee is a United Kingdom (UK) citizen and is concerned about the laws regarding software-based full-disk encryption in the United States (US). The employee believes that the UK will be a more favorable jurisdiction from the perspective of being forced to reveal any keys or passwords associated with encryption. Contrast the UK and US legal perspectives regards compelled decryption, speculate on the optimal jurisdiction in the given context.

The legal perspectives on compelled decryption differ between the United States (US) and the United Kingdom (UK). Here is a brief overview of the legal stances in both jurisdictions:

United States (US): In the US, the Fifth Amendment to the Constitution protects individuals from self-incrimination. This has been interpreted by some courts as providing protection against being forced to reveal encryption keys or passwords, as doing so could be seen as self-incrimination. However, the interpretation of the Fifth Amendment in the context of compelled decryption is not uniform across all courts, and some have ruled that individuals can be compelled to provide decryption keys or passwords under certain circumstances.

United Kingdom (UK): In the UK, the Regulation of Investigatory Powers Act 2000 (RIPA) governs the legal framework surrounding encryption and compelled decryption. Under RIPA, individuals can be legally compelled to provide encryption keys or passwords to law enforcement authorities when ordered to do so by a court. Failure to comply with such an order can result in severe penalties, including imprisonment.

In the given context, the employee’s belief that the UK might be a more favorable jurisdiction from the perspective of being forced to reveal encryption keys or passwords might be misguided. The UK’s legal framework, as established by RIPA, clearly allows for compelled decryption under certain circumstances, whereas the US legal system provides some degree of protection against self-incrimination through the Fifth Amendment.

However, it is important to note that the specific circumstances of the case and the legal interpretations of the relevant laws may vary, leading to different outcomes in each jurisdiction. The optimal jurisdiction would depend on various factors, including the details of the case and the stance of the courts involved.

In conclusion, while the US legal system may offer more protection against compelled decryption than the UK, there are no guarantees, and the optimal jurisdiction would depend on the specific circumstances and the courts involved. The employee should seek legal counsel to better understand their rights and potential risks in both jurisdictions.

  1. BCO Case

The fictional United States (US) Borders and Customs Office (BCO) wants to strengthen border controls. The BCO wants to ensure rigorous checks are possible at the border to ensure illegal digital content does not come across the border on physical devices. The BCO is particularly concerned that such files are hidden in unallocated space on drives. The BCO want a rapid process that can confirm a target file or traces of a target file are present on a suspected system. The suspected system can then be kept for further, deeper analysis. Argue an appropriate file carving technique and outline an implementation for the given context.

a). In the given context, the BCO aims to quickly identify whether a target file or traces of it are present on a suspected system, particularly in unallocated space on drives. An appropriate technique for this purpose is file carving.

In the given context, the BCO aims to quickly identify whether a target file or traces of it are present on a suspected system, particularly in unallocated space on drives. An appropriate technique for this purpose is file carving. Among the file carving techniques, Hash Based Carving is a suitable choice for the BCO’s requirements.

Hash Based Carving involves creating hash values of known target files and comparing them to the hash values of data blocks in the unallocated space of a drive. This technique is fast and can efficiently identify complete or partial matches of target files on a suspected system.

The implementation of Hash Based Carving for the BCO’s context can be outlined as follows:

Preparation: Compile a list of known target files that the BCO is concerned about. Calculate the hash values for these files and store them in a database.

Drive imaging: At the border, if a suspicious device is identified, create a forensically sound image of the drive to avoid tampering with the original evidence.

Data extraction: Extract data blocks from the unallocated space of the drive image. Divide the extracted data into fixed-size blocks, which will be used for hash comparison.

Hash comparison: Calculate hash values for each data block extracted from the unallocated space. Compare these hash values with the hash values of the known target files stored in the database.

Identification: If a match is found between the hash values, it indicates that the target file or traces of it are present on the suspected system. In such cases, the system can be retained for further, deeper analysis.

Hash Based Carving allows the BCO to rapidly check devices at the border for illegal digital content, enabling them to focus on suspicious systems for more in-depth investigation. This technique not only minimizes false positives but also helps streamline the process of detecting and preventing the transportation of illegal digital content across the border.

b). The BCO want to reduce the number of false positives as this can result in unnecessary workload and delays at the border. Evaluate your proposed approach in (a), indicate potential causes of false positives and argue how they can be addressed.

The proposed approach in (a) is Hash Based Carving. While it is a fast and efficient method to identify target files or their traces, it is not without potential causes of false positives. Here, we’ll evaluate the approach, highlight the possible reasons for false positives, and suggest ways to address them.

Potential causes of false positives:

Hash collisions: Although rare, hash collisions can occur when two different data blocks result in the same hash value. In such cases, the carving tool might falsely identify a non-target file as a target file, leading to false positives.

Partial matches: Hash Based Carving can detect partial matches of target files. However, there might be instances where the partial matches are unrelated to the target files, thus causing false positives.

Addressing false positives:

Utilize multiple hash algorithms: To minimize the chances of hash collisions, the BCO can use multiple hash algorithms (such as SHA-256, SHA-3, or others) and perform a comparison based on a combination of hash values. This approach significantly reduces the likelihood of false positives due to hash collisions.

Verify file headers and footers: In addition to hash comparison, the BCO can implement a secondary check for file headers and footers to ensure that the identified files are indeed the target files. By verifying the unique file signatures of known target files, the BCO can further minimize false positives.

Threshold-based matching: To address false positives due to partial matches, the BCO can set a threshold value for the level of similarity required for a match. By refining the matching criteria, the BCO can filter out unrelated partial matches, thereby reducing false positives.

By addressing these potential causes of false positives, the BCO can enhance the efficiency of the Hash Based Carving approach, ensuring a more accurate and streamlined process at the border. This will minimize unnecessary workload and delays, allowing the BCO to focus on genuine cases that require deeper analysis.

c). The BCO also want to ensure the legality of the approach. The BCO want to ensure the approach does not require a specific search warrant, as this would impact on the speed and efficiency of the approach in terms of border control. Argue the potential legal concerns and outline how they may be addressed in any implementation for the given context.

Potential legal concerns:

  1. Privacy rights: Conducting a file carving process on a suspected system may raise concerns about an individual’s right to privacy, as it involves searching and potentially extracting personal and private information without their consent.
  2. Search and seizure laws: Depending on the jurisdiction, searching an individual’s digital device without a specific search warrant could potentially violate search and seizure laws, which generally require law enforcement to obtain a warrant before conducting a search that infringes on an individual’s privacy.
  3. Chain of custody: Ensuring the integrity and admissibility of any digital evidence obtained through file carving in a court of law requires maintaining a proper chain of custody. This involves documenting every step of the evidence handling process, from the initial search to the final analysis.

Ways to address legal concerns:

  1. Establish clear policies and guidelines: Develop and implement clear policies and guidelines for border agents to follow when conducting file carving or other digital forensic searches. These guidelines should outline the circumstances under which such searches are permissible, the extent of the search, and the steps to be followed to ensure legal compliance.
  2. Train border agents: Provide regular training for border agents on the legal aspects of digital forensics and the proper procedures for conducting file carving and other digital searches. This can help minimize the risk of violating privacy rights and search and seizure laws.
  3. Obtain appropriate authorization: While the BCO aims to avoid the need for specific search warrants, it is essential to obtain the necessary legal authorization to conduct file carving searches. This could involve establishing a reasonable suspicion or probable cause before conducting a search, depending on the jurisdiction’s requirements.
  4. Implement a tiered search approach: To minimize potential privacy intrusions, consider implementing a tiered search approach that starts with less invasive techniques (such as basic keyword searches) and only escalates to more intrusive methods like file carving when there’s a reasonable basis for suspicion.
  5. Maintain proper documentation: Ensure that a proper chain of custody is maintained throughout the entire digital forensics process. This includes documenting every step of the evidence handling process, from the initial search to the final analysis, to ensure the admissibility of any evidence obtained in a court of law.

Ultimately, it is crucial for the BCO to consult with legal experts to develop a compliant and legally defensible approach to file carving and other digital forensic techniques at the border. This can help ensure that the method is both effective in identifying illegal digital content and respecting individual privacy rights and due process requirements.

  1. Conway Energy Case

Conway Energy is a large enterprise with many customers. The company recently discovered that an employee generated letters demanding missed payments from hundreds of customers. The employee used a variant of a standard company letter and altered it to instruct recipients to make payment into their bank account. The employee then lodged the letters with the corporate file store for automatic dispatch. The technical team state the letters can be retrieved, but have concerns as the corporate file store contains millions of documents and letters. The company legal and management team have approved an investigation by the technical team to extract the hundreds of generated letters. The technical team have uncovered a template for the fraudulent standard letter on a corporate workstation. The technical team have altered the letter to include a known affected customer name and address. The technical team then generated a hash of the file, but were unable to identify a match in the file store.

a). The management team are concerned that evidence discovered during the internal investigation may eventually be presented in court. The management team are confident the fraudulent standard letter has been seized legally with appropriate authority. However, the management team want to ensure the discovered standard letter is admissible evidence in court. Evaluate and argue if the uncovered fraudulent letter is admissible evidence to a court of law in the given context.

The use of scientifically derived and proven methods towards the preservation, collection, validation, identification, analysis, interpretation and presentation of digital evidence derived from digital sources for the purposes of facilitating or furthering the reconstruction of events found to be criminal or helping to anticipate the unauthorised actions shown to be disruptive to planned operations

Admissibility depends upon several factors: (1) authenticity, (2) relevancy, and (3) competency.

In the context of the discovered fraudulent standard letter, the following factors can be considered:

  1. Relevance: The term relevancy means that the information must reasonably tend to prove or disprove any matter in issue. The question or test involved is, “Does the evidence aid the court in answering the question before it?”. The fraudulent letter is directly related to the case at hand, as it demonstrates the employee’s actions to create and distribute the letters demanding missed payments. It is likely to be considered relevant evidence.
  2. Reliability: The technical team must be able to demonstrate that the letter was discovered through a reliable and consistent process, and that the investigation methods were accurate and thorough. Proper documentation of the investigation process, such as the steps taken to identify the fraudulent letter, can help establish its reliability.
  3. Authenticity: The term authenticity refers to the genuine character of the evidence. The court will want to ensure that the discovered letter is indeed the fraudulent standard letter created by the employee. The technical team should be prepared to provide evidence that confirms the letter’s authenticity, such as metadata, timestamps, and any other identifying information. A proper chain of custody should also be maintained to document the handling, storage, and transfer of the letter.
  4. Competency: Competent as used to describe evidence means that the evidence is relevant and not barred by any exclusionary rule. The competency of the evidence in Conway Energy’s case will depend on the technical team’s qualifications and expertise, the methods and techniques used in the investigation, proper documentation and record-keeping, and compliance with legal and procedural requirements. Ensuring these factors are addressed will increase the likelihood of the evidence being considered competent and admissible in court.

b). The technical team have uncovered more fraudulent letters, but hashes of each do not match any in the corporate file store. Upon closer inspection the technical team have determined that the employee has inserted words with the font colour set to white. The words are effectively ‘hidden to visual inspection as they are not easily observable. The technical team have generated a definitive list of the hidden words present in the fraudulent letters. The technical team are unconvinced that generating a hash of each fraudulent letter is an effective route. The technical team need to utilise a hashing approach that is able to identify homologous patterns between the known fraudulent letters and those in the file store. Devise and explain an effective hashing approach in the given context.

Since the traditional hashing approach does not seem to be effective in identifying the fraudulent letters, the technical team can explore alternative hashing techniques that focus on content-based similarity rather than exact file matches. One such approach is known as locality-sensitive hashing (LSH).

Locality-sensitive hashing is a technique used to identify similar documents by generating hashes that have a higher probability of colliding when the documents are similar. This approach is more effective in identifying homologous patterns between the known fraudulent letters and those in the file store.

Here’s a possible approach to implementing LSH in this context:

  1. Preprocess the documents: Convert all the documents in the corporate file store and the known fraudulent letters to plain text, including the hidden white text, to ensure a consistent format for comparison.
  2. Tokenize and create document vectors: Break the text of each document into tokens (e.g., words or phrases) and represent each document as a high-dimensional vector using techniques like term frequency-inverse document frequency (TF-IDF) or word embeddings. This process converts the documents into a suitable format for LSH.
  3. Implement locality-sensitive hashing: Apply an LSH algorithm to the document vectors. The algorithm will generate similar hashes for documents with similar content, making it easier to identify the fraudulent letters with homologous patterns in the file store.
  4. Set a similarity threshold: Determine an appropriate similarity threshold to identify potential matches. This threshold will depend on the specific LSH algorithm used and the desired balance between precision and recall.
  5. Compare and flag potential matches: Compare the LSH hashes of the known fraudulent letters with those in the corporate file store. Flag any documents with hashes that exceed the set similarity threshold for further investigation.
  6. Verify flagged documents: Manually review the flagged documents to ensure they are indeed fraudulent letters and not false positives. Make note of any discrepancies or issues to further refine the LSH algorithm or similarity threshold, if necessary.
  7. Preserve evidence and implement preventive measures: Once the fraudulent letters have been identified and extracted, preserve the evidence and consider implementing additional security measures to prevent similar incidents in the future.

By employing locality-sensitive hashing, the technical team can identify fraudulent letters with similar content patterns, even when the exact file hashes do not match. This approach should be more effective in detecting the hidden white text and other subtle alterations made by the employee.

CTPH:

  1. Preprocess the documents: Convert all the documents in the corporate file store and the known fraudulent letters to plain text, including the hidden white text, to ensure a consistent format for comparison.
  2. Apply CTPH algorithm: Implement a CTPH algorithm, such as ssdeep, to generate fuzzy hash values for each document. The algorithm will create hashes that are similar for documents with similar content.
  3. Set a similarity threshold: Determine an appropriate similarity threshold for comparing the generated fuzzy hashes. This threshold will depend on the desired balance between precision and recall in identifying similar documents.
  4. Compare and flag potential matches: Compare the fuzzy hashes of the known fraudulent letters with those in the corporate file store. Flag any documents with hashes that exceed the set similarity threshold for further investigation.
  5. Verify flagged documents: Manually review the flagged documents to ensure they are indeed fraudulent letters and not false positives. Make note of any discrepancies or issues to further refine the CTPH algorithm or similarity threshold, if necessary.
  6. Preserve evidence and implement preventive measures: Once the fraudulent letters have been identified and extracted, preserve the evidence and consider implementing additional security measures to prevent similar incidents in the future.

By employing CTPH, the technical team can identify fraudulent letters with similar content patterns, even when there are subtle differences such as the hidden white text. This approach should be more effective in detecting the homologous patterns between the known fraudulent letters and those in the file store compared to traditional hashing methods.

c). The technical team are concerned that the hashing approach devised in (b) may not be appropriate in the given context. Identify potential concerns with the hashing approach devised in (b) for the given context.

  1. Sensitivity to small changes: Although CTPH is designed to detect similar files, it may not always be sensitive enough to detect subtle differences, such as the hidden white text used in the fraudulent letters. This could lead to false negatives, where the technical team fails to identify some fraudulent letters.
  2. False positives: CTPH can sometimes produce false positives, where non-fraudulent documents are flagged as potentially fraudulent due to similarity in content or structure. This could result in the technical team spending time and resources on manually reviewing non-fraudulent documents.
  3. Scalability: Given that the corporate file store contains millions of documents, comparing the fuzzy hashes of the known fraudulent letters with those in the file store could be computationally expensive and time-consuming.
  4. Accuracy: The accuracy of CTPH in identifying fraudulent letters depends on the chosen similarity threshold. Setting an appropriate threshold can be challenging, as a high threshold might result in false negatives, while a low threshold could lead to false positives.
  5. Legal admissibility: There may be concerns about the legal admissibility of the evidence gathered using CTPH, as it relies on similarity rather than exact matches. The court may require additional validation or proof that the flagged documents are indeed fraudulent.

Given these concerns, the technical team should carefully consider whether the CTPH approach is appropriate for their specific context. They may need to explore alternative methods, such as advanced text analytics or machine learning techniques, to more accurately and efficiently identify the fraudulent letters in the file store. Additionally, the technical team should consult with legal professionals to ensure the chosen approach meets the requirements for evidence admissibility in court.

However, if we use LSH,

Using locality-sensitive hashing (LSH) as your hash algorithm could be a suitable alternative for Conway Energy’s case, as LSH is designed to identify similar documents by generating hashes that have a higher probability of colliding when the documents are similar. This approach can be more effective in identifying homologous patterns between the known fraudulent letters and those in the file store.

Still we need to consider the potential problems of LSH:

Locality-sensitive hashing (LSH) is a powerful technique for identifying similar documents, but it comes with some potential concerns that should be considered in the context of the Conway Energy case:

  1. False positives: LSH can produce false positives, where non-fraudulent documents are flagged as potentially fraudulent due to similarity in content or structure. This can lead to spending additional time and resources on manual review of non-fraudulent documents.
  2. False negatives: Depending on the chosen similarity threshold and LSH algorithm, LSH can also produce false negatives, where fraudulent documents are not flagged due to insufficient similarity in their LSH hashes. This can result in missing important evidence.
  3. Scalability: LSH requires a considerable amount of computation and storage, especially when dealing with large datasets like Conway Energy’s corporate file store. This can lead to increased processing time and resource requirements.
  4. Parameter selection: LSH algorithms often have several parameters that need to be fine-tuned, such as the similarity threshold, the number of hash functions, and the number of hash tables. Selecting appropriate parameters can be challenging and may require empirical testing and validation.
  5. Preprocessing and feature extraction: LSH relies on converting documents into high-dimensional vectors, which may require considerable preprocessing and feature extraction, such as tokenization, stemming, and text vectorization using techniques like TF-IDF or word embeddings. This can be computationally expensive and may introduce additional complexity.
  6. Legal admissibility: Similar to CTPH, there may be concerns about the legal admissibility of the evidence gathered using LSH, as it relies on similarity rather than exact matches. The court may require additional validation or proof that the flagged documents are indeed fraudulent.

Given these potential concerns, the technical team should carefully consider whether the LSH approach is appropriate for their specific context. They may need to explore alternative methods or combine LSH with other techniques to improve accuracy, efficiency, and legal admissibility. Additionally, the technical team should consult with legal professionals to ensure the chosen approach meets the requirements for evidence admissibility in court.

  1. Laputa University Case

The University of Laputa replaces computer systems for staff every five years. The management team have been informed by research staff that some systems have been replaced without sufficient notice and as a result important files have been lost. The management team have also been informed that some systems are being sold through various online auction websites, rather than being recycled.

The management team suspects a member of the systems support team has been selling the systems via online auction websites. The management team have authorised the digital investigations team to purchase several systems from online auction websites that they suspect have come from the institution. The management team have also authorised the digital investigation team to utilise appropriate data recovery techniques to recover files.

a). The digital investigations team want to recover any previous Personal Storage Table (PST) files from many of the systems they have purchased from online auction websites. The digital investigations team believe such a file, in general, will not be heavily fragmented due to the numerous approaches adopted by modern file systems. Argue whether the position of the digital investigations team is accurate.

The digital investigations team’s position that Personal Storage Table (PST) files, in general, will not be heavily fragmented due to the numerous approaches adopted by modern file systems is mostly accurate. However, some factors can still contribute to the fragmentation of PST files, even on modern file systems.

Modern file systems, such as NTFS, HFS+, and ext4, are designed to minimize fragmentation by using various allocation strategies and techniques. These file systems attempt to keep related data blocks close together and allocate new blocks in a way that minimizes fragmentation. As a result, the overall fragmentation of files on these file systems tends to be less severe compared to older file systems like FAT.

However, PST files, which are used by Microsoft Outlook to store email messages, contacts, and other data, can still become fragmented under certain conditions:

Large file sizes: PST files can grow quite large over time, especially if users have many emails and attachments. Large files can be more susceptible to fragmentation as they are more likely to be allocated in non-contiguous blocks.

Frequent updates: PST files are updated frequently as new emails are received, sent, or deleted. These updates can lead to fragmentation as the file system may need to allocate new blocks to accommodate the changes in file size and content.

Insufficient free space: If there is insufficient free space available on the storage device, it may be challenging for the file system to allocate contiguous blocks for new or updated data, resulting in fragmentation.

Multiple concurrent users: In a shared environment, multiple users might be accessing and modifying different files simultaneously. This can create a higher likelihood of fragmentation as the file system attempts to allocate blocks for various files concurrently.

While modern file systems are better at managing fragmentation, it is still essential for the digital investigations team to consider the factors mentioned above when attempting to recover PST files from the purchased systems. Fragmentation can affect the ease and success of the data recovery process, and the team may need to employ specialized data recovery tools or techniques to recover fragmented PST files effectively.

b). The digital investigations team eventually assume the PST files they want to extract from the purchased systems are likely to comprise of more than two fragments and the relevant clusters are not necessarily in sequence. Devise and justify a carving approach to recover a single file in the given context. Highlight any limitations or constraints in the proposed solution.

In the given context, the digital investigations team can employ a carving approach that combines signature-based carving and file system metadata analysis to recover the fragmented PST files. This approach involves the following steps:

  1. Signature-based carving: Scan the storage device for known file signatures or magic numbers associated with PST files. This process can help identify the starting point of each PST file fragment. Common magic numbers for PST files include !BDN for Outlook 97-2002 and !BD0 for Outlook 2003 and later.
  2. File system metadata analysis: Analyze the file system metadata to gather information about the allocation and location of clusters associated with PST files. This can help identify the correct sequence of fragmented clusters and uncover additional fragments that may not have been detected through signature-based carving.
  3. Cluster chaining: Once the starting points of the file fragments and their metadata are identified, attempt to reconstruct the file by chaining the clusters in the correct order based on their allocation in the file system. This can be done using specialized data recovery tools or custom-built scripts.
  4. File validation: After the PST file has been reconstructed, validate its integrity by checking its internal structure and attempting to open it using a compatible email client or PST viewer. This step helps ensure that the recovered file is complete and functional.
  5. Iterative refinement: If the initial reconstruction is unsuccessful, refine the carving approach by adjusting parameters, such as the search window for signature-based carving or the cluster allocation strategy. Repeat the process until a successful recovery is achieved or it becomes clear that the file cannot be recovered.

Limitations and constraints of the proposed solution:

  1. Incomplete recovery: The carving approach may not always be successful in recovering the entire PST file, particularly if some fragments are missing or corrupted.
  2. Time-consuming: This process can be time-consuming, especially when dealing with large PST files or complex fragmentation patterns.
  3. False positives: Signature-based carving can sometimes produce false positives, where unrelated data is mistakenly identified as part of the PST file.
  4. Expertise required: The proposed carving approach requires a certain level of expertise in data recovery and file system analysis, as well as access to specialized tools or custom scripts.

Despite these limitations, the proposed carving approach should provide the digital investigations team with a robust method for recovering fragmented PST files from the purchased systems. The team may need to iterate and refine the approach as necessary to maximize the chances of successful file recovery.

c). The digital investigations team have since learned that they need to recover several PST files from each system, not just a single PST file. The digital investigations team have decided that the speed of recovery of the multiple files is more important than the accuracy of recovery. Devise a carving approach to recover multiple files in the given context. Highlight any limitations or constraints in the proposed solution.

In the given context, where the speed of recovery is more important than accuracy, the digital investigations team can employ a streamlined carving approach to recover multiple PST files from each system. This approach involves the following steps:

  1. Signature-based carving: Perform a bulk scan of the storage device for known file signatures or magic numbers associated with PST files (e.g., !BDN for Outlook 97-2002 and !BD0 for Outlook 2003 and later). This process helps identify the starting point of each PST file fragment.
  2. File size estimation: Estimate the size of each PST file based on the distance between consecutive file signatures. This can help in the quick recovery of files without needing extensive file system metadata analysis.
  3. File extraction: Extract the identified file fragments based on the estimated size and signature locations. This step may involve some level of over-extraction or under-extraction to ensure that complete files are recovered, at the expense of potential inaccuracies.
  4. File validation (optional): If time permits, validate the integrity of the recovered PST files by checking their internal structure and attempting to open them using a compatible email client or PST viewer. This step can help identify any major issues with the recovered files.

Limitations and constraints of the proposed solution:

  1. Inaccurate recovery: By prioritizing speed over accuracy, the carving approach may result in inaccurately recovered PST files, with potentially missing or corrupted data.
  2. False positives: Signature-based carving can produce false positives, where unrelated data is mistakenly identified as part of the PST file. This may lead to the recovery of irrelevant or incomplete files.
  3. File fragmentation: This approach does not account for fragmented files, which may result in incomplete recovery of some PST files.
  4. File validation: Skipping or minimizing the file validation step can increase the risk of recovering unusable or corrupted files.
  5. Expertise required: The proposed carving approach requires a certain level of expertise in data recovery and the ability to quickly analyze and adapt to the specific storage device’s conditions.

Despite these limitations, the proposed carving approach should provide the digital investigations team with a faster method for recovering multiple PST files from the purchased systems. The team may need to accept the trade-off between speed and accuracy, understanding that some of the recovered files may be incomplete or corrupted.

d). The digital investigations team have recovered the PST file from one of the systems purchased online with the revelation that the PST does not belong to any researcher or member of the staff at the University of Laputa. The digital investigations team actually suspect the file might belong to another university. The digital investigations team have decided to investigate the system further to identify the specific individual. Argue whether the actions of the digital investigation team are appropriate in the given context.

In the given context, the actions of the digital investigations team can be seen as both appropriate and inappropriate, depending on the objectives and the ethical considerations involved.

Arguments for the appropriateness of the digital investigations team’s actions:

  1. Prevent potential misuse of data: The recovery of a PST file that does not belong to any researcher or staff member at the University of Laputa raises concerns about the potential misuse of the data contained within it. Investigating the system further could help the team understand how this file ended up on the system and prevent any potential misuse of the information.
  2. Uphold data privacy and security: Universities are responsible for protecting the privacy and security of personal and sensitive information. Investigating the origin of the unknown PST file and identifying the individual it belongs to could help the team ensure that the university is upholding its data protection obligations.

Arguments against the appropriateness of the digital investigations team’s actions:

  1. Privacy concerns: Investigating the contents of a PST file that does not belong to a member of the University of Laputa could be seen as an invasion of privacy. The team should consider the ethical implications of accessing someone else’s personal data without their consent.
  2. Legal considerations: The digital investigations team should be aware of any legal implications associated with accessing and analyzing data that does not belong to their institution. There might be laws and regulations that govern the handling of such data, and the team should ensure they are acting within the legal framework.
  3. Scope of investigation: The primary objective of the investigation was to determine whether a member of the systems support team was selling university-owned systems online. The discovery of a PST file that does not belong to any researcher or staff member may not be directly relevant to this objective. The team should consider whether further investigation of the file falls within the scope of their initial mandate.

In conclusion, the actions of the digital investigations team can be considered appropriate if they are conducted within legal and ethical boundaries and if they serve a legitimate purpose, such as protecting data privacy and security. However, the team should carefully weigh the potential risks and implications of their actions, ensuring they do not infringe upon the privacy rights of individuals or act outside the scope of their initial investigation.

Sample exam paper 2020

The management team for Lime Legal, a large legal firm that conducts numerous digital investigations, has decided to develop its own hash function for use in digital investigations. The management team has commissioned a specialised software developer to design and implement the hash function. The specialised software developer states that compression is an important requirement for a hash function.

a). Argue for another TWO important requirements for a hash function in the given context. (approximately 200 words)

要想要设计一个哈希算法,需要满足以下几点要求:

  1. 从哈希值不能反向推导出原始数据(哈希算法更多算是一种单向加密算法)。
  2. 对输入数据敏感,输入数据只要改变 1 bit,那么最终得到的哈希值也要不同。
  3. 冲突的概率要小,即对于不同的原始数据,哈希值相同的概率非常小。
  4. 哈希算法的执行效率要高,针对较长的文本,也能快速计算出哈希值。

While compression is indeed an important requirement for a hash function, there are two other crucial requirements that must be considered for Lime Legal’s digital investigations: security and performance.

Firstly, security is paramount for a hash function in the context of digital investigations. A secure hash function needs to exhibit several properties, including collision resistance, preimage resistance, and second preimage resistance. Collision resistance ensures that it is computationally infeasible to find two distinct inputs that map to the same hash output, which is crucial to maintain the integrity of the evidence. Preimage resistance makes it difficult to find an input for a given hash output, while second preimage resistance ensures that it is hard to find a different input with the same hash output as an existing input. These security properties are essential for Lime Legal’s work as they ensure the reliability and trustworthiness of the digital evidence in legal proceedings.

Secondly, performance is another key requirement for a hash function in digital investigations. Lime Legal’s work likely involves processing large volumes of data in a timely manner. As such, the hash function must be efficient in terms of computational and memory requirements. A fast and resource-efficient hash function will not only minimize the time spent on processing the data but also reduce the likelihood of bottlenecks in the investigation process. This enables Lime Legal to provide more effective and timely services to their clients.

In conclusion, besides compression, security and performance are two essential requirements for a hash function in the context of Lime Legal’s digital investigations. Ensuring a secure and efficient hash function will not only maintain the integrity of digital evidence but also enhance the effectiveness of the firm’s investigative processes.

b). The specialised software developer states that elements of the bespoke hash function will rely upon some internal initial values and constants. The specialised software developer states that these values and constants will be generated using a sophisticated and secret algorithm. The specialised software developer informs the management team that the initial values and constants will be made public along with the design and implementation details, but the algorithm to generate them will be kept secret and managed by the company. Argue whether the approach favoured by the specialised software developer is appropriate in the given context.(approximately 400 words)

One of the primary concerns is the lack of transparency in the process. In the field of cryptography, it is widely accepted that security should rely on the strength of the algorithm rather than the secrecy of its design. This principle is known as Kerckhoffs’s principle. By keeping the algorithm for generating initial values and constants secret, Lime Legal risks undermining the trust and credibility of their hash function. Digital evidence generated using a hash function with undisclosed components may face challenges in legal proceedings, as opposing parties could question its integrity.

Additionally, the secrecy of the algorithm prevents independent verification and analysis by the broader cryptographic community. Peer review and open scrutiny are essential to establishing the security and reliability of cryptographic algorithms. Closed-source designs may contain unintentional flaws or vulnerabilities that would otherwise be identified and resolved through a transparent review process.

Furthermore, the reliance on a secret algorithm for generating initial values and constants introduces the possibility of a single point of failure. If the secret algorithm is compromised, the entire hash function could be rendered insecure, potentially jeopardizing ongoing and past investigations.

In conclusion, the approach favored by the specialized software developer is not appropriate in the given context. Lime Legal should consider adhering to established cryptographic principles and industry best practices, which emphasize transparency, open scrutiny, and independent verification to ensure the credibility and robustness of their bespoke hash function.

c). The specialised software developer is not entirely sure how to design the bespoke hash function. Devise a potential hash function that exhibits a Merkle-Damgård construction, highlight and argue the importance of any core components. (approximately 400 words)

d). The management team want to employ the use of the bespoke hash function to identify unauthorised files on employee smartphones and laptops. The specialised software developer states they can develop a system that can be used to rapidly inspect employee smartphones and laptops as part of random security searches as employees leave campus. A member of the management team is concerned that such a process may violate the privacy of the employee and some employees may feel targeted. Argue whether the approach favoured by the management team is appropriate in the given context.(approximately 250 words)

  1. Orange Entertainment

The management team for Orange Entertainment want to recover files from a Microsoft Windows 10 workstation that have been destroyed by a disgruntled employee. The management team believe the employee destroyed the files as they had been manipulating them for their own gain over several months. The management team have authorised the systems support team to recover the files as part of their investigation. The systems support team have allocated trainees Bill and Ben to lead the investigation and recover the files.

a). Ben has identified ShadowExplorer as a useful tool to recover files from the Microsoft Windows 10 workstation.Discuss TWO relevant features of the ShadowExplorer tool and argue the relevance in the given context. (approximately 200 words)

ShadowExplorer is a valuable tool for recovering lost or damaged files, offering two key features that make it particularly relevant for the Orange Entertainment management team’s investigation.

  1. Access to Shadow Copies: Since it is a Windows OS, One of the primary features of ShadowExplorer is its ability to access and browse through the shadow copies of files created by the Windows Volume Shadow Copy Service (VSS) on Windows. These shadow copies act as snapshots of the files and their respective states at different points in time. In the context of Orange Entertainment’s investigation, this feature is crucial as it allows Bill and Ben to potentially recover earlier versions of the manipulated files. By restoring these earlier versions, the management team can gain insight into the disgruntled employee’s actions and better understand the extent of the manipulation.
  2. User-friendly interface: ShadowExplorer’s intuitive and user-friendly interface is another important feature that makes it suitable for the investigation. The tool presents a familiar Explorer-like interface, enabling Bill and Ben to easily navigate through the shadow copies and locate the relevant files. This ease of use will help streamline the recovery process, allowing the trainees to efficiently identify and restore the destroyed files. Furthermore, since both Bill and Ben are trainees, the simplicity of the tool will make it easier for them to learn and utilize in their investigation, reducing the chances of making mistakes during the recovery process.

In summary, ShadowExplorer’s ability to access shadow copies and its user-friendly interface make it an ideal tool for Bill and Ben to recover the destroyed files, providing Orange Entertainment’s management team with the information they need to assess the situation and address the employee’s misconduct.

b. Bill has decided to use ShadowExplorer on the employee workstation in-situ, but Ben is concerned if such an approach is appropriate. Ben also suggests the pair should at least take some simple notes of their actions, Bill argues it is not necessary. Critique the differing positions of Bill and Ben in the given context.(approximately 300 words)

Bill’s decision to use ShadowExplorer directly on the employee workstation in-situ might seem efficient and time-saving; however, Ben’s concerns are valid, particularly in the context of an investigation where preserving evidence and maintaining a clear chain of custody is crucial.

Using ShadowExplorer in-situ poses several risks. First, the process might inadvertently alter the state of the workstation, potentially corrupting or overwriting evidence. Such modifications can jeopardize the integrity of the investigation and might also impact the legal admissibility of the evidence, should the management team decide to pursue legal action against the disgruntled employee. Instead, it is more appropriate to create a forensic image of the hard drive and work on a copy of that image to ensure the original data remains unaltered.

Second, working directly on the employee workstation increases the risk of accidental data loss or damage, especially given that both Bill and Ben are trainees. Utilizing a forensic copy provides a safety net, allowing them to revert to the original state if any mistakes are made during the recovery process (investigation revertiable).

Regarding the documentation of their actions, Ben’s suggestion to take simple notes is actually a necessary step in a proper investigation. Maintaining detailed records of their actions, tools used, and findings is essential for several reasons:

  1. Accountability: Documenting the investigation process ensures that all actions taken can be justified and reviewed, which helps maintain the credibility and integrity of the investigation.
  2. Reproducibility: Detailed notes allow others, including senior team members or external experts, to review and reproduce the steps taken in the investigation if needed, helping to validate the findings.
  3. Legal purposes: Should the case go to court, proper documentation is vital for establishing the chain of custody and proving the legitimacy of the evidence obtained.

In conclusion, Ben’s concerns about using ShadowExplorer in-situ and the need for documentation are valid. Adopting a more cautious approach that preserves the integrity of the evidence and maintains a clear record of their actions will not only improve the quality of the investigation but also ensure that the recovered data can be used effectively in any potential legal proceedings.

2019 Sample Paper

1. Janus in BBFS

Janus is a software engineer for Bill and Ben Financial Services (BBFS). Janus has concerns about algorithms that unfairly disadvantage business customers. Janus has raised it with his line manager, but she was disinterested in his concerns. Janus decides to effectively smuggle elements of source code and associated documents outside the organisation using an external disk.

(a) Janus is aware that the company utilises forensic techniques to identify encrypted files on external disks taken outside the organisation. Janus decides to smuggle the data via an external disk using steganography.Contrast steganography with cryptography and argue for steganography in the given context.[4]

Steganography and cryptography are two distinct techniques used for protecting and concealing data. While both methods have their applications, steganography might be more suitable for Janus’s situation due to its ability to hide information within other data.

Steganography involves concealing information within another file or data stream, such as an image, audio, or video file, in such a way that it is virtually undetectable to an observer. The information is embedded in the carrier file without changing its perceptible characteristics, making it difficult to identify the presence of hidden data. This technique allows for the secret transfer of information, as the carrier file appears innocuous and attracts little suspicion.

Cryptography, on the other hand, focuses on encrypting data to make it unreadable and incomprehensible to unauthorized parties. While cryptography protects the contents of a message, it does not hide the fact that encrypted data exists. Encrypted files can draw attention and raise suspicion, potentially leading to further investigation.

In the given context, steganography might be more suitable for Janus’s needs. Since BBFS uses forensic techniques to identify encrypted files, using cryptography to protect the data on an external disk could raise red flags and make it more likely for Janus’s actions to be discovered. Steganography, however, would allow Janus to hide the source code and documents within seemingly harmless files, avoiding detection by BBFS’s security measures. By employing steganography, Janus can minimize the risk of his actions being discovered while still smuggling the data out of the organization.

(b) Janus has decided to use the Bit Plane Complexity Segmentation (BPCS) algorithm to ensure high-capacity use of vessel images. Janus wants to ensure that insertion of the payload will not result in images that are vulnerable to human visualinspection. Janus plans to use several holiday images in Pure Binary Coding (PBC) with many ‘noisy’ qualities, e.g. sand and rain. Explain THREE operations of the BPCS algorithm to ensure the payload is effectively hidden in the given context.[9]

BPCS-隐写术(Bit-Plane Complexity Segmentation steganography)是数字隐写术的一种。

数字隐写术可以通过将机密数据(即秘密文件)嵌入到一些称为“容器数据”的媒体数据中来非常安全地隐藏它们。容器数据也称为“承运人、封面或虚拟数据”。在 BPCS 隐写术中,真彩色图像(即24 位彩色图像)主要用于血管数据。实际中的嵌入操作是用机密数据替换血管图像位平面上的“复杂区域” 。BPCS-隐写术最重要的方面是嵌入容量非常大。与仅使用最不重要的数据位的简单图像隐写术相比,因此(对于24 位颜色图像)只能嵌入相当于总大小 1/8 的数据,而 BPCS 隐写术使用多个位平面,因此可以嵌入更多的数据,尽管这取决于单个图像。对于“正常”图像,在图像退化变得明显之前,大约 50% 的数据可能可以用秘密数据替换。

比特平面复杂度分割(BPCS)算法是一种有效的隐写术方法,用于在图像中隐藏数据,特别是那些具有 “噪音 “性质的图像。在Janus的案例中,使用带有沙子和雨水的假日图像对BPCS隐写术是有利的。以下是BPCS算法的三个关键操作,有助于确保有效地隐藏有效载荷:

分解为位平面: BPCS算法首先将容器图像分解为一系列的位平面。每个比特平面代表了图像二进制表示法中的不同重要性水平。例如,最重要的位(MSB)平面包含最高的对比度信息,而最不重要的位(LSB)平面包含最低的对比度信息。通过以这种方式分解图像,BPCS可以操作对比度较低、”噪音较大 “的位平面来嵌入有效载荷,而不会对图像造成明显的变化。

复杂度计算和分割: BPCS通过计算不同值(0或1)的接壤像素对的比例来评估每个位面段的复杂性。如果一个片段的复杂性超过了预定的阈值,它就被认为是 “有噪声 “的,适合嵌入有效载荷。在Janus的案例中,使用带有沙子和雨水的图像增加了可用于数据隐藏的复杂片段的数量。这有助于确保有效载荷被很好地隐藏起来,难以被人类的视觉检查所发现。

自适应的数据嵌入: BPCS将有效载荷嵌入到上一步确定的复杂片段中。通过根据其复杂性自适应地选择合适的片段,BPCS确保嵌入的数据不会对图像造成明显的变化。该算法以保持图像的整体复杂性的方式,用有效载荷数据替换原始的复杂片段。这种自适应的嵌入过程对于在给定的环境中有效地隐藏有效载荷至关重要,因为它减少了通过视觉检查发现的风险。

通过使用BPCS算法,Janus可以利用假日图像的噪声特性来有效地隐藏有效载荷。分解为位平面,复杂度计算和分割,以及自适应数据嵌入操作共同作用,确保隐藏的数据难以被发现,同时对图像的视觉质量影响最小。

(c) Janus is planning on implementing the BPCS algorithm on his workstation as to ensure he can embed payload data in the vessel images.Devise a simple BPCS algorithm to embed payload data in vessel images.[6]

A simple BPCS algorithm for embedding payload data in vessel images can be broken down into the following steps:

  1. Image Preparation: Convert the vessel image to a suitable format, such as a lossless format like PNG or BMP, to prevent compression artifacts from affecting the steganography process. Resize the image if necessary to accommodate the payload data.
  2. Bit Plane Decomposition: Decompose the vessel image into a series of bit planes. Separate the image into its color channels (e.g., red, green, and blue) and represent each channel using binary values. Then, create bit planes for each level of significance, from the most significant bit (MSB) to the least significant bit (LSB).
  3. Payload Preparation: Convert the payload data into binary format. You may consider compressing and encrypting the data beforehand to further protect and optimize the payload.
  4. Complexity Calculation and Segmentation: Evaluate the complexity of each bit plane segment by calculating the proportion of bordering pixel pairs with different values (0 or 1). If a segment’s complexity surpasses a predefined threshold, it is considered ‘noisy’ and suitable for embedding the payload.
  5. Adaptive Data Embedding: Iterate through the noisy segments identified in the previous step, and embed the payload data by replacing the original complex segments. Ensure that the embedding process maintains the overall complexity of the image to avoid arousing suspicion.
  6. Image Reconstruction: Reassemble the modified bit planes into their respective color channels, and then combine the channels to create the final stego-image. Save the stego-image in a lossless format to preserve the embedded data.
  7. Payload Extraction: To extract the payload data from the stego-image, reverse the process by decomposing the stego-image into bit planes, identifying the noisy segments where the payload data was embedded, and reconstructing the original payload data from the binary values stored in those segments.

By following these steps, Janus can implement a simple BPCS algorithm to effectively embed payload data in vessel images. The use of noisy segments for data embedding makes the hidden information difficult to detect through visual inspection, providing a level of security for the concealed data.

(d) Janus is concerned that inspection techniques will reveal use of the steganography approach. Janus is concerned he will be legally required to reveal the devised algorithm in (c) to relevant authorities under the (UK) Regulation of Investigatory Powers Act 2000 (RIPA).Argue whether Janus would be required to reveal the devised algorithm in (c) under the (UK) Regulation of Investigatory Powers Act 2000 (RIPA).[4]

While I am not a legal expert, I can provide some general information on the matter. The Regulation of Investigatory Powers Act 2000 (RIPA) in the UK provides a legal framework for the use of investigatory powers by authorities, including the interception of communications, acquisition of communications data, and the use of covert human intelligence sources, among others.

Under Part III of RIPA, authorities may legally demand the disclosure of protected information, which includes encrypted data or keys necessary to decrypt the information. If served with a notice under RIPA, individuals or organizations are required to provide the requested information or assistance, or face penalties for non-compliance.

In the case of Janus, if the relevant authorities become aware of his use of steganography and suspect that he has hidden sensitive or illegal information within the images, they may issue a notice under RIPA, requiring Janus to disclose the hidden data or provide the necessary means to access it. This could potentially include revealing the BPCS algorithm devised in (c).

However, RIPA notices are typically issued when there is a justified need for access to the protected information, such as in cases of national security, crime prevention, or public safety concerns. Whether Janus would be required to reveal the devised algorithm under RIPA would depend on the specific circumstances of his case and whether the authorities deem it necessary to obtain the hidden data for a lawful purpose.

It is essential for Janus to consider the legal implications of his actions and consult with a legal professional if he has concerns regarding the use of steganography and potential requirements under RIPA.

2. Pagli and Antonellis

Pagli and Antonellis are novice cyber system forensic investigators and have started a small start-up business. The pair have invested in two, basic laptop computers. The pair have been contracted by a large company to investigate an employee workstation. The company has multiple workstations, comprising of basic components, e.g. limited processing capabilities. The company management are particularly interested in specific Microsoft Word documents. The company state the workstations do not make use of any anti-forensics techniques, e.g. full-disk encryption.

(a) Pagli and Antonellis have recovered the files of particular interest to company management but have discovered the files are encrypted and protected by unknown passwords. The pair are concerned that the passwords are sophisticated and cannot be easily determined. Pagli argues that the Distributed Network Attack (DNA) software tool from AccessData could be valuable.Describe TWO technical approaches employed by Distributed Network Attack(DNA) tool and argue the relevance in the given context[6]

The Distributed Network Attack (DNA) tool from AccessData is designed to assist in recovering passwords for encrypted files by leveraging the power of distributed computing. In the given context, where Pagli and Antonellis have recovered encrypted Microsoft Word documents with unknown passwords, DNA could be a valuable tool to help them gain access to the files. Here are two technical approaches employed by the DNA tool and their relevance in this context:

  1. Brute-force attack: DNA can perform a brute-force attack, which involves systematically attempting every possible password combination until the correct one is found. Brute-force attacks can be time-consuming, especially if the password is long and complex. However, DNA’s distributed computing capabilities allow it to harness the processing power of multiple computers, including the company’s workstations, to expedite the password recovery process. This distributed approach makes it more feasible to crack sophisticated passwords within a reasonable time frame, increasing the likelihood of success for Pagli and Antonellis.
  2. Dictionary attack: Another approach employed by DNA is the dictionary attack. This method involves using a precompiled list of words, phrases, or known passwords (a dictionary) to attempt to guess the password. DNA can also utilize rules-based variations, such as common substitutions or character additions, to further expand the list of potential passwords. Dictionary attacks are generally faster than brute-force attacks, as they focus on more likely password candidates. In the given context, this approach could be relevant if the employee used a password based on a dictionary word, a common phrase, or a known pattern.

In conclusion, the Distributed Network Attack tool could be valuable for Pagli and Antonellis in their efforts to recover the passwords for the encrypted Microsoft Word documents. By employing both brute-force and dictionary attacks, while utilizing distributed computing resources, DNA increases the chances of successfully cracking the passwords, even if they are sophisticated. This would ultimately help Pagli and Antonellis fulfill their contract and provide the company management with access to the files of interest.

(b) Antonellis argues that the pair cannot afford to invest in Distributed Network Attack (DNA) software as resources are limited. Pagli argues that tool would be invaluable to the current case. Antonellis argues the pair should use a combination of command line tools and tailored scripts.Compare and contrast the tools suggested by Pagli and Antonellis and argue for the optimal approach in the given context.[8]

Both the Distributed Network Attack (DNA) software and a combination of command line tools and tailored scripts have their merits and drawbacks in the given context. Here, we will compare and contrast these approaches and argue for the optimal solution for Pagli and Antonellis.

Distributed Network Attack (DNA) software: Pros:

  1. Comprehensive and user-friendly: DNA is a dedicated tool designed for password recovery, with built-in features and functionality that simplify the process for users.
  2. Distributed computing: DNA leverages the power of multiple computers to speed up the password recovery process, making it more efficient for cracking complex passwords.
  3. Multiple attack strategies: DNA supports both brute-force and dictionary attacks, offering a versatile approach to password recovery.

Cons:

  1. Cost: DNA may be expensive, particularly for a small start-up with limited resources.
  2. Overkill for simple cases: DNA’s advanced capabilities may not be necessary if the target password is weak or follows a predictable pattern.

Command line tools and tailored scripts: Pros:

  1. Cost-effective: Using open-source command line tools and custom scripts can be more budget-friendly, as there is no need to invest in expensive software.
  2. Flexibility: Tailored scripts can be customized to the specific needs of the case, allowing Pagli and Antonellis to adapt their approach as required.

Cons:

  1. Time-consuming setup: Developing and configuring custom scripts and tools may require a significant investment of time and expertise.
  2. Limited scalability: The performance of command line tools and scripts may be constrained by the available hardware resources, making it less suitable for cracking complex passwords in a timely manner.

In the given context, the optimal approach depends on several factors, including the available budget, time constraints, and the complexity of the passwords. If Pagli and Antonellis believe that the encrypted files are of high importance and the passwords are likely to be sophisticated, investing in the DNA software could prove invaluable for its speed, efficiency, and user-friendly features. The distributed computing capabilities and the support for multiple attack strategies can significantly increase the chances of success.

However, if the pair’s budget is truly limited and they possess the technical expertise to develop custom scripts, using command line tools and tailored scripts may be a more cost-effective alternative. This approach would allow them to retain control over the process and adapt their strategy to the specific case.

Ultimately, the optimal approach will depend on the pair’s assessment of the case’s importance, the potential value of the encrypted files, and their available resources. It is crucial for Pagli and Antonellis to weigh the pros and cons of each option carefully before deciding on the best course of action.

Terminology&Jargons:

FDE:

Full Disk Encryption (FDE) is an encryption technology implemented on hard disk drives or solid state drives. It protects all data stored on the disk, including operating system, program files, user data, etc. Its main purpose is to ensure that sensitive data on the disk cannot be deciphered in case of unauthorized access.

File Carving:

File carving is a process used in [computer forensics](https://www.infosecinstitute.com/courses/computer-forensics-boot-camp/?utm_source=resources&utm_medium=infosec network&utm_campaign=course pricing&utm_content=hyperlink) to extract data from a disk drive or other storage device without the assistance of the file system that originality created the file.

Unallocated area:

Unallocated space refers to the area of the drive which no longer holds any file information as indicated by the file system structures like the file table.

Ip:

investigative process.

DFI(Digital forensics investigation):

investigate the tool of crime and subject of crime.

Hash:

哈希算法是指将任意长度的二进制值串映射为固定长度的二进制值串。原始数据经过映射之后得到的二进制值串就是哈希值。

MD5加密原理步骤:

a). 填充,将其长度填充为512的整数倍:

​ 填充的方法如下:

1) 在信息的后面填充一个1和无数个0,直到满足上面的条件时才停止用0对信息的填充。

2) 在这个结果后面附加一个以64位二进制表示的填充前信息长度(单位为Bit),如果二

​ 进制表示的填充前信息长度超过64位,则取低64位。

​ 经过这两步的处理,信息的位长=N512+448+64=(N+1)512,即长度恰好是512的整数倍。这样做的原因是为满足后面处理中对信息长度的要求。

b). 初始化变量

​ 初始的128位值为初试链接变量,这些参数用于第一轮的运算,以大端字节序来表示,他们分别为: A=0x01234567,B=0x89ABCDEF,C=0xFEDCBA98,D=0x76543210。

​ (每一个变量给出的数值是高字节存于内存低地址,低字节存于内存高地址,即大端字节序。在程序中变量A、 B、C、D的值分别为0x67452301,0xEFCDAB89,0x98BADCFE,0x10325476)

c).处理分组数据

​ 每一分组的算法流程如下:

​ 第一分组需要将上面四个链接变量复制到另外四个变量中:A到a,B到b,C到c,D到d。从第二分组开始的变量 为上一分组的运算结果,即A = a, B = b, C = c, D = d。

​ 主循环有四轮(MD4只有三轮),每轮循环都很相似。第一轮进行16次操作。每次操作对a、b、c和d中的其中 三个作一次非线性函数运算,然后将所得结果加上第四个变量,文本的一个子分组和一个常数。再将所得结果 向左环移一个不定的数,并加上a、b、c或d中之一。最后用该结果取代a、b、c或d中之一。

​ 以下是每次操作中用到的四个非线性函数(每轮一个)。

​ F( X ,Y ,Z ) = ( X & Y ) | ( (~X) & Z )

​ G( X ,Y ,Z ) = ( X & Z ) | ( Y & (~Z) )

​ H( X ,Y ,Z ) =X ^ Y ^ Z

​ I( X ,Y ,Z ) =Y ^ ( X | (~Z) )

​ (&是与(And),|是或(Or),~是非(Not),^是异或(Xor))


Merkle–Damgård

Merkle–Damgård结构简称为MD结构,主要用在hash算法中抵御碰撞攻击。这个结构是一些优秀的hash算法,比如MD5,SHA-1和SHA-2的基础。今天给大家讲解一下这个MD结构和对他进行的长度延展攻击。

Steps:

Padding

MD结构首先对输入消息进行填充,让消息变成固定长度的整数倍(比如512或者1024)。这是因为压缩算法是不能对任意长度的消息进行处理的,所以在处理之前必须进行填充。在原始数据的尾部添上1000…然后加上原始消息长度的2进制值使其长度变为512或1024的整数倍;使用额外的block,额外的使用一个block往往有点浪费,一个更加节约空间的做法就是,如果填充到最后一个block的0中有住够的空间的话,那么可以消息的长度放在那里。

Compress

完成padding之后就可以进行compress了。消息被分成了很多个block,最开始的初始化向量和第一个block进行f操作,得到了的结果再和第二个block进行操作,如此循环进行,最终得到了最后的结果。

MD Structure

长度延展攻击

MD结构,是将消息分成一个一个的block,前一个block 运算出来的值会跟下一个block再次进行运算,这种结构可以很方便的进行长度延展攻击。前提是我们需要知道原消息的长度。在密码学中长度延展攻击就是指攻击者通过已知的hash(message1)和message1的长度,从而能够知道hash(message1‖message2)的值。其中‖ 表示的是连接符。并且攻击性并需要知道message1到底是什么。

Wide pipe

为了避免长度延展攻击,我们可以对MD结构进行一些变形。

Wide pipe

wide pipe和MD的流程基本上是一致的,不同的是生成的中间临时的加密后的消息长度是最终生成消息长度的两倍。

这也就是为什么上图中会有两个初始向量IV1 和 IV2。假如最终的结果长度是n的话,那么在中间生成的结果的长度就是2n。我们需要在最后的final 这一步中,将2n长度的数据缩减为n长度的数据。

Fast wide pipe

SHA-512/224 和 SHA-512/256 只是简单的丢弃掉一半数据。

还有一种比wide pipe更快的算法叫做fast wide pipe:

Fast wide pipe

和wide pipe不同的是,它的主要思想是将前一个链接值的一半转发给XOR,然后将其与压缩函数的输出进行XOR。

SLACK SPACE:

Slack space occurs when a file can not be efficiently compartmentalised into file systems containers.

Feature: effectively containers are not going to be completely full, there is some slack space.

Example: consider file that is 59 bytes in size, that is allocated a 2048 byte cluster - the remaining 1989 bytes are considered slack space.

Potential for slack space to contain interesting data or data from previous files.

Data may exist between the end of the allocated file data and the sector.

Data may also exist between the sectors within the cluster that are not allocated data.

Interesting data may also exist in the sectors within the cluster.

an important aspect of slack space is that it is allocated space, it not unallocated.

Sector and Cluster:

In computer file systems, a sector is the smallest unit of storage on a disk. A cluster is a group of sectors that are treated as a single unit of storage. Sector is a fixed-size, contiguous block of storage space on the disk, typically ranging from 512 bytes to 4096 bytes, depending on the disk format. Clusters are used to allocate disk space for files, and they are typically much larger than sectors, ranging from a few sectors to several kilobytes in size, depending on the file system and disk size.

FIRST AVAILABLE:

in terms of forensics, recovery of deleted data will likely be more fruitful near the end of the file system.

NEXT AVAILABLE:

in terms of forensics, recovery of deleted data may be more balanced in comparison.

BEST FIT:

recall, the file itself may grow and data can become scattered across the system.

bytegroupings

FRAGMENTATION

• systems become fragmented as files are deleted, added and altered.

• file considered fragmented when its containers are not consecutive, but are rather scattered across the storage device.

• fragmentation typically occurs due to alteration of files, low disk space and specific approaches.

• fragmentation of files is relatively uncommon in modern systems.

• modern operating systems are effective at avoiding fragmentation as this affords faster reading and writing.

• disk space has became less of a concern, suggesting that fragmentation is more likely on relatively smaller disks.

• researchers argue fragmented files are more likely of interest to investigators.

FILE EXTENSION FRAGMENTATION:

• different fragmentation rates are observed for different file types.
• temporary and logs files are often fragmented as they grow over system lifetime.
• movie, image, document and personal organisation information are often highly fragment.
• arguably such files are more pertinent to investigation than benign system files

HIGH FRAGMENTATION:

• there are some files that are highly fragmented, potentially into more than 100 or over 1000 fragments.

• such fragments are typically associated with large system updates or patches

FILE CARVING:

• process of reconstructing files based on structure and content, instead of meta-data.

• typically used to recover data from unallocated space on the disk as indicated by the file system.

• useful for data recovery when the device itself has been damaged, e.g. hard disk.

• valuable in forensics when specific files have been deleted, e.g. data still present in sectors

• may not recognise the file system used or even trust the file system itself.

• early file carver approaches relied on magic numbers to discover and recover files.

• limitation is that the file craver would recover continuous data unsure that it is actually valid or properly associated with the file

Chanllenges:

  1. the initial challenge is to identify the files that are to be carved from the image itself.

  2. process must exist that ensures the files are actually intact.

  3. files then need to be carved or extracted from the image.

LIMITATIONS:

• problem is that unless clusters are contiguous can be difficult to recover file.

• even if associated clusters are recovered, difficult to validated file is what is expected.

Hex Carving:

在进行Hex Carving时,分析师需要寻找特定的十六进制签名(也称为文件头和文件尾),这些签名是文件类型的特征,例如JPEG图片、PDF文档等。一旦找到这些签名,就可以从存储介质上提取相应的数据块,并将其恢复为完整的文件。

Bitfragment Gap Carving(BGC):

Steps:

• initial step is to determine the header and footer of the file.
• process the clusters between the header and footer to confirm the contents of the container files.
• perform the computationally expensive step of validating each cluster.
• Know bh and bz, start with g = 1 and grow until each fragment validates

Limitation:

• approach works when the file is bi-fragmented, anymore fragments it will not work.

• corrupted or lost clusters will result in the worse case performance.

• approach works for files that have structure that can actually be validated and/or decoded.

• not always possible to trust validation and/or decoding approach.

• approach struggles with large gaps.

Bitfragment Gap Carving (BGC) 旨在克服传统的 Hex Carving 方法在面对破碎文件时的局限性。在某些情况下,文件在存储介质上可能是分散的,这意味着文件的各个部分可能不是连续存储的。

BGC 通过在存储介质中搜索特定的文件片段(称为比特片段)来解决这个问题,而不是仅搜索文件头和文件尾。这些文件片段可能包含文件的重要信息,如内容、元数据等。在确定了这些比特片段之后,BGC 将尝试将它们重新组合成一个完整的文件。

BGC 的一个关键优势是它可以在不了解文件系统的情况下恢复分散的文件,这使得它在处理损坏的文件系统或恢复被删除的文件时非常有用。然而,BGC 也存在一些挑战,例如需要开发针对特定文件类型的比特片段签名、可能出现误报以及需要处理大量的比特片段组合。

Do we think it is wise to carve out the fragments between the header and footer?

在使用 Bitfragment Gap Carving(BGC)方法时,在文件头和文件尾之间提取文件片段通常是有意义的。这是因为在某些情况下,文件的各个部分可能是分散存储的,而不是连续存储的。这意味着文件头和文件尾之间可能存在其他文件片段,这些片段包含文件的重要信息,如内容、元数据等。

What else do we typically know about the files were interested in?

在使用 Bitfragment Gap Carving(BGC)方法时,我们通常需要了解一些关于感兴趣文件的信息,以提高恢复的成功率和准确性。以下是在 BGC 中可能需要了解的文件相关信息:

  1. 文件类型:了解目标文件的类型有助于确定特定的文件片段签名。例如,JPEG 图像、PDF 文档和 Microsoft Word 文件等具有不同的文件结构和特征。了解文件类型有助于缩小搜索范围并提高恢复效果。
  2. 文件片段签名:针对特定文件类型,需要了解其比特片段签名,以便在存储介质上搜索和识别文件片段。这些签名可能包括文件头、文件尾以及其他特征信息,如元数据、内容标记等。
  3. 文件大小:如果可能的话,了解文件的大致大小可以帮助估算文件片段的数量和潜在位置。这可以提高搜索效率并减少误报。
  4. 存储设备和文件系统信息:了解所使用的存储设备(如硬盘、USB 闪存驱动器等)和文件系统(如 NTFS、FAT32、ext4 等)可能有助于确定文件分散的程度和方式,以及优化 BGC 方法。
  5. 数据删除或损坏的原因:了解数据丢失或损坏的原因(如意外删除、硬盘损坏、恶意软件攻击等)可能有助于确定最佳的恢复策略。

If we have determined the header and footers, plus container structure - what else could we do?

  1. 分析文件内部结构:了解特定文件类型的内部结构有助于识别并提取更多相关的比特片段。这包括文件的元数据、编码方式、标记和其他特征信息。掌握这些信息可以帮助更准确地搜索和识别文件片段。
  2. 优化搜索策略:根据已知的文件头、文件尾和容器结构,优化搜索策略以提高效率。这可能包括限制搜索范围、调整搜索参数,或者根据已知的文件大小信息预测文件片段的潜在位置。
  3. 验证恢复结果:在提取和重新组合文件片段后,仔细检查恢复结果以确保准确性和完整性。这可能包括验证文件的元数据、内容和内部结构。如果发现问题,可以返回到搜索和提取阶段,尝试使用不同的参数和策略。
  4. 结合其他恢复技术:在某些情况下,BGC 方法可能无法完全恢复文件。这时,可以尝试结合其他文件恢复技术和工具,如 Hex Carving、文件系统分析等,以提高恢复结果的准确性和完整性。
  5. 优化和学习:通过不断地学习和优化比特片段签名、搜索策略和参数设置,可以提高 BGC 方法的性能。在实践中,分析师可能需要面对各种不同的文件类型和存储设备,因此需要不断地调整和优化方法以适应不同的情况。

Theoretical Graph Carving:

THEORETICAL GRAPH CARVING

• need to determine the clusters that are adjacent to each other.

• approach is to determine the correct ordering is to weight fragment pairs.

• use function to generate a weight for each pair of clusters, select the heaviest pairing.

• ideal permutation is that where the sum of the ordering is maximum.

• determining the path is the same as finding maximum weight Hamiltonian path in a complete graph

Graph Carving:

• Hamiltonian path approach does not consider the situation where we have multiple files.

• Problem can be reconsidered as k-vertex disjointed path problem.

• Where we consider k as the number of files, identified from the number of headers.

• Disjointed path problem if we consider that each cluster only belongs to one file.

​ PARALLEL UNIQUE PATH(PUP):

Hash Carving:

哈希雕刻的工作流程分为以下几个步骤:

  1. 确定已知文件的哈希值:通过使用独立于磁盘镜像或文件系统的工具,计算已知文件的哈希值,并记录下来。
  2. 扫描磁盘镜像或文件系统:使用哈希雕刻工具扫描磁盘镜像或文件系统,并计算每个数据块的哈希值。
  3. 匹配哈希值:将扫描到的每个数据块的哈希值与已知文件的哈希值进行比较。如果匹配成功,则说明该数据块可能包含被删除文件的内容。
  4. 恢复文件内容:对于匹配成功的数据块,使用哈希雕刻工具将其提取出来,并尝试恢复被删除文件的内容。
  5. 验证数据完整性:对于已恢复的文件,应该进行数据完整性验证,以确保文件的完整性和可靠性。

需要注意的是,哈希雕刻可能会产生虚假匹配或错误的结果,因此需要对结果进行进一步的验证和分析。同时,在进行哈希雕刻时,也需要考虑数据隐私和法律规定,以确保取证过程的合法性和证据的可靠性。

**Non-probative block test:**(非证据块测试)

是指在数字取证中使用的一种测试方法,用于排除磁盘镜像或文件系统中的非证据块,以减少后续分析的时间和资源消耗。

在数字取证中,非证据块通常是指不包含有效数据或与案件无关的数据块,如操作系统的空闲块、已删除的文件块等。非证据块测试可以通过计算每个数据块的哈希值,并与已知的非证据块哈希值进行比较,以快速识别和排除这些非证据块。这可以大大减少后续分析的时间和资源消耗,并有助于集中分析有价值的证据数据。

需要注意的是,非证据块测试并不能保证100%的准确性,可能会产生误判或漏判的情况。因此,在数字取证中,还需要结合其他分析方法和工具,进行综合分析和验证,以确保取证结果的准确性和可靠性。

Magic numbers:

In the context of computer file formats, a magic number is a sequence of bytes that identifies the format of a file. It is called a “magic number” because it is often used like a magic spell to identify the file format, much like a spell might be used to identify a person or object.

data analyst

  • Have completed an undergraduate degree to a (2.1 minimum) with a Mathematical or Scientific background
  • Knowledge in SQL or Power BI
  • Strong analytical and numerical skills
  • Experience preparing, consolidating and normalising data :ETL processes
  • Experience / knowledge of data visualisation and dashboard tools e.g. Qlikview, Tableau or Power BI
  • Strong communication skills to work comfortable with all levels of users
  • Be within commuting distance to Croydon

Csharp

C# Get started

hello guys, follow me and let’s studying the C#.

  • Commercial experience of software development in C#
  • Experience with source code version control such as Git
  • Experience with software development tools such as Visual Studio, JIRA, MSBuild, Jenkins
  • Experience with unit test frameworks such as NUnit or MSTest
  • Experience designing user interfaces. (Exposure to DevExpress an advantage).
  • Exposure to Iterative or Agile development methodologies (Scrum / Kanban)
  • Understanding of OOP
  • Understanding of SOLID

Chapter1: Basic

  • using: The keyword is used to include the System namespace in the program. A program usually has multiple using statements.
  • namespace: which contains different classes.
  • class: declariation of the class.
  • Main: Is the entry point for all C# programs. What does the class do when executed
  • Formatted output aka interpolation:Console.WriteLine($“Hello World! The counter is {counter}“); By using $ dollar symbol.

Here are a few things to note:

  • C# is case sensitive.
  • All statements and expressions must start with a semicolon (;). The end.
  • Execution of the program starts with the Main method.
  • Unlike Java, the file name can be different from the name of the class.

Numbers and Math in C#: Int,float,double,decimal,string..

1
2
3
4
5
6
7
8
int minVal = int.MinValue; // -2147483648
int maxVal = int.MaxValue; // 2147483647

double min = double.MinValue // -1.79769313486232E+308
double max = double.MaxValue // 1.79769313486232E+308

decimal min = decimal.MinValue; //-79228162514264337593543950335
decimal max = decimal.MaxValue; // 79228162514264337593543950335

The decimal type has a smaller range but greater precision than double.

The M suffix on the numbers is how you indicate that a constant should use the decimal type. For example:

1
2
3
decimal c = 1.0M;
decimal d = 3.0M;
Console.WriteLine(c / d);

Arrays:

1
2
3
4
5
6
7
8
9
// 定义一个整型数组,长度为3
int[] nums = new int[3];

// 定义一个字符串数组,长度为4,初始化数组元素
string[] names = new string[] { "Tom", "Jerry", "Alice", "Bob" };

// 使用 var 关键字定义数组
var scores = new int[] { 90, 80, 95, 85, 70 };

以上示例中,第一个定义创建了一个长度为 3 的整型数组,数组中的元素初始值为 0;第二个定义创建了一个长度为 4 的字符串数组,并初始化了每个元素的值;第三个定义使用 var 关键字根据初始化值推断出数组的类型,并创建了一个长度为 5 的整型数组

 Common Math function

  • Math.Abs: Returns the absolute value of a number.
  • Math.Ceiling: Rounds up a number and returns the smallest integer greater than or equal to that number.
  • Math.Floor: Rounds down a number and returns the largest integer less than or equal to the number.
  • Math.Max: Returns the maximum of two numbers.
  • Math.Min: Returns the minimum of two numbers.
  • Math.Pow: Returns the specified power of a number.
  • Math.Round: Round to the nearest whole number or number of specified decimal places.
  • Math.Sqrt: Returns the square root of a number.
  • Math.Log: Returns the natural log base e of a number.
  • Math.Exp: Returns e to the specified power.
  • Math.Truncate: Truncates a number into its integer part.

Chapter2. if/else/loop

1. if statement

bool carries true and false. which is different from java

1
2
3
4
5
6
7
if(condition){
execution when condition is true
}

else{
execution when condition is false
}

Logical AND Operator:&&

Logical OR Operator:||

Logical NOT Operator: !

Equality Operator: ==

2. loop

  1. while loop

    same with java

    1
    2
    3
    4
    5
    6
    int counter = 0;
    while (counter < 10)
    {
    Console.WriteLine($"Hello World! The counter is {counter}");
    counter++;
    }
  2. do while

    same with java

    1
    2
    3
    4
    5
    6
    int counter = 0;
    do
    {
    Console.WriteLine($"Hello World! The counter is {counter}");
    counter++;
    } while (counter < 10);
  3. for loop

​ Same with java

1
2
3
4
for (int counter = 0; counter < 10; counter++)
{
Console.WriteLine($"Hello World! The counter is {counter}");
}

(for initializer;for condition;for iterator) same with java

Nested for loop which can be used to create matrix, e.g:

1
2
3
4
5
6
7
for (int row = 1; row < 11; row++)
{
for (char column = 'a'; column < 'k'; column++)
{
Console.WriteLine($"The cell is ({row}, {column})");
}
}

Chapter3. Lists collection

Example:

1
2
3
4
5
var names = new List<string> { "Joshua", "Ana", "Felipe" };
foreach (var name in names)
{
Console.WriteLine($"Hello {name.ToUpper()}!");
}
  • You specify the type of the elements between the angle brackets, <>.
  • One important aspect of this List type is that it can grow or shrink, enabling you to add or remove elements.
1
2
3
4
5
6
7
8
9
var names = new List<string> { "Joshua", "Ana", "Felipe" };
Console.WriteLine();
names.Add("Joshua");
names.Add("Bill");
names.RemoveAll(name=>name=="Joshua");
foreach (var name in names)
{
Console.WriteLine($"Hello {name.ToUpper()}!");
}

RemoveAll method is different from that in Java. It is provided by list

1
public int RemoveAll(Predicate<T> match);
  • list allows you reference the individual items by index directly which is different from java=>”list.get()”.

  • list also provides Count method which allows you to count the numbers of elements in the list. xxx.Count

  • list provides you a IndexOf method which allows you to find the index of the specific element.

  • The items in your list can be sorted as well. The Sort method sorts all the items in the list in their normal order (alphabetically for strings). Add this code and run again: Quite similar with the Collection.sort();

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
//print fibonacci numbers

var fibonacciNumbers = new List<int> {1, 1};

while (fibonacciNumbers.Count < 20)
{
var previous = fibonacciNumbers[fibonacciNumbers.Count - 1];
var previous2 = fibonacciNumbers[fibonacciNumbers.Count - 2];

fibonacciNumbers.Add(previous + previous2);
}
foreach(var item in fibonacciNumbers)
{
Console.WriteLine(item);
}

Fundamentals.

1. Program Structure.

1. Overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// A skeleton of a C# program
using System;

// Your program starts here:
Console.WriteLine("Hello world!");

namespace YourNamespace
{
class YourClass
{
}

struct YourStruct
{
}

interface IYourInterface
{
}

delegate int YourDelegate();

enum YourEnum
{
}

namespace YourNestedNamespace
{
struct YourStruct
{
}
}
}

The preceding example uses top-level statements for the program’s entry point. This feature was added in C# 9. Prior to C# 9, the entry point was a static method named Main, as shown in the following example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

// A skeleton of a C# program
using System;
namespace YourNamespace
{
class YourClass
{
}

struct YourStruct
{
}

interface IYourInterface
{
}

delegate int YourDelegate();

enum YourEnum
{
}

namespace YourNestedNamespace
{
struct YourStruct
{
}
}

class Program
{
static void Main(string[] args)
{
//Your program starts here...
Console.WriteLine("Hello world!");
}
}
}

Prefix of the c# file is .cs.

2. Main method

The Main method is the entry point of a C# application. (Libraries and services do not require a Main method as an entry point.) When the application is started, the Main method is the first method that is invoked.

There can only be one entry point in a C# program. If you have more than one class that has a Main method, you must compile your program with the StartupObject compiler option to specify which Main method to use as the entry point. For more information, see StartupObject (C# Compiler Options).

Starting in C# 9, you can omit the Main method, and write C# statements as if they were in the Main method (this is similar with Python), as in the following example:

1
2
3
4
5
6
7
using System.Text;

StringBuilder builder = new();
builder.AppendLine("Hello");
builder.AppendLine("World!");

Console.WriteLine(builder.ToString());

For information about how to write application code with an implicit entry point method, see Top-level statements.

  • The Main method is the entry point of an executable program; it is where the program control starts and ends.
  • Main is declared inside a class or struct. Main must be static and it need not be public. (In the earlier example, it receives the default access of private.) The enclosing class or struct is not required to be static.
  • Main can either have a void, int, Task, or Task<int> return type.
  • If and only if Main returns a Task or Task<int>, the declaration of Main may include the async modifier. This specifically excludes an async void Main method.
  • The Main method can be declared with or without a string[] parameter that contains command-line arguments. When using Visual Studio to create Windows applications, you can add the parameter manually or else use the GetCommandLineArgs() method to obtain the command-line arguments. Parameters are read as zero-indexed command-line arguments. Unlike C and C++, the name of the program is not treated as the first command-line argument in the args array, but it is the first element of the GetCommandLineArgs() method.

Valid main declariation:

1
2
3
4
5
6
7
8
public static void Main() { }
public static int Main() { }
public static void Main(string[] args) { }
public static int Main(string[] args) { }
public static async Task Main() { }
public static async Task<int> Main() { }
public static async Task Main(string[] args) { }
public static async Task<int> Main(string[] args) { }

The preceding examples all use the public accessor modifier. That’s typical, but not required. (which means they can be private).

The addition of async and Task, Task<int> return types simplifies program code when console applications need to start and await asynchronous operations in Main

Main() return values:

You can return an int from the Main method by defining the method in one of the following ways:

Main method code Main signature
No use of args or await static int Main()
Uses args, no use of await static int Main(string[] args)
No use of args, uses await static async Task<int> Main()
Uses args and await static async Task<int> Main(string[] args)

If the return value from Main is not used, returning void or Task allows for slightly simpler code.

Main method code Main signature
No use of args or await static void Main()
Uses args, no use of await static void Main(string[] args)
No use of args, uses await static async Task Main()
Uses args and await static async Task Main(string[] args)

However, returning int or Task<int> enables the program to communicate status information to other programs or scripts that invoke the executable file.

The following example shows how the exit code for the process can be accessed.

This example uses .NET Core command-line tools. If you are unfamiliar with .NET Core command-line tools, you can learn about them in this get-started article.

Create a new application by running dotnet new console. Modify the Main method in Program.cs as follows:

1
2
3
4
5
6
7
8
9
// Save this program as MainReturnValTest.cs.
class MainReturnValTest
{
static int Main()
{
//...
return 0;
}
}

When a program is executed in Windows OS, any value returned from the Main function is stored in an environment variable. This environment variable can be retrieved using ERRORLEVEL from a batch file, or $LastExitCode from PowerShell.

You can build the application using the dotnet CLI dotnet build command.

Next, create a PowerShell script to run the application and display the result. Paste the following code into a text file and save it as test.ps1 in the folder that contains the project. Run the PowerShell script by typing test.ps1 at the PowerShell prompt.

Because the code returns zero, the batch file will report success. However, if you change MainReturnValTest.cs to return a non-zero value and then recompile the program, subsequent execution of the PowerShell script will report failure.

1
2
3
4
5
6
7
8
dotnet run
if ($LastExitCode -eq 0) {
Write-Host "Execution succeeded"
} else
{
Write-Host "Execution Failed"
}
Write-Host "Return value = " $LastExitCode
1
2
Execution succeeded
Return value = 0

Async Main return values

When you declare an async return value for Main, the compiler generates the boilerplate code for calling asynchronous methods in Main. If you don’t specify the async keyword, you need to write that code yourself, as shown in the following example. The code in the example ensures that your program runs until the asynchronous operation is completed:

1
2
3
4
5
6
7
8
9
10
public static void Main()
{
AsyncConsoleWork().GetAwaiter().GetResult();
}

private static async Task<int> AsyncConsoleWork()
{
// Main body here
return 0;
}

This boilerplate code can be replaced by:

1
2
3
4
static async Task<int> Main(string[] args)
{
return await AsyncConsoleWork();
}

An advantage of declaring Main as async is that the compiler always generates the correct code.

When the application entry point returns a Task or Task<int>, the compiler generates a new entry point that calls the entry point method declared in the application code. Assuming that this entry point is called $GeneratedMain, the compiler generates the following code for these entry points:

  • static Task Main() results in the compiler emitting the equivalent of private static void $GeneratedMain() => Main().GetAwaiter().GetResult();
  • static Task Main(string[]) results in the compiler emitting the equivalent of private static void $GeneratedMain(string[] args) => Main(args).GetAwaiter().GetResult();
  • static Task<int> Main() results in the compiler emitting the equivalent of private static int $GeneratedMain() => Main().GetAwaiter().GetResult();
  • static Task<int> Main(string[]) results in the compiler emitting the equivalent of private static int $GeneratedMain(string[] args) => Main(args).GetAwaiter().GetResult();

Noted: If the examples used async modifier on the Main method, the compiler would generate the same code.

Command-Line Arguments

You can send arguments to the Main method by defining the method in one of the following ways:

Main method code Main signature
No return value, no use of await static void Main(string[] args)
Return value, no use of await static int Main(string[] args)
No return value, uses await static async Task Main(string[] args)
Return value, uses await static async Task<int> Main(string[] args)

If the arguments are not used, you can omit args from the method signature for slightly simpler code:

Main method code Main signature
No return value, no use of await static void Main()
Return value, no use of await static int Main()
No return value, uses await static async Task Main()
Return value, uses await static async Task<int> Main()

Note

You can also use Environment.CommandLine or Environment.GetCommandLineArgs to access the command-line arguments from any point in a console or Windows Forms application. To enable command-line arguments in the Main method signature in a Windows Forms application, you must manually modify the signature of Main. The code generated by the Windows Forms designer creates Main without an input parameter.

The parameter of the Main method is a String array that represents the command-line arguments. Usually you determine whether arguments exist by testing the Length property, for example:

C#Copy

1
2
3
4
5
if (args.Length == 0)
{
System.Console.WriteLine("Please enter a numeric argument.");
return 1;
}

Tip

The args array can’t be null. So, it’s safe to access the Length property without null checking.

You can also convert the string arguments to numeric types by using the Convert class or the Parse method. For example, the following statement converts the string to a long number by using the Parse method:

C#Copy

1
long num = Int64.Parse(args[0]);

It is also possible to use the C# type long, which aliases Int64:

C#Copy

1
long num = long.Parse(args[0]);

You can also use the Convert class method ToInt64 to do the same thing:

C#Copy

1
long num = Convert.ToInt64(s);

For more information, see Parse and Convert.

The following example shows how to use command-line arguments in a console application. The application takes one argument at run time, converts the argument to an integer, and calculates the factorial of the number. If no arguments are supplied, the application issues a message that explains the correct usage of the program.

To compile and run the application from a command prompt, follow these steps:

  1. Paste the following code into any text editor, and then save the file as a text file with the name Factorial.cs.

    C#Copy

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    public class Functions
    {
    public static long Factorial(int n)
    {
    // Test for invalid input.
    if ((n < 0) || (n > 20))
    {
    return -1;
    }

    // Calculate the factorial iteratively rather than recursively.
    long tempResult = 1;
    for (int i = 1; i <= n; i++)
    {
    tempResult *= i;
    }
    return tempResult;
    }
    }

    class MainClass
    {
    static int Main(string[] args)
    {
    // Test if input arguments were supplied.
    if (args.Length == 0)
    {
    Console.WriteLine("Please enter a numeric argument.");
    Console.WriteLine("Usage: Factorial <num>");
    return 1;
    }

    // Try to convert the input arguments to numbers. This will throw
    // an exception if the argument is not a number.
    // num = int.Parse(args[0]);
    int num;
    bool test = int.TryParse(args[0], out num);
    if (!test)
    {
    Console.WriteLine("Please enter a numeric argument.");
    Console.WriteLine("Usage: Factorial <num>");
    return 1;
    }

    // Calculate factorial.
    long result = Functions.Factorial(num);

    // Print result.
    if (result == -1)
    Console.WriteLine("Input must be >= 0 and <= 20.");
    else
    Console.WriteLine($"The Factorial of {num} is {result}.");

    return 0;
    }
    }
    // If 3 is entered on command line, the
    // output reads: The factorial of 3 is 6.
  2. From the Start screen or Start menu, open a Visual Studio Developer Command Prompt window, and then navigate to the folder that contains the file that you created.

  3. Enter the following command to compile the application.

    dotnet build

    If your application has no compilation errors, an executable file that’s named Factorial.exe is created.

  4. Enter the following command to calculate the factorial of 3:

    dotnet run -- 3

  5. The command produces this output: The factorial of 3 is 6.

Note

When running an application in Visual Studio, you can specify command-line arguments in the Debug Page, Project Designer.

3. Top-level statements

java studying

Java

Comparable vs comparator

  • 第一,字面含义不同

我们先从二者的字面含义来理解它,Comparable翻译为中文是“比较”的意思,而Comparator是“比较器”的意思。Comparable是以-able结尾的,表示它自身具备着某种能力,而Comparator是以-or结尾,表示自身是比较的参与者,这是从字面含义先来理解二者的不同。

  • 第二,用法不同

二者都是顶级的接口,但拥有的方法和用法是不同的,下面我们分别来看。

Comparable用法

Comparable接口只有一个方法compareTo,实现Comparable接口并重写compareTo方法就可以实现某个类的排序了,它支持Collections.sort和Arrays.sort的排序。

在我们没有使用Comparable时,程序的执行是这样的

v2-227ef5bbf20b532f1bb342573b7ebef5_1440w

v2-6c8292dea0e393ceffffabc72cdae6e3_1440w

从上图可以看出,当自定义类Person没有实现Comparable时,List集合是没有排序的,只能以元素的插入顺序作为输出的顺序。

然而这个时候,老板有一个需求:需要根据Person对象的年龄age属性进行倒序,也就是根据age属性从大到小进行排序,这个时候就可以请出,我们本文的主角:Comparable出场了。

Comparable的使用是在自定义对象的类中实现Comparable接口,并重写compareTo方法来实现自定义排序规则的,具体实现代码如下:

v2-1cbf6d14a01aa2489f5a8a5a3e80b4d2_1440w

程序的执行结果如下图所示:

v2-724747d21c4f51c7b37ab98df34f5db7_1440w

compareTo排序方法说明

compareTo方法接收的参数p是要对比的对象,排序规则是用当前对象和要对比的对象进行比较,然后返回一个int类型的值。正序从小到大的排序规则是:使用当前的对象值减去要对比对象的值;而倒序从大到小的排序规则刚好相反:是用对比对象的值减去当前对象的值。

注意事项:如果自定义对象没有实现Comparable接口,那么它是不能使用Collections.sort方法进行排序的

Comparator用法

Comparator和Comparable的排序方法是不同的,Comparable排序的方法是compareTo,而Comparator排序的方法是compare,具体实现代码如下:

v2-f742089317861a254770a57c2a41f0ee_1440w

程序的执行结果如下图所示:

v2-dd7e3161f5eb2d2055f73a9796cf81ff_1440w

匿名类:

v2-53099add8da4513a445681f7f835d05a_1440w

第三,使用场景不同

通过上面示例的实现代码我们可以看出,使用Comparable必须要修改原有的类,也就是你要排序那个类,就要在那个中实现Comparable接口并重写compareTo方法,所以Comparable更像是“对内”进行排序的接口。

而Comparator的使用则不相同,Comparator无需修改原有类。也就是在最极端情况下,即使Person类是第三方提供的,我们依然可以通过创建新的自定义比较器Comparator,来实现对第三方类Person的排序功能。也就是说通过Comparator接口可以实现和原有类的解耦,在不修改原有类的情况下实现排序功能,所以Comparator可以看作是“对外”提供排序的接口。

总结

Comparable和Comparator都是用来实现元素排序的,它们二者的区别如下:

  • Comparable是“比较”的意思,而Comparator是“比较器”的意思;
  • Comparable是通过重写compareTo方法实现排序的,而Comparator是通过重写compare方法实现排序的;
  • Comparable必须由自定义类内部实现排序方法,而Comparator是外部定义并实现排序的。

所以用一句话总结二者的区别:Comparable可以看作是“对内”进行排序接口,而Comparator是“对外”进行排序的接口。

List

在Java中,List是一种接口类型,它定义了一组用于操作列表(List)数据结构的方法。ArrayList是List接口的一个实现类,它使用数组来实现List接口中定义的方法。在这里,使用List list = new ArrayList<>();这样的语法是因为:

  1. 泛型:List中的Integer表示这个List只能存储Integer类型的元素。这是Java泛型的一种应用,它可以在编译时检查类型错误,避免在运行时出现类型不匹配的错误。
  2. 多态性:使用List接口作为类型声明,而不是具体的ArrayList类,可以让代码更具有可扩展性。这样,如果需要更改实现方式,只需要更改赋值右侧的实现类,而不需要更改其余代码。
  3. 简洁性:使用diamond运算符(<>)可以让代码更简洁。在Java 7之前,需要写成List list = new ArrayList();这样的形式。

Collection.sort()

如果要给自定义泛型的集合排序,需要确保该泛型类实现了 Comparable 接口,并重写 compareTo 方法。 compareTo 方法用于比较两个对象的大小关系,以便于排序。或者通过实现Comparator接口来给自定义范型的集合排序。

1
2
3
4
5
6
7
8
9
10
11
12
13
//1
public class Person implements Comparable<Person> {
private String name;
private int age;

// constructor, getter and setter methods

@Override
public int compareTo(Person other) {
return Integer.compare(this.age, other.age);
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
//2
import java.util.*;

public class Person {
private String name;
private int age;

public Person(String name, int age) {
this.name = name;
this.age = age;
}

public String getName() {
return name;
}

public int getAge() {
return age;
}
}

class AgeComparator implements Comparator<Person> {
@Override
public int compare(Person p1, Person p2) {
return Integer.compare(p1.getAge(), p2.getAge());
}
}

public class Main {
public static void main(String[] args) {
List<Person> persons = new ArrayList<>();
persons.add(new Person("John", 25));
persons.add(new Person("Alice", 30));
persons.add(new Person("Bob", 20));

Collections.sort(persons, new AgeComparator());

for (Person p : persons) {
System.out.println(p.getName() + " " + p.getAge());
}
}
}

Object-Orientated

Page1

Page2

Page3

Page4

Page5

Page6

Page7

Page8

Page9

Collection

在Java中,如果一个Java对象可以在内部持有若干其他Java对象,并对外提供访问接口,我们把这种Java对象称为集合

Java的数组可以看作是一种集合

  • 数组初始化后大小不可变;
  • 数组只能按索引顺序存取。

Java标准库自带的java.util包提供了集合类:Collection. 它是除Map外所有其他集合类的根接口。Java的java.util包主要提供了以下三种类型的集合:

  • List:一种有序列表的集合,例如,按索引排列的StudentList
  • Set:一种保证没有重复元素的集合,例如,所有无重复名称的StudentSet
  • Map:一种通过键值(key-value)查找的映射表集合,例如,根据Studentname查找对应StudentMap

List

ArrayList在内部使用了数组来存储所有元素。例如,一个ArrayList拥有5个元素,实际数组大小为6(即有一个空位)

ArrayList把添加和删除的操作封装起来,让我们操作List类似于操作数组,却不用关心内部元素如何移动。

我们考察List<E>接口,可以看到几个主要的接口方法:

  • 在末尾添加一个元素:boolean add(E e)
  • 在指定索引添加一个元素:boolean add(int index, E e)
  • 删除指定索引的元素:E remove(int index)
  • 删除某个元素:boolean remove(Object e)
  • 获取指定索引的元素:E get(int index)
  • 获取链表大小(包含元素的个数):int size()

LinkedList通过“链表”也实现了List接口。在LinkedList中,它的内部每个元素都指向下一个元素。

【遍历集合list】

我们要始终坚持使用迭代器Iterator来访问ListIterator本身也是一个对象,但它是由List的实例调用iterator()方法的时候创建的。Iterator对象知道如何遍历一个List,并且不同的List类型,返回的Iterator对象实现也是不同的,但总是具有最高的访问效率。

Iterator对象有两个方法:boolean hasNext()判断是否有下一个元素,E next()返回下一个元素。因此,使用Iterator遍历List代码如下:

1
2
3
4
5
List<String> list = List.of("apple", "pear", "banana");
for (Iterator<String> it = list.iterator(); it.hasNext(); ) {
String s = it.next();
System.out.println(s);
}

Java的for each循环本身就可以帮我们使用Iterator遍历。上面的代码再改写如下:

1
2
3
4
List<String> list = List.of("apple", "pear", "banana");
for (String s : list) {
System.out.println(s);
}

【list转化为array】

List变为Array有三种方法,第一种是调用toArray()方法直接返回一个Object[]数组

1
2
3
4
5
List<String> list = List.of("apple", "pear", "banana");
Object[] array = list.toArray();
for (Object s : array) {
System.out.println(s);
}

第二种方式是给toArray(T[])传入一个类型相同的ArrayList内部自动把元素复制到传入的Array中:

1
2
3
4
5
List<Integer> list = List.of(12, 34, 56);
Integer[] array = list.toArray(new Integer[3]);
for (Integer n : array) {
System.out.println(n);
}

最后一种更简洁的写法是通过List接口定义的T[] toArray(IntFunction<T[]> generator)方法:

1
Integer[] array = list.toArray(Integer[]::new);

Map

Map也是一个接口,最常用的实现类是HashMap

始终牢记:Map中不存在重复的key,因为放入相同的key,只会把原有的key-value对应的value给替换掉。

此外,在一个Map中,虽然key不能重复,但value是可以重复的

  • put(K key, V value): 将指定的键值对存储到HashMap中。
  • get(Object key): 返回与指定键关联的值,如果键不存在,则返回null。
  • containsKey(Object key): 检查HashMap中是否包含指定的键。
  • containsValue(Object value): 检查HashMap中是否包含指定的值。
  • remove(Object key): 从HashMap中删除指定键对应的键值对。
  • size(): 返回HashMap中键值对的数量。
  • isEmpty(): 检查HashMap是否为空。
  • clear(): 清空HashMap,删除所有的键值对。
  • keySet(): 返回HashMap中所有键构成的Set集合。
  • values(): 返回HashMap中所有值构成的Collection集合。
  • entrySet(): 返回HashMap中所有键值对构成的Set集合。
  • putAll(Map<? extends K, ? extends V> m): 将另一个Map中的所有键值对添加到HashMap中。
  • replaceAll(BiFunction<? super K, ? super V, ? extends V> function): 使用指定的函数对HashMap中的每个键值对进行替换操作。
  • computeIfAbsent(K key, Function<? super K, ? extends V> mappingFunction): 如果指定的键尚未与值关联,则使用给定函数计算一个值,并将其存储到HashMap中。
  • computeIfPresent(K key, BiFunction<? super K, ? super V, ? extends V> remappingFunction): 如果指定的键存在且与非空值关联,则使用给定函数重新计算该值,

在map中使用的作为key的对象也需要自己重新正确的编写equals()方法

通过key计算索引的方式就是调用key对象的hashCode()方法,它返回一个int整数。HashMap正是通过这个方法直接定位key对应的value的索引,继而直接返回value

因此,正确使用Map必须保证:

  1. 作为key的对象必须正确覆写equals()方法,相等的两个key实例调用equals()必须返回true
  2. 作为key的对象还必须正确覆写hashCode()方法,且hashCode()方法要严格遵循以下规范:
  • 如果两个对象相等,则两个对象的hashCode()必须相等;
  • 如果两个对象不相等,则两个对象的hashCode()尽量不要相等。

如何编写正确的equals()?

1
2
3
4
5
public class Person {
String firstName;
String lastName;
int age;
}

把需要比较的字段找出来:

  • firstName
  • lastName
  • age

然后,引用类型使用Objects.equals()比较,基本类型使用==比较。

在正确实现equals()的基础上,我们还需要正确实现hashCode(),即上述3个字段分别相同的实例,hashCode()返回的int必须相同:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class Person {
String firstName;
String lastName;
int age;

@Override
int hashCode() {
int h = 0;
h = 31 * h + firstName.hashCode();
h = 31 * h + lastName.hashCode();
h = 31 * h + age;
return h;
}
}

注意到String类已经正确实现了hashCode()方法,我们在计算PersonhashCode()时,反复使用31*h,这样做的目的是为了尽量把不同的Person实例的hashCode()均匀分布到整个int范围。

和实现equals()方法遇到的问题类似,如果firstNamelastNamenull,上述代码工作起来就会抛NullPointerException。为了解决这个问题,我们在计算hashCode()的时候,经常借助Objects.hash()来计算:

1
2
3
int hashCode() {
return Objects.hash(firstName, lastName, age);
}

所以,编写equals()hashCode()遵循的原则是:

equals()用到的用于比较的每一个字段,都必须在hashCode()中用于计算;equals()中没有使用到的字段,绝不可放在hashCode()中计算。

Extension

  1. hashCode()返回的int范围高达±21亿,先不考虑负数,HashMap内部使用的数组得有多大?

既然HashMap内部使用了数组,通过计算keyhashCode()直接定位value所在的索引,那么第一个问题来了:hashCode()返回的int范围高达±21亿,先不考虑负数,HashMap内部使用的数组得有多大?

实际上HashMap初始化时默认的数组大小只有16,任何key,无论它的hashCode()有多大,都可以简单地通过

把索引确定在0~15,即永远不会超出数组范围,上述算法只是一种最简单的实现。

1
int index = key.hashCode() & 0xf; // 0xf = 15
  1. 如果添加超过16个key-valueHashMap,数组不够用了怎么办?

添加超过一定数量的key-value时,HashMap会在内部自动扩容,每次扩容一倍,即长度为16的数组扩展为长度32,相应地,需要重新确定hashCode()计算的索引位置。例如,对长度为32的数组计算hashCode()对应的索引,计算方式要改为:

1
int index = key.hashCode() & 0x1f; // 0x1f = 31

由于扩容会导致重新分布已有的key-value,所以,频繁扩容对HashMap的性能影响很大。如果我们确定要使用一个容量为10000key-valueHashMap,更好的方式是创建HashMap时就指定容量:

1
Map<String, Integer> map = new HashMap<>(10000);

虽然指定容量是10000,但HashMap内部的数组长度总是2n,因此,实际数组长度被初始化为比10000大的16384 which is 2^14.

  1. 如果不同的两个key,例如"a""b",它们的hashCode()恰好是相同的(这种情况是完全可能的,因为不相等的两个实例,只要求hashCode()尽量不相等),那么,当我们放入:
1
2
map.put("a", new Person("Xiao Ming"));
map.put("b", new Person("Xiao Hong"));

时,由于计算出的数组索引相同,后面放入的"Xiao Hong"会不会把"Xiao Ming"覆盖了?

当然不会!使用Map的时候,只要key不相同,它们映射的value就互不干扰。但是,在HashMap内部,确实可能存在不同的key,映射到相同的hashCode(),即相同的数组索引上,肿么办?

我们就假设"a""b"这两个key最终计算出的索引都是5,那么,在HashMap的数组中,实际存储的不是一个Person实例,而是一个List,它包含两个Entry,一个是"a"的映射,一个是"b"的映射:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  ┌───┐
0 │ │
├───┤
1 │ │
├───┤
2 │ │
├───┤
3 │ │
├───┤
4 │ │
├───┤
5 │ ●─┼───> List<Entry<String, Person>>
├───┤
6 │ │
├───┤
7 │ │
└───┘

在查找的时候,例如:

1
Person p = map.get("a");

HashMap内部通过"a"找到的实际上是List<Entry<String, Person>>,它还需要遍历这个List,并找到一个Entry,它的key字段是"a",才能返回对应的Person实例。

我们把不同的key具有相同的hashCode()的情况称之为哈希冲突。在冲突的时候,一种最简单的解决办法是用List存储hashCode()相同的key-value。显然,如果冲突的概率越大,这个List就越长,Mapget()方法效率就越低,这就是为什么要尽量满足条件二:

如果两个对象不相等,则两个对象的hashCode()尽量不要相等。

小结

要正确使用HashMap,作为key的类必须正确覆写equals()hashCode()方法;

一个类如果覆写了equals(),就必须覆写hashCode(),并且覆写规则是:

  • 如果equals()返回true,则hashCode()返回值必须相等;
  • 如果equals()返回false,则hashCode()返回值尽量不要相等。

实现hashCode()方法可以通过Objects.hashCode()辅助方法实现。

TreeMap

还有一种Map,它在内部会对Key进行排序,这种Map就是SortedMap。注意到SortedMap是接口,它的实现类是TreeMap

1
2
3
4
5
6
7
8
9
10
11
12
13
14
       ┌───┐
│Map│
└───┘

┌────┴─────┐
│ │
┌───────┐ ┌─────────┐
│HashMap│ │SortedMap│
└───────┘ └─────────┘


┌─────────┐
│ TreeMap │
└─────────┘

SortedMap保证遍历时以Key的顺序来进行排序。例如,放入的Key是"apple""pear""orange",遍历的顺序一定是"apple""orange""pear",因为String默认按字母排序:

使用TreeMap时,放入的Key必须实现Comparable接口。StringInteger这些类已经实现了Comparable接口,因此可以直接作为Key使用。作为Value的对象则没有任何要求。

如果作为Key的class没有实现Comparable接口,那么,必须在创建TreeMap时同时指定一个自定义排序算法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public class Main {
public static void main(String[] args) {
Map<Person, Integer> map = new TreeMap<>(new Comparator<Person>() {
public int compare(Person p1, Person p2) {
return p1.name.compareTo(p2.name);
}
});
map.put(new Person("Tom"), 1);
map.put(new Person("Bob"), 2);
map.put(new Person("Lily"), 3);
for (Person key : map.keySet()) {
System.out.println(key);
}
// {Person: Bob}, {Person: Lily}, {Person: Tom}
System.out.println(map.get(new Person("Bob"))); // 2
}
}

class Person {
public String name;
Person(String name) {
this.name = name;
}
public String toString() {
return "{Person: " + name + "}";
}
}

注意到Comparator接口要求实现一个比较方法,它负责比较传入的两个元素ab,如果a<b,则返回负数,通常是-1,如果a==b,则返回0,如果a>b,则返回正数,通常是1TreeMap内部根据比较结果对Key进行排序。

从上述代码执行结果可知,打印的Key确实是按照Comparator定义的顺序排序的。如果要根据Key查找Value,我们可以传入一个new Person("Bob")作为Key,它会返回对应的Integer2

另外,注意到Person类并未覆写equals()hashCode(),因为TreeMap不使用equals()hashCode()

我们来看一个稍微复杂的例子:这次我们定义了Student类,并用分数score进行排序,高分在前:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class Main {
public static void main(String[] args) {
Map<Student, Integer> map = new TreeMap<>(new Comparator<Student>() {
public int compare(Student p1, Student p2) {
return p1.score > p2.score ? -1 : 1;
}
});
map.put(new Student("Tom", 77), 1);
map.put(new Student("Bob", 66), 2);
map.put(new Student("Lily", 99), 3);
for (Student key : map.keySet()) {
System.out.println(key);
}
System.out.println(map.get(new Student("Bob", 66))); // null?
}
}

class Student {
public String name;
public int score;
Student(String name, int score) {
this.name = name;
this.score = score;
}
public String toString() {
return String.format("{%s: score=%d}", name, score);
}
}

Properties

因为配置文件非常常用,所以Java集合库提供了一个Properties来表示一组“配置”。由于历史遗留原因。

Properties内部本质上是一个Hashtable,但我们只需要用到Properties自身关于读写配置的接口。

可以从文件系统读取这个.properties文件:

1
2
3
4
5
6
String f = "setting.properties";
Properties props = new Properties();
props.load(new java.io.FileInputStream(f));

String filepath = props.getProperty("last_open_file");
String interval = props.getProperty("auto_save_interval", "120");

可见,用Properties读取配置文件,一共有三步:

  1. 创建Properties实例;
  2. 调用load()读取文件;
  3. 调用getProperty()获取配置。

调用getProperty()获取配置时,如果key不存在,将返回null。我们还可以提供一个默认值,这样,当key不存在的时候,就返回默认值。

也可以从classpath读取.properties文件,因为load(InputStream)方法接收一个InputStream实例,表示一个字节流,它不一定是文件流,也可以是从jar包中读取的资源流:

1
2
Properties props = new Properties();
props.load(getClass().getResourceAsStream("/common/setting.properties"));

试试从内存读取一个字节流:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import java.io.*;
import java.util.Properties;

public class Main {
public static void main(String[] args) throws IOException {
String settings = "# test" + "\n" + "course=Java" + "\n" + "last_open_date=2019-08-07T12:35:01";
ByteArrayInputStream input = new ByteArrayInputStream(settings.getBytes("UTF-8"));
Properties props = new Properties();
props.load(input);

System.out.println("course: " + props.getProperty("course"));
System.out.println("last_open_date: " + props.getProperty("last_open_date"));
System.out.println("last_open_file: " + props.getProperty("last_open_file"));
System.out.println("auto_save: " + props.getProperty("auto_save", "60"));
}
}

Set

如果我们只需要存储不重复的key,并不需要存储映射的value,那么就可以使用Set

Set用于存储不重复的元素集合,它主要提供以下几个方法:

  • 将元素添加进Set<E>boolean add(E e)
  • 将元素从Set<E>删除:boolean remove(Object e)
  • 判断是否包含元素:boolean contains(Object e)
1
2
3
4
5
6
7
8
9
10
11
12
public class Main {
public static void main(String[] args) {
Set<String> set = new HashSet<>();
System.out.println(set.add("abc")); // true
System.out.println(set.add("xyz")); // true
System.out.println(set.add("xyz")); // false,添加失败,因为元素已存在
System.out.println(set.contains("xyz")); // true,元素存在
System.out.println(set.contains("XYZ")); // false,元素不存在
System.out.println(set.remove("hello")); // false,删除失败,因为元素不存在
System.out.println(set.size()); // 2,一共两个元素
}
}

放入Set的元素和Map的key类似,都要正确实现equals()hashCode()方法,否则该元素无法正确地放入Set

最常用的Set实现类是HashSet,实际上,HashSet仅仅是对HashMap的一个简单封装,它的核心代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public class HashSet<E> implements Set<E> {
// 持有一个HashMap:
private HashMap<E, Object> map = new HashMap<>();

// 放入HashMap的value:
private static final Object PRESENT = new Object();

public boolean add(E e) {
return map.put(e, PRESENT) == null;
}

public boolean contains(Object o) {
return map.containsKey(o);
}

public boolean remove(Object o) {
return map.remove(o) == PRESENT;
}
}

Set接口并不保证有序,而SortedSet接口则保证元素是有序的:

  • HashSet是无序的,因为它实现了Set接口,并没有实现SortedSet接口;
  • TreeSet是有序的,因为它实现了SortedSet接口。

用一张图表示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
       ┌───┐
│Set│
└───┘

┌────┴─────┐
│ │
┌───────┐ ┌─────────┐
│HashSet│ │SortedSet│
└───────┘ └─────────┘


┌─────────┐
│ TreeSet │
└─────────┘
1
2
3
4
5
6
7
8
9
10
11
12
public class Main {
public static void main(String[] args) {
Set<String> set = new HashSet<>();
set.add("apple");
set.add("banana");
set.add("pear");
set.add("orange");
for (String s : set) {
System.out.println(s);
}
}
}

在这里打印出来的并不是按照既不是添加的顺序,也不是String排序的顺序,在不同版本的JDK中,这个顺序也可能是不同的。

但是如果用treeset的话,输出出来的就会是有序的了。

1
2
3
4
5
6
7
8
9
10
11
12
public class Main {
public static void main(String[] args) {
Set<String> set = new TreeSet<>();
set.add("apple");
set.add("banana");
set.add("pear");
set.add("orange");
for (String s : set) {
System.out.println(s);
}
}
}

使用TreeSet和使用TreeMap的要求一样,添加的元素必须正确实现Comparable接口,如果没有实现Comparable接口,那么创建TreeSet时必须传入一个Comparator对象。

Queue

在Java的标准库中,队列接口Queue定义了以下几个方法:

  • int size():获取队列长度;
  • boolean add(E)/boolean offer(E):添加元素到队尾;
  • E remove()/E poll():获取队首元素并从队列中删除;
  • E element()/E peek():获取队首元素但并不从队列中删除。

对于具体的实现类,有的Queue有最大队列长度限制,有的Queue没有。注意到添加、删除和获取队列元素总是有两个方法,这是因为在添加或获取元素失败时,这两个方法的行为是不同的。如下表所示(举个例子,add方法有可能会抛出一场而offer方法并不会)

throw Exception 返回false或null
添加元素到队尾 add(E e) boolean offer(E e)
取队首元素并删除 E remove() E poll()
取队首元素但不删除 E element() E peek()

注意:不要把null添加到队列中,否则poll()方法返回null时,很难确定是取到了null元素还是队列为空。

PriorityQueue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import java.util.PriorityQueue;
import java.util.Queue;

public class Main {
public static void main(String[] args) {
Queue<String> q = new PriorityQueue<>();
// 添加3个元素到队列:
q.offer("apple");
q.offer("pear");
q.offer("banana");
System.out.println(q.poll()); // apple
System.out.println(q.poll()); // banana
System.out.println(q.poll()); // pear
System.out.println(q.poll()); // null,因为队列为空
}
}

放入的顺序是apple、pear、banana但是出来的顺序是apple、banana、pear。这就是因为我们使用了 PriorityQueue<>,优先队列能够将队列中的元素按照顺序取出。所以,存入优先队列里面的元素也必须需要实现Comparable接口。如果没有实现Comparable接口的话,我们需要提供一个Comparator对象来判断两个元素的顺序。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import java.util.Comparator;
import java.util.PriorityQueue;
import java.util.Queue;

public class Main {
public static void main(String[] args) {
Queue<User> q = new PriorityQueue<>(new UserComparator());
// 添加3个元素到队列:
q.offer(new User("Bob", "A1"));
q.offer(new User("Alice", "A2"));
q.offer(new User("Boss", "V1"));
System.out.println(q.poll()); // Boss/V1
System.out.println(q.poll()); // Bob/A1
System.out.println(q.poll()); // Alice/A2
System.out.println(q.poll()); // null,因为队列为空
}
}

class UserComparator implements Comparator<User> {
public int compare(User u1, User u2) {
if (u1.number.charAt(0) == u2.number.charAt(0)) {
// 如果两人的号都是A开头或者都是V开头,比较号的大小:
return u1.number.compareTo(u2.number);
}
if (u1.number.charAt(0) == 'V') {
// u1的号码是V开头,优先级高:
return -1;
} else {
return 1;
}
}
}

class User {
public final String name;
public final String number;

public User(String name, String number) {
this.name = name;
this.number = number;
}

public String toString() {
return name + "/" + number;
}
}

上面的UserComparator的比较逻辑其实还是有问题的,它会把A10排在A2的前面

Deque (Double Ended Queue)

Java集合提供了接口Deque来实现一个双端队列,它的功能是:

  • 既可以添加到队尾,也可以添加到队首;
  • 既可以从队首获取,又可以从队尾获取。

我们来比较一下QueueDeque出队和入队的方法:

Queue Deque
添加元素到队尾 add(E e) / offer(E e) addLast(E e) / offerLast(E e)
取队首元素并删除 E remove() / E poll() E removeFirst() / E pollFirst()
取队首元素但不删除 E element() / E peek() E getFirst() / E peekFirst()
添加元素到队首 addFirst(E e) / offerFirst(E e)
取队尾元素并删除 E removeLast() / E pollLast()
取队尾元素但不删除 E getLast() / E peekLast()
1
2
3
4
5
6
// 不推荐的写法:
LinkedList<String> d1 = new LinkedList<>();
d1.offerLast("z");
// 推荐的写法:
Deque<String> d2 = new LinkedList<>();
d2.offerLast("z");

可见面向抽象编程的一个原则就是:尽量持有接口,而不是具体的实现类。

小结

Deque实现了一个双端队列(Double Ended Queue),它可以:

  • 将元素添加到队尾或队首:addLast()/offerLast()/addFirst()/offerFirst()
  • 从队首/队尾获取元素并删除:removeFirst()/pollFirst()/removeLast()/pollLast()
  • 从队首/队尾获取元素但不删除:getFirst()/peekFirst()/getLast()/peekLast()
  • 总是调用xxxFirst()/xxxLast()以便与Queue的方法区分开;
  • 避免把null添加到队列。

Stack

在Java中,我们用Deque可以实现Stack的功能:

  • 把元素压栈:push(E)/addFirst(E)
  • 把栈顶的元素“弹出”:pop()/removeFirst()
  • 取栈顶元素但不弹出:peek()/peekFirst()

为什么Java的集合类没有单独的Stack接口呢?因为有个遗留类名字就叫Stack,出于兼容性考虑,所以没办法创建Stack接口,只能用Deque接口来“模拟”一个Stack了。

当我们把Deque作为Stack使用时,注意只调用push()/pop()/peek()方法,不要调用addFirst()/removeFirst()/peekFirst()方法,这样代码更加清晰。

Stack的作用

Stack在计算机中使用非常广泛,JVM在处理Java方法调用的时候就会通过栈这种数据结构维护方法调用的层次,例如

1
2
3
4
5
6
7
8
9
10
11
static void main(String[] args) {
foo(123);
}

static String foo(x) {
return "F-" + bar(x + 1);
}

static int bar(int x) {
return x << 2;
}

JVM会创建方法调用栈,每调用一个方法时,先将参数压栈,然后执行对应的方法;当方法返回时,返回值压栈,调用方法通过出栈操作获得方法返回值。

因为方法调用栈有容量限制,嵌套调用过多会造成栈溢出,即引发StackOverflowError

使用栈的思想来实现将10进制转化为16进制的功能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import java.util.Stack;

public class DecimalToHexadecimal {
public static String decimalToHex(int decimal) {
if (decimal == 0) {
return "0"; // 十进制数为0时,直接返回"0"
}

Stack<Character> stack = new Stack<>();

while (decimal > 0) {
int remainder = decimal % 16;
char hexDigit = getHexDigit(remainder);
stack.push(hexDigit);
decimal /= 16;
}

StringBuilder hexBuilder = new StringBuilder();
while (!stack.isEmpty()) {
hexBuilder.append(stack.pop());
}

return hexBuilder.toString();
}

private static char getHexDigit(int digit) {
if (digit >= 0 && digit <= 9) {
return (char) (digit + '0'); // 将数字转换为字符
} else {
return (char) (digit - 10 + 'A'); // 将数字转换为A~F之间的字符
}
}

public static void main(String[] args) {
int decimalNumber = 255;
String hexadecimalNumber = decimalToHex(decimalNumber);
System.out.println("十进制数 " + decimalNumber + " 转换为十六进制数为 " + hexadecimalNumber);
}
}

Iterator

java的集合类都可以使用for each循环,

1
2
3
4
List<String> list = List.of("Apple", "Orange", "Pear");
for (String s : list) {
System.out.println(s);
}

实际上,Java编译器并不知道如何遍历List。上述代码能够编译通过,只是因为编译器把for each循环通过Iterator改写为了普通的for循环:

1
2
3
4
for (Iterator<String> it = list.iterator(); it.hasNext(); ) {
String s = it.next();
System.out.println(s);
}

使用迭代器的好处在于,调用方总是以统一的方式遍历各种集合类型,而不必关心它们内部的存储结构。

例如,我们虽然知道ArrayList在内部是以数组形式存储元素,并且,它还提供了get(int)方法。虽然我们可以用for循环遍历:

1
2
3
for (int i=0; i<list.size(); i++) {
Object value = list.get(i);
}

但是这样一来,调用方就必须知道集合的内部存储结构。并且,如果把ArrayList换成LinkedListget(int)方法耗时会随着index的增加而增加。如果把ArrayList换成Set,上述代码就无法编译,因为Set内部没有索引。

Iterator遍历就没有上述问题,因为Iterator对象是集合对象自己在内部创建的,它自己知道如何高效遍历内部的数据集合,调用方则获得了统一的代码,编译器才能把标准的for each循环自动转换为Iterator遍历。

如果我们自己编写了一个集合类,想要使用for each循环,只需满足以下条件:

  • 集合类实现Iterable口,该接口要求返回一个Iterator对象;
  • Iterator对象迭代集合内部数据

这里的关键在于,集合类通过调用iterator()方法,返回一个Iterator对象,这个对象必须自己知道如何遍历该集合。

一个简单的Iterator示例如下,它总是以倒序遍历集合:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import java.util.*;

public class Main {
public static void main(String[] args) {
ReverseList<String> rlist = new ReverseList<>();
rlist.add("Apple");
rlist.add("Orange");
rlist.add("Pear");
for (String s : rlist) {
System.out.println(s);
}
}
}

class ReverseList<T> implements Iterable<T> {

private List<T> list = new ArrayList<>();

public void add(T t) {
list.add(t);
}

@Override
public Iterator<T> iterator() {
return new ReverseIterator(list.size());
}

class ReverseIterator implements Iterator<T> {
int index;

ReverseIterator(int index) {
this.index = index;
}

@Override
public boolean hasNext() {
return index > 0;
}

@Override
public T next() {
index--;
return ReverseList.this.list.get(index);
}
}
}

Collections类

位于java.util包中

它为集合类提供了很多静态的方法,方便我们操作各种集合

1.创建空集合:

  • 创建空List:List<T> emptyList()
  • 创建空Map:Map<K, V> emptyMap()
  • 创建空Set:Set<T> emptySet()

要注意到返回的空集合是不可变集合,无法向其中添加或删除元素。

此外,也可以用各个集合接口提供的of(T...)方法创建空集合。例如,以下创建空List的两个方法是等价的:

1
2
List<String> list1 = List.of();
List<String> list2 = Collections.emptyList();

2.创建单元素集合:

Collections提供了一系列方法来创建一个单元素集合:

  • 创建一个元素的List:List<T> singletonList(T o)
  • 创建一个元素的Map:Map<K, V> singletonMap(K key, V value)
  • 创建一个元素的Set:Set<T> singleton(T o)

要注意到返回的单元素集合也是不可变集合,无法向其中添加或删除元素。

此外,也可以用各个集合接口提供的of(T...)方法创建单元素集合。例如,以下创建单元素List的两个方法是等价的:

1
2
List<String> list1 = List.of("apple");
List<String> list2 = Collections.singletonList("apple");

实际上,使用List.of(T...)更方便,因为它既可以创建空集合,也可以创建单元素集合,还可以创建任意个元素的集合:

1
2
3
4
List<String> list1 = List.of(); // empty list
List<String> list2 = List.of("apple"); // 1 element
List<String> list3 = List.of("apple", "pear"); // 2 elements
List<String> list4 = List.of("apple", "pear", "orange"); // 3 elements

3.排序:

Collections可以对List进行排序。因为排序会直接修改List元素的位置,因此必须传入可变List

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class Main {
public static void main(String[] args) {
List<String> list = new ArrayList<>();
list.add("apple");
list.add("pear");
list.add("orange");
// 排序前:
System.out.println(list);
Collections.sort(list);
// 排序后:
System.out.println(list);
}
}

4.shuffle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import java.util.*;


public class Main {
public static void main(String[] args) {
List<Integer> list = new ArrayList<>();
for (int i=0; i<10; i++) {
list.add(i);
}
// 洗牌前:
System.out.println(list);
Collections.shuffle(list);
// 洗牌后:
System.out.println(list);
}
}

5.不可变集合

Collections还提供了一组方法把可变集合封装成不可变集合:

  • 封装成不可变List:List<T> unmodifiableList(List<? extends T> list)
  • 封装成不可变Set:Set<T> unmodifiableSet(Set<? extends T> set)
  • 封装成不可变Map:Map<K, V> unmodifiableMap(Map<? extends K, ? extends V> m)

这种封装实际上是通过创建一个代理对象,拦截掉所有修改方法实现的。我们来看看效果:

6.线程安全集合

Collections还提供了一组方法,可以把线程不安全的集合变为线程安全的集合:

  • 变为线程安全的List:List<T> synchronizedList(List<T> list)
  • 变为线程安全的Set:Set<T> synchronizedSet(Set<T> s)
  • 变为线程安全的Map:Map<K,V> synchronizedMap(Map<K,V> m)

多线程的概念我们会在后面讲。因为从Java 5开始,引入了更高效的并发集合类,所以上述这几个同步方法已经没有什么用了。

IO

File object

import Java.io.*;

File f = new File(Absolute path/Relative Path);

注意Windows平台使用\作为路径分隔符,在Java字符串中需要用\\表示一个\。Linux平台使用/作为路径分隔符:

1
File f = new File("/usr/bin/javac");

传入相对路径时,相对路径前面加上当前目录就是绝对路径:

1
2
3
4
// 假设当前目录是C:\Docs
File f1 = new File("sub\\javac"); // 绝对路径是C:\Docs\sub\javac
File f3 = new File(".\\sub\\javac"); // 绝对路径是C:\Docs\sub\javac
File f3 = new File("..\\sub\\javac"); // 绝对路径是C:\sub\javac

This is can help to find the separator of the currect system.

1
File.separator

InputStream

InputStream就是Java标准库提供的最基本的输入流。它位于java.io这个包里。java.io包提供了所有同步IO的功能。

要特别注意的一点是,InputStream并不是一个接口,而是一个抽象类,它是所有输入流的超类。这个抽象类定义的一个最重要的方法就是int read(),签名如下:

1
public abstract int read() throws IOException;

这个方法会读取输入流的下一个字节,并返回字节表示的int值(0~255)。如果已读到末尾,返回-1表示不能继续读取了

FileInputStreamInputStream的一个子类。顾名思义,FileInputStream就是从文件流中读取数据。下面的代码演示了如何完整地读取一个FileInputStream的所有字节:

1
2
3
4
5
6
7
8
9
10
11
12
public void readFile() throws IOException {
// 创建一个FileInputStream对象:
InputStream input = new FileInputStream("src/readme.txt");
for (;;) {
int n = input.read(); // 反复调用read()方法,直到返回-1
if (n == -1) {
break;
}
System.out.println(n); // 打印byte的值
}
input.close(); // 关闭流
}

InputStreamOutputStream都是通过close()方法来关闭流。关闭流就会释放对应的底层资源。

我们还要注意到在读取或写入IO流的过程中,可能会发生错误,例如,文件不存在导致无法读取,没有写权限导致写入失败,等等,这些底层错误由Java虚拟机自动封装成IOException异常并抛出。因此,所有与IO操作相关的代码都必须正确处理IOException

仔细观察上面的代码,会发现一个潜在的问题:如果读取过程中发生了IO错误,InputStream就没法正确地关闭,资源也就没法及时释放。

因此,我们需要用try ... finally来保证InputStream在无论是否发生IO错误的时候都能够正确地关闭:

只需要编写try语句,让编译器自动为我们关闭资源,Java7引入的新特性。推荐的写法如下:

1
2
3
4
5
6
7
8
public void readFile() throws IOException {
try (InputStream input = new FileInputStream("src/readme.txt")) {
int n;
while ((n = input.read()) != -1) {
System.out.println(n);
}
} // 编译器在此自动为我们写入finally并调用close()
}

实际上,编译器并不会特别地为InputStream加上自动关闭。编译器只看try(resource = ...)中的对象是否实现了java.lang.AutoCloseable接口,如果实现了,就自动加上finally语句并调用close()方法。InputStreamOutputStream都实现了这个接口,因此,都可以用在try(resource)中。

上段代码与以下代码等价:

1
2
3
4
5
6
7
8
9
10
11
12
public void readFile() throws IOException {
InputStream input = null;
try {
input = new FileInputStream("src/readme.txt");
int n;
while ((n = input.read()) != -1) { // 利用while同时读取并判断
System.out.println(n);
}
} finally {
if (input != null) { input.close(); }
}
}

缓冲

在InputStream中提供了两个重载的方法来支持读取多个字节。

  • int read(byte[] b):读取若干字节并填充到byte[]数组,返回读取的字节数
  • int read(byte[] b, int off, int len):指定byte[]数组的偏移量和最大填充数

利用上述方法一次读取多个字节时,需要先定义一个byte[]数组作为缓冲区,read()方法会尽可能多地读取字节到缓冲区, 但不会超过缓冲区的大小。read()方法的返回值不再是字节的int值,而是返回实际读取了多少个字节。如果返回-1,表示没有更多的数据了。

1
2
3
4
5
6
7
8
9
10
public void readFile() throws IOException {
try (InputStream input = new FileInputStream("src/readme.txt")) {
// 定义1000个字节大小的缓冲区:
byte[] buffer = new byte[1000];
int n;
while ((n = input.read(buffer)) != -1) { // 读取到缓冲区
System.out.println("read " + n + " bytes.");
}
}
}

阻塞

在调用InputStreamread()方法读取数据时,我们说read()方法是阻塞(Blocking)的。它的意思是,对于下面的代码:

1
2
3
int n;
n = input.read(); // 必须等待read()方法返回才能执行下一行代码
int m = n;

执行到第二行代码时,必须等read()方法返回后才能继续。因为读取IO流相比执行普通代码,速度会慢很多,因此,无法确定read()方法调用到底要花费多长时间。

InputStream实现类

FileInputStream可以从文件获取输入流,这是InputStream常用的一个实现类。此外,ByteArrayInputStream可以在内存中模拟一个InputStream

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import java.io.*;

public class Main {
public static void main(String[] args) throws IOException {
byte[] data = { 72, 101, 108, 108, 111, 33 };
try (InputStream input = new ByteArrayInputStream(data)) {
int n;
while ((n = input.read()) != -1) {
System.out.println((char)n);
}
}
}
}

ByteArrayInputStream实际上是把一个byte[]数组在内存中变成一个InputStream,虽然实际应用不多,但测试的时候,可以用它来构造一个InputStream

举个栗子:我们想从文件中读取所有字节,并转换成char然后拼成一个字符串,可以这么写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class Main {
public static void main(String[] args) throws IOException {
String s;
try (InputStream input = new FileInputStream("C:\\test\\README.txt")) {
int n;
StringBuilder sb = new StringBuilder();
while ((n = input.read()) != -1) {
sb.append((char) n);
}
s = sb.toString();
}
System.out.println(s);
}
}

OutputStream:

InputStream类似,OutputStream也是抽象类,它是所有输出流的超类。这个抽象类定义的一个最重要的方法就是void write(int b),签名如下:

1
public abstract void write(int b) throws IOException;

这个方法会写入一个字节到输出流。要注意的是,虽然传入的是int参数,但只会写入一个字节,即只写入int最低8位表示字节的部分(相当于b & 0xff)。

InputStream类似,OutputStream也提供了close()方法关闭输出流,以便释放系统资源。要特别注意:OutputStream还提供了一个flush()方法,它的目的是将缓冲区的内容真正输出到目的地。

1
2
3
4
5
6
7
8
9
public void writeFile() throws IOException {
OutputStream output = new FileOutputStream("out/readme.txt");
output.write(72); // H
output.write(101); // e
output.write(108); // l
output.write(108); // l
output.write(111); // o
output.close();
}

每次写入一个字节非常麻烦,更常见的方法是一次性写入若干字节。这时,可以用OutputStream提供的重载方法void write(byte[])来实现:

1
2
3
4
5
public void writeFile() throws IOException {
OutputStream output = new FileOutputStream("out/readme.txt");
output.write("Hello".getBytes("UTF-8")); // Hello
output.close();
}

InputStream一样,上述代码没有考虑到在发生异常的情况下如何正确地关闭资源。写入过程也会经常发生IO错误,例如,磁盘已满,无权限写入等等。我们需要用try(resource)来保证OutputStream在无论是否发生IO错误的时候都能够正确地关闭:

1
2
3
4
5
public void writeFile() throws IOException {
try (OutputStream output = new FileOutputStream("out/readme.txt")) {
output.write("Hello".getBytes("UTF-8")); // Hello
} // 编译器在此自动为我们写入finally并调用close()
}

阻塞

InputStream一样,OutputStreamwrite()方法也是阻塞的。

OutputStream实现类

FileOutputStream可以从文件获取输出流,这是OutputStream常用的一个实现类。此外,ByteArrayOutputStream可以在内存中模拟一个OutputStream

1
2
3
4
5
6
7
8
9
10
11
12
13
import java.io.*;

public class Main {
public static void main(String[] args) throws IOException {
byte[] data;
try (ByteArrayOutputStream output = new ByteArrayOutputStream()) {
output.write("Hello ".getBytes("UTF-8"));
output.write("world!".getBytes("UTF-8"));
data = output.toByteArray();
}
System.out.println(new String(data, "UTF-8"));
}
}

Filter模式

Java的IO标准库提供的InputStream根据来源可以包括:

  • FileInputStream:从文件读取数据,是最终数据源;
  • ServletInputStream:从HTTP请求读取数据,是最终数据源;
  • Socket.getInputStream():从TCP连接读取数据,是最终数据源;

如果我们要给FileInputStream添加缓冲功能,则可以从FileInputStream派生一个又一个的类:

比如BufferedFileInputStream,DigestFileInputStream和CipherFileInputStream类,但是这样做的话很可能会遇到子类爆炸的情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
                          ┌─────────────────┐
│ FileInputStream │
└─────────────────┘

┌───────────┬─────────┼─────────┬───────────┐
│ │ │ │ │
┌───────────────────────┐│┌─────────────────┐│┌─────────────────────┐
│BufferedFileInputStream│││DigestInputStream│││CipherFileInputStream│
└───────────────────────┘│└─────────────────┘│└─────────────────────┘
│ │
┌─────────────────────────────┐ ┌─────────────────────────────┐
│BufferedDigestFileInputStream│ │BufferedCipherFileInputStream│
└─────────────────────────────┘ └─────────────────────────────┘

为了解决这个问题,JDK首先将InputStream分为两大类:

一类是直接提供数据的基础InputStream,例如:

  • FileInputStream
  • ByteArrayInputStream
  • ServletInputStream

一类是提供额外附加功能的InputStream,例如:

  • BufferedInputStream
  • DigestInputStream
  • CipherInputStream

当我们需要给一个“基础”InputStream附加各种功能时,我们先确定这个能提供数据源的InputStream,因为我们需要的数据总得来自某个地方,例如,FileInputStream,数据来源自文件:

1
InputStream file = new FileInputStream("test.gz");

紧接着,我们希望FileInputStream能提供缓冲的功能来提高读取的效率,因此我们用BufferedInputStream包装这个InputStream,得到的包装类型是BufferedInputStream,但它仍然被视为一个InputStream

1
InputStream buffered = new BufferedInputStream(file);

最后,假设该文件已经用gzip压缩了,我们希望直接读取解压缩的内容,就可以再包装一个GZIPInputStream

1
InputStream gzip = new GZIPInputStream(buffered);

无论我们包装多少次,得到的对象始终是InputStream,我们直接用InputStream来引用它,就可以正常读取:

1
2
3
4
5
6
7
8
9
┌─────────────────────────┐
│GZIPInputStream │
│┌───────────────────────┐│
││BufferedFileInputStream││
││┌─────────────────────┐││
│││ FileInputStream │││
││└─────────────────────┘││
│└───────────────────────┘│
└─────────────────────────┘

上述这种通过一个“基础”组件再叠加各种“附加”功能组件的模式,称之为Filter模式(或者装饰器模式:Decorator)。它可以让我们通过少量的类来实现各种功能的组合:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
                 ┌─────────────┐
│ InputStream │
└─────────────┘
▲ ▲
┌────────────────────┐ │ │ ┌─────────────────┐
│ FileInputStream │─┤ └─│FilterInputStream│
└────────────────────┘ │ └─────────────────┘
┌────────────────────┐ │ ▲ ┌───────────────────┐
│ByteArrayInputStream│─┤ ├─│BufferedInputStream│
└────────────────────┘ │ │ └───────────────────┘
┌────────────────────┐ │ │ ┌───────────────────┐
│ ServletInputStream │─┘ ├─│ DataInputStream │
└────────────────────┘ │ └───────────────────┘
│ ┌───────────────────┐
└─│CheckedInputStream │
└───────────────────┘

类似的,OutputStream也是以这种模式来提供各种功能:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
                  ┌─────────────┐
│OutputStream │
└─────────────┘
▲ ▲
┌─────────────────────┐ │ │ ┌──────────────────┐
│ FileOutputStream │─┤ └─│FilterOutputStream│
└─────────────────────┘ │ └──────────────────┘
┌─────────────────────┐ │ ▲ ┌────────────────────┐
│ByteArrayOutputStream│─┤ ├─│BufferedOutputStream│
└─────────────────────┘ │ │ └────────────────────┘
┌─────────────────────┐ │ │ ┌────────────────────┐
│ ServletOutputStream │─┘ ├─│ DataOutputStream │
└─────────────────────┘ │ └────────────────────┘
│ ┌────────────────────┐
└─│CheckedOutputStream │
└────────────────────┘

操作Zip

ZipInputStream是一种FilterInputStream,它可以直接读取zip包的内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
┌───────────────────┐
│ InputStream │
└───────────────────┘


┌───────────────────┐
│ FilterInputStream │
└───────────────────┘


┌───────────────────┐
│InflaterInputStream│
└───────────────────┘


┌───────────────────┐
│ ZipInputStream │
└───────────────────┘


┌───────────────────┐
│ JarInputStream │
└───────────────────┘

另一个JarInputStream是从ZipInputStream派生,它增加的主要功能是直接读取jar文件里面的MANIFEST.MF文件。因为本质上jar包就是zip包,只是额外附加了一些固定的描述文件。

读取zip包

我们要创建一个ZipInputStream,通常是传入一个FileInputStream作为数据源,然后,循环调用getNextEntry(),直到返回null,表示zip流结束。

一个ZipEntry表示一个压缩文件或目录,如果是压缩文件,我们就用read()方法不断读取,直到返回-1

1
2
3
4
5
6
7
8
9
10
11
12
try (ZipInputStream zip = new ZipInputStream(new FileInputStream(...))) {
ZipEntry entry = null;
while ((entry = zip.getNextEntry()) != null) {
String name = entry.getName();
if (!entry.isDirectory()) { //判断当前的entry不是目录而是zip文件就可以开始读了
int n;
while ((n = zip.read()) != -1) {
...
}
}
}
}

写入zip包

使用了ZipOutputStream,是一种FilterOutputStream,它可以直接写入内容到zip包。

我们要先创建一个ZipOutputStream,通常是包装一个FileOutputStream,然后,每写入一个文件前,先调用putNextEntry(),然后用write()写入byte[]数据,写入完毕后调用closeEntry()结束这个文件的打包。

1
2
3
4
5
6
7
8
try (ZipOutputStream zip = new ZipOutputStream(new FileOutputStream(...))) {
File[] files = ...
for (File file : files) {
zip.putNextEntry(new ZipEntry(file.getName()));
zip.write(Files.readAllBytes(file.toPath()));
zip.closeEntry();
}
}

读取classpath资源

我们知道,Java存放.class的目录或jar包也可以包含任意其他类型的文件,例如:

  • 配置文件,例如.properties
  • 图片文件,例如.jpg
  • 文本文件,例如.txt.csv

从classpath读取文件就可以避免不同环境下文件路径不一致的问题:如果我们把default.properties文件放到classpath中,就不用关心它的实际存放路径。

在classpath中的资源文件,路径总是以开头,我们先获取当前的Class对象,然后调用getResourceAsStream()就可以直接从classpath读取任意的资源文件:

调用getResourceAsStream()需要特别注意的一点是,如果资源文件不存在,它将返回null。因此,我们需要检查返回的InputStream是否为null,如果为null,表示资源文件在classpath中没有找到:

1
2
3
4
5
try (InputStream input = getClass().getResourceAsStream("/default.properties")) {
if (input != null) {
// TODO:
}
}

如果我们把默认的配置放到jar包中,再从外部文件系统读取一个可选的配置文件,就可以做到既有默认的配置文件,又可以让用户自己修改配置:

1
2
3
Properties props = new Properties();
props.load(inputStreamFromClassPath("/default.properties"));
props.load(inputStreamFromFile("./conf.properties"));

类路径(Classpath)是用于指定在 Java 应用程序中查找类和资源文件的路径。它是一组目录和 JAR 文件的集合,这些文件包含了编译后的 Java 类文件和其他资源文件。

Cautious

把资源存储在classpath中可以避免文件路径依赖;

Class对象的getResourceAsStream()可以从classpath中读取指定资源;

根据classpath读取资源时,需要检查返回的InputStream是否为null

序列化

序列化是指把一个java对变成二进制内容,本质上就是一个byte[]数组。

为什么要把Java对象序列化呢?因为序列化后可以把byte[]保存到文件中,或者把byte[]通过网络传输到远程,这样,就相当于把Java对象存储到文件或者通过网络传输出去了。

一个Java对象要能序列化,必须实现一个特殊的java.io.Serializable接口,它的定义如下:

1
2
public interface Serializable {
}

Serializable接口没有定义任何方法,它是一个空接口。我们把这样的空接口称为“标记接口”(Marker Interface),实现了标记接口的类仅仅是给自身贴了个“标记”,并没有增加任何方法。

序列化的例子

在这里我们需要用到ObjectOutputStream,它负责将一个java对象写入一个字节流。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import java.io.*;
import java.util.Arrays;

public class Main {
public static void main(String[] args) throws IOException {
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
try (ObjectOutputStream output = new ObjectOutputStream(buffer)) {
// 写入int:
output.writeInt(12345);
// 写入String:
output.writeUTF("Hello");
// 写入Object:
output.writeObject(Double.valueOf(123.456));
}
System.out.println(Arrays.toString(buffer.toByteArray()));
}
}

ObjectOutputStream既可以写入基本类型,如intboolean,也可以写入String(以UTF-8编码),还可以写入实现了Serializable接口的Object

因为写入Object时需要大量的类型信息,所以写入的内容很大。

反序列化例子

ObjectOutputStream相反,ObjectInputStream负责从一个字节流读取Java对象:

1
2
3
4
5
try (ObjectInputStream input = new ObjectInputStream(...)) {
int n = input.readInt();
String s = input.readUTF();
Double d = (Double) input.readObject();
}

除了能读取基本类型和String类型外,调用readObject()可以直接返回一个Object对象。要把它变成一个特定类型,必须强制转型。

readObject()可能抛出的异常有:

  • ClassNotFoundException:没有找到对应的Class;
  • InvalidClassException:Class不匹配。

对于ClassNotFoundException,这种情况常见于一台电脑上的Java程序把一个Java对象,例如,Person对象序列化以后,通过网络传给另一台电脑上的另一个Java程序,但是这台电脑的Java程序并没有定义Person类,所以无法反序列化。

对于InvalidClassException,这种情况常见于序列化的Person对象定义了一个int类型的age字段,但是反序列化时,Person类定义的age字段被改成了long类型,所以导致class不兼容。

为了解决这两个bug

Java的序列化允许class定义一个特殊的serialVersionUID静态变量,用于标识Java类的序列化“版本”,通常可以由IDE自动生成。如果增加或修改了字段,可以改变serialVersionUID的值,这样就能自动阻止不匹配的class版本:

1
2
3
public class Person implements Serializable {
private static final long serialVersionUID = 2709425275741743919L;
}

反序列化时,由JVM直接构造出Java对象,不调用构造方法,构造方法内部的代码,在反序列化时根本不可能执行。

小结

可序列化的Java对象必须实现java.io.Serializable接口,类似Serializable这样的空接口被称为“标记接口”(Marker Interface);

反序列化时不调用构造方法,可设置serialVersionUID作为版本号(非必需);

Java的序列化机制仅适用于Java,如果需要与其它语言交换数据,必须使用通用的序列化方法,例如JSON。

Reader/Writer

Reader

这两个是java的IO库中提供的另外两个输入输出流接口,和InputStream的区别是,InputStream是一个字节流,以byte为单位读取,而Reader是一个字符流,即以char为单位读取。

InputStream Reader
字节流,以byte为单位 字符流,以char为单位
读取字节(-1,0~255):int read() 读取字符(-1,0~65535):int read()
读到字节数组:int read(byte[] b) 读到字符数组:int read(char[] c)

java.io.Reader是所有字符输入流的超类,它最主要的方法是:

1
public int read() throws IOException;
1
2
3
4
5
6
7
8
9
10
11
12
public void readFile() throws IOException {
// 创建一个FileReader对象:
Reader reader = new FileReader("src/readme.txt"); // 字符编码是???
for (;;) {
int n = reader.read(); // 反复调用read()方法,直到返回-1
if (n == -1) {
break;
}
System.out.println((char)n); // 打印char
}
reader.close(); // 关闭流
}

如果我们读取一个纯ASCII编码的文本文件,上述代码工作是没有问题的。但如果文件中包含中文,就会出现乱码,因为FileReader默认的编码与系统相关,例如,Windows系统的默认编码可能是GBK,打开一个UTF-8编码的文本文件就会出现乱码。

要避免乱码问题,我们需要在创建FileReader时指定编码:

1
Reader reader = new FileReader("src/readme.txt", StandardCharsets.UTF_8);

InputStream类似,Reader也是一种资源,需要保证出错的时候也能正确关闭,所以我们需要用try (resource)来保证Reader在无论有没有IO错误的时候都能够正确地关闭:

1
2
3
try (Reader reader = new FileReader("src/readme.txt", StandardCharsets.UTF_8) {
// TODO
}

Reader还提供了一次性读取若干字符并填充到char[]数组的方法:

1
public int read(char[] c) throws IOException

它返回实际读入的字符个数,最大不超过char[]数组的长度。返回-1表示流结束。

利用这个方法,我们可以先设置一个缓冲区,然后,每次尽可能地填充缓冲区:

1
2
3
4
5
6
7
8
9
public void readFile() throws IOException {
try (Reader reader = new FileReader("src/readme.txt", StandardCharsets.UTF_8)) {
char[] buffer = new char[1000];
int n;
while ((n = reader.read(buffer)) != -1) {
System.out.println("read " + n + " chars.");
}
}
}

CharArrayReader

CharArrayReader可以在内存中模拟一个Reader,它的作用实际上是把一个char[]数组变成一个Reader,这和ByteArrayInputStream非常类似:

1
2
try (Reader reader = new CharArrayReader("Hello".toCharArray())) {
}

StringReader

StringReader可以直接把String作为数据源,它和CharArrayReader几乎一样:

1
2
try (Reader reader = new StringReader("Hello")) {
}

InputStreamReader

ReaderInputStream有什么关系?

除了特殊的CharArrayReaderStringReader,普通的Reader实际上是基于InputStream构造的,因为Reader需要从InputStream中读入字节流(byte),然后,根据编码设置,再转换为char就可以实现字符流。如果我们查看FileReader的源码,它在内部实际上持有一个FileInputStream

既然Reader本质上是一个基于InputStreambytechar的转换器,那么,如果我们已经有一个InputStream,想把它转换为Reader,是完全可行的。InputStreamReader就是这样一个转换器,它可以把任何InputStream转换为Reader。示例代码如下:

1
2
3
4
// 持有InputStream:
InputStream input = new FileInputStream("src/readme.txt");
// 变换为Reader:
Reader reader = new InputStreamReader(input, "UTF-8");

构造InputStreamReader时,我们需要传入InputStream,还需要指定编码,就可以得到一个Reader对象。上述代码可以通过try (resource)更简洁地改写如下:

1
2
3
try (Reader reader = new InputStreamReader(new FileInputStream("src/readme.txt"), "UTF-8")) {
// TODO:
}

上述代码实际上就是FileReader的一种实现方式。

使用try (resource)结构时,当我们关闭Reader时,它会在内部自动调用InputStreamclose()方法,所以,只需要关闭最外层的Reader对象即可。

使用InputStreamReader,可以把一个InputStream转换成一个Reader。

小结

Reader定义了所有字符输入流的超类:

  • FileReader实现了文件字符流输入,使用时需要指定编码;
  • CharArrayReaderStringReader可以在内存中模拟一个字符流输入。

Reader是基于InputStream构造的:可以通过InputStreamReader在指定编码的同时将任何InputStream转换为Reader

总是使用try (resource)保证Reader正确关闭。

Writer

OutputStream Writer
字节流,以byte为单位 字符流,以char为单位
写入字节(0~255):void write(int b) 写入字符(0~65535):void write(int c)
写入字节数组:void write(byte[] b) 写入字符数组:void write(char[] c)
无对应方法 写入String:void write(String s)

Writer是所有字符输出流的超类,它提供的方法主要有:

  • 写入一个字符(0~65535):void write(int c)
  • 写入字符数组的所有字符:void write(char[] c)
  • 写入String表示的所有字符:void write(String s)

FileWriter

FileWriter就是向文件中写入字符流的Writer。它的使用方法和FileReader类似:

1
2
3
4
5
try (Writer writer = new FileWriter("readme.txt", StandardCharsets.UTF_8)) {
writer.write('H'); // 写入单个字符
writer.write("Hello".toCharArray()); // 写入char[]
writer.write("Hello"); // 写入String
}

CharArrayWriter

CharArrayWriter可以在内存中创建一个Writer,它的作用实际上是构造一个缓冲区,可以写入char,最后得到写入的char[]数组,这和ByteArrayOutputStream非常类似:

1
2
3
4
5
6
try (CharArrayWriter writer = new CharArrayWriter()) {
writer.write(65);
writer.write(66);
writer.write(67);
char[] data = writer.toCharArray(); // { 'A', 'B', 'C' }
}

StringWriter

StringWriter也是一个基于内存的Writer,它和CharArrayWriter类似。实际上,StringWriter在内部维护了一个StringBuffer,并对外提供了Writer接口。

OutputStreamWriter

除了CharArrayWriterStringWriter外,普通的Writer实际上是基于OutputStream构造的,它接收char,然后在内部自动转换成一个或多个byte,并写入OutputStream。因此,OutputStreamWriter就是一个将任意的OutputStream转换为Writer的转换器:

1
2
3
try (Writer writer = new OutputStreamWriter(new FileOutputStream("readme.txt"), "UTF-8")) {
// TODO:
}

上述代码实际上就是FileWriter的一种实现方式。这和上一节的InputStreamReader是一样的。

小结

Writer定义了所有字符输出流的超类:

  • FileWriter实现了文件字符流输出;
  • CharArrayWriterStringWriter在内存中模拟一个字符流输出。

使用try (resource)保证Writer正确关闭。

Writer是基于OutputStream构造的,可以通过OutputStreamWriterOutputStream转换为Writer,转换时需要指定编码。

PrintStream/PrintWriter

PrintStream是一种FilterOutputStream,它在OutputStream的接口上,额外提供了一些写入各种数据类型的方法:

  • 写入intprint(int)
  • 写入booleanprint(boolean)
  • 写入Stringprint(String)
  • 写入Objectprint(Object),实际上相当于print(object.toString())

PrintWriter

PrintStream最终输出的总是byte数据,而PrintWriter则是扩展了Writer接口,它的print()/println()方法最终输出的是char数据。两者的使用方法几乎是一模一样的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import java.io.*;

public class Main {
public static void main(String[] args) {
StringWriter buffer = new StringWriter();
try (PrintWriter pw = new PrintWriter(buffer)) {
pw.println("Hello");
pw.println(12345);
pw.println(true);
}
System.out.println(buffer.toString());
}
}

小结

PrintStream是一种能接收各种数据类型的输出,打印数据时比较方便:

  • System.out是标准输出;
  • System.err是标准错误输出。

PrintWriter是基于Writer的输出。

Files

从Java 7开始,提供了Files这个工具类,能极大地方便我们读写文件。

虽然Filesjava.nio包里面的类,但他俩封装了很多读写文件的简单方法,例如,我们要把一个文件的全部内容读取为一个byte[],可以这么写:

1
byte[] data = Files.readAllBytes(Path.of("/path/to/file.txt"));

如果是文本文件,可以把一个文件的全部内容读取为String

1
2
3
4
5
6
// 默认使用UTF-8编码读取:
String content1 = Files.readString(Path.of("/path/to/file.txt"));
// 可指定编码:
String content2 = Files.readString(Path.of("/path", "to", "file.txt"), StandardCharsets.ISO_8859_1);
// 按行读取并返回每行内容:
List<String> lines = Files.readAllLines(Path.of("/path/to/file.txt"));

写入文件也非常方便:

1
2
3
4
5
6
7
8
// 写入二进制文件:
byte[] data = ...
Files.write(Path.of("/path/to/file.txt"), data);
// 写入文本并指定编码:
Files.writeString(Path.of("/path/to/file.txt"), "文本内容...", StandardCharsets.ISO_8859_1);
// 按行写入文本:
List<String> lines = ...
Files.write(Path.of("/path/to/file.txt"), lines);

此外,Files工具类还有copy()delete()exists()move()等快捷方法操作文件和目录。

最后需要特别注意的是,Files提供的读写方法,受内存限制,只能读写小文件,例如配置文件等,不可一次读入几个G的大文件。读写大型文件仍然要使用文件流,每次只读写一部分文件内容。

反射

class类

class是由JVM在执行过程中动态加载的。JVM在第一次读取到一种class类型时,将其加载进内存。

每加载一种class,JVM就为其创建一个Class类型的实例,并关联起来。注意:这里的Class类型是一个名叫Classclass。它长这样:

1
2
3
public final class Class {
private Class() {}
}

String类为例,当JVM加载String类时,它首先读取String.class文件到内存,然后,为String类创建一个Class实例并关联起来:

1
Class cls = new Class(String);

这个Class实例是JVM内部创建的,如果我们查看JDK源码,可以发现Class类的构造方法是private,只有JVM能创建Class实例,我们自己的Java程序是无法创建Class实例的。

所以,JVM持有的每个Class实例都指向一个数据类型(classinterface

1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌───────────────────────────┐
│ Class Instance │──────> String
├───────────────────────────┤
│name = "java.lang.String" │
└───────────────────────────┘
┌───────────────────────────┐
│ Class Instance │──────> Random
├───────────────────────────┤
│name = "java.util.Random" │
└───────────────────────────┘
┌───────────────────────────┐
│ Class Instance │──────> Runnable
├───────────────────────────┤
│name = "java.lang.Runnable"│

一个Class实例包含了该class的所有完整信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
┌───────────────────────────┐
│ Class Instance │──────> String
├───────────────────────────┤
│name = "java.lang.String" │
├───────────────────────────┤
│package = "java.lang" │
├───────────────────────────┤
│super = "java.lang.Object" │
├───────────────────────────┤
│interface = CharSequence...│
├───────────────────────────┤
│field = value[],hash,... │
├───────────────────────────┤
│method = indexOf()... │
└───────────────────────────┘

因此,如果获取了某个Class实例,我们就可以通过这个Class实例获取到该实例对应的class的所有信息。

这种通过Class实例获取class信息的方法称为反射(Reflection)。

获取class的Class实例有三个方法

方法一:直接通过一个class的静态变量class获取:

1
Class cls = String.class;

方法二:如果我们有一个实例变量,可以通过该实例变量提供的getClass()方法获取:

1
2
String s = "Hello";
Class cls = s.getClass();

方法三:如果知道一个class的完整类名,可以通过静态方法Class.forName()获取:

1
Class cls = Class.forName("java.lang.String");

因为Class实例在JVM中是唯一的,所以,上述方法获取的Class实例是同一个实例。可以用==比较两个Class实例:

1
2
3
4
5
6
Class cls1 = String.class;

String s = "Hello";
Class cls2 = s.getClass();

boolean sameClass = cls1 == cls2; // true

注意一下Class实例比较和instanceof的差别:

1
2
3
4
5
6
7
Integer n = new Integer(123);

boolean b1 = n instanceof Integer; // true,因为n是Integer类型
boolean b2 = n instanceof Number; // true,因为n是Number类型的子类

boolean b3 = n.getClass() == Integer.class; // true,因为n.getClass()返回Integer.class
boolean b4 = n.getClass() == Number.class; // false,因为Integer.class!=Number.class

获得反射后的Class实例中的信息,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public class Main {
public static void main(String[] args) {
printClassInfo("".getClass());//String,使用对象调取Class话需要用到.getClass()方法
printClassInfo(Runnable.class);//Runnable interface
printClassInfo(java.time.Month.class);//Month
printClassInfo(String[].class);//String[]
printClassInfo(int.class);//int
}
static void printClassInfo(Class cls) {
System.out.println("Class name: " + cls.getName());
System.out.println("Simple name: " + cls.getSimpleName());
if (cls.getPackage() != null) {
System.out.println("Package name: " + cls.getPackage().getName());
}
System.out.println("is interface: " + cls.isInterface());
System.out.println("is enum: " + cls.isEnum());
System.out.println("is array: " + cls.isArray());
System.out.println("is primitive: " + cls.isPrimitive());
}
}
1
2
3
4
5
6
// 获取String的Class实例:
Class cls = String.class;
// 创建一个String实例:
String s = (String) cls.newInstance();
//等同于
String s = new String();

String s = (String) cls.newInstance();这样创建类实例的局限是只能调用public的无参构造方法。

JVM在执行Java程序的时候,并不是一次性把所有用到的class全部加载到内存,而是第一次需要用到class时才加载。例如:

1
2
3
4
5
6
7
8
9
10
11
12
// Main.java
public class Main {
public static void main(String[] args) {
if (args.length > 0) {
create(args[0]);
}
}

static void create(String name) {
Person p = new Person(name);
}
}

当执行Main.java时,由于用到了Main,因此,JVM首先会把Main.class加载到内存。然而,并不会加载Person.class,除非程序执行到create()方法,JVM发现需要加载Person类时,才会首次加载Person.class。如果没有执行create()方法,那么Person.class根本就不会被加载。

动态加载class的特性对于Java程序非常重要。利用JVM动态加载class的特性,我们才能在运行期根据条件加载不同的实现类。例如,Commons Logging总是优先使用Log4j,只有当Log4j不存在时,才使用JDK的logging。利用JVM动态加载特性,大致的实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Commons Logging优先使用Log4j:
LogFactory factory = null;
if (isClassPresent("org.apache.logging.log4j.Logger")) {
factory = createLog4j();
} else {
factory = createJdkLog();
}

boolean isClassPresent(String name) {
try {
Class.forName(name);
return true;
} catch (Exception e) {
return false;
}
}

这就是为什么我们只需要把Log4j的jar包放到classpath中,Commons Logging就会自动使用Log4j的原因。

访问字段

我们先看看如何通过Class实例获取字段信息。Class类提供了以下几个方法来获取字段:

  • Field getField(name):根据字段名获取某个public的field(包括父类)
  • Field getDeclaredField(name):根据字段名获取当前类的某个field(不包括父类)
  • Field[] getFields():获取所有public的field(包括父类)
  • Field[] getDeclaredFields():获取当前类的所有field(不包括父类)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public class Main {
public static void main(String[] args) throws Exception {
Class stdClass = Student.class;
// 获取public字段"score":
System.out.println(stdClass.getField("score"));
// 获取继承的public字段"name":
System.out.println(stdClass.getField("name"));
// 获取private字段"grade":
System.out.println(stdClass.getDeclaredField("grade"));
}
}

class Student extends Person {
public int score;
private int grade;
}

class Person {
public String name;
}

print out: org.example is the package name where this class is.

1
2
3
public int org.example.Student.score
public java.lang.String org.example.Person.name
private int org.example.Student.grade

Field对象

一个Field对象包含了一个字段的所有信息:

  • getName():返回字段名称,例如,"name"
  • getType():返回字段类型,也是一个Class实例,例如,String.class
  • getModifiers():返回字段的修饰符,它是一个int,不同的bit表示不同的含义。
1
2
3
4
5
6
7
8
9
Field f = String.class.getDeclaredField("value");
f.getName(); // "value"
f.getType(); // class [B 表示byte[]类型
int m = f.getModifiers();
Modifier.isFinal(m); // true
Modifier.isPublic(m); // false
Modifier.isProtected(m); // false
Modifier.isPrivate(m); // true
Modifier.isStatic(m); // false

下面这段代码可以先拿到name字段的field,再获取这个实例的name字段的值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import java.lang.reflect.Field;

public class Main {

public static void main(String[] args) throws Exception {
Object p = new Person("Xiao Ming");
Class c = p.getClass();
Field f = c.getDeclaredField("name");
Object value = f.get(p);
System.out.println(value); // "Xiao Ming"
}
}

class Person {
private String name;

public Person(String name) {
this.name = name;
}
}

述代码先获取Class实例,再获取Field实例,然后,用Field.get(Object)获取指定实例的指定字段的值。

运行代码,如果不出意外,会得到一个bugIllegalAccessException,这是因为name被定义为一个private字段,正常情况下,Main类无法访问Person类的private字段。要修复错误,可以将private改为public,或者,在调用Object value = f.get(p);前,先写一句:

1
f.setAccessible(true);

调用Field.setAccessible(true)的意思是,别管这个字段是不是public,一律允许访问。

调用方法

调用构造方法

获取继承关系

动态代理

注解

范型

多线程multi-thread

基础

线程:在计算机中,我们把一个任务称为一个进程,浏览器就是一个进程,视频播放器是另一个进程

进程:某些进程内部还需要同时执行多个子任务。例如,我们在使用Word时,Word可以让我们一边打字,一边进行拼写检查,同时还可以在后台进行打印,我们把子任务称为线程。

进程和线程的关系就是:一个进程可以包含一个或多个线程,但至少会有一个线程

操作系统调度的最小任务单位是线程。常用的Windows、Linux等操作系统都采用抢占式多任务,如何调度线程完全由操作系统决定,程序自己不能决定什么时候执行,以及执行多长时间。

同一个应用程序,既可以有多个进程,也可以有多个线程,因此,实现多任务的方法,有以下几种:

  1. 多进程模式(每个进程只有一个线程)

  2. 多线程模式(一个进程有多个线程)

  3. 多进程加多线程(多个进程,且每个进程里面可能有多个线程)

进程和线程是包含关系,但是多任务既可以由多进程实现,也可以由单进程内的多线程实现,还可以混合多进程+多线程

和多线程相比,多进程的缺点在于:

  • 创建进程比创建线程开销大,尤其是在Windows系统上;
  • 进程间通信比线程间通信要慢,因为线程间通信就是读写同一个变量,速度很快。

多进程的优点在于:

多进程稳定性比多线程高,因为在多进程的情况下,一个进程崩溃不会影响其他进程,而在多线程的情况下,任何一个线程崩溃会直接导致整个进程崩溃。

多线程

Java语言内置了多线程支持:一个Java程序实际上是一个JVM进程,JVM进程用一个主线程来执行main()方法,在main()方法内部,我们又可以启动多个线程。此外,JVM还有负责垃圾回收的其他工作线程等。

因此,对于大多数Java程序来说,我们说多任务,实际上是说如何使用多线程实现多任务。

和单线程相比,多线程编程的特点在于:多线程经常需要读写共享数据,并且需要同步。例如,播放电影时,就必须由一个线程播放视频,另一个线程播放音频,两个线程需要协调运行,否则画面和声音就不同步。因此,多线程编程的复杂度高,调试更困难。

Java多线程编程的特点又在于:

  • 多线程模型是Java程序最基本的并发模型;
  • 后续读写网络、数据库、Web开发等都依赖Java多线程模型。

因此,必须掌握Java多线程编程才能继续深入学习其他内容。

创建新线程

Java语言内置了多线程支持。当Java程序启动的时候,实际上是启动了一个JVM进程,然后,JVM启动主线程来执行main()方法。在main()方法中,我们又可以启动其他线程。

1
2
3
4
5
6
public class Main {
public static void main(String[] args) {
Thread t = new Thread();
t.start(); // 启动新线程
}
}

让线程执行指定代码

方法一:从Thread派生一个自定义类,然后覆写run()方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
public class Main {
public static void main(String[] args) {
Thread t = new MyThread();
t.start(); // 启动新线程
}
}

class MyThread extends Thread {
@Override
public void run() {
System.out.println("start new thread!");
}
}

方法二:创建Thread实例时,传入一个Runnable实例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class Main {
public static void main(String[] args) {
Thread t = new Thread(new MyRunnable());
t.start(); // 启动新线程
}
}

class MyRunnable implements Runnable {
@Override
public void run() {
System.out.println("start new thread!");
}
}

等同于

1
2
3
4
5
6
7
8
public class Main {
public static void main(String[] args) {
Thread t = new Thread(() -> {
System.out.println("start new thread!");
});
t.start(); // 启动新线程
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
public class Main {
public static void main(String[] args) {
System.out.println("main start...");
Thread t = new Thread() {
public void run() {
System.out.println("thread run...");
System.out.println("thread end.");
}
};
t.start();
System.out.println("main end...");
}
}

我们看线程的执行顺序:

  1. main线程肯定是先打印main start,再打印main end
  2. t线程肯定是先打印thread run,再打印thread end

但是,除了可以肯定,main start会先打印外,main end打印在thread run之前、thread end之后或者之间,都无法确定。因为从t线程开始运行以后,两个线程就开始同时运行了,并且由操作系统调度,程序本身无法确定线程的调度顺序。

模拟并发的效果,调用thread.sleep(),

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public class Main {
public static void main(String[] args) {
System.out.println("main start...");
Thread t = new Thread() {
public void run() {
System.out.println("thread run...");
try {
Thread.sleep(10);
} catch (InterruptedException e) {}
System.out.println("thread end.");
}
};
t.start();
try {
Thread.sleep(20);
} catch (InterruptedException e) {}
System.out.println("main end...");
}
}

sleep()传入的参数是毫秒。调整暂停时间的大小,我们可以看到main线程和t线程执行的先后顺序。

要特别注意:直接调用Thread实例的run()方法是无效的:

直接调用run()方法,相当于调用了一个普通的Java方法,当前线程并没有任何改变,也不会启动新线程。上述代码实际上是在main()方法内部又调用了run()方法,打印hello语句是在main线程中执行的,没有任何新线程被创建。

必须调用Thread实例的start()方法才能启动新线程,如果我们查看Thread类的源代码,会看到start()方法内部调用了一个private native void start0()方法,native修饰符表示这个方法是由JVM虚拟机内部的C代码实现的,不是由Java代码实现的。

可以对线程设定优先级,设定优先级的方法是:

1
Thread.setPriority(int n) // 1~10, 默认值5

JVM自动把1(低)~10(高)的优先级映射到操作系统实际优先级上(不同操作系统有不同的优先级数量)。优先级高的线程被操作系统调度的优先级较高,操作系统对高优先级线程可能调度更频繁,但我们决不能通过设置优先级来确保高优先级的线程一定会先执行。

线程的状态

在Java程序中,一个线程对象只能调用一次start()方法启动新线程,并在新线程中执行run()方法。一旦run()方法执行完毕,线程就结束了。因此,Java线程的状态有以下几种:

  • New:新创建的线程,尚未执行;
  • Runnable:运行中的线程,正在执行run()方法的Java代码;
  • Blocked:运行中的线程,因为某些操作被阻塞而挂起;
  • Waiting:运行中的线程,因为某些操作在等待中;
  • Timed Waiting:运行中的线程,因为执行sleep()方法正在计时等待;
  • Terminated:线程已终止,因为run()方法执行完毕。

用一个状态转移图表示如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
         ┌─────────────┐
│ New │
└─────────────┘


┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│┌─────────────┐ ┌─────────────┐│
││ Runnable │ │ Blocked ││
│└─────────────┘ └─────────────┘│
│┌─────────────┐ ┌─────────────┐│
││ Waiting │ │Timed Waiting││
│└─────────────┘ └─────────────┘│
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─


┌─────────────┐
│ Terminated │
└─────────────┘

当线程启动后,它可以在RunnableBlockedWaitingTimed Waiting这几个状态之间切换,直到最后变成Terminated状态,线程终止。

线程终止的原因有:

  • 线程正常终止:run()方法执行到return语句返回;
  • 线程意外终止:run()方法因为未捕获的异常导致线程终止;
  • 对某个线程的Thread实例调用stop()方法强制终止(强烈不推荐使用)。

Join()

一个线程还可以等待另一个线程直到其运行结束。例如,main线程在启动t线程后,可以通过t.join()等待t线程结束后再继续运行:

1
2
3
4
5
6
7
8
9
10
11
public class Main {
public static void main(String[] args) throws InterruptedException {
Thread t = new Thread(() -> {
System.out.println("hello");
});
System.out.println("start");
t.start();
t.join();
System.out.println("end");
}
}

中断线程

中断一个线程非常简单,只需要在其他线程中对目标线程调用interrupt()方法,目标线程需要反复检测自身状态是否是interrupted状态,如果是,就立刻结束运行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public class Main {
public static void main(String[] args) throws InterruptedException {
Thread t = new MyThread();
t.start();
Thread.sleep(1); // 暂停1毫秒
t.interrupt(); // 中断t线程
t.join(); // 等待t线程结束
System.out.println("end");
}
}

class MyThread extends Thread {
public void run() {
int n = 0;
while (! isInterrupted()) {
n ++;
System.out.println(n + " hello!");
}
}
}

仔细看上述代码,main线程通过调用t.interrupt()方法中断t线程,但是要注意,interrupt()方法仅仅向t线程发出了“中断请求”,至于t线程是否能立刻响应,要看具体代码。而t线程的while循环会检测isInterrupted(),所以上述代码能正确响应interrupt()请求,使得自身立刻结束运行run()方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
public class Main {
public static void main(String[] args) throws InterruptedException {
Thread t = new MyThread();
t.start();
Thread.sleep(1000);
t.interrupt(); // 中断t线程
t.join(); // 等待t线程结束
System.out.println("end");
}
}

class MyThread extends Thread {
public void run() {
Thread hello = new HelloThread();
hello.start(); // 启动hello线程
try {
hello.join(); // 等待hello线程结束
} catch (InterruptedException e) {
System.out.println("interrupted!");
}
hello.interrupt();
}
}

class HelloThread extends Thread {
public void run() {
int n = 0;
while (!isInterrupted()) {
n++;
System.out.println(n + " hello!");
try {
Thread.sleep(100);
} catch (InterruptedException e) {
break;
}
}
}
}

main线程通过调用t.interrupt()从而通知t线程中断,而此时t线程正位于hello.join()的等待中,此方法会立刻结束等待并抛出InterruptedException。由于我们在t线程中捕获了InterruptedException,因此,就可以准备结束该线程。在t线程结束前,对hello线程也进行了interrupt()调用通知其中断。如果去掉这一行代码,可以发现hello线程仍然会继续运行,且JVM不会退出。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public class Main {
public static void main(String[] args) throws InterruptedException {
HelloThread t = new HelloThread();
t.start();
Thread.sleep(1);
t.running = false; // 标志位置为false
}
}

class HelloThread extends Thread {
public volatile boolean running = true;
public void run() {
int n = 0;
while (running) {
n ++;
System.out.println(n + " hello!");
}
System.out.println("end!");
}
}

另一个常用的中断线程的方法是设置标志位。我们通常会用一个running标志位来标识线程是否应该继续运行,在外部线程中,通过把HelloThread.running置为false,就可以让线程结束

注意到HelloThread的标志位boolean running是一个线程间共享的变量。线程间共享变量需要使用volatile关键字标记,确保每个线程都能读取到更新后的变量值。

为什么要对线程间共享的变量用关键字volatile声明?这涉及到Java的内存模型。在Java虚拟机中,变量的值保存在主内存中,但是,当线程访问变量时,它会先获取一个副本,并保存在自己的工作内存中。如果线程修改了变量的值,虚拟机会在某个时刻把修改后的值回写到主内存,但是,这个时间是不确定的!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│ Main Memory
│ │
│ ┌───────┐┌───────┐┌───────┐
│ │ var A ││ var B ││ var C │ │
│ └───────┘└───────┘└───────┘
│ │ ▲ │ ▲ │
─ ─ ─│─│─ ─ ─ ─ ─ ─ ─ ─│─│─ ─ ─
│ │ │ │
┌ ─ ─ ┼ ┼ ─ ─ ┐ ┌ ─ ─ ┼ ┼ ─ ─ ┐
▼ │ ▼ │
│ ┌───────┐ │ │ ┌───────┐ │
│ var A │ │ var C │
│ └───────┘ │ │ └───────┘ │
Thread 1 Thread 2
└ ─ ─ ─ ─ ─ ─ ┘ └ ─ ─ ─ ─ ─ ─ ┘

volatile关键字的目的是告诉虚拟机:

  • 每次访问变量时,总是获取主内存的最新值;
  • 每次修改变量后,立刻回写到主内存

volatile关键字解决的是可见性问题:当一个线程修改了某个共享变量的值,其他线程能够立刻看到修改后的值。

如果我们去掉volatile关键字,运行上述程序,发现效果和带volatile差不多,这是因为在x86的架构下,JVM回写主内存的速度非常快,但是,换成ARM的架构,就会有显著的延迟。

Daemon Thread

有一种线程的目的就是无限循环,例如,一个定时触发任务的线程:

1
2
3
4
5
6
7
8
9
10
11
12
13
class TimerThread extends Thread {
@Override
public void run() {
while (true) {
System.out.println(LocalTime.now());
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
break;
}
}
}
}

对于这类线程,我们可以使用守护线程来结束他们

在JVM中,所有非守护线程都执行完毕后,无论有没有守护线程,虚拟机都会自动退出。因此,JVM退出时,不必关心守护线程是否已结束。

如何创建守护线程呢?方法和普通线程一样,只是在调用start()方法前,调用setDaemon(true)把该线程标记为守护线程:

1
2
3
Thread t = new MyThread();
t.setDaemon(true);
t.start();

Synchronized

如果多个线程同时读写共享变量,会出现数据不一致的问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public class Main {
public static void main(String[] args) throws Exception {
Thread add = new AddThread();
Thread dec = new DecThread();
add.start();
dec.start();
add.join();
dec.join();
System.out.println(Counter.count);
}
}

class Counter {
public static int count = 0;
}

class AddThread extends Thread {
public void run() {
for (int i=0; i<10000; i++) { Counter.count += 1; }
}
}

class DecThread extends Thread {
public void run() {
for (int i=0; i<10000; i++) { Counter.count -= 1; }
}
}

上面的代码每一次结果都可能不同因为没有保证代码执行的原子性

可见,保证一段代码的原子性就是通过加锁和解锁实现的。Java程序使用synchronized关键字对一个对象进行加锁:

1
2
3
synchronized(lock) {
n = n + 1;
}

下面是使用synchoronized的一个例子,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
public class Main {
public static void main(String[] args) throws Exception {
var add = new AddThread();
var dec = new DecThread();
add.start();
dec.start();
add.join();
dec.join();
System.out.println(Counter.count);
}
}

class Counter {
public static final Object lock = new Object();
public static int count = 0;
}

class AddThread extends Thread {
public void run() {
for (int i=0; i<10000; i++) {
synchronized(Counter.lock) {
Counter.count += 1;
}
}
}
}

class DecThread extends Thread {
public void run() {
for (int i=0; i<10000; i++) {
synchronized(Counter.lock) {
Counter.count -= 1;
}
}
}
}
1
2
3
synchronized(Counter.lock) { // 获取锁
...
} // 释放锁

这里表示使用Counter.lock这个实例作为锁,两个线程在执行各自的代码块的时候需要先获得锁,才能进行代码块执行代码,执行结束后,在sychronized修饰的代码块之后,会自动释放这个锁的资源。这样一来,对Counter.count变量进行读写就不可能同时进行。

注意不要错误的使用锁,比如对应该执行原子操作的变量使用两个锁去上锁,这样做是没用的;还有一种情况是对可以同步进行的数据进行上锁。

有些不需要synchronized的操作

  • 基本类型(longdouble除外)赋值,例如:int n = m
  • 引用类型赋值,例如:List<String> list = anotherList
  • 对不可变对象的读写

Sychronized Methods

让线程选择锁的对象,不如在定义对象的时候就将sychronized封装起来。例如以下的一个计数器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public class Counter {
private int count = 0;

public void add(int n) {
synchronized(this) {
count += n;
}
}

public void dec(int n) {
synchronized(this) {
count -= n;
}
}

public int get() {
return count;
}
}

这样一来,线程调用add()dec()方法时,它不必关心同步逻辑,因为synchronized代码块在add()dec()方法内部。并且,我们注意到,synchronized锁住的对象是this,即当前实例,这又使得创建多个Counter实例的时候,它们之间互不影响,可以并发执行:

如果一个类被设计为允许多线程正确访问,我们就说这个类就是“线程安全”的(thread-safe),上面的Counter类就是线程安全的。Java标准库的java.lang.StringBuffer也是线程安全的。

还有一些不变类,例如StringIntegerLocalDate,它们的所有成员变量都是final,多线程同时访问时只能读不能写,这些不变类也是线程安全的。

1
2
3
4
5
6
7
8
9
public void add(int n) {
synchronized(this) { // 锁住this
count += n;
} // 解锁
}
等价于
public synchronized void add(int n) { // 锁住this
count += n;
} // 解锁

因此,用synchronized修饰的方法就是同步方法,它表示整个方法都必须用this实例加锁。

对于static方法,是没有this实例的,因为static方法是针对类而不是实例。但是我们注意到任何一个类都有一个由JVM自动创建的Class实例,因此,对static方法添加synchronized,锁住的是该类的Class实例。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class Counter {
public synchronized static void test(int n) {
...
}
}
等同于

public class Counter {
public static void test(int n) {
synchronized(Counter.class) {
...
}
}
}

Deadlock

JVM允许同一个线程重复获取同一个锁,这种能被同一个线程反复获取的锁,就叫做可重入锁。

由于Java的线程锁是可重入锁,所以,获取锁的时候,不但要判断是否是第一次获取,还要记录这是第几次获取。每获取一次锁,记录+1,每退出synchronized块,记录-1,减到0的时候,才会真正释放锁。

死锁:两个线程各自持有不同的锁,然后各自试图获取对方手里的锁,造成了双方无限等待下去,这就是死锁。

死锁发生后,没有任何机制能解除死锁,只能强制结束JVM进程。

那么我们应该如何避免死锁呢?答案是:线程获取锁的顺序要一致。即严格按照先获取lockA,再获取lockB的顺序

Wait and notify

在Java程序中,synchronized解决了多线程竞争的问题。例如,对于一个任务管理器,多个线程同时往队列中添加任务,可以用synchronized加锁:

1
2
3
4
5
6
7
8
9
10
11
12
13
class TaskQueue {
Queue<String> queue = new LinkedList<>();

public synchronized void addTask(String s) {
this.queue.add(s);
}

public synchronized String getTask() {
while (queue.isEmpty()) {
}
return queue.remove();
}
}

Wrong!

When you are in the while loop, you lock ‘this’, and you can not invoke addTask

so add this.wait() in getTask() method.

1
2
3
4
5
6
7
8
public synchronized String getTask() {
while (queue.isEmpty()) {
// 释放this锁:
this.wait();
// 重新获取this锁
}
return queue.remove();
}

这里的关键是:wait()方法必须在当前获取的锁对象上调用,这里获取的是this锁,因此调用this.wait()

而且必须在sychronized块中才能调用wait()方法。wait()方法调用时会释放线程获得的锁。在wait()方法返回hou,线程又会重新试图获得锁。

在相同的锁对象上调用notify()方法会让等待的线程被唤醒,然后从wait()方法返回。notifyAll()将唤醒所有当前正在this锁等待的线程。notifyAll()更安全。有些时候,如果我们的代码逻辑考虑不周,用notify()会导致只唤醒了一个线程,而其他线程可能永远等待下去醒不过来了。

1
2
3
4
public synchronized void addTask(String s) {
this.queue.add(s);
this.notify(); // 唤醒在this锁等待的线程
}

Design pattern

Factory pattern

The purpose of the factory method is to make creating and using objects separate, and the client always refers to the abstract factory and the abstract product:

工厂方法的目的是使得创建对象和使用对象是分离的,并且客户端总是引用抽象工厂和抽象产品:

mall developing

Bug shooting

Bug1.

captcha disappearance

https://www.javazxz.com/thread-7116-1-1.html

Bug2.

com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

network problem. The student.gla.ac.uk network filed cannot access to tencent server.

Bug3.

No spring.config.import property has been defined

产生问题的原因是bootstrap.properties比application.properties的优先级要高
由于bootstrap.properties是系统级的资源配置文件,是用在程序引导执行时更加早期配置信息读取;
而application.properties是用户级的资源配置文件,是用来后续的一些配置所需要的公共参数。
但是在SpringCloud 2020.* 版本把bootstrap禁用了,导致在读取文件的时候读取不到而报错,所以我们只要把bootstrap从新导入进来就会生效了。

1
2
3
4
5
<dependency>
<groupId>org.springframework.clouds</groupId>
<artifactId>spring-cloud-starter-bootstrap</artifactId>
<version>3.0.3</version>
</dependency>

在全部使用nacos配置后,就不用引用这个了

Bug4

java: 找不到符号 符号:方法 XXX() 位置: 类型为io.renren.modules.sys.entity

Solution:

  1. 这是因为你的idea中内置lombok和项目中pom文件的版本不兼容的原因,简单说就是这两个发生冲突了,大部分的出现在idea2020和2021版本上可能会出现问题。
  2. 因为lombok版本是一般是受springboot版本管理的,也可以通过升级springboot的版本来提高lombok版本,只要你springboot中引用的Lombok高于上面的版本就可以。
1
2
3
4
5
6
7
8
<!--Lombok-->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.14</version>
<scope>provided</scope>//注意这行代码一定要加上去,我的问题就是出现在没有加这行代码还是会报错、
</dependency>

Config:

1. redis

Remote configuration:![image-20230403164112175](/Users/joshua/Library/Application Support/typora-user-images/image-20230403164112175.png)

mysql

![image-20230403164158804](/Users/joshua/Library/Application Support/typora-user-images/image-20230403164158804.png)

![image-20230403164227851](/Users/joshua/Library/Application Support/typora-user-images/image-20230403164227851.png)

JDBC

Info:

Url:jdbc:mysql://47.98.63.250:3306/mall_pms

use the server’s public ip addr

DBMS: MySQL (ver. 5.7.41)
Case sensitivity: plain=exact, delimited=exact
Driver: MySQL Connector/J (ver. mysql-connector-java-8.0.25 (Revision: 08be9e9b4cba6aa115f9b27b215887af40b159e0), JDBC4.2)

Ping: 1 sec, 25 ms
SSL: yes

nacos:

nacos配置按照最新版本的springcloud alibaba来

不用使用bootstrap的配置文件来定位nacos

1
2
3
4
5
6
7
8
9
10
11
<dependency>
<groupId>com.alibaba.cloud</groupId>
<artifactId>spring-cloud-starter-alibaba-nacos-config</artifactId>
<!-- 排除 bootstrap, 未来版本 spring-cloud-alibaba 应该在 spring boot >= 2.4.0 时将该依赖设置为 optional -->
<exclusions>
<exclusion>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-bootstrap</artifactId>
</exclusion>
</exclusions>
</dependency>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
spring:
cloud:
nacos:
config:
server-addr: 127.0.0.1:8848
namespace: 2e40d043-66f9-4a92-acc1-af90e80b9204
group: dev
config:
import:
- optional:nacos:mall-coupon.yml?group=dev # 监听 DEFAULT_GROUP:mall-coupon.yml覆盖默认 group, 监听 group_01:test01.yml
- optional:nacos:datasource.yml?group=dev&refreshEnabled=false # 不开启动态刷新
- optional:nacos:mybatis.yml?group=dev
- optional:nacos:other.yml?group=dev

容器重启数据丢失问题:

MySQL 配置

1
vi /mydata/mysql/conf/my.cnf 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[client] 

default-character-set=utf8

[mysql]

default-character-set=utf8

[mysqld]

init_connect='SET collation_connection = utf8_unicode_ci'

init_connect='SET NAMES utf8'

character-set-server=utf8

collation-server=utf8_unicode_ci

skip-character-set-client-handshake

skip-name-resolve
1
2
3
4
5
6
7
8
9
docker run -p 3306:3306 \
--name mysql \
-v /mydata/mysql/log:/var/log/mysql \
-v /mydata/mysql/data:/var/lib/mysql \
-v /mydata/mysql/conf:/etc/mysql/my.cnf \
-e MYSQL_ROOT_PASSWORD=root \
-d mysql:5.7


通过容器的 mysql 命令行工具连接

1
docker exec -it mysql mysql -uroot -proot

设置 root 远程访问

1
2
3
grant all privileges on *.* to 'root'@'%' identified by 'root' with grant option; 

flush privileges;
1
2
3
4
docker exec -it mysql /bin/bash

cd /etc/mysql
ls

解决方法
了解上面的介绍后,解决它就很简单了。

1
2
docker ps -a 找到我们上次运行的容器id
docker restart id 即可

你没看错就这样就完了。

我这里创建了volume容器来对其进行持久化,位置在/var/lib/docker/volumes/mysql-data/_data,从mysql容器中mount在我的cloud server上的路径为my/own

Gateway

将renren-fast注册到nacos中去, 可能会遇到一些依赖错误,在该微服务的pom文件中引入这两个坐标

1
2
3
4
5
6
7
8
9
10
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-commons</artifactId>
<version>3.1.5</version>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-context</artifactId>
<version>3.1.5</version> <!-- Make sure to use the correct version -->
</dependency>

在gateway中按格式加入

1
2
3
4
5
6
- id: admin_route
uri: lb://renren-fast # 路由给renren-fast,lb代表负载均衡
predicates: # 什么情况下路由给它
- Path=/api/** # 默认前端项目都带上api前缀,
filters:
- RewritePath=/api/(?<segment>.*),/renren-fast/$\{segment}

Cors

记得将renren-fast中的cors注解掉

并在网关中配置这个

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
package com.josh.mall.gateway.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.cors.CorsConfiguration;
import org.springframework.web.cors.CorsConfigurationSource;
import org.springframework.web.cors.reactive.CorsWebFilter;
import org.springframework.web.cors.reactive.UrlBasedCorsConfigurationSource;

/**
* Description:
* Author: joshua
* Date: 2023/5/28
*/
@Configuration // gateway
public class MallCorsConfiguration {

@Bean // 添加过滤器
public CorsWebFilter corsWebFilter(){
// 基于url跨域,选择reactive包下的
UrlBasedCorsConfigurationSource source=new UrlBasedCorsConfigurationSource();
// 跨域配置信息
CorsConfiguration corsConfiguration = new CorsConfiguration();
// 允许跨域的头
corsConfiguration.addAllowedHeader("*");
// 允许跨域的请求方式
corsConfiguration.addAllowedMethod("*");
// 允许跨域的请求来源
corsConfiguration.addAllowedOriginPattern("*");
// 是否允许携带cookie跨域
corsConfiguration.setAllowCredentials(true);

// 任意url都要进行跨域配置
source.registerCorsConfiguration("/**",corsConfiguration);
return new CorsWebFilter(source);
}
}

Sparse skills

Python

1a

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import pandas as pd

# Task 1a: Data Loading and Preprocessing

# 1. Load the Data: Load the CSV file into a Pandas DataFrame
file_path = "data.csv" # Replace with your file path
df = pd.read_csv(file_path)
print("Data loaded successfully.")

# 2. Data Cleaning: Check for missing values and fill them appropriately
if df.isnull().sum().sum() > 0:
print("Missing values found. Filling with appropriate values...")
# Assuming numeric columns are filled with 0 and categorical with 'Unknown'
for column in df.columns:
if df[column].dtype == 'object':
df[column].fillna('Unknown', inplace=True)
else:
df[column].fillna(0, inplace=True)
else:
print("No missing values found.")

# 3. Convert the Date column to a datetime object
if 'Date' in df.columns:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
if df['Date'].isnull().sum() > 0:
print("Some dates were invalid. Converting invalid dates to a default value (e.g., today's date).")
df['Date'].fillna(pd.Timestamp.today(), inplace=True)
print("Date column converted to datetime successfully.")

# 4. Ensure the Total column correctly represents the product of Quantity and Price
if 'Total' in df.columns and 'Quantity' in df.columns and 'Price' in df.columns:
df['Calculated_Total'] = df['Quantity'] * df['Price']
discrepancy = df[df['Total'] != df['Calculated_Total']]
if not discrepancy.empty:
print("Discrepancies found in the Total column. Correcting them...")
df['Total'] = df['Calculated_Total']
print("Total column verified and corrected if necessary.")
else:
print("Required columns (Total, Quantity, or Price) not found.")

# Display the cleaned DataFrame (Optional)
print("Cleaned DataFrame:")
print(df.head())

# Save the cleaned data (Optional)
df.to_csv("cleaned_data.csv", index=False)
print("Cleaned data saved to 'cleaned_data.csv'.")

1b

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
import matplotlib.pyplot as plt
data = {
'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
'Quantity': [10, 15, 12, 20, 25, 10, 30, 15, 10, 5],
'Date': [
'2023-01-15', '2023-01-20', '2023-02-10', '2023-02-18',
'2023-03-05', '2023-03-12', '2023-04-07', '2023-04-15',
'2023-05-01', '2023-05-10'
]
}

df = pd.DataFrame(data)

# Ensure 'Date' is a datetime type
df['Date'] = pd.to_datetime(df['Date'])

# 1. Product Sales Distribution (Bar Chart)
product_sales = df.groupby('Product')['Quantity'].sum()

plt.figure(figsize=(10, 6))
product_sales.plot(kind='bar')
plt.title('Total Quantity Sold for Each Product')
plt.xlabel('Product')
plt.ylabel('Total Quantity Sold')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# 2. Sales Over Time (Line Plot)
df['Month'] = df['Date'].dt.to_period('M') # Extract month and year
monthly_sales = df.groupby('Month')['Quantity'].sum()

plt.figure(figsize=(10, 6))
monthly_sales.plot(kind='line', marker='o')
plt.title('Total Sales Over the Months of 2023')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.grid(axis='both', linestyle='--', alpha=0.7)
plt.show()

2a

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# 2a: Database Creation, Table Creation, and Data Insertion

import sqlite3

# Connect to (or create) the SQLite database named SalesDB
conn = sqlite3.connect('SalesDB.db')
cursor = conn.cursor()

# Create the Sales table with appropriate columns
# Ensuring no duplicate entries can be done by setting a UNIQUE constraint on (Date, Product)
cursor.execute('''
CREATE TABLE IF NOT EXISTS Sales (
Date TEXT,
Product TEXT,
Quantity INTEGER,
Price REAL,
Total REAL,
UNIQUE(Date, Product)
)
''')

# Insert cleaned and processed data from Part 1
# Assuming we have a list of tuples as our cleaned data, for example:
# cleaned_data = [
# ("2023-01-01", "Widget A", 10, 9.99, 99.90),
# ("2023-01-01", "Widget B", 5, 19.99, 99.95),
# ("2023-01-02", "Widget A", 8, 9.99, 79.92),
# ...
# ]
# Replace the above with your actual cleaned data list

cleaned_data = [
("2023-01-01", "Widget A", 10, 9.99, 99.90),
("2023-01-01", "Widget B", 5, 19.99, 99.95),
("2023-02-10", "Widget A", 20, 9.99, 199.80),
("2023-03-15", "Widget C", 3, 29.99, 89.97),
("2023-03-15", "Widget B", 2, 19.99, 39.98),
]

# Insert data using "INSERT OR IGNORE" to avoid duplicates
for record in cleaned_data:
cursor.execute('''
INSERT OR IGNORE INTO Sales (Date, Product, Quantity, Price, Total)
VALUES (?, ?, ?, ?, ?)
''', record)

conn.commit()
conn.close()

2b

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 2b: Querying the Database

import sqlite3

conn = sqlite3.connect('SalesDB.db')
cursor = conn.cursor()

# Query 1: Total Sales for the year 2023
cursor.execute('''
SELECT SUM(Total) AS total_sales_2023
FROM Sales
WHERE Date LIKE '2023-%';
''')
total_sales_2023 = cursor.fetchone()[0]
print("Total Sales in 2023:", total_sales_2023)

# Query 2: Product Sales Summary for 2023 (total quantity sold per product, descending order)
cursor.execute('''
SELECT Product, SUM(Quantity) as total_quantity
FROM Sales
WHERE Date LIKE '2023-%'
GROUP BY Product
ORDER BY total_quantity DESC;
''')
product_sales_summary = cursor.fetchall()
print("Product Sales Summary for 2023:")
for row in product_sales_summary:
print(row) # (Product, total_quantity)

conn.close()

3a:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# Task 3a: Model Initialization
class SimpleNN(nn.Module):
def __init__(self, input_size):
super(SimpleNN, self).__init__()
self.linear = nn.Linear(input_size, 1)

def forward(self, x):
return self.linear(x)

# Instantiate the m
input_size = 10
model = SimpleNN(input_size)

# Define loss function and optimizer
loss_function = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

3b

1
2
3
4
5
6
# Generate synthetic data
num_samples = 100
num_features = 10

data = torch.randn(num_samples, num_features)
target = torch.randn(num_samples, 1) # 100 target values reshaped for MSE

3c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
num_epochs = 20
losses = []

for epoch in range(num_epochs):
# Forward pass
predictions = model(data)
loss = loss_function(predictions, target)

# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Record loss
losses.append(loss.item())
print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item():.4f}")

3d

1
2
3
4
5
6
plt.plot(range(1, num_epochs + 1), losses, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Training Loss')
plt.title('Training Loss Over Epochs')
plt.grid(True)
plt.show()

Java

1a

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
package petcare;

public class Animal {
private String name;
private AnimalSize size;
private int comfortableTemperatureLower;
private int comfortableTemperatureUpper;

public Animal(String name, AnimalSize size, int comfortableTemperatureLower, int comfortableTemperatureUpper) {
if (name == null || name.length() < 3) {
throw new IllegalArgumentException("Name must be at least 3 characters long.");
}
if (comfortableTemperatureLower < 0 || comfortableTemperatureUpper > 50 || comfortableTemperatureLower > comfortableTemperatureUpper) {
throw new IllegalArgumentException("Temperature range must be between 0 and 50 and valid.");
}
this.name = name;
this.size = size;
this.comfortableTemperatureLower = comfortableTemperatureLower;
this.comfortableTemperatureUpper = comfortableTemperatureUpper;
}
}

1b

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
@Override
public String toString() {
return "Animal{" +
"name='" + name + '\'' +
", size=" + size +
", comfortableTemperatureLower=" + comfortableTemperatureLower +
", comfortableTemperatureUpper=" + comfortableTemperatureUpper +
'}';
}

@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Animal animal = (Animal) o;
return size == animal.size && name.equals(animal.name);
}

@Override
public int hashCode() {
return Objects.hash(name, size);
}

// Getters and Setters
public String getName() {
return name;
}

public void setName(String name) {
if (name == null || name.length() < 3) {
throw new IllegalArgumentException("Name must be at least 3 characters long.");
}
this.name = name;
}

public AnimalSize getSize() {
return size;
}

public void setSize(AnimalSize size) {
this.size = size;
}

public int getComfortableTemperatureLower() {
return comfortableTemperatureLower;
}

public void setComfortableTemperatureLower(int comfortableTemperatureLower) {
if (comfortableTemperatureLower < 0 || comfortableTemperatureLower > comfortableTemperatureUpper) {
throw new IllegalArgumentException("Invalid lower temperature.");
}
this.comfortableTemperatureLower = comfortableTemperatureLower;
}

public int getComfortableTemperatureUpper() {
return comfortableTemperatureUpper;
}

public void setComfortableTemperatureUpper(int comfortableTemperatureUpper) {
if (comfortableTemperatureUpper > 50 || comfortableTemperatureUpper < comfortableTemperatureLower) {
throw new IllegalArgumentException("Invalid upper temperature.");
}
this.comfortableTemperatureUpper = comfortableTemperatureUpper;
}

2a

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
package petcare;

public class Enclosure {
private AnimalSize size;
private int temperature;
private int runningCosts;
private Animal occupant;

public Enclosure(AnimalSize size, int temperature, int runningCosts) {
this.size = size;
this.temperature = temperature;
this.runningCosts = runningCosts;
this.occupant = null;
}

public AnimalSize getSize() {
return size;
}

public int getTemperature() {
return temperature;
}

public int getRunningCosts() {
return runningCosts;
}

public Animal getOccupant() {
return occupant;
}
}

2b

1
2
3
4
5
6
7
8
9
public boolean checkCompatibility(Animal animal) {
if (animal == null) {
throw new IllegalArgumentException("Animal cannot be null.");
}
return animal.getSize().ordinal() <= size.ordinal() &&
temperature >= animal.getComfortableTemperatureLower() &&
temperature <= animal.getComfortableTemperatureUpper();
}

2c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public void addAnimal(Animal animal) {
if (occupant != null) {
throw new IllegalArgumentException("Enclosure already has an occupant.");
}
if (!checkCompatibility(animal)) {
throw new IllegalArgumentException("Animal is not compatible with this enclosure.");
}
this.occupant = animal;
}

public void removeAnimal() {
this.occupant = null;
}

3a

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package petcare;

import java.util.ArrayList;
import java.util.List;

public class PetService {
private List<Enclosure> enclosures;

public PetService() {
this.enclosures = new ArrayList<>();
}

public List<Enclosure> getEnclosures() {
return enclosures;
}
}

3b

1
2
3
4
5
6
7
8
9
10
11
12
13
public void addEnclosure(Enclosure enclosure) {
if (enclosure == null) {
throw new IllegalArgumentException("Enclosure cannot be null.");
}
enclosures.add(enclosure);
}

public void printAllEnclosures() {
for (Enclosure enclosure : enclosures) {
System.out.println(enclosure);
}
}

3c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public boolean allocateAnimal(Animal animal) {
Enclosure bestEnclosure = null;
for (Enclosure enclosure : enclosures) {
if (enclosure.getOccupant() == null && enclosure.checkCompatibility(animal)) {
if (bestEnclosure == null || enclosure.getRunningCosts() < bestEnclosure.getRunningCosts()) {
bestEnclosure = enclosure;
}
}
}
if (bestEnclosure != null) {
bestEnclosure.addAnimal(animal);
return true;
}
return false;
}

3d

1
2
3
4
5
6
7
8
9
public void removeAnimal(Animal animal) {
for (Enclosure enclosure : enclosures) {
if (animal.equals(enclosure.getOccupant())) {
enclosure.removeAnimal();
return;
}
}
}

CS skills

Git study

Knapsack Problem

你有一个容量为 W=10 的背包,以及以下物品:

物品 重量(kg) 价值(元)
物品 1 2 6
物品 2 3 10
物品 3 4 12

目标是选择物品装入背包,使得总价值最大,同时不超过背包容量。

动态规划:

步骤:

  1. 定义状态

用 dp[i][w]表示在前 i 个物品中,容量不超过 w 时的最大价值。

  1. 状态转移方程

如果不选择第i个物品: dp[i][w] = dp[i-1][w]

如果选择第i个物品: dp[i][w] = dp[i-1][w-weight[i]]+value[i]

  1. 初始化

当i = 0 或 w = 0, 即没有物品或背包容量为 0 时: dp[i][w]=0

  1. 最终结果

dp[n][W] 表示在前 n 个物品中,容量不超过 W 时的最大价值。

![image-20241202125650466](../images/:Users:joshua:Library:Application Support:typora-user-images:image-20241202125650466.png)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def knapsack(weights, values, W):
n = len(weights)
dp = [[0] * (W + 1) for _ in range(n + 1)]

for i in range(1, n + 1):
for w in range(W + 1):
if weights[i-1] <= w:
dp[i][w] = max(dp[i-1][w], dp[i-1][w-weights[i-1]] + values[i-1])
else:
dp[i][w] = dp[i-1][w]

return dp[n][W]

# 示例数据
weights = [2, 3, 4]
values = [6, 10, 12]
W = 10

print("最大价值为:", knapsack(weights, values, W))

如果背包问题变成连续空间呢?

YARN

Yet Another Resource Negotiator

YARN 的核心组件

  1. ResourceManager (RM)
    • 集群范围的资源管理器,负责协调整个集群的资源分配。
    • 包含两个主要模块:
      • Scheduler:根据调度策略为应用程序分配资源。
      • ApplicationsManager:负责应用程序的生命周期管理,比如启动 ApplicationMaster。
  2. NodeManager (NM)
    • 每个节点上的代理,负责单节点资源的管理和任务的执行。
    • 监控容器(Container)的资源使用(CPU、内存等)并报告给 ResourceManager。
  3. ApplicationMaster (AM)
    • 每个应用程序的具体作业管理器,负责作业的调度和任务执行的管理。
    • 通过与 ResourceManager 和 NodeManager 通信来请求资源和启动任务。
  4. Container
    • YARN 中的最小资源分配单元,包含一定数量的 CPU 核心和内存。
    • 每个任务运行在一个或多个 Container 中。

工作原理

应用提交

  • 客户端向 ResourceManager 提交一个应用请求。
  • ResourceManager 启动 ApplicationMaster 来管理该应用。

资源分配

  • ApplicationMaster 与 ResourceManager 协商资源需求。
  • ResourceManager 根据当前的资源调度策略分配容器。

任务执行

  • ApplicationMaster 请求 NodeManager 启动 Container 并运行任务。
  • NodeManager 在分配的容器中执行任务,并将状态返回给 ApplicationMaster。

任务完成

  • ApplicationMaster 汇总任务执行结果并向 ResourceManager 汇报应用完成状态。
  • Copyrights © 2021-2024 Mingwei Li
  • Visitors: | Views:

Buy me a bottle of beer please~

支付宝
微信